| title | DevOps AI Agent Simulator | |||||
|---|---|---|---|---|---|---|
| emoji | 🤖 | |||||
| colorFrom | blue | |||||
| colorTo | green | |||||
| sdk | docker | |||||
| pinned | false | |||||
| app_port | 8000 | |||||
| base_path | /web | |||||
| tags |
|
A production-grade, OpenEnv-compliant reinforcement learning environment and interactive platform for building, evaluating, and visualizing AI agents specialized in SRE (Site Reliability Engineering) and Auto-Healing.
This project simulates real-world DevOps system failures, providing a rich training ground for AI agents to diagnose and resolve incidents including memory leaks, database locks, and CPU scaling bottlenecks.
Ensure you have uv installed.
uv run uvicorn server.app:appPort 8000 (standard for OpenEnv).
uv run streamlit run streamlit_app.pyAccessible at http://localhost:8501.
# Requires GROQ_API_KEY or OpenAI-compatible endpoint
uv run python3 inference.pyThe platform is built on the openenv-core SDK and utilizes a Dual-State Model:
- Observable State: CPU (%), Memory (%), DB Latency, and Service Status (API/DB/Cache).
- Hidden Root Causes: Memory leaks, database locks, and CPU scaling bottlenecks.
- Failure Propagation: Metrics evolve dynamically based on hidden failures. A memory leak (
memory > 85%) increases the objective chance of a service crash. - Multi-Step Chains: Resolving 'Expert' tasks requires specific action sequences (e.g., investigating logs → clearing cache → restarting the API).
- Log Generator: A 3-depth log system that only reveals the true 'Root Cause' message after the agent executes
check_logs.
restart_service:[api|database|cache]scale_up:cpuoptimize_databaseclear_cachecheck_logsdo_nothing(only valid when system is healthy)
Episodes are graded out of 1.0 via a deterministic 5-component scoring model:
- Outcome (35%): Reaching a 'healthy' status.
- Logic Match (25%): Using the correct corrective action for the specific hidden cause.
- Efficiency (15%): Minimizing unnecessary steps.
- Health Bonus (15%): Maximizing resource optimization.
- Diagnostic (10%): Rewarding log investigation before fixing.
The project includes a production-grade inference engine (inference.py) that supports:
- Groq (LLaMA-3): Blazing-fast inference for real-time SRE responses.
- Reasoning Protocol: The agent identifies the
root_cause, citesevidence_logs, and proposes aplanbefore taking anaction. - Safety Fallbacks: Hardcoded SRE heuristics protect the system if the LLM fails or hallucinations are detected.
.
├── env/ # Core Logic
│ ├── environment.py # Rewards, State Transitions, Failures
│ ├── tasks.py # Dynamic Scenario Generation
│ └── models.py # Pydantic Schemas
├── server/
│ └── app.py # FastAPI Endpoints (/reset, /step, /auto-run)
├── streamlit_app.py # Interactive Web UI
├── inference.py # AI Agent & Reasoning Engine
├── openenv.yaml # Manifest for Hugging Face Deployment
└── requirements.txt # Dependencies
This environment is optimized for deployment as a Docker Space:
# Register the environment and push to HF
openenv pushThe deployed space includes:
- Interactive UI at
/web - FastAPI Documentation at
/docs - OpenEnv Specification at
/spec
MIT License. Created for the OpenEnv DevOps/SRE Auto-Healing Benchmark.