🤖 DevOps AI Simulator (v6.2.0-PROD)

title

DevOps AI Agent Simulator

emoji

🤖

colorFrom

blue

colorTo

green

sdk

docker

pinned

false

app_port

8000

base_path

/web

🤖 DevOps AI Simulator (v6.2.0-PROD)

A production-grade, OpenEnv-compliant reinforcement learning environment and interactive platform for building, evaluating, and visualizing AI agents specialized in SRE (Site Reliability Engineering) and Auto-Healing.

This project simulates real-world DevOps system failures, providing a rich training ground for AI agents to diagnose and resolve incidents including memory leaks, database locks, and CPU scaling bottlenecks.

🚀 Quick Start

Ensure you have uv installed.

1. Start the Environment Server (FastAPI)

uv run uvicorn server.app:app

Port 8000 (standard for OpenEnv).

2. Launch the Interactive Dashboard (Streamlit)

uv run streamlit run streamlit_app.py

Accessible at http://localhost:8501.

3. Run AI Agent Inference (CLI)

# Requires GROQ_API_KEY or OpenAI-compatible endpoint
uv run python3 inference.py

🏗️ Technical Architecture

The platform is built on the openenv-core SDK and utilizes a Dual-State Model:

Observable State: CPU (%), Memory (%), DB Latency, and Service Status (API/DB/Cache).
Hidden Root Causes: Memory leaks, database locks, and CPU scaling bottlenecks.

🔍 Simulation Engine

Failure Propagation: Metrics evolve dynamically based on hidden failures. A memory leak (memory > 85%) increases the objective chance of a service crash.
Multi-Step Chains: Resolving 'Expert' tasks requires specific action sequences (e.g., investigating logs → clearing cache → restarting the API).
Log Generator: A 3-depth log system that only reveals the true 'Root Cause' message after the agent executes check_logs.

📊 Environment Specification

🎮 Available Actions

restart_service:[api|database|cache]
scale_up:cpu
optimize_database
clear_cache
check_logs
do_nothing (only valid when system is healthy)

📈 Scoring & Rewards

Episodes are graded out of 1.0 via a deterministic 5-component scoring model:

Outcome (35%): Reaching a 'healthy' status.
Logic Match (25%): Using the correct corrective action for the specific hidden cause.
Efficiency (15%): Minimizing unnecessary steps.
Health Bonus (15%): Maximizing resource optimization.
Diagnostic (10%): Rewarding log investigation before fixing.

🤖 AI Agent Integration

The project includes a production-grade inference engine (inference.py) that supports:

Groq (LLaMA-3): Blazing-fast inference for real-time SRE responses.
Reasoning Protocol: The agent identifies the root_cause, cites evidence_logs, and proposes a plan before taking an action.
Safety Fallbacks: Hardcoded SRE heuristics protect the system if the LLM fails or hallucinations are detected.

📁 Project Structure

.
├── env/                   # Core Logic
│   ├── environment.py     # Rewards, State Transitions, Failures
│   ├── tasks.py           # Dynamic Scenario Generation
│   └── models.py          # Pydantic Schemas
├── server/
│   └── app.py             # FastAPI Endpoints (/reset, /step, /auto-run)
├── streamlit_app.py       # Interactive Web UI
├── inference.py           # AI Agent & Reasoning Engine
├── openenv.yaml           # Manifest for Hugging Face Deployment
└── requirements.txt       # Dependencies

☁️ Deploying to Hugging Face Spaces

This environment is optimized for deployment as a Docker Space:

# Register the environment and push to HF
openenv push

The deployed space includes:

Interactive UI at /web
FastAPI Documentation at /docs
OpenEnv Specification at /spec

📄 License

MIT License. Created for the OpenEnv DevOps/SRE Auto-Healing Benchmark.

Name		Name	Last commit message	Last commit date
Latest commit History 4 Commits
server		server
.gitignore		.gitignore
Dockerfile		Dockerfile
README.md		README.md
__init__.py		__init__.py
client.py		client.py
inference.py		inference.py
models.py		models.py
openenv.yaml		openenv.yaml
pyproject.toml		pyproject.toml
requirements.txt		requirements.txt
streamlit_app.py		streamlit_app.py
uv.lock		uv.lock

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

🤖 DevOps AI Simulator (v6.2.0-PROD)

🚀 Quick Start

1. Start the Environment Server (FastAPI)

2. Launch the Interactive Dashboard (Streamlit)

3. Run AI Agent Inference (CLI)

🏗️ Technical Architecture

🔍 Simulation Engine

📊 Environment Specification

🎮 Available Actions

📈 Scoring & Rewards

🤖 AI Agent Integration

📁 Project Structure

☁️ Deploying to Hugging Face Spaces

📄 License

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

🤖 DevOps AI Simulator (v6.2.0-PROD)

🚀 Quick Start

1. Start the Environment Server (FastAPI)

2. Launch the Interactive Dashboard (Streamlit)

3. Run AI Agent Inference (CLI)

🏗️ Technical Architecture

🔍 Simulation Engine

📊 Environment Specification

🎮 Available Actions

📈 Scoring & Rewards

🤖 AI Agent Integration

📁 Project Structure

☁️ Deploying to Hugging Face Spaces

📄 License

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages