Skip to content

javatcoding1/devops-autoheal

Repository files navigation

title DevOps AI Agent Simulator
emoji 🤖
colorFrom blue
colorTo green
sdk docker
pinned false
app_port 8000
base_path /web
tags
openenv
devops
sre
reinforcement-learning
ai-agent

🤖 DevOps AI Simulator (v6.2.0-PROD)

A production-grade, OpenEnv-compliant reinforcement learning environment and interactive platform for building, evaluating, and visualizing AI agents specialized in SRE (Site Reliability Engineering) and Auto-Healing.

This project simulates real-world DevOps system failures, providing a rich training ground for AI agents to diagnose and resolve incidents including memory leaks, database locks, and CPU scaling bottlenecks.


🚀 Quick Start

Ensure you have uv installed.

1. Start the Environment Server (FastAPI)

uv run uvicorn server.app:app

Port 8000 (standard for OpenEnv).

2. Launch the Interactive Dashboard (Streamlit)

uv run streamlit run streamlit_app.py

Accessible at http://localhost:8501.

3. Run AI Agent Inference (CLI)

# Requires GROQ_API_KEY or OpenAI-compatible endpoint
uv run python3 inference.py

🏗️ Technical Architecture

The platform is built on the openenv-core SDK and utilizes a Dual-State Model:

  1. Observable State: CPU (%), Memory (%), DB Latency, and Service Status (API/DB/Cache).
  2. Hidden Root Causes: Memory leaks, database locks, and CPU scaling bottlenecks.

🔍 Simulation Engine

  • Failure Propagation: Metrics evolve dynamically based on hidden failures. A memory leak (memory > 85%) increases the objective chance of a service crash.
  • Multi-Step Chains: Resolving 'Expert' tasks requires specific action sequences (e.g., investigating logs → clearing cache → restarting the API).
  • Log Generator: A 3-depth log system that only reveals the true 'Root Cause' message after the agent executes check_logs.

📊 Environment Specification

🎮 Available Actions

  • restart_service:[api|database|cache]
  • scale_up:cpu
  • optimize_database
  • clear_cache
  • check_logs
  • do_nothing (only valid when system is healthy)

📈 Scoring & Rewards

Episodes are graded out of 1.0 via a deterministic 5-component scoring model:

  • Outcome (35%): Reaching a 'healthy' status.
  • Logic Match (25%): Using the correct corrective action for the specific hidden cause.
  • Efficiency (15%): Minimizing unnecessary steps.
  • Health Bonus (15%): Maximizing resource optimization.
  • Diagnostic (10%): Rewarding log investigation before fixing.

🤖 AI Agent Integration

The project includes a production-grade inference engine (inference.py) that supports:

  • Groq (LLaMA-3): Blazing-fast inference for real-time SRE responses.
  • Reasoning Protocol: The agent identifies the root_cause, cites evidence_logs, and proposes a plan before taking an action.
  • Safety Fallbacks: Hardcoded SRE heuristics protect the system if the LLM fails or hallucinations are detected.

📁 Project Structure

.
├── env/                   # Core Logic
│   ├── environment.py     # Rewards, State Transitions, Failures
│   ├── tasks.py           # Dynamic Scenario Generation
│   └── models.py          # Pydantic Schemas
├── server/
│   └── app.py             # FastAPI Endpoints (/reset, /step, /auto-run)
├── streamlit_app.py       # Interactive Web UI
├── inference.py           # AI Agent & Reasoning Engine
├── openenv.yaml           # Manifest for Hugging Face Deployment
└── requirements.txt       # Dependencies

☁️ Deploying to Hugging Face Spaces

This environment is optimized for deployment as a Docker Space:

# Register the environment and push to HF
openenv push

The deployed space includes:

  • Interactive UI at /web
  • FastAPI Documentation at /docs
  • OpenEnv Specification at /spec

📄 License

MIT License. Created for the OpenEnv DevOps/SRE Auto-Healing Benchmark.

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors