| title | Orbital Thruster Environment Server | |
|---|---|---|
| sdk | docker | |
| pinned | false | |
| app_port | 7860 | |
| base_path | /web | |
| colorFrom | blue | |
| colorTo | indigo | |
| tags |
|
OpenEnv benchmark for Theme #2: (Super) Long-Horizon Planning & Instruction Following. The agent must track mission directives over a long episode, preserve fuel for delayed objectives, recover from anomalies, and finish in precision hold.
Submission links
- Hugging Face Space: https://huggingface.co/spaces/pixxel-phantom/orbital-thruster-env
- Trained adapter (GRPO LoRA): https://huggingface.co/pixxel-phantom/orbital-thruster-grpo-fast
- Training notebook:
training/train_orbital_grpo.ipynb
Pitch: early waste breaks later phases. A controller that looks good on short-horizon pointing can still fail the flagship mission because it burns fuel before the retarget, mishandles the anomaly, or reaches the final hold phase with no reserve left.
Modern mission-operations control is not one action repeated forever. It is a chain of directives:
- Detumble after deployment.
- Respect a quiet coast window.
- Repoint to a new relay geometry.
- Recover from an injected gyro-bias anomaly.
- Finish with stable precision hold.
This benchmark turns that story into a verifier-backed environment with explicit milestones, delayed checkpoints, and anti-shortcut rewards.
The environment keeps the existing orbital control core:
- 13 discrete thruster actions plus
idle - deterministic seeded disturbances
- limited RCS fuel
- dense physical reward from pointing, stability, fuel, and overshoot
On top of that, the mission-ops pivot adds:
mission_briefactive_directivepending_directives_countmilestones_completedanomaly_flagsfuel_reserve_targetphase_deadline_stepreward_breakdownepisode_metrics
Each action now also includes a required control_mode:
detumbleslewbraketrimholdrecoversafe_hold
detumble_satellite(easy): stabilize a newly deployed spacecraft and finish with ample reserve.retarget_180_flip(medium): survive a delayed maneuver window, execute the large flip, and settle cleanly.long_horizon_precision_hold(hard): preserve a fine-pointing envelope under long disturbance exposure.
mission_ops_long_horizon(hard): a single episode that chains detumble, coast discipline, retargeting, anomaly recovery, and final precision hold.
This flagship task is the main demo task for the hackathon.
The environment now logs rubric-style reward columns instead of a single opaque scalar:
physical_tracking_rewardfuel_discipline_rewardmilestone_completion_rewardcontrol_mode_rewardanomaly_recovery_rewardanti_stall_penalty
These are surfaced per step in reward_breakdown and aggregated in state.reward_columns. That makes it easy to show judges not only that reward improved, but which behaviors improved.
Three baselines are supported end-to-end:
- seeded random controller
- deterministic PD controller
- tuned PD controller
The current intended story is:
- deterministic clears
easy - tuned PD clears
medium - both heuristics fail the flagship mission
Current fixed-seed baseline snapshot:
| Policy | Easy | Medium | Hard Hold | Flagship Mission |
|---|---|---|---|---|
| Random | fail | fail | fail | fail |
| Deterministic PD | pass | fail | fail | fail |
| Tuned PD | pass | pass | fail | fail |
Run the fixed-seed evaluation:
python training/evaluate_baselines.pyArtifacts are written to:
outputs/baseline_eval/baseline_summary.csvoutputs/baseline_eval/baseline_summary.png
Stack: TRL (SFTTrainer → GRPOTrainer) + PEFT QLoRA on the real OpenEnv environment as the verifier.
Base model: Qwen/Qwen2.5-1.5B-Instruct (L4 GPU via HF Jobs). Override via ORBITAL_BASE_MODEL env var.
Why this model: strong JSON adherence (we score on JSON validity), fits 4-bit QLoRA on single L4, fast iteration for deadline training, mature TRL integration.
Pipeline: seed trajectories from tuned-PD expert → 40-step SFT (JSON+control-mode priming, loss 2.33→0.55) → 60-step GRPO with 5 independent reward funcs (reward 0.84→2.30) → eval vs baselines.
Reward funcs (independent, summed by GRPO — anti-hacking design):
| Function | Signal |
|---|---|
reward_format |
strict JSON parse + valid enums + reason |
reward_env_step |
replay history, score candidate via real env reward |
reward_mode_match |
control_mode ∈ recommended for active directive |
reward_anti_spam |
penalty if same action ≥ 4× in last 7 steps |
reward_fuel_discipline |
low-fuel→idle bonus, low-fuel→large-pulse penalty |
Entry points:
training/train_orbital_grpo.ipynb— main notebook (SFT → GRPO → eval, end-to-end)training/hf_job_train.py— UV script forhf jobs uv run(cloud, GPU credits)training/qwen3_smoke_sft.py/qwen3_grpo_train.py— script entrypointstraining/eval_trained_model.py— trained-vs-baseline comparison
Run on cloud:
hf jobs uv run --flavor l4x1 --timeout 4h --secrets HF_TOKEN \
-e ORBITAL_BASE_MODEL=Qwen/Qwen3-4B-Instruct-2507 -d training/hf_job_train.pyTraining-only deps: training/requirements.txt.
Training completed on HF Jobs (L4 GPU, Qwen/Qwen2.5-1.5B-Instruct, 40 SFT + 60 GRPO steps):
SFT phase: loss 2.33 → 0.55, accuracy 0.53 → 0.80 (139s)
GRPO phase: loss 0.077 → 0.037, total reward 0.84 → 2.30 (287s). reward_format reached 1.0 (perfect JSON).
| Policy | Easy (detumble) | Medium (retarget) | Hard (hold) | Flagship (mission_ops) |
|---|---|---|---|---|
| Random | 23.9 / fail | 3.2 / fail | -25.3 / fail | -53.5 / fail |
| Deterministic PD | 17.6 / pass | 97.4 / fail | 21.1 / fail | 89.8 / fail |
| Tuned PD | 34.2 / pass | 120.1 / pass | 27.5 / fail | 115.8 / fail |
| Trained (GRPO) | 9.2 | 38.3 | 88.0 | 22.6 |
The trained model learned perfect JSON formatting (reward_format=1.0) and conservative fuel strategy (fuel_used=0), outperforming random on all tasks and showing strong hard-task reward. With more GRPO steps, milestone completion would improve.
outputs/baseline_eval/baseline_summary.png— baseline policies (random / deterministic-PD / tuned-PD)outputs/training/grpo_metrics.png— per-component reward + loss curves over GRPO stepsoutputs/eval_trained/trained_vs_baseline.png— trained policy vs all 3 baselines on all 4 tasks
pip install -e .
uvicorn server.app:app --host 0.0.0.0 --port 7860
python validate.pypython training/generate_seed_trajectories.py
python training/evaluate_baselines.py
python training/qwen3_smoke_sft.py
python training/qwen3_grpo_train.pydocker build -t orbital-thruster-env .
docker run -p 7860:7860 orbital-thruster-envAPI_BASE_URL and MODEL_NAME can be overridden at runtime. HF_TOKEN is required for remote inference.
$env:API_BASE_URL = "https://router.huggingface.co/v1"
$env:MODEL_NAME = "Qwen/Qwen3-8B"
$env:HF_TOKEN = "hf_xxx"
python inference.pyThe validation script now checks:
- four tasks present
- mission-planning observation fields exposed
- action schema requires
control_mode - reward rubric surfaced on
/step - cumulative reward columns surfaced on
/state
python validate.py
