Spaces:
Runtime error
Experiment Log
Last updated: 2026-03-08 UTC
Scope
Current focus is the 3 vs 4 comparison:
3:robertzty/EgoNormia-Cosmos-Reason2-2B-v6b-shortcotasSFT only4: the same checkpoint plusGRPOon the localEgoNormia2-turn OpenEnv
Data Alignment
The environment is now aligned to the same metadata family used by the earlier SFT pipeline.
- Env metadata source of truth: final_data.json
- Heldout split: verified_split.json
- Heldout size:
200 - Env overlap with heldout:
200/200
Important note:
train-norm-updated.parquethas only1743rowsfinal_data.jsonhas1853scenes- the earlier SFT pipeline used
final_data.json + verified_split, not the parquet-only view
SFT Baseline
Official/OpenEnv caveat
Do not directly compare the OpenEnv 2-turn metrics below with the earlier official-style SFT metrics such as:
78.5 / 88.5 / 70.5 / 0.6450 / 100.0
Those are a different evaluation protocol. The numbers below are from the OpenEnv 2-turn benchmark-mode evaluator.
Text-only env eval
Output:
Summary:
accuracy = 0.555avg_reward = 0.4487avg_sensibility = 0.6900avg_taxonomy_match = 0.2995avg_justification_alignment = 0.3867avg_rubric_average = 0.5504
Image-conditioned env eval
Output:
Summary:
- model:
robertzty/EgoNormia-Cosmos-Reason2-2B-v6b-shortcot accuracy = 0.685avg_reward = 0.5305avg_action_selection = 0.685avg_sensibility = 0.765avg_taxonomy_match = 0.3029avg_justification_alignment = 0.4800avg_rubric_average = 0.6205
This is the current SFT only baseline to use for OpenEnv.
GRPO Status
What was fixed
The GRPO training path in train_grpo_reason2.py was upgraded to:
- use real image-conditioned 2-turn observations instead of text-only prompts
- pack a full episode as one training example
- preserve environment-feedback masking via
env_mask - pass multimodal tensors needed by
Qwen3VL/Cosmos-Reason2 - add
mm_token_type_idshandling on the training path - downscale training-time images with
--image-max-edgeto avoid H100 OOM on the very wideframe_all_prev.jpg/frame_all_during.jpgpanoramas
Smoke test result
A real 1-step GRPO smoke test completed successfully on the H100.
Run config:
- model:
robertzty/EgoNormia-Cosmos-Reason2-2B-v6b-shortcot - env mode:
benchmark scene_limit = 1scene_repeats = 2num_generations = 2per_device_train_batch_size = 2gradient_accumulation_steps = 1max_steps = 1max_completion_length = 128image_max_edge = 320
Observed training summary:
train_loss = 0.01375train_runtime = 138.4s
Artifacts:
- output dir: /tmp/egosocial_train/grpo_v6b_multimodal_smoke_gpu
- checkpoint: /tmp/egosocial_train/grpo_v6b_multimodal_smoke_gpu/checkpoint-1
- final weights: /tmp/egosocial_train/grpo_v6b_multimodal_smoke_gpu/model.safetensors
This means the TRL + OpenEnv + Cosmos-Reason2 training path is now operational.
Cosmos Predict
Cosmos Predict integration remains available as the world-model backend for train mode, but it is not the current training truth source.
Current positioning:
benchmark mode: main training/eval truth pathtrain mode + Cosmos Predict: rollout/demo/scaling path
Recommended Next Step
Run the real 4 experiment:
v6b SFT onlyon the fixed 200-scene heldoutv6b + GRPOon non-heldoutEgoNormia, then evaluate on the same heldout
Suggested starting command:
cd /root/openenv-hack/egosocial_env
HF_HOME=/tmp/hf TRANSFORMERS_CACHE=/tmp/hf/hub \
.venv/bin/python scripts/train_grpo_reason2.py \
--model-id /tmp/hf/hub/models--robertzty--EgoNormia-Cosmos-Reason2-2B-v6b-shortcot/snapshots/358d42e154c403a07f3cc2bac9e4f17551146484 \
--trust-remote-code \
--env-mode benchmark \
--scene-limit 256 \
--scene-repeats 2 \
--num-generations 2 \
--per-device-train-batch-size 2 \
--gradient-accumulation-steps 1 \
--max-steps 30 \
--max-completion-length 128 \
--image-max-edge 320 \
--output-dir /tmp/egosocial_train/grpo_v6b_benchmark_run
Then evaluate the resulting checkpoint with:
cd /root/openenv-hack/egosocial_env
HF_HOME=/tmp/hf TRANSFORMERS_CACHE=/tmp/hf/hub \
.venv/bin/python scripts/eval_reason2.py \
--model-id /tmp/egosocial_train/grpo_v6b_benchmark_run \
--trust-remote-code \
--bf16 \
--env-mode benchmark \
--output-path /tmp/egosocial_eval/eval_grpo_v6b_verified_image_200.json
Active Run
An actual long GRPO benchmark run has been started.
Run directory:
Command:
cd /root/openenv-hack/egosocial_env
HF_HOME=/tmp/hf TRANSFORMERS_CACHE=/tmp/hf/hub \
.venv/bin/python scripts/train_grpo_reason2.py \
--model-id /tmp/hf/hub/models--robertzty--EgoNormia-Cosmos-Reason2-2B-v6b-shortcot/snapshots/358d42e154c403a07f3cc2bac9e4f17551146484 \
--trust-remote-code \
--bf16 \
--env-mode benchmark \
--scene-limit 0 \
--scene-repeats 2 \
--num-generations 2 \
--per-device-train-batch-size 2 \
--gradient-accumulation-steps 1 \
--max-steps 60 \
--max-completion-length 128 \
--image-max-edge 320 \
--output-dir /tmp/egosocial_train/grpo_v6b_benchmark_run
Observed status at launch:
- training loop entered successfully
- progress reached
0/60 - GPU visible on H100
Known Remaining Gap
The OpenEnv 2-turn evaluator is now valid for the 3 vs 4 experiment, but it is still a different protocol from the earlier official-style SFT benchmark. If needed for reporting, add a separate official-style eval table rather than mixing the two protocols.