# Experiment Log Last updated: 2026-03-08 UTC ## Scope Current focus is the `3 vs 4` comparison: - `3`: `robertzty/EgoNormia-Cosmos-Reason2-2B-v6b-shortcot` as `SFT only` - `4`: the same checkpoint plus `GRPO` on the local `EgoNormia` 2-turn OpenEnv ## Data Alignment The environment is now aligned to the same metadata family used by the earlier SFT pipeline. - Env metadata source of truth: [final_data.json](/root/openenv-hack/egosocial_env/data/egonormia/final_data.json) - Heldout split: [verified_split.json](/root/openenv-hack/egosocial_env/data/splits/verified_split.json) - Heldout size: `200` - Env overlap with heldout: `200/200` Important note: - `train-norm-updated.parquet` has only `1743` rows - `final_data.json` has `1853` scenes - the earlier SFT pipeline used `final_data.json + verified_split`, not the parquet-only view ## SFT Baseline ### Official/OpenEnv caveat Do not directly compare the OpenEnv 2-turn metrics below with the earlier official-style SFT metrics such as: - `78.5 / 88.5 / 70.5 / 0.6450 / 100.0` Those are a different evaluation protocol. The numbers below are from the OpenEnv 2-turn benchmark-mode evaluator. ### Text-only env eval Output: - [/tmp/egosocial_eval/eval_v6b_verified_finaldata.json](/tmp/egosocial_eval/eval_v6b_verified_finaldata.json) Summary: - `accuracy = 0.555` - `avg_reward = 0.4487` - `avg_sensibility = 0.6900` - `avg_taxonomy_match = 0.2995` - `avg_justification_alignment = 0.3867` - `avg_rubric_average = 0.5504` ### Image-conditioned env eval Output: - [/tmp/egosocial_eval/eval_v6b_verified_finaldata_image_200.json](/tmp/egosocial_eval/eval_v6b_verified_finaldata_image_200.json) Summary: - model: `robertzty/EgoNormia-Cosmos-Reason2-2B-v6b-shortcot` - `accuracy = 0.685` - `avg_reward = 0.5305` - `avg_action_selection = 0.685` - `avg_sensibility = 0.765` - `avg_taxonomy_match = 0.3029` - `avg_justification_alignment = 0.4800` - `avg_rubric_average = 0.6205` This is the current `SFT only` baseline to use for OpenEnv. ## GRPO Status ### What was fixed The GRPO training path in [train_grpo_reason2.py](/root/openenv-hack/egosocial_env/scripts/train_grpo_reason2.py) was upgraded to: - use real image-conditioned 2-turn observations instead of text-only prompts - pack a full episode as one training example - preserve environment-feedback masking via `env_mask` - pass multimodal tensors needed by `Qwen3VL/Cosmos-Reason2` - add `mm_token_type_ids` handling on the training path - downscale training-time images with `--image-max-edge` to avoid H100 OOM on the very wide `frame_all_prev.jpg` / `frame_all_during.jpg` panoramas ### Smoke test result A real `1-step` GRPO smoke test completed successfully on the H100. Run config: - model: `robertzty/EgoNormia-Cosmos-Reason2-2B-v6b-shortcot` - env mode: `benchmark` - `scene_limit = 1` - `scene_repeats = 2` - `num_generations = 2` - `per_device_train_batch_size = 2` - `gradient_accumulation_steps = 1` - `max_steps = 1` - `max_completion_length = 128` - `image_max_edge = 320` Observed training summary: - `train_loss = 0.01375` - `train_runtime = 138.4s` Artifacts: - output dir: [/tmp/egosocial_train/grpo_v6b_multimodal_smoke_gpu](/tmp/egosocial_train/grpo_v6b_multimodal_smoke_gpu) - checkpoint: [/tmp/egosocial_train/grpo_v6b_multimodal_smoke_gpu/checkpoint-1](/tmp/egosocial_train/grpo_v6b_multimodal_smoke_gpu/checkpoint-1) - final weights: [/tmp/egosocial_train/grpo_v6b_multimodal_smoke_gpu/model.safetensors](/tmp/egosocial_train/grpo_v6b_multimodal_smoke_gpu/model.safetensors) This means the `TRL + OpenEnv + Cosmos-Reason2` training path is now operational. ## Cosmos Predict Cosmos Predict integration remains available as the world-model backend for `train` mode, but it is not the current training truth source. Current positioning: - `benchmark mode`: main training/eval truth path - `train mode + Cosmos Predict`: rollout/demo/scaling path ## Recommended Next Step Run the real `4` experiment: - `v6b SFT only` on the fixed 200-scene heldout - `v6b + GRPO` on non-heldout `EgoNormia`, then evaluate on the same heldout Suggested starting command: ```bash cd /root/openenv-hack/egosocial_env HF_HOME=/tmp/hf TRANSFORMERS_CACHE=/tmp/hf/hub \ .venv/bin/python scripts/train_grpo_reason2.py \ --model-id /tmp/hf/hub/models--robertzty--EgoNormia-Cosmos-Reason2-2B-v6b-shortcot/snapshots/358d42e154c403a07f3cc2bac9e4f17551146484 \ --trust-remote-code \ --env-mode benchmark \ --scene-limit 256 \ --scene-repeats 2 \ --num-generations 2 \ --per-device-train-batch-size 2 \ --gradient-accumulation-steps 1 \ --max-steps 30 \ --max-completion-length 128 \ --image-max-edge 320 \ --output-dir /tmp/egosocial_train/grpo_v6b_benchmark_run ``` Then evaluate the resulting checkpoint with: ```bash cd /root/openenv-hack/egosocial_env HF_HOME=/tmp/hf TRANSFORMERS_CACHE=/tmp/hf/hub \ .venv/bin/python scripts/eval_reason2.py \ --model-id /tmp/egosocial_train/grpo_v6b_benchmark_run \ --trust-remote-code \ --bf16 \ --env-mode benchmark \ --output-path /tmp/egosocial_eval/eval_grpo_v6b_verified_image_200.json ``` ## Active Run An actual long GRPO benchmark run has been started. Run directory: - [/tmp/egosocial_train/grpo_v6b_benchmark_run](/tmp/egosocial_train/grpo_v6b_benchmark_run) Command: ```bash cd /root/openenv-hack/egosocial_env HF_HOME=/tmp/hf TRANSFORMERS_CACHE=/tmp/hf/hub \ .venv/bin/python scripts/train_grpo_reason2.py \ --model-id /tmp/hf/hub/models--robertzty--EgoNormia-Cosmos-Reason2-2B-v6b-shortcot/snapshots/358d42e154c403a07f3cc2bac9e4f17551146484 \ --trust-remote-code \ --bf16 \ --env-mode benchmark \ --scene-limit 0 \ --scene-repeats 2 \ --num-generations 2 \ --per-device-train-batch-size 2 \ --gradient-accumulation-steps 1 \ --max-steps 60 \ --max-completion-length 128 \ --image-max-edge 320 \ --output-dir /tmp/egosocial_train/grpo_v6b_benchmark_run ``` Observed status at launch: - training loop entered successfully - progress reached `0/60` - GPU visible on H100 ## Known Remaining Gap The OpenEnv 2-turn evaluator is now valid for the `3 vs 4` experiment, but it is still a different protocol from the earlier official-style SFT benchmark. If needed for reporting, add a separate official-style eval table rather than mixing the two protocols.