egosocial-env / EXPERIMENT_LOG.md
robertzty's picture
Upload folder using huggingface_hub
a0a453b verified

Experiment Log

Last updated: 2026-03-08 UTC

Scope

Current focus is the 3 vs 4 comparison:

  • 3: robertzty/EgoNormia-Cosmos-Reason2-2B-v6b-shortcot as SFT only
  • 4: the same checkpoint plus GRPO on the local EgoNormia 2-turn OpenEnv

Data Alignment

The environment is now aligned to the same metadata family used by the earlier SFT pipeline.

Important note:

  • train-norm-updated.parquet has only 1743 rows
  • final_data.json has 1853 scenes
  • the earlier SFT pipeline used final_data.json + verified_split, not the parquet-only view

SFT Baseline

Official/OpenEnv caveat

Do not directly compare the OpenEnv 2-turn metrics below with the earlier official-style SFT metrics such as:

  • 78.5 / 88.5 / 70.5 / 0.6450 / 100.0

Those are a different evaluation protocol. The numbers below are from the OpenEnv 2-turn benchmark-mode evaluator.

Text-only env eval

Output:

Summary:

  • accuracy = 0.555
  • avg_reward = 0.4487
  • avg_sensibility = 0.6900
  • avg_taxonomy_match = 0.2995
  • avg_justification_alignment = 0.3867
  • avg_rubric_average = 0.5504

Image-conditioned env eval

Output:

Summary:

  • model: robertzty/EgoNormia-Cosmos-Reason2-2B-v6b-shortcot
  • accuracy = 0.685
  • avg_reward = 0.5305
  • avg_action_selection = 0.685
  • avg_sensibility = 0.765
  • avg_taxonomy_match = 0.3029
  • avg_justification_alignment = 0.4800
  • avg_rubric_average = 0.6205

This is the current SFT only baseline to use for OpenEnv.

GRPO Status

What was fixed

The GRPO training path in train_grpo_reason2.py was upgraded to:

  • use real image-conditioned 2-turn observations instead of text-only prompts
  • pack a full episode as one training example
  • preserve environment-feedback masking via env_mask
  • pass multimodal tensors needed by Qwen3VL/Cosmos-Reason2
  • add mm_token_type_ids handling on the training path
  • downscale training-time images with --image-max-edge to avoid H100 OOM on the very wide frame_all_prev.jpg / frame_all_during.jpg panoramas

Smoke test result

A real 1-step GRPO smoke test completed successfully on the H100.

Run config:

  • model: robertzty/EgoNormia-Cosmos-Reason2-2B-v6b-shortcot
  • env mode: benchmark
  • scene_limit = 1
  • scene_repeats = 2
  • num_generations = 2
  • per_device_train_batch_size = 2
  • gradient_accumulation_steps = 1
  • max_steps = 1
  • max_completion_length = 128
  • image_max_edge = 320

Observed training summary:

  • train_loss = 0.01375
  • train_runtime = 138.4s

Artifacts:

This means the TRL + OpenEnv + Cosmos-Reason2 training path is now operational.

Cosmos Predict

Cosmos Predict integration remains available as the world-model backend for train mode, but it is not the current training truth source.

Current positioning:

  • benchmark mode: main training/eval truth path
  • train mode + Cosmos Predict: rollout/demo/scaling path

Recommended Next Step

Run the real 4 experiment:

  • v6b SFT only on the fixed 200-scene heldout
  • v6b + GRPO on non-heldout EgoNormia, then evaluate on the same heldout

Suggested starting command:

cd /root/openenv-hack/egosocial_env
HF_HOME=/tmp/hf TRANSFORMERS_CACHE=/tmp/hf/hub \
.venv/bin/python scripts/train_grpo_reason2.py \
  --model-id /tmp/hf/hub/models--robertzty--EgoNormia-Cosmos-Reason2-2B-v6b-shortcot/snapshots/358d42e154c403a07f3cc2bac9e4f17551146484 \
  --trust-remote-code \
  --env-mode benchmark \
  --scene-limit 256 \
  --scene-repeats 2 \
  --num-generations 2 \
  --per-device-train-batch-size 2 \
  --gradient-accumulation-steps 1 \
  --max-steps 30 \
  --max-completion-length 128 \
  --image-max-edge 320 \
  --output-dir /tmp/egosocial_train/grpo_v6b_benchmark_run

Then evaluate the resulting checkpoint with:

cd /root/openenv-hack/egosocial_env
HF_HOME=/tmp/hf TRANSFORMERS_CACHE=/tmp/hf/hub \
.venv/bin/python scripts/eval_reason2.py \
  --model-id /tmp/egosocial_train/grpo_v6b_benchmark_run \
  --trust-remote-code \
  --bf16 \
  --env-mode benchmark \
  --output-path /tmp/egosocial_eval/eval_grpo_v6b_verified_image_200.json

Active Run

An actual long GRPO benchmark run has been started.

Run directory:

Command:

cd /root/openenv-hack/egosocial_env
HF_HOME=/tmp/hf TRANSFORMERS_CACHE=/tmp/hf/hub \
.venv/bin/python scripts/train_grpo_reason2.py \
  --model-id /tmp/hf/hub/models--robertzty--EgoNormia-Cosmos-Reason2-2B-v6b-shortcot/snapshots/358d42e154c403a07f3cc2bac9e4f17551146484 \
  --trust-remote-code \
  --bf16 \
  --env-mode benchmark \
  --scene-limit 0 \
  --scene-repeats 2 \
  --num-generations 2 \
  --per-device-train-batch-size 2 \
  --gradient-accumulation-steps 1 \
  --max-steps 60 \
  --max-completion-length 128 \
  --image-max-edge 320 \
  --output-dir /tmp/egosocial_train/grpo_v6b_benchmark_run

Observed status at launch:

  • training loop entered successfully
  • progress reached 0/60
  • GPU visible on H100

Known Remaining Gap

The OpenEnv 2-turn evaluator is now valid for the 3 vs 4 experiment, but it is still a different protocol from the earlier official-style SFT benchmark. If needed for reporting, add a separate official-style eval table rather than mixing the two protocols.