# Experiment Log

Last updated: 2026-03-08 UTC

## Scope

Current focus is the `3 vs 4` comparison:

- `3`: `robertzty/EgoNormia-Cosmos-Reason2-2B-v6b-shortcot` as `SFT only`
- `4`: the same checkpoint plus `GRPO` on the local `EgoNormia` 2-turn OpenEnv

## Data Alignment

The environment is now aligned to the same metadata family used by the earlier SFT pipeline.

- Env metadata source of truth: [final_data.json](/root/openenv-hack/egosocial_env/data/egonormia/final_data.json)
- Heldout split: [verified_split.json](/root/openenv-hack/egosocial_env/data/splits/verified_split.json)
- Heldout size: `200`
- Env overlap with heldout: `200/200`

Important note:

- `train-norm-updated.parquet` has only `1743` rows
- `final_data.json` has `1853` scenes
- the earlier SFT pipeline used `final_data.json + verified_split`, not the parquet-only view

## SFT Baseline

### Official/OpenEnv caveat

Do not directly compare the OpenEnv 2-turn metrics below with the earlier official-style SFT metrics such as:

- `78.5 / 88.5 / 70.5 / 0.6450 / 100.0`

Those are a different evaluation protocol. The numbers below are from the OpenEnv 2-turn benchmark-mode evaluator.

### Text-only env eval

Output:

- [/tmp/egosocial_eval/eval_v6b_verified_finaldata.json](/tmp/egosocial_eval/eval_v6b_verified_finaldata.json)

Summary:

- `accuracy = 0.555`
- `avg_reward = 0.4487`
- `avg_sensibility = 0.6900`
- `avg_taxonomy_match = 0.2995`
- `avg_justification_alignment = 0.3867`
- `avg_rubric_average = 0.5504`

### Image-conditioned env eval

Output:

- [/tmp/egosocial_eval/eval_v6b_verified_finaldata_image_200.json](/tmp/egosocial_eval/eval_v6b_verified_finaldata_image_200.json)

Summary:

- model: `robertzty/EgoNormia-Cosmos-Reason2-2B-v6b-shortcot`
- `accuracy = 0.685`
- `avg_reward = 0.5305`
- `avg_action_selection = 0.685`
- `avg_sensibility = 0.765`
- `avg_taxonomy_match = 0.3029`
- `avg_justification_alignment = 0.4800`
- `avg_rubric_average = 0.6205`

This is the current `SFT only` baseline to use for OpenEnv.

## GRPO Status

### What was fixed

The GRPO training path in [train_grpo_reason2.py](/root/openenv-hack/egosocial_env/scripts/train_grpo_reason2.py) was upgraded to:

- use real image-conditioned 2-turn observations instead of text-only prompts
- pack a full episode as one training example
- preserve environment-feedback masking via `env_mask`
- pass multimodal tensors needed by `Qwen3VL/Cosmos-Reason2`
- add `mm_token_type_ids` handling on the training path
- downscale training-time images with `--image-max-edge` to avoid H100 OOM on the very wide `frame_all_prev.jpg` / `frame_all_during.jpg` panoramas

### Smoke test result

A real `1-step` GRPO smoke test completed successfully on the H100.

Run config:

- model: `robertzty/EgoNormia-Cosmos-Reason2-2B-v6b-shortcot`
- env mode: `benchmark`
- `scene_limit = 1`
- `scene_repeats = 2`
- `num_generations = 2`
- `per_device_train_batch_size = 2`
- `gradient_accumulation_steps = 1`
- `max_steps = 1`
- `max_completion_length = 128`
- `image_max_edge = 320`

Observed training summary:

- `train_loss = 0.01375`
- `train_runtime = 138.4s`

Artifacts:

- output dir: [/tmp/egosocial_train/grpo_v6b_multimodal_smoke_gpu](/tmp/egosocial_train/grpo_v6b_multimodal_smoke_gpu)
- checkpoint: [/tmp/egosocial_train/grpo_v6b_multimodal_smoke_gpu/checkpoint-1](/tmp/egosocial_train/grpo_v6b_multimodal_smoke_gpu/checkpoint-1)
- final weights: [/tmp/egosocial_train/grpo_v6b_multimodal_smoke_gpu/model.safetensors](/tmp/egosocial_train/grpo_v6b_multimodal_smoke_gpu/model.safetensors)

This means the `TRL + OpenEnv + Cosmos-Reason2` training path is now operational.

## Cosmos Predict

Cosmos Predict integration remains available as the world-model backend for `train` mode, but it is not the current training truth source.

Current positioning:

- `benchmark mode`: main training/eval truth path
- `train mode + Cosmos Predict`: rollout/demo/scaling path

## Recommended Next Step

Run the real `4` experiment:

- `v6b SFT only` on the fixed 200-scene heldout
- `v6b + GRPO` on non-heldout `EgoNormia`, then evaluate on the same heldout

Suggested starting command:

```bash
cd /root/openenv-hack/egosocial_env
HF_HOME=/tmp/hf TRANSFORMERS_CACHE=/tmp/hf/hub \
.venv/bin/python scripts/train_grpo_reason2.py \
  --model-id /tmp/hf/hub/models--robertzty--EgoNormia-Cosmos-Reason2-2B-v6b-shortcot/snapshots/358d42e154c403a07f3cc2bac9e4f17551146484 \
  --trust-remote-code \
  --env-mode benchmark \
  --scene-limit 256 \
  --scene-repeats 2 \
  --num-generations 2 \
  --per-device-train-batch-size 2 \
  --gradient-accumulation-steps 1 \
  --max-steps 30 \
  --max-completion-length 128 \
  --image-max-edge 320 \
  --output-dir /tmp/egosocial_train/grpo_v6b_benchmark_run
```

Then evaluate the resulting checkpoint with:

```bash
cd /root/openenv-hack/egosocial_env
HF_HOME=/tmp/hf TRANSFORMERS_CACHE=/tmp/hf/hub \
.venv/bin/python scripts/eval_reason2.py \
  --model-id /tmp/egosocial_train/grpo_v6b_benchmark_run \
  --trust-remote-code \
  --bf16 \
  --env-mode benchmark \
  --output-path /tmp/egosocial_eval/eval_grpo_v6b_verified_image_200.json
```

## Active Run

An actual long GRPO benchmark run has been started.

Run directory:

- [/tmp/egosocial_train/grpo_v6b_benchmark_run](/tmp/egosocial_train/grpo_v6b_benchmark_run)

Command:

```bash
cd /root/openenv-hack/egosocial_env
HF_HOME=/tmp/hf TRANSFORMERS_CACHE=/tmp/hf/hub \
.venv/bin/python scripts/train_grpo_reason2.py \
  --model-id /tmp/hf/hub/models--robertzty--EgoNormia-Cosmos-Reason2-2B-v6b-shortcot/snapshots/358d42e154c403a07f3cc2bac9e4f17551146484 \
  --trust-remote-code \
  --bf16 \
  --env-mode benchmark \
  --scene-limit 0 \
  --scene-repeats 2 \
  --num-generations 2 \
  --per-device-train-batch-size 2 \
  --gradient-accumulation-steps 1 \
  --max-steps 60 \
  --max-completion-length 128 \
  --image-max-edge 320 \
  --output-dir /tmp/egosocial_train/grpo_v6b_benchmark_run
```

Observed status at launch:

- training loop entered successfully
- progress reached `0/60`
- GPU visible on H100

## Known Remaining Gap

The OpenEnv 2-turn evaluator is now valid for the `3 vs 4` experiment, but it is still a different protocol from the earlier official-style SFT benchmark. If needed for reporting, add a separate official-style eval table rather than mixing the two protocols.