Spaces:

robertzty
/

egosocial-env

Runtime error

train-norm-updated.parquet has only 1743 rows
final_data.json has 1853 scenes
the earlier SFT pipeline used final_data.json + verified_split, not the parquet-only view

SFT Baseline

Official/OpenEnv caveat

Do not directly compare the OpenEnv 2-turn metrics below with the earlier official-style SFT metrics such as:

78.5 / 88.5 / 70.5 / 0.6450 / 100.0

Those are a different evaluation protocol. The numbers below are from the OpenEnv 2-turn benchmark-mode evaluator.

Text-only env eval

Output:

/tmp/egosocial_eval/eval_v6b_verified_finaldata.json

Summary:

accuracy = 0.555
avg_reward = 0.4487
avg_sensibility = 0.6900
avg_taxonomy_match = 0.2995
avg_justification_alignment = 0.3867
avg_rubric_average = 0.5504

Image-conditioned env eval

Output:

/tmp/egosocial_eval/eval_v6b_verified_finaldata_image_200.json

Summary:

model: robertzty/EgoNormia-Cosmos-Reason2-2B-v6b-shortcot
accuracy = 0.685
avg_reward = 0.5305
avg_action_selection = 0.685
avg_sensibility = 0.765
avg_taxonomy_match = 0.3029
avg_justification_alignment = 0.4800
avg_rubric_average = 0.6205

This is the current SFT only baseline to use for OpenEnv.

GRPO Status

What was fixed

The GRPO training path in train_grpo_reason2.py was upgraded to:

use real image-conditioned 2-turn observations instead of text-only prompts
pack a full episode as one training example
preserve environment-feedback masking via env_mask
pass multimodal tensors needed by Qwen3VL/Cosmos-Reason2
add mm_token_type_ids handling on the training path
downscale training-time images with --image-max-edge to avoid H100 OOM on the very wide frame_all_prev.jpg / frame_all_during.jpg panoramas

Smoke test result

A real 1-step GRPO smoke test completed successfully on the H100.

Run config:

model: robertzty/EgoNormia-Cosmos-Reason2-2B-v6b-shortcot
env mode: benchmark
scene_limit = 1
scene_repeats = 2
num_generations = 2
per_device_train_batch_size = 2
gradient_accumulation_steps = 1
max_steps = 1
max_completion_length = 128
image_max_edge = 320

Observed training summary:

train_loss = 0.01375
train_runtime = 138.4s

Artifacts:

output dir: /tmp/egosocial_train/grpo_v6b_multimodal_smoke_gpu
checkpoint: /tmp/egosocial_train/grpo_v6b_multimodal_smoke_gpu/checkpoint-1
final weights: /tmp/egosocial_train/grpo_v6b_multimodal_smoke_gpu/model.safetensors

This means the TRL + OpenEnv + Cosmos-Reason2 training path is now operational.

Cosmos Predict

Cosmos Predict integration remains available as the world-model backend for train mode, but it is not the current training truth source.

Current positioning:

benchmark mode: main training/eval truth path
train mode + Cosmos Predict: rollout/demo/scaling path

Recommended Next Step

Run the real 4 experiment:

v6b SFT only on the fixed 200-scene heldout
v6b + GRPO on non-heldout EgoNormia, then evaluate on the same heldout

Suggested starting command:

cd /root/openenv-hack/egosocial_env
HF_HOME=/tmp/hf TRANSFORMERS_CACHE=/tmp/hf/hub \
.venv/bin/python scripts/train_grpo_reason2.py \
  --model-id /tmp/hf/hub/models--robertzty--EgoNormia-Cosmos-Reason2-2B-v6b-shortcot/snapshots/358d42e154c403a07f3cc2bac9e4f17551146484 \
  --trust-remote-code \
  --env-mode benchmark \
  --scene-limit 256 \
  --scene-repeats 2 \
  --num-generations 2 \
  --per-device-train-batch-size 2 \
  --gradient-accumulation-steps 1 \
  --max-steps 30 \
  --max-completion-length 128 \
  --image-max-edge 320 \
  --output-dir /tmp/egosocial_train/grpo_v6b_benchmark_run

Then evaluate the resulting checkpoint with:

cd /root/openenv-hack/egosocial_env
HF_HOME=/tmp/hf TRANSFORMERS_CACHE=/tmp/hf/hub \
.venv/bin/python scripts/eval_reason2.py \
  --model-id /tmp/egosocial_train/grpo_v6b_benchmark_run \
  --trust-remote-code \
  --bf16 \
  --env-mode benchmark \
  --output-path /tmp/egosocial_eval/eval_grpo_v6b_verified_image_200.json

Active Run

An actual long GRPO benchmark run has been started.

Run directory:

/tmp/egosocial_train/grpo_v6b_benchmark_run

Command:

cd /root/openenv-hack/egosocial_env
HF_HOME=/tmp/hf TRANSFORMERS_CACHE=/tmp/hf/hub \
.venv/bin/python scripts/train_grpo_reason2.py \
  --model-id /tmp/hf/hub/models--robertzty--EgoNormia-Cosmos-Reason2-2B-v6b-shortcot/snapshots/358d42e154c403a07f3cc2bac9e4f17551146484 \
  --trust-remote-code \
  --bf16 \
  --env-mode benchmark \
  --scene-limit 0 \
  --scene-repeats 2 \
  --num-generations 2 \
  --per-device-train-batch-size 2 \
  --gradient-accumulation-steps 1 \
  --max-steps 60 \
  --max-completion-length 128 \
  --image-max-edge 320 \
  --output-dir /tmp/egosocial_train/grpo_v6b_benchmark_run

Observed status at launch:

training loop entered successfully
progress reached 0/60
GPU visible on H100

Known Remaining Gap

The OpenEnv 2-turn evaluator is now valid for the 3 vs 4 experiment, but it is still a different protocol from the earlier official-style SFT benchmark. If needed for reporting, add a separate official-style eval table rather than mixing the two protocols.