Spaces:

robertzty
/

egosocial-env

Runtime error

App Files Files Community

egosocial-env / EXPERIMENT_LOG.md

robertzty

Upload folder using huggingface_hub

a0a453b verified 3 months ago

preview code

raw

history blame contribute delete

6.33 kB

	# Experiment Log

	Last updated: 2026-03-08 UTC

	## Scope

	Current focus is the `3 vs 4` comparison:

	- `3`: `robertzty/EgoNormia-Cosmos-Reason2-2B-v6b-shortcot` as `SFT only`
	- `4`: the same checkpoint plus `GRPO` on the local `EgoNormia` 2-turn OpenEnv

	## Data Alignment

	The environment is now aligned to the same metadata family used by the earlier SFT pipeline.

	- Env metadata source of truth: [final_data.json](/root/openenv-hack/egosocial_env/data/egonormia/final_data.json)
	- Heldout split: [verified_split.json](/root/openenv-hack/egosocial_env/data/splits/verified_split.json)
	- Heldout size: `200`
	- Env overlap with heldout: `200/200`

	Important note:

	- `train-norm-updated.parquet` has only `1743` rows
	- `final_data.json` has `1853` scenes
	- the earlier SFT pipeline used `final_data.json + verified_split`, not the parquet-only view

	## SFT Baseline

	### Official/OpenEnv caveat

	Do not directly compare the OpenEnv 2-turn metrics below with the earlier official-style SFT metrics such as:

	- `78.5 / 88.5 / 70.5 / 0.6450 / 100.0`

	Those are a different evaluation protocol. The numbers below are from the OpenEnv 2-turn benchmark-mode evaluator.

	### Text-only env eval

	Output:

	- [/tmp/egosocial_eval/eval_v6b_verified_finaldata.json](/tmp/egosocial_eval/eval_v6b_verified_finaldata.json)

	Summary:

	- `accuracy = 0.555`
	- `avg_reward = 0.4487`
	- `avg_sensibility = 0.6900`
	- `avg_taxonomy_match = 0.2995`
	- `avg_justification_alignment = 0.3867`
	- `avg_rubric_average = 0.5504`

	### Image-conditioned env eval

	Output:

	- [/tmp/egosocial_eval/eval_v6b_verified_finaldata_image_200.json](/tmp/egosocial_eval/eval_v6b_verified_finaldata_image_200.json)

	Summary:

	- model: `robertzty/EgoNormia-Cosmos-Reason2-2B-v6b-shortcot`
	- `accuracy = 0.685`
	- `avg_reward = 0.5305`
	- `avg_action_selection = 0.685`
	- `avg_sensibility = 0.765`
	- `avg_taxonomy_match = 0.3029`
	- `avg_justification_alignment = 0.4800`
	- `avg_rubric_average = 0.6205`

	This is the current `SFT only` baseline to use for OpenEnv.

	## GRPO Status

	### What was fixed

	The GRPO training path in [train_grpo_reason2.py](/root/openenv-hack/egosocial_env/scripts/train_grpo_reason2.py) was upgraded to:

	- use real image-conditioned 2-turn observations instead of text-only prompts
	- pack a full episode as one training example
	- preserve environment-feedback masking via `env_mask`
	- pass multimodal tensors needed by `Qwen3VL/Cosmos-Reason2`
	- add `mm_token_type_ids` handling on the training path
	- downscale training-time images with `--image-max-edge` to avoid H100 OOM on the very wide `frame_all_prev.jpg` / `frame_all_during.jpg` panoramas

	### Smoke test result

	A real `1-step` GRPO smoke test completed successfully on the H100.

	Run config:

	- model: `robertzty/EgoNormia-Cosmos-Reason2-2B-v6b-shortcot`
	- env mode: `benchmark`
	- `scene_limit = 1`
	- `scene_repeats = 2`
	- `num_generations = 2`
	- `per_device_train_batch_size = 2`
	- `gradient_accumulation_steps = 1`
	- `max_steps = 1`
	- `max_completion_length = 128`
	- `image_max_edge = 320`

	Observed training summary:

	- `train_loss = 0.01375`
	- `train_runtime = 138.4s`

	Artifacts:

	- output dir: [/tmp/egosocial_train/grpo_v6b_multimodal_smoke_gpu](/tmp/egosocial_train/grpo_v6b_multimodal_smoke_gpu)
	- checkpoint: [/tmp/egosocial_train/grpo_v6b_multimodal_smoke_gpu/checkpoint-1](/tmp/egosocial_train/grpo_v6b_multimodal_smoke_gpu/checkpoint-1)
	- final weights: [/tmp/egosocial_train/grpo_v6b_multimodal_smoke_gpu/model.safetensors](/tmp/egosocial_train/grpo_v6b_multimodal_smoke_gpu/model.safetensors)

	This means the `TRL + OpenEnv + Cosmos-Reason2` training path is now operational.

	## Cosmos Predict

	Cosmos Predict integration remains available as the world-model backend for `train` mode, but it is not the current training truth source.

	Current positioning:

	- `benchmark mode`: main training/eval truth path
	- `train mode + Cosmos Predict`: rollout/demo/scaling path

	## Recommended Next Step

	Run the real `4` experiment:

	- `v6b SFT only` on the fixed 200-scene heldout
	- `v6b + GRPO` on non-heldout `EgoNormia`, then evaluate on the same heldout

	Suggested starting command:

	```bash
	cd /root/openenv-hack/egosocial_env
	HF_HOME=/tmp/hf TRANSFORMERS_CACHE=/tmp/hf/hub \
	.venv/bin/python scripts/train_grpo_reason2.py \
	--model-id /tmp/hf/hub/models--robertzty--EgoNormia-Cosmos-Reason2-2B-v6b-shortcot/snapshots/358d42e154c403a07f3cc2bac9e4f17551146484 \
	--trust-remote-code \
	--env-mode benchmark \
	--scene-limit 256 \
	--scene-repeats 2 \
	--num-generations 2 \
	--per-device-train-batch-size 2 \
	--gradient-accumulation-steps 1 \
	--max-steps 30 \
	--max-completion-length 128 \
	--image-max-edge 320 \
	--output-dir /tmp/egosocial_train/grpo_v6b_benchmark_run
	```

	Then evaluate the resulting checkpoint with:

	```bash
	cd /root/openenv-hack/egosocial_env
	HF_HOME=/tmp/hf TRANSFORMERS_CACHE=/tmp/hf/hub \
	.venv/bin/python scripts/eval_reason2.py \
	--model-id /tmp/egosocial_train/grpo_v6b_benchmark_run \
	--trust-remote-code \
	--bf16 \
	--env-mode benchmark \
	--output-path /tmp/egosocial_eval/eval_grpo_v6b_verified_image_200.json
	```

	## Active Run

	An actual long GRPO benchmark run has been started.

	Run directory:

	- [/tmp/egosocial_train/grpo_v6b_benchmark_run](/tmp/egosocial_train/grpo_v6b_benchmark_run)

	Command:

	```bash
	cd /root/openenv-hack/egosocial_env
	HF_HOME=/tmp/hf TRANSFORMERS_CACHE=/tmp/hf/hub \
	.venv/bin/python scripts/train_grpo_reason2.py \
	--model-id /tmp/hf/hub/models--robertzty--EgoNormia-Cosmos-Reason2-2B-v6b-shortcot/snapshots/358d42e154c403a07f3cc2bac9e4f17551146484 \
	--trust-remote-code \
	--bf16 \
	--env-mode benchmark \
	--scene-limit 0 \
	--scene-repeats 2 \
	--num-generations 2 \
	--per-device-train-batch-size 2 \
	--gradient-accumulation-steps 1 \
	--max-steps 60 \
	--max-completion-length 128 \
	--image-max-edge 320 \
	--output-dir /tmp/egosocial_train/grpo_v6b_benchmark_run
	```

	Observed status at launch:

	- training loop entered successfully
	- progress reached `0/60`
	- GPU visible on H100

	## Known Remaining Gap

	The OpenEnv 2-turn evaluator is now valid for the `3 vs 4` experiment, but it is still a different protocol from the earlier official-style SFT benchmark. If needed for reporting, add a separate official-style eval table rather than mixing the two protocols.