Other

[NeurIPS'25] VFRTok: Variable Frame Rates Video Tokenizer with Duration-Proportional Information Assumption

arXiv  Paper  GitHub 

Teaser

Get Started

Download Pretrain Models

The weights vfrtok-l.bin and vfrtok-s.bin can be downloaded here.

Setup Enviroment

conda create -n vfrtok python=3.10
conda activate vfrtok
pip install -r requirements.txt

Data Preparation

You need to organize the data for inference into csv format, where all video paths are located under the video_path column. Then call the following script to supplement the video metadata:

python scripts/add_metadata_to_csv.py -i $YOUR_CSV -o $CSV_PATH --data_column video_path

The data format can be referred to this.

Inference

# Symmetric
deepspeed inference.py -i $YOUR_CSV -o outputs \
    --config configs/vfrtok-l.yaml --ckpt vfrtok-l.bin \
    --enc_fps 24

# Asymmetric
deepspeed inference.py -i $YOUR_CSV -o outputs \
    --config configs/vfrtok-l.yaml --ckpt vfrtok-l.bin \
    --enc_fps 24 --dec_fps 60

Training

Data Preparation

We release all the csv files used in training. Download from here and move into data folder.
The corresponding video data can be downloaded from official sources: ImageNet-1k, K600, BVI_HFR
Since the paths given in the csv files are relative, they need to be replaced according to the storage path of the actual data.
If you want to train on your own data, the process is the same as in the Get Started section.

Distributed training

# Stage 1: quick initialize on ImageNet dataset 
# run 30k steps, global bs=512
deepspeed train/pretrain.py --config configs/train/stage1.yaml --ds-config configs/train/stage1_ds.json

# Stage 2: pretrain on K600 dataset 
# run 200k steps, global bs=64
deepspeed train/pretrain.py --config configs/train/stage2.yaml --ds-config configs/train/stage2_ds.json

# Stage 3: asymmetric training on K600 dataset and BVI_HFR datasets 
# run 100k steps, global bs=16
deepspeed train/asymm.py --config configs/train/stage3.yaml --ds-config configs/train/stage3_ds.json

We train VFRTok on 8 H800 GPUs. If your number of GPUs is different, you can align the global batch size with our experiment by modifying the train_micro_batch_size_per_gpu in the deepspeed json configs.
After each stage of training, the deepspeed weights can be synthesized through the following script. Before the second and third phases of training, modify the vq_ckpt and disc_ckpt in the yaml configuration.

# model checkpoint
python scripts/zero_to_fp32.py -i exp001 -t 200000
# GAN discriminator checkpoint
python scripts/zero_to_fp32.py -i exp001 -t 200000 --module disc

Reference

@article{zhong2025vfrtok,
    title={VFRTok: Variable Frame Rates Video Tokenizer with Duration-Proportional Information Assumption},
    author={Tianxiong Zhong, Xingye Tian and Boyuan Jiang, Xuebo Wang and Xin Tao, Pengfei Wan and Zhiwei Zhang},
    booktitle={Advances in Neural Information Processing Systems},
    year={2025},
}
Downloads last month

-

Downloads are not tracked for this model. How to track
Inference Providers NEW
This model isn't deployed by any Inference Provider. ๐Ÿ™‹ Ask for provider support