TARA: Time-Aware Retrieval Adaptation for Video Understanding

TARA (Time-Aware Retrieval Adaptation) is a multimodal model for video and text understanding.

Installation & Setup

1. Install Git LFS (if not already installed)

Git LFS is required to download the model weights.

Please install Git LFS from https://git-lfs.github.com/. You can refer to this guide for non-sudo installation. I have not tested this guide, but it should work.

Check the installation:

git lfs --version
git lfs install

The output should be:

git-lfs/3.4.1 (GitHub; linux amd64; go 1.20.11; git 0898dcbc
Updated Git hooks.
Git LFS initialized.

2. Clone the Repository

git clone https://huggingface.co/bpiyush/TARA
cd TARA

This will download all model weights (may take a few minutes depending on your connection).

3. Install Dependencies

  • Create/activate the conda env (skip if you already have it):
    conda create -n tara python=3.10 -y
    conda activate tara
    
  • Install CUDA 12.1 PyTorch wheels (adjust the index URL if you need a different CUDA/CPU build):
    pip install --index-url https://download.pytorch.org/whl/cu121 \
      torch==2.5.1+cu121 torchvision==0.20.1+cu121 torchaudio==2.5.1+cu121
    
  • Install the remaining model dependencies:
    pip install -r requirements.txt
    
  • (Optional) Verify the install:
    python -c "import torch, transformers; print(torch.cuda.is_available(), transformers.__version__)"
    

Quick Start

See the script at demo_usage.py for a quick start. You can run it:

python demo_usage.py

The output should look something like this:

============================================================
TARA Model Demo
============================================================

[1/6] Loading model...
[ MODEL ] Loading TARA from /work/piyush/pretrained_checkpoints/TARA/ [..............]
### do_image_padding is set as False, images will be resized directly!
The model weights are not tied. Please use the `tie_weights` method before using the `infer_auto_device` function.
Loading checkpoint shards: 100%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆ| 3/3 [00:03<00:00,  1.05s/it]
βœ“ Model loaded successfully!
Number of parameters: 7.063B
----------------------------------------------------------------------------------------------------

[2/6] Testing video encoding and captioning ...
βœ“ Video encoded successfully!
Video shape: torch.Size([1, 16, 3, 240, 426])
Video embedding shape: torch.Size([4096])
Video caption: A hand is seen folding a white paper on a gray carpeted floor. The paper is opened flat on the surface, and then the hand folds it in half vertically, creating a crease in the middle. The hand continues to fold the paper further, resulting in a smaller, more compact size. The background remains a consistent gray carpet throughout the video.
----------------------------------------------------------------------------------------------------

[3/6] Testing text encoding...
βœ“ Text encoded successfully!
Text: ['someone is folding a paper', 'cutting a paper', 'someone is unfolding a paper']
Text embedding shape: torch.Size([3, 4096])

[4/6] Computing video-text similarities...
βœ“ Similarities computed!
  'someone is folding a paper': 0.5039
  'cutting a paper': 0.3022
  'someone is unfolding a paper': 0.3877
----------------------------------------------------------------------------------------------------

[5/6] Testing negation example...
Image embedding shape: torch.Size([2, 4096])
Text query:  ['an image of a cat but there is no dog in it']
Text-Image similarity: tensor([[0.2585, 0.1449]])
- - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - 
Text query:  ['an image of a cat and a dog together']
Text-Image similarity: tensor([[0.2815, 0.4399]])
----------------------------------------------------------------------------------------------------

[6/6] Testing composed video retrieval...
Source-Target similarity with edit: 0.6476313471794128

============================================================
Demo completed successfully! πŸŽ‰
============================================================

OR use the snippet below:

import torch
from modeling_tara import TARA, read_frames_decord

model = TARA.from_pretrained(
    ".",  # Load from current directory
    device_map='auto',
    torch_dtype=torch.bfloat16,
)
n_params = sum(p.numel() for p in model.model.parameters())
print(f"Number of parameters: {round(n_params/1e9, 3)}B")

# Embed a video
video_path = "./assets/folding_paper.mp4"
video_tensor = read_frames_decord(video_path, num_frames=16)
video_tensor = video_tensor.unsqueeze(0)
video_tensor = video_tensor.to(model.model.device)
with torch.no_grad():
    video_emb = model.encode_vision(video_tensor).cpu().squeeze(0).float()
print(f"Video shape: {video_tensor.shape}")  # torch.Size([1, 16, 3, 240, 426])
print(f"Video embedding shape: {video_emb.shape}")  # torch.Size([4096])

# Embed a text
text = ['someone is folding a paper', 'cutting a paper', 'someone is folding a paper']
with torch.no_grad():
    text_emb = model.encode_text(text).cpu().float()
print(f"Text embedding shape: {text_emb.shape}")  # torch.Size([3, 4096])

Citation

If you use this model, please cite:

@misc{tara2025,
  title={TARA: Simple and Efficient Time Aware Retrieval Adaptation of MLLMs for Video Understanding},
  author={Piyush Bagad and Andrew Zisserman},
  year={2025}
}

License

Apache 2.0

Downloads last month
144
Safetensors
Model size
7B params
Tensor type
BF16
Β·
Inference Providers NEW
This model isn't deployed by any Inference Provider. πŸ™‹ Ask for provider support

Model tree for bpiyush/TARA

Finetuned
(2)
this model

Dataset used to train bpiyush/TARA