---
library_name: transformers
tags:
- vision
license: apache-2.0
pipeline_tag: zero-shot-object-detection
---


# LLMDet (tiny variant)

[LLMDet](https://arxiv.org/abs/2501.18954) model was proposed in [LLMDet: Learning Strong Open-Vocabulary Object Detectors under the Supervision of Large Language Models
](https://arxiv.org/abs/2501.18954) by Shenghao Fu, Qize Yang, Qijie Mo, Junkai Yan, Xihan Wei, Jingke Meng, Xiaohua Xie, Wei-Shi Zheng.

LLMDet improves upon the [MM Grounding DINO](https://huggingface.co/docs/transformers/model_doc/mm-grounding-dino) and [Grounding DINO](https://huggingface.co/docs/transformers/model_doc/grounding-dino) by co-training the model with a large language model.

You can find all the LLMDet checkpoints under the [LLMDet](https://huggingface.co/collections/rziga/llmdet-68398b294d9866c16046dcdd) collection. Note that these checkpoints are inference only -- they do not include LLM which was used for training. The inference is identical to that of [MM Grounding DINO](https://huggingface.co/docs/transformers/model_doc/mm-grounding-dino).


## Intended uses

You can use the raw model for zero-shot object detection.

Here's how to use the model for zero-shot object detection:

```py
import torch
from transformers import AutoModelForZeroShotObjectDetection, AutoProcessor
from transformers.image_utils import load_image


# Prepare processor and model
model_id = "iSEE-Laboratory/llmdet_tiny"
device = "cuda" if torch.cuda.is_available() else "cpu"
processor = AutoProcessor.from_pretrained(model_id)
model = AutoModelForZeroShotObjectDetection.from_pretrained(model_id).to(device)

# Prepare inputs
image_url = "http://images.cocodataset.org/val2017/000000039769.jpg"
image = load_image(image_url)
text_labels = [["a cat", "a remote control"]]
inputs = processor(images=image, text=text_labels, return_tensors="pt").to(device)

# Run inference
with torch.no_grad():
    outputs = model(**inputs)

# Postprocess outputs
results = processor.post_process_grounded_object_detection(
    outputs,
    threshold=0.4,
    target_sizes=[(image.height, image.width)]
)

# Retrieve the first image result
result = results[0]
for box, score, labels in zip(result["boxes"], result["scores"], result["labels"]):
    box = [round(x, 2) for x in box.tolist()]
    print(f"Detected {labels} with confidence {round(score.item(), 3)} at location {box}")
```

## Training Data

This model was trained on:
 - [Objects365v1](https://www.objects365.org/overview.html)
 - [GOLD-G](https://arxiv.org/abs/2104.12763)
 - [GRIT](https://huggingface.co/datasets/zzliang/GRIT)
 - [V3Det](https://github.com/V3Det/V3Det)
 - [GroundingCap-1M](https://arxiv.org/abs/2501.18954)


## Evaluation results

- Here's a table of LLMDet models and their performance on LVIS (results from [official repo](https://github.com/iSEE-Laboratory/LLMDet)):

    |                             Model                         | Pre-Train Data            |  MiniVal APr | MiniVal APc | MiniVal APf | MiniVal AP  | Val1.0 APr | Val1.0 APc | Val1.0 APf |  Val1.0 AP  |
    | --------------------------------------------------------- | -------------------------------------------- | ------------ | ----------- | ----------- | ----------- | ---------- | ---------- | ---------- | ----------- |
    | [llmdet_tiny](https://huggingface.co/rziga/llmdet_tiny)   | (O365,GoldG,GRIT,V3Det) + GroundingCap-1M    | 44.7         | 37.3        | 39.5        | 50.7        | 34.9       | 26.0       | 30.1       | 44.3        |
    | [llmdet_base](https://huggingface.co/rziga/llmdet_base)   | (O365,GoldG,V3Det) + GroundingCap-1M         | 48.3         | 40.8        | 43.1        | 54.3        | 38.5       | 28.2       | 34.3       | 47.8        |
    | [llmdet_large](https://huggingface.co/rziga/llmdet_large) | (O365V2,OpenImageV6,GoldG) + GroundingCap-1M | 51.1         | 45.1        | 46.1        | 56.6        | 42.0       | 31.6       | 38.8       | 50.2        |


## BibTeX entry and citation info

```bib
@article{fu2025llmdet,
  title={LLMDet: Learning Strong Open-Vocabulary Object Detectors under the Supervision of Large Language Models},
  author={Fu, Shenghao and Yang, Qize and Mo, Qijie and Yan, Junkai and Wei, Xihan and Meng, Jingke and Xie, Xiaohua and Zheng, Wei-Shi},
  journal={arXiv preprint arXiv:2501.18954},
  year={2025}
}
```