--- library_name: transformers tags: - vision license: apache-2.0 pipeline_tag: zero-shot-object-detection --- # LLMDet (tiny variant) [LLMDet](https://arxiv.org/abs/2501.18954) model was proposed in [LLMDet: Learning Strong Open-Vocabulary Object Detectors under the Supervision of Large Language Models ](https://arxiv.org/abs/2501.18954) by Shenghao Fu, Qize Yang, Qijie Mo, Junkai Yan, Xihan Wei, Jingke Meng, Xiaohua Xie, Wei-Shi Zheng. LLMDet improves upon the [MM Grounding DINO](https://huggingface.co/docs/transformers/model_doc/mm-grounding-dino) and [Grounding DINO](https://huggingface.co/docs/transformers/model_doc/grounding-dino) by co-training the model with a large language model. You can find all the LLMDet checkpoints under the [LLMDet](https://huggingface.co/collections/rziga/llmdet-68398b294d9866c16046dcdd) collection. Note that these checkpoints are inference only -- they do not include LLM which was used for training. The inference is identical to that of [MM Grounding DINO](https://huggingface.co/docs/transformers/model_doc/mm-grounding-dino). ## Intended uses You can use the raw model for zero-shot object detection. Here's how to use the model for zero-shot object detection: ```py import torch from transformers import AutoModelForZeroShotObjectDetection, AutoProcessor from transformers.image_utils import load_image # Prepare processor and model model_id = "iSEE-Laboratory/llmdet_tiny" device = "cuda" if torch.cuda.is_available() else "cpu" processor = AutoProcessor.from_pretrained(model_id) model = AutoModelForZeroShotObjectDetection.from_pretrained(model_id).to(device) # Prepare inputs image_url = "http://images.cocodataset.org/val2017/000000039769.jpg" image = load_image(image_url) text_labels = [["a cat", "a remote control"]] inputs = processor(images=image, text=text_labels, return_tensors="pt").to(device) # Run inference with torch.no_grad(): outputs = model(**inputs) # Postprocess outputs results = processor.post_process_grounded_object_detection( outputs, threshold=0.4, target_sizes=[(image.height, image.width)] ) # Retrieve the first image result result = results[0] for box, score, labels in zip(result["boxes"], result["scores"], result["labels"]): box = [round(x, 2) for x in box.tolist()] print(f"Detected {labels} with confidence {round(score.item(), 3)} at location {box}") ``` ## Training Data This model was trained on: - [Objects365v1](https://www.objects365.org/overview.html) - [GOLD-G](https://arxiv.org/abs/2104.12763) - [GRIT](https://huggingface.co/datasets/zzliang/GRIT) - [V3Det](https://github.com/V3Det/V3Det) - [GroundingCap-1M](https://arxiv.org/abs/2501.18954) ## Evaluation results - Here's a table of LLMDet models and their performance on LVIS (results from [official repo](https://github.com/iSEE-Laboratory/LLMDet)): | Model | Pre-Train Data | MiniVal APr | MiniVal APc | MiniVal APf | MiniVal AP | Val1.0 APr | Val1.0 APc | Val1.0 APf | Val1.0 AP | | --------------------------------------------------------- | -------------------------------------------- | ------------ | ----------- | ----------- | ----------- | ---------- | ---------- | ---------- | ----------- | | [llmdet_tiny](https://huggingface.co/rziga/llmdet_tiny) | (O365,GoldG,GRIT,V3Det) + GroundingCap-1M | 44.7 | 37.3 | 39.5 | 50.7 | 34.9 | 26.0 | 30.1 | 44.3 | | [llmdet_base](https://huggingface.co/rziga/llmdet_base) | (O365,GoldG,V3Det) + GroundingCap-1M | 48.3 | 40.8 | 43.1 | 54.3 | 38.5 | 28.2 | 34.3 | 47.8 | | [llmdet_large](https://huggingface.co/rziga/llmdet_large) | (O365V2,OpenImageV6,GoldG) + GroundingCap-1M | 51.1 | 45.1 | 46.1 | 56.6 | 42.0 | 31.6 | 38.8 | 50.2 | ## BibTeX entry and citation info ```bib @article{fu2025llmdet, title={LLMDet: Learning Strong Open-Vocabulary Object Detectors under the Supervision of Large Language Models}, author={Fu, Shenghao and Yang, Qize and Mo, Qijie and Yan, Junkai and Wei, Xihan and Meng, Jingke and Xie, Xiaohua and Zheng, Wei-Shi}, journal={arXiv preprint arXiv:2501.18954}, year={2025} } ```