File size: 7,586 Bytes

---
base_model: facebook/deformable-detr-detic
datasets:
- Voxel51/fisheye8k
library_name: transformers
license: apache-2.0
tags:
- generated_from_trainer
- object-detection
- zero-shot
pipeline_tag: zero-shot-object-detection
model-index:
- name: fisheye8k_facebook_deformable-detr-detic
  results: []
---

# fisheye8k_facebook_deformable-detr-detic

This model is a fine-tuned version of [facebook/deformable-detr-detic](https://huggingface.co/facebook/deformable-detr-detic) on the [Fisheye8K dataset](https://huggingface.co/datasets/Voxel51/fisheye8k), developed as part of the **Mcity Data Engine** project.

The model achieves the following results on the evaluation set:
- Loss: 2.1348

📚 Paper: [Mcity Data Engine: Iterative Model Improvement Through Open-Vocabulary Data Selection](https://huggingface.co/papers/2504.21614)  
🌐 Project Page: [Mcity Data Engine Docs](https://mcity.github.io/mcity_data_engine/)  
💻 Code: [mcity/mcity_data_engine GitHub Repository](https://github.com/mcity/mcity_data_engine)

## Model description

The `fisheye8k_facebook_deformable-detr-detic` model is a component of the **Mcity Data Engine**, an open-source system designed for iterative model improvement through open-vocabulary data selection. This engine provides modules for the complete data-based development cycle, from data acquisition to model deployment, specifically addressing the challenge of detecting long-tail and novel classes in large amounts of unlabeled data, particularly in Intelligent Transportation Systems (ITS).

This model leverages its base, `deformable-detr-detic`, to specialize in object detection within the context of the Mcity Data Engine's workflows, enabling the identification of objects even for classes not seen during its initial fine-tuning, through its open-vocabulary capabilities.

## Intended uses & limitations

This model is primarily intended for **zero-shot object detection** within Intelligent Transportation Systems (ITS) and related domains. It is designed to assist researchers and practitioners in identifying rare and novel classes of interest from raw visual data, facilitating the continuous improvement of AI models.

**Potential use cases include:**
*   Detecting various types of road users and vehicles in complex traffic scenarios.
*   Identifying long-tail or previously unseen objects in automotive perception datasets.
*   Serving as a component within larger data curation and model training pipelines.

**Limitations:**
*   While designed for generalization through its open-vocabulary approach, performance on highly out-of-distribution scenarios might still vary.
*   Optimal utilization of the Mcity Data Engine workflows, including those leveraging this model, often requires a powerful GPU.
*   The model's performance on standard perspective images may differ, as it was fine-tuned on fisheye camera data.

## Sample Usage

You can use this model with the Hugging Face `transformers` library to perform zero-shot object detection. This example demonstrates how to detect objects using descriptive text queries.

```python
from transformers import AutoProcessor, AutoModelForObjectDetection
import torch
from PIL import Image
import requests

# Load an example image. For best results, use images similar to the Fisheye8K dataset.
# This example uses a general image, but real-world usage should focus on ITS contexts.
url = "http://images.cocodataset.org/val2017/000000039769.jpg" # Example image (cat, dog, couch)
image = Image.open(requests.get(url, stream=True).raw)

# Load processor and model from the Hugging Face Hub
model_id = "mcity-data-engine/fisheye8k_facebook_deformable-detr-detic"
processor = AutoProcessor.from_pretrained(model_id)
model = AutoModelForObjectDetection.from_pretrained(model_id)

# Define target labels/text queries. While trained on specific classes (Bus, Bike, Car, Pedestrian, Truck),
# the model can generalize to other categories thanks to its open-vocabulary nature.
text_queries = ["a bus on the road", "a bicycle", "a car", "a pedestrian crossing", "a truck"]

# Prepare inputs
inputs = processor(images=image, text=text_queries, return_tensors="pt")

# Perform inference
with torch.no_grad():
    outputs = model(**inputs)

# Post-process outputs to get detected objects
target_sizes = torch.tensor([image.size[::-1]]) # (height, width)
results = processor.post_process_object_detection(outputs, target_sizes=target_sizes, threshold=0.5)[0]

# Print detected objects
print(f"Detected objects in the image (threshold=0.5):")
for score, label, box in zip(results["scores"], results["labels"], results["boxes"]):
    print(
        f"  {text_queries[label.item()]}: {round(score.item(), 3)}
"
        f"  [xmin={round(box[0].item(), 2)}, ymin={round(box[1].item(), 2)}, "
        f"xmax={round(box[2].item(), 2)}, ymax={round(box[3].item(), 2)}]"
    )
```

## Training and evaluation data

This model was fine-tuned on the [Fisheye8K dataset](https://huggingface.co/datasets/Voxel51/fisheye8k). The Mcity Data Engine leverages data selection processes to focus on detecting long-tail and novel classes, which is crucial for ITS applications. The model's `config.json` indicates it was fine-tuned for classes such as "Bus", "Bike", "Car", "Pedestrian", and "Truck".

## Training procedure

### Training hyperparameters

The following hyperparameters were used during training:
- learning_rate: 5e-05
- train_batch_size: 1
- eval_batch_size: 8
- seed: 0
- optimizer: Use OptimizerNames.ADAMW_TORCH with betas=(0.9,0.999) and epsilon=1e-08 and optimizer_args=No additional optimizer arguments
- lr_scheduler_type: cosine
- num_epochs: 36
- mixed_precision_training: Native AMP

### Training results

| Training Loss | Epoch | Step  | Validation Loss |
|:-------------:|:-----:|:-----:|:---------------:|\
| 2.435         | 1.0   | 5288  | 2.4832          |\
| 2.2626        | 2.0   | 10576 | 2.6324          |\
| 1.8443        | 3.0   | 15864 | 2.1361          |\
| 2.4834        | 4.0   | 21152 | 2.5269          |\
| 2.3417        | 5.0   | 26440 | 2.5997          |\
| 1.939         | 6.0   | 31728 | 2.1948          |\
| 1.8384        | 7.0   | 37016 | 2.0057          |\
| 1.7235        | 8.0   | 42304 | 2.0182          |\
| 1.728         | 9.0   | 47592 | 1.9454          |\
| 1.621         | 10.0  | 52880 | 1.9876          |\
| 1.539         | 11.0  | 58168 | 1.8862          |\
| 1.7229        | 12.0  | 63456 | 2.2071          |\
| 1.9613        | 13.0  | 68744 | 2.5147          |\
| 1.5238        | 14.0  | 74032 | 1.9836          |\
| 1.5777        | 15.0  | 79320 | 2.0812          |\
| 1.5963        | 16.0  | 84608 | 2.1348          |\


### Framework versions

- Transformers 4.48.3
- Pytorch 2.5.1+cu124
- Datasets 3.2.0
- Tokenizers 0.21.0

## Acknowledgements

Mcity would like to thank Amazon Web Services (AWS) for their pivotal role in providing the cloud infrastructure on which the Data Engine depends. We couldn’t have done it without their tremendous support!

Special thanks to these amazing people for contributing to the Mcity Data Engine! 🙌

<a href="https://github.com/mcity/mcity_data_engine/graphs/contributors">
  <img src="https://contrib.rocks/image?repo=mcity/mcity_data_engine" />
</a>

## Citation

If you use the Mcity Data Engine in your research, feel free to cite the project:

```bibtex
@article{bogdoll2025mcitydataengine,
  title={Mcity Data Engine},
  author={Bogdoll, Daniel and Anata, Rajanikant Patnaik and Stevens, Gregory},
  journal={GitHub. Note: https://github.com/mcity/mcity_data_engine},
  year={2025}
}
```