File size: 7,586 Bytes
3ef920c afc6438 3ef920c afc6438 3ef920c afc6438 c3f160a 3ef920c afc6438 3ef920c afc6438 3ef920c afc6438 3ef920c afc6438 3ef920c b2bdc9c 3ef920c b2bdc9c 3ef920c afc6438 3ef920c b373e8a afc6438 |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 |
---
base_model: facebook/deformable-detr-detic
datasets:
- Voxel51/fisheye8k
library_name: transformers
license: apache-2.0
tags:
- generated_from_trainer
- object-detection
- zero-shot
pipeline_tag: zero-shot-object-detection
model-index:
- name: fisheye8k_facebook_deformable-detr-detic
results: []
---
# fisheye8k_facebook_deformable-detr-detic
This model is a fine-tuned version of [facebook/deformable-detr-detic](https://huggingface.co/facebook/deformable-detr-detic) on the [Fisheye8K dataset](https://huggingface.co/datasets/Voxel51/fisheye8k), developed as part of the **Mcity Data Engine** project.
The model achieves the following results on the evaluation set:
- Loss: 2.1348
📚 Paper: [Mcity Data Engine: Iterative Model Improvement Through Open-Vocabulary Data Selection](https://huggingface.co/papers/2504.21614)
🌐 Project Page: [Mcity Data Engine Docs](https://mcity.github.io/mcity_data_engine/)
💻 Code: [mcity/mcity_data_engine GitHub Repository](https://github.com/mcity/mcity_data_engine)
## Model description
The `fisheye8k_facebook_deformable-detr-detic` model is a component of the **Mcity Data Engine**, an open-source system designed for iterative model improvement through open-vocabulary data selection. This engine provides modules for the complete data-based development cycle, from data acquisition to model deployment, specifically addressing the challenge of detecting long-tail and novel classes in large amounts of unlabeled data, particularly in Intelligent Transportation Systems (ITS).
This model leverages its base, `deformable-detr-detic`, to specialize in object detection within the context of the Mcity Data Engine's workflows, enabling the identification of objects even for classes not seen during its initial fine-tuning, through its open-vocabulary capabilities.
## Intended uses & limitations
This model is primarily intended for **zero-shot object detection** within Intelligent Transportation Systems (ITS) and related domains. It is designed to assist researchers and practitioners in identifying rare and novel classes of interest from raw visual data, facilitating the continuous improvement of AI models.
**Potential use cases include:**
* Detecting various types of road users and vehicles in complex traffic scenarios.
* Identifying long-tail or previously unseen objects in automotive perception datasets.
* Serving as a component within larger data curation and model training pipelines.
**Limitations:**
* While designed for generalization through its open-vocabulary approach, performance on highly out-of-distribution scenarios might still vary.
* Optimal utilization of the Mcity Data Engine workflows, including those leveraging this model, often requires a powerful GPU.
* The model's performance on standard perspective images may differ, as it was fine-tuned on fisheye camera data.
## Sample Usage
You can use this model with the Hugging Face `transformers` library to perform zero-shot object detection. This example demonstrates how to detect objects using descriptive text queries.
```python
from transformers import AutoProcessor, AutoModelForObjectDetection
import torch
from PIL import Image
import requests
# Load an example image. For best results, use images similar to the Fisheye8K dataset.
# This example uses a general image, but real-world usage should focus on ITS contexts.
url = "http://images.cocodataset.org/val2017/000000039769.jpg" # Example image (cat, dog, couch)
image = Image.open(requests.get(url, stream=True).raw)
# Load processor and model from the Hugging Face Hub
model_id = "mcity-data-engine/fisheye8k_facebook_deformable-detr-detic"
processor = AutoProcessor.from_pretrained(model_id)
model = AutoModelForObjectDetection.from_pretrained(model_id)
# Define target labels/text queries. While trained on specific classes (Bus, Bike, Car, Pedestrian, Truck),
# the model can generalize to other categories thanks to its open-vocabulary nature.
text_queries = ["a bus on the road", "a bicycle", "a car", "a pedestrian crossing", "a truck"]
# Prepare inputs
inputs = processor(images=image, text=text_queries, return_tensors="pt")
# Perform inference
with torch.no_grad():
outputs = model(**inputs)
# Post-process outputs to get detected objects
target_sizes = torch.tensor([image.size[::-1]]) # (height, width)
results = processor.post_process_object_detection(outputs, target_sizes=target_sizes, threshold=0.5)[0]
# Print detected objects
print(f"Detected objects in the image (threshold=0.5):")
for score, label, box in zip(results["scores"], results["labels"], results["boxes"]):
print(
f" {text_queries[label.item()]}: {round(score.item(), 3)}
"
f" [xmin={round(box[0].item(), 2)}, ymin={round(box[1].item(), 2)}, "
f"xmax={round(box[2].item(), 2)}, ymax={round(box[3].item(), 2)}]"
)
```
## Training and evaluation data
This model was fine-tuned on the [Fisheye8K dataset](https://huggingface.co/datasets/Voxel51/fisheye8k). The Mcity Data Engine leverages data selection processes to focus on detecting long-tail and novel classes, which is crucial for ITS applications. The model's `config.json` indicates it was fine-tuned for classes such as "Bus", "Bike", "Car", "Pedestrian", and "Truck".
## Training procedure
### Training hyperparameters
The following hyperparameters were used during training:
- learning_rate: 5e-05
- train_batch_size: 1
- eval_batch_size: 8
- seed: 0
- optimizer: Use OptimizerNames.ADAMW_TORCH with betas=(0.9,0.999) and epsilon=1e-08 and optimizer_args=No additional optimizer arguments
- lr_scheduler_type: cosine
- num_epochs: 36
- mixed_precision_training: Native AMP
### Training results
| Training Loss | Epoch | Step | Validation Loss |
|:-------------:|:-----:|:-----:|:---------------:|\
| 2.435 | 1.0 | 5288 | 2.4832 |\
| 2.2626 | 2.0 | 10576 | 2.6324 |\
| 1.8443 | 3.0 | 15864 | 2.1361 |\
| 2.4834 | 4.0 | 21152 | 2.5269 |\
| 2.3417 | 5.0 | 26440 | 2.5997 |\
| 1.939 | 6.0 | 31728 | 2.1948 |\
| 1.8384 | 7.0 | 37016 | 2.0057 |\
| 1.7235 | 8.0 | 42304 | 2.0182 |\
| 1.728 | 9.0 | 47592 | 1.9454 |\
| 1.621 | 10.0 | 52880 | 1.9876 |\
| 1.539 | 11.0 | 58168 | 1.8862 |\
| 1.7229 | 12.0 | 63456 | 2.2071 |\
| 1.9613 | 13.0 | 68744 | 2.5147 |\
| 1.5238 | 14.0 | 74032 | 1.9836 |\
| 1.5777 | 15.0 | 79320 | 2.0812 |\
| 1.5963 | 16.0 | 84608 | 2.1348 |\
### Framework versions
- Transformers 4.48.3
- Pytorch 2.5.1+cu124
- Datasets 3.2.0
- Tokenizers 0.21.0
## Acknowledgements
Mcity would like to thank Amazon Web Services (AWS) for their pivotal role in providing the cloud infrastructure on which the Data Engine depends. We couldn’t have done it without their tremendous support!
Special thanks to these amazing people for contributing to the Mcity Data Engine! 🙌
<a href="https://github.com/mcity/mcity_data_engine/graphs/contributors">
<img src="https://contrib.rocks/image?repo=mcity/mcity_data_engine" />
</a>
## Citation
If you use the Mcity Data Engine in your research, feel free to cite the project:
```bibtex
@article{bogdoll2025mcitydataengine,
title={Mcity Data Engine},
author={Bogdoll, Daniel and Anata, Rajanikant Patnaik and Stevens, Gregory},
journal={GitHub. Note: https://github.com/mcity/mcity_data_engine},
year={2025}
}
``` |