---
base_model: SenseTime/deformable-detr
datasets:
- Voxel51/fisheye8k
library_name: transformers
license: apache-2.0
tags:
- generated_from_trainer
- object-detection
- computer-vision
- deformable-detr
- detr
model-index:
- name: fisheye8k_SenseTime_deformable-detr
  results: []
pipeline_tag: object-detection
---

# fisheye8k_SenseTime_deformable-detr

This model is a fine-tuned version of [SenseTime/deformable-detr](https://huggingface.co/SenseTime/deformable-detr) on the [Fisheye8K dataset](https://huggingface.co/datasets/Voxel51/fisheye8k). It was developed as part of the [Mcity Data Engine](https://mcity.github.io/mcity_data_engine/) project, described in the paper [Mcity Data Engine: Iterative Model Improvement Through Open-Vocabulary Data Selection](https://huggingface.co/papers/2504.21614).

The code for the Mcity Data Engine project is available on [GitHub](https://github.com/mcity/mcity_data_engine).

It achieves the following results on the evaluation set:
- Loss: 1.2335

## Model description

This model is a fine-tuned object detection model based on the `SenseTime/deformable-detr` architecture, specifically trained for object detection on fisheye camera imagery. It is a product of the [Mcity Data Engine](https://mcity.github.io/mcity_data_engine/), an open-source system designed for iterative data selection and model improvement in Intelligent Transportation Systems (ITS). The model can detect objects such as "Bus", "Bike", "Car", "Pedestrian", and "Truck", leveraging an open-vocabulary data selection process during its development to focus on rare and novel classes.

## Intended uses & limitations

This model is intended for object detection tasks within Intelligent Transportation Systems (ITS) that utilize fisheye camera data. Potential applications include traffic monitoring, enhancing autonomous driving perception, and smart city infrastructure, with a focus on detecting long-tail classes of interest and vulnerable road users (VRU).

**Limitations:**
*   The model's performance is optimized for fisheye camera data and the specific object classes it was trained on.
*   Performance may vary significantly in out-of-distribution scenarios or when applied to data from different camera types or environments.
*   Users should consider potential biases inherited from the underlying Fisheye8K dataset.

## Sample Usage

You can use this model directly with the `transformers` pipeline for object detection:

```python
from transformers import pipeline
from PIL import Image
import requests
from io import BytesIO

# Load the object detection pipeline
detector = pipeline("object-detection", model="mcity-data-engine/fisheye8k_SenseTime_deformable-detr")

# Example image (replace with a relevant fisheye image if available, or a local path)
# Using a generic example image for demonstration purposes. For best results, use a fisheye image.
image_url = "https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/transformers/tasks/bird_sized.jpg"
try:
    response = requests.get(image_url, stream=True)
    response.raise_for_status() # Raise an HTTPError for bad responses (4xx or 5xx)
    image = Image.open(BytesIO(response.content)).convert("RGB")
except requests.exceptions.RequestException as e:
    print(f"Could not load example image from URL: {e}. Please provide a local image path.")
    # Fallback/exit if image cannot be loaded
    exit()

# Perform inference
predictions = detector(image)

# Print detected objects
for pred in predictions:
    print(f"Label: {pred['label']}, Score: {pred['score']:.2f}, Box: {pred['box']}")

# For visualization (optional, requires matplotlib):
# from matplotlib import pyplot as plt
# import matplotlib.patches as patches
#
# fig, ax = plt.subplots(1)
# ax.imshow(image)
#
# for p in predictions:
#     box = p['box']
#     rect = patches.Rectangle((box['xmin'], box['ymin']), box['xmax'] - box['xmin'], box['ymax'] - box['ymin'],
#                              linewidth=1, edgecolor='r', facecolor='none')
#     ax.add_patch(rect)
#     plt.text(box['xmin'], box['ymin'] - 5, f"{p['label']}: {p['score']:.2f}", color='red', fontsize=8)
#
# plt.show()
```

## Training and evaluation data

This model was fine-tuned on the [Fisheye8K dataset](https://huggingface.co/datasets/Voxel51/fisheye8k). The Fisheye8K dataset comprises images captured from fisheye cameras, featuring annotated instances of common road users such as cars, buses, bikes, trucks, and pedestrians. The training process leveraged the capabilities of the [Mcity Data Engine](https://mcity.github.io/mcity_data_engine/), which facilitates iterative model improvement and open-vocabulary data selection, especially for Intelligent Transportation Systems (ITS) applications.

## Training procedure

### Training hyperparameters

The following hyperparameters were used during training:
- learning_rate: 5e-05
- train_batch_size: 1
- eval_batch_size: 8
- seed: 0
- optimizer: Use OptimizerNames.ADAMW_TORCH with betas=(0.9,0.999) and epsilon=1e-08 and optimizer_args=No additional optimizer arguments
- lr_scheduler_type: cosine
- num_epochs: 36
- mixed_precision_training: Native AMP

### Training results

| Training Loss | Epoch | Step  | Validation Loss |
|:-------------:|:-----:|:-----:|:---------------:|
| 0.8943        | 1.0   | 5288  | 1.5330          |
| 0.7865        | 2.0   | 10576 | 1.4108          |
| 0.7238        | 3.0   | 15864 | 1.2660          |
| 0.6657        | 4.0   | 21152 | 1.2084          |
| 0.646         | 5.0   | 26440 | 1.2666          |
| 0.6269        | 6.0   | 31728 | 1.2555          |
| 0.6049        | 7.0   | 37016 | 1.2350          |
| 0.5894        | 8.0   | 42304 | 1.2940          |
| 0.5484        | 9.0   | 47592 | 1.2335          |


### Framework versions

- Transformers 4.48.3
- Pytorch 2.5.1+cu124
- Datasets 3.2.0
- Tokenizers 0.21.0