---
base_model: jozhang97/deta-swin-large-o365
datasets:
- Voxel51/fisheye8k
library_name: transformers
tags:
- generated_from_trainer
- deta
- swin
- traffic
- automotive
- ITS
- computer-vision
pipeline_tag: object-detection
license: mit
model-index:
- name: fisheye8k_jozhang97_deta-swin-large-o365
  results: []
---

# fisheye8k_jozhang97_deta-swin-large-o365

This model is a fine-tuned version of [jozhang97/deta-swin-large-o365](https://huggingface.co/jozhang97/deta-swin-large-o365) on the [Voxel51/fisheye8k dataset](https://huggingface.co/datasets/Voxel51/fisheye8k). It achieves the following results on the evaluation set:
- Loss: 1.0247

This model is a component of the **Mcity Data Engine**, presented in the paper:
*   **Paper:** [Mcity Data Engine: Iterative Model Improvement Through Open-Vocabulary Data Selection](https://huggingface.co/papers/2504.21614)
*   **Project Documentation:** [Mcity Data Engine Docs](https://mcity.github.io/mcity_data_engine/)
*   **GitHub Repository:** [mcity/mcity_data_engine](https://github.com/mcity/mcity_data_engine)

## Model description

The `fisheye8k_jozhang97_deta-swin-large-o365` model is part of the **Mcity Data Engine**, a comprehensive open-source system designed for iterative model improvement through an open-vocabulary data selection process. This model is based on the DETA architecture with a Swin-Large backbone, fine-tuned specifically for object detection on fisheye camera data, which is critical for Intelligent Transportation Systems (ITS).

The Mcity Data Engine focuses on addressing the challenge of detecting long-tail and novel classes within large amounts of unlabeled data generated by vehicle fleets and roadside perception systems. This model leverages these advancements to provide robust object detection capabilities in challenging real-world ITS scenarios.

## Intended uses & limitations

This model is intended for object detection tasks within Intelligent Transportation Systems (ITS), particularly for identifying vehicles (Bus, Bike, Car, Truck) and pedestrians from fisheye camera imagery. It is designed to facilitate the continuous improvement of AI models by enabling the detection and curation of rare and novel classes in large, unlabeled datasets.

**Key Use Cases:**
*   Object detection in automotive environments, especially with fisheye camera distortions.
*   Integration into data pipelines for iterative model retraining and improvement.
*   Supporting research and development in autonomous driving and transportation perception.

**Limitations:**
*   The model's performance is optimized for fisheye camera perspectives; performance on standard camera views may vary.
*   It may exhibit reduced performance on out-of-distribution data or in environmental conditions not well-represented in its training data.
*   As with any ML model, real-world deployment in safety-critical applications requires rigorous additional testing and validation.

## Sample Usage

You can use this model directly with the Hugging Face `transformers` library for object detection:

```python
from transformers import pipeline
from PIL import Image
import requests
from io import BytesIO

# Load the object detection pipeline
detector = pipeline("object-detection", model="mcity-data-engine/fisheye8k_jozhang97_deta-swin-large-o365")

# Example image from the Fisheye8K dataset
img_url = "https://huggingface.co/datasets/Voxel51/fisheye8k/resolve/main/data/000000_1.png"
response = requests.get(img_url)
image = Image.open(BytesIO(response.content)).convert("RGB")

# Perform inference
detections = detector(image)

# Print detected objects
for detection in detections:
    print(detection)

# Example output structure:
# [{'box': {'xmin': 18, 'ymin': 58, 'xmax': 227, 'ymax': 393}, 'score': 0.99, 'label': 'person'}]
```

## Training and evaluation data

This model was fine-tuned on the [Voxel51/fisheye8k dataset](https://huggingface.co/datasets/Voxel51/fisheye8k). This dataset consists of images captured from fisheye cameras in automotive contexts, crucial for training models capable of handling wide-angle distortions and diverse traffic scenarios.

## Training procedure

### Training hyperparameters

The following hyperparameters were used during training:
- learning_rate: 5e-05
- train_batch_size: 1
- eval_batch_size: 8
- seed: 0
- optimizer: Use OptimizerNames.ADAMW_TORCH with betas=(0.9,0.999) and epsilon=1e-08 and optimizer_args=No additional optimizer arguments
- lr_scheduler_type: cosine
- num_epochs: 36
- mixed_precision_training: Native AMP

### Training results

| Training Loss | Epoch | Step | Validation Loss |
|:-------------:|:-----:|:----:|:---------------:|
| 1.3933 | 1.0 | 5288 | 1.6177 |
| 1.098 | 2.0 | 10576 | 1.2979 |
| 0.9565 | 3.0 | 15864 | 1.2650 |
| 0.8734 | 4.0 | 21152 | 1.2495 |
| 0.8196 | 5.0 | 26440 | 1.1328 |
| 0.7977 | 6.0 | 31728 | 1.3190 |
| 0.8448 | 7.0 | 37016 | 1.3999 |
| 0.7399 | 8.0 | 42304 | 1.3117 |
| 0.6325 | 9.0 | 47592 | 1.1202 |
| 0.621 | 10.0 | 52880 | 1.1707 |
| 0.7134 | 11.0 | 58168 | 1.2353 |
| 0.6425 | 12.0 | 63456 | 1.0416 |
| 0.5935 | 13.0 | 68744 | 0.9215 |
| 0.5798 | 14.0 | 74032 | 1.0827 |
| 0.5924 | 15.0 | 79320 | 1.0398 |
| 0.5559 | 16.0 | 84608 | 1.0112 |
| 0.5783 | 17.0 | 89896 | 1.0434 |
| 0.5536 | 18.0 | 95184 | 1.0247 |

### Framework versions

- Transformers 4.48.3
- Pytorch 2.5.1+cu124
- Datasets 3.2.0
- Tokenizers 0.21.0