--- base_model: jozhang97/deta-swin-large-o365 datasets: - Voxel51/fisheye8k library_name: transformers tags: - generated_from_trainer - deta - swin - traffic - automotive - ITS - computer-vision pipeline_tag: object-detection license: mit model-index: - name: fisheye8k_jozhang97_deta-swin-large-o365 results: [] --- # fisheye8k_jozhang97_deta-swin-large-o365 This model is a fine-tuned version of [jozhang97/deta-swin-large-o365](https://huggingface.co/jozhang97/deta-swin-large-o365) on the [Voxel51/fisheye8k dataset](https://huggingface.co/datasets/Voxel51/fisheye8k). It achieves the following results on the evaluation set: - Loss: 1.0247 This model is a component of the **Mcity Data Engine**, presented in the paper: * **Paper:** [Mcity Data Engine: Iterative Model Improvement Through Open-Vocabulary Data Selection](https://huggingface.co/papers/2504.21614) * **Project Documentation:** [Mcity Data Engine Docs](https://mcity.github.io/mcity_data_engine/) * **GitHub Repository:** [mcity/mcity_data_engine](https://github.com/mcity/mcity_data_engine) ## Model description The `fisheye8k_jozhang97_deta-swin-large-o365` model is part of the **Mcity Data Engine**, a comprehensive open-source system designed for iterative model improvement through an open-vocabulary data selection process. This model is based on the DETA architecture with a Swin-Large backbone, fine-tuned specifically for object detection on fisheye camera data, which is critical for Intelligent Transportation Systems (ITS). The Mcity Data Engine focuses on addressing the challenge of detecting long-tail and novel classes within large amounts of unlabeled data generated by vehicle fleets and roadside perception systems. This model leverages these advancements to provide robust object detection capabilities in challenging real-world ITS scenarios. ## Intended uses & limitations This model is intended for object detection tasks within Intelligent Transportation Systems (ITS), particularly for identifying vehicles (Bus, Bike, Car, Truck) and pedestrians from fisheye camera imagery. It is designed to facilitate the continuous improvement of AI models by enabling the detection and curation of rare and novel classes in large, unlabeled datasets. **Key Use Cases:** * Object detection in automotive environments, especially with fisheye camera distortions. * Integration into data pipelines for iterative model retraining and improvement. * Supporting research and development in autonomous driving and transportation perception. **Limitations:** * The model's performance is optimized for fisheye camera perspectives; performance on standard camera views may vary. * It may exhibit reduced performance on out-of-distribution data or in environmental conditions not well-represented in its training data. * As with any ML model, real-world deployment in safety-critical applications requires rigorous additional testing and validation. ## Sample Usage You can use this model directly with the Hugging Face `transformers` library for object detection: ```python from transformers import pipeline from PIL import Image import requests from io import BytesIO # Load the object detection pipeline detector = pipeline("object-detection", model="mcity-data-engine/fisheye8k_jozhang97_deta-swin-large-o365") # Example image from the Fisheye8K dataset img_url = "https://huggingface.co/datasets/Voxel51/fisheye8k/resolve/main/data/000000_1.png" response = requests.get(img_url) image = Image.open(BytesIO(response.content)).convert("RGB") # Perform inference detections = detector(image) # Print detected objects for detection in detections: print(detection) # Example output structure: # [{'box': {'xmin': 18, 'ymin': 58, 'xmax': 227, 'ymax': 393}, 'score': 0.99, 'label': 'person'}] ``` ## Training and evaluation data This model was fine-tuned on the [Voxel51/fisheye8k dataset](https://huggingface.co/datasets/Voxel51/fisheye8k). This dataset consists of images captured from fisheye cameras in automotive contexts, crucial for training models capable of handling wide-angle distortions and diverse traffic scenarios. ## Training procedure ### Training hyperparameters The following hyperparameters were used during training: - learning_rate: 5e-05 - train_batch_size: 1 - eval_batch_size: 8 - seed: 0 - optimizer: Use OptimizerNames.ADAMW_TORCH with betas=(0.9,0.999) and epsilon=1e-08 and optimizer_args=No additional optimizer arguments - lr_scheduler_type: cosine - num_epochs: 36 - mixed_precision_training: Native AMP ### Training results | Training Loss | Epoch | Step | Validation Loss | |:-------------:|:-----:|:----:|:---------------:| | 1.3933 | 1.0 | 5288 | 1.6177 | | 1.098 | 2.0 | 10576 | 1.2979 | | 0.9565 | 3.0 | 15864 | 1.2650 | | 0.8734 | 4.0 | 21152 | 1.2495 | | 0.8196 | 5.0 | 26440 | 1.1328 | | 0.7977 | 6.0 | 31728 | 1.3190 | | 0.8448 | 7.0 | 37016 | 1.3999 | | 0.7399 | 8.0 | 42304 | 1.3117 | | 0.6325 | 9.0 | 47592 | 1.1202 | | 0.621 | 10.0 | 52880 | 1.1707 | | 0.7134 | 11.0 | 58168 | 1.2353 | | 0.6425 | 12.0 | 63456 | 1.0416 | | 0.5935 | 13.0 | 68744 | 0.9215 | | 0.5798 | 14.0 | 74032 | 1.0827 | | 0.5924 | 15.0 | 79320 | 1.0398 | | 0.5559 | 16.0 | 84608 | 1.0112 | | 0.5783 | 17.0 | 89896 | 1.0434 | | 0.5536 | 18.0 | 95184 | 1.0247 | ### Framework versions - Transformers 4.48.3 - Pytorch 2.5.1+cu124 - Datasets 3.2.0 - Tokenizers 0.21.0