|
|
--- |
|
|
language: |
|
|
- en |
|
|
license: apache-2.0 |
|
|
library_name: transformers |
|
|
tags: |
|
|
- image-to-text |
|
|
- blip |
|
|
- accessibility |
|
|
- navigation |
|
|
- traffic |
|
|
- vijayawada |
|
|
- india |
|
|
- urban-mobility |
|
|
- visually-impaired |
|
|
- assistive-technology |
|
|
- computer-vision |
|
|
- andhra-pradesh |
|
|
datasets: |
|
|
- custom |
|
|
metrics: |
|
|
- bleu |
|
|
- rouge |
|
|
pipeline_tag: image-to-text |
|
|
widget: |
|
|
- src: https://huggingface.co/datasets/mishig/sample_images/resolve/main/tiger.jpg |
|
|
example_title: Sample Traffic Scene |
|
|
base_model: Salesforce/blip-image-captioning-base |
|
|
model-index: |
|
|
- name: vijayawada-traffic-accessibility-v2 |
|
|
results: |
|
|
- task: |
|
|
type: image-to-text |
|
|
name: Image Captioning |
|
|
dataset: |
|
|
type: custom |
|
|
name: Vijayawada Traffic Scenes |
|
|
metrics: |
|
|
- type: prediction_success_rate |
|
|
value: 100.0 |
|
|
name: Prediction Success Rate |
|
|
- type: traffic_vocabulary_coverage |
|
|
value: 50.0 |
|
|
name: Traffic Vocabulary Coverage |
|
|
--- |
|
|
|
|
|
# Model Card for Vijayawada Traffic Accessibility Navigation Model |
|
|
|
|
|
This model is a specialized BLIP (Bootstrapping Language-Image Pre-training) model fine-tuned specifically for traffic scene understanding in Vijayawada, Andhra Pradesh, India. It generates accessibility-focused captions to assist visually impaired users with safe navigation through urban traffic environments. |
|
|
|
|
|
## Model Details |
|
|
|
|
|
### Model Description |
|
|
|
|
|
This model addresses the critical need for localized accessibility technology in Indian urban environments. Fine-tuned on curated traffic scenes from Vijayawada, it understands local traffic patterns, vehicle types, and infrastructure to provide navigation-appropriate descriptions for visually impaired users. |
|
|
|
|
|
The model specializes in recognizing motorcycles, auto-rickshaws, cars, trucks, and pedestrians while understanding Vijayawada-specific locations like Benz Circle, Railway Station Junction, Eluru Road, and Governorpet areas. |
|
|
|
|
|
- **Developed by:** Charan Sai Ponnada |
|
|
- **Funded by [optional]:** Independent research project |
|
|
- **Shared by [optional]:** Community contribution for accessibility |
|
|
- **Model type:** Vision-Language Model (Image-to-Text) |
|
|
- **Language(s) (NLP):** English |
|
|
- **License:** Apache 2.0 |
|
|
- **Finetuned from model [optional]:** Salesforce/blip-image-captioning-base |
|
|
|
|
|
### Model Sources [optional] |
|
|
|
|
|
- **Repository:** https://huggingface.co/Charansaiponnada/vijayawada-traffic-accessibility-v2-fixed |
|
|
- **Paper [optional]:** [Model documentation available in repository] |
|
|
- **Demo [optional]:** Interactive widget available on model page |
|
|
|
|
|
## Uses |
|
|
|
|
|
### Direct Use |
|
|
|
|
|
This model is designed for direct integration into accessibility navigation applications for visually impaired users in Vijayawada. It can process real-time camera feeds from mobile devices to provide spoken traffic scene descriptions. |
|
|
|
|
|
**Primary use cases:** |
|
|
- Mobile navigation apps with voice guidance |
|
|
- Real-time traffic scene description for pedestrian navigation |
|
|
- Integration with existing accessibility tools and screen readers |
|
|
- Educational tools for traffic awareness training |
|
|
|
|
|
### Downstream Use [optional] |
|
|
|
|
|
The model can be fine-tuned further for: |
|
|
- Extension to other Andhra Pradesh cities |
|
|
- Integration with GPS and mapping services |
|
|
- Multilingual caption generation (Telugu language support) |
|
|
- Enhanced safety features with risk assessment |
|
|
|
|
|
### Out-of-Scope Use |
|
|
|
|
|
**This model should NOT be used for:** |
|
|
- Autonomous vehicle decision-making or control systems |
|
|
- Medical diagnosis or health-related assessments |
|
|
- Financial or legal decision-making |
|
|
- General-purpose image captioning outside of traffic contexts |
|
|
- Critical safety decisions without human oversight |
|
|
- Traffic management or control systems |
|
|
|
|
|
## Bias, Risks, and Limitations |
|
|
|
|
|
**Geographic Bias:** The model is specifically trained on Vijayawada traffic patterns and may not generalize well to other cities or countries. |
|
|
|
|
|
**Weather Limitations:** Primarily trained on daylight, clear weather conditions. Performance may degrade in rain, fog, or night conditions. |
|
|
|
|
|
**Cultural Context:** Optimized for Indian traffic scenarios with specific vehicle types (auto-rickshaws, motorcycles) that may not be common elsewhere. |
|
|
|
|
|
**Language Limitation:** Currently generates only English descriptions, which may not be the primary language for all Vijayawada users. |
|
|
|
|
|
**Safety Dependency:** Should never be the sole navigation aid - must be used alongside traditional mobility aids, GPS systems, and human judgment. |
|
|
|
|
|
### Recommendations |
|
|
|
|
|
Users should be made aware that: |
|
|
- This model provides supplementary navigation assistance, not replacement for traditional mobility aids |
|
|
- Descriptions should be verified with environmental audio cues and other senses |
|
|
- The model works best in familiar traffic scenarios similar to training data |
|
|
- Regular updates and retraining may be needed as traffic patterns change |
|
|
- Integration with local emergency services and support systems is recommended |
|
|
|
|
|
## How to Get Started with the Model |
|
|
from transformers import BlipProcessor, BlipForConditionalGeneration |
|
|
from PIL import Image |
|
|
|
|
|
Load the model |
|
|
processor = BlipProcessor.from_pretrained("Charansaiponnada/vijayawada-traffic-accessibility-v2") |
|
|
model = BlipForConditionalGeneration.from_pretrained("Charansaiponnada/vijayawada-traffic-accessibility-v2") |
|
|
|
|
|
Process a traffic image |
|
|
image = Image.open("vijayawada_traffic_scene.jpg") |
|
|
inputs = processor(images=image, return_tensors="pt") |
|
|
generated_ids = model.generate(**inputs, max_length=128, num_beams=5) |
|
|
caption = processor.decode(generated_ids, skip_special_tokens=True) |
|
|
|
|
|
print(f"Traffic description: {caption}") |
|
|
|
|
|
## Training Details |
|
|
|
|
|
### Training Data |
|
|
|
|
|
The model was trained on a carefully curated dataset of 101 traffic scene images from Vijayawada, covering: |
|
|
- **Geographic Areas:** Benz Circle, Railway Station Junction, Eluru Road, Governorpet, One Town Signal, Patamata Bridge |
|
|
- **Traffic Elements:** Motorcycles, cars, trucks, auto-rickshaws, pedestrians, road infrastructure |
|
|
- **Conditions:** Daylight scenes with various traffic densities and road conditions |
|
|
|
|
|
**Data Quality Control:** |
|
|
- Manual verification of all images for clarity and relevance |
|
|
- Traffic-specific keyword filtering and scoring |
|
|
- Accessibility-focused caption enhancement |
|
|
- Location-specific context addition |
|
|
|
|
|
### Training Procedure |
|
|
|
|
|
#### Preprocessing [optional] |
|
|
|
|
|
- Image resizing to 384×384 pixels for consistency |
|
|
- Caption cleaning and validation |
|
|
- Location context enhancement (adding area-specific information) |
|
|
- Traffic vocabulary verification and optimization |
|
|
- Data augmentation with brightness and contrast adjustments (±20%) |
|
|
|
|
|
#### Training Hyperparameters |
|
|
|
|
|
- **Training regime:** FP32 precision for stability |
|
|
- **Optimizer:** AdamW |
|
|
- **Learning Rate:** 1e-5 (reduced for stability) |
|
|
- **Batch Size:** 1 (with gradient accumulation of 8 steps) |
|
|
- **Epochs:** 10 with early stopping |
|
|
- **Total Training Steps:** 50 |
|
|
- **Warmup Steps:** 10 |
|
|
- **Weight Decay:** 0.01 |
|
|
- **Scheduler:** Cosine annealing |
|
|
|
|
|
#### Speeds, Sizes, Times [optional] |
|
|
|
|
|
- **Training Time:** 6.63 minutes (emergency configuration) |
|
|
- **Model Size:** 990MB |
|
|
- **Inference Time:** ~2-3 seconds per image on mobile GPU |
|
|
- **Memory Usage:** ~1.2GB during inference |
|
|
- **Training Hardware:** Google Colab with NVIDIA GPU |
|
|
|
|
|
## Evaluation |
|
|
|
|
|
### Testing Data, Factors & Metrics |
|
|
|
|
|
#### Testing Data |
|
|
|
|
|
Test set comprised 10% of the curated Vijayawada traffic dataset (approximately 10 images) representing diverse traffic scenarios across different areas of the city. |
|
|
|
|
|
#### Factors |
|
|
|
|
|
Evaluation considered: |
|
|
- **Geographic Coverage:** Performance across different Vijayawada areas |
|
|
- **Vehicle Types:** Recognition accuracy for motorcycles, cars, trucks, auto-rickshaws |
|
|
- **Traffic Density:** Performance in light to heavy traffic conditions |
|
|
- **Infrastructure Elements:** Recognition of roads, junctions, signals, bridges |
|
|
|
|
|
#### Metrics |
|
|
|
|
|
- **Prediction Success Rate:** Percentage of test samples generating valid captions |
|
|
- **Traffic Vocabulary Coverage:** Proportion of traffic-relevant terms in generated captions |
|
|
- **Caption Length Consistency:** Average word count for accessibility optimization |
|
|
- **Quality Assessment:** Manual evaluation using word overlap and context relevance |
|
|
|
|
|
### Results |
|
|
|
|
|
| Metric | Value | Interpretation | |
|
|
|--------|-------|----------------| |
|
|
| **Prediction Success Rate** | 100% | All test samples generated valid captions | |
|
|
| **Traffic Vocabulary Coverage** | 50% | Strong understanding of traffic terminology | |
|
|
| **Average Caption Length** | 5 words | Appropriate for text-to-speech applications | |
|
|
| **Quality Rating** | 62.5% Good+ | Manual evaluation of caption relevance | |
|
|
|
|
|
#### Summary |
|
|
|
|
|
The model demonstrated excellent reliability with 100% prediction success rate and consistent generation of traffic-relevant captions. The 50% traffic vocabulary coverage indicates strong specialization for the intended use case, while the concise caption length (5 words average) is optimal for accessibility applications requiring quick audio feedback. |
|
|
|
|
|
## Model Examination [optional] |
|
|
|
|
|
**Sample Predictions Analysis:** |
|
|
|
|
|
| Input Scene | Generated Caption | Quality Assessment | |
|
|
|-------------|-------------------|-------------------| |
|
|
| Governorpet Junction | "motorcycles parked on the road" | Excellent - Accurate vehicle identification and spatial understanding | |
|
|
| Eluru Road | "the road is dirty" | Excellent - Correct infrastructure condition assessment | |
|
|
| Railway Station | "the car is yellow in color" | Excellent - Accurate vehicle and color recognition | |
|
|
| One Town Signal | "three people riding motorcycles on the road" | Good - Correct count and activity recognition | |
|
|
|
|
|
The model shows strong performance in vehicle recognition and spatial relationship understanding, with particular strength in identifying motorcycles (dominant in Vijayawada traffic). |
|
|
|
|
|
## Environmental Impact |
|
|
|
|
|
Carbon emissions were minimized through efficient training on Google Colab infrastructure: |
|
|
|
|
|
- **Hardware Type:** NVIDIA GPU (Google Colab) |
|
|
- **Hours used:** 0.11 hours (6.63 minutes) |
|
|
- **Cloud Provider:** Google Cloud Platform |
|
|
- **Compute Region:** Global (Google Colab) |
|
|
- **Carbon Emitted:** Minimal due to short training time and existing infrastructure |
|
|
|
|
|
## Technical Specifications [optional] |
|
|
|
|
|
### Model Architecture and Objective |
|
|
|
|
|
- **Base Architecture:** BLIP (Bootstrapping Language-Image Pre-training) |
|
|
- **Vision Encoder:** Vision Transformer (ViT) |
|
|
- **Text Decoder:** BERT-based transformer |
|
|
- **Fine-tuning Method:** Full model fine-tuning (all parameters updated) |
|
|
- **Objective:** Cross-entropy loss for caption generation with accessibility focus |
|
|
|
|
|
### Compute Infrastructure |
|
|
|
|
|
#### Hardware |
|
|
|
|
|
- **Training:** Google Colab Pro with NVIDIA GPU |
|
|
- **Memory:** ~12GB GPU memory available |
|
|
- **Storage:** Google Drive integration for dataset access |
|
|
|
|
|
#### Software |
|
|
|
|
|
- **Framework:** PyTorch with Transformers library |
|
|
- **Key Dependencies:** |
|
|
- transformers==4.36.0 |
|
|
- torch==2.1.0 |
|
|
- datasets==2.15.0 |
|
|
- accelerate==0.25.0 |
|
|
- **Development Environment:** Google Colab with Python 3.11 |
|
|
|
|
|
|
|
|
**APA:** |
|
|
|
|
|
Ponnada, C. S. (2025). *Vijayawada Traffic Accessibility Navigation Model*. Hugging Face Model Hub. https://huggingface.co/Charansaiponnada/vijayawada-traffic-accessibility-v2 |
|
|
|
|
|
## Glossary [optional] |
|
|
|
|
|
- **BLIP:** Bootstrapping Language-Image Pre-training - A vision-language model architecture |
|
|
- **Traffic Vocabulary Coverage:** Percentage of generated captions containing traffic-specific terminology |
|
|
- **Accessibility Navigation:** Technology designed to assist visually impaired users with spatial orientation and mobility |
|
|
- **Auto-rickshaw:** Three-wheeled motorized vehicle common in Indian cities for public transport |
|
|
- **Fine-tuning:** Process of adapting a pre-trained model to a specific domain or task |
|
|
|
|
|
## More Information [optional] |
|
|
|
|
|
This model is part of a broader initiative to create inclusive AI technology for Indian urban environments. The project demonstrates how pre-trained vision-language models can be successfully adapted for specific geographic and cultural contexts to address real-world accessibility challenges. |
|
|
|
|
|
**Future Development Plans:** |
|
|
- Extension to other Andhra Pradesh cities |
|
|
- Telugu language support |
|
|
- Night and weather condition training data |
|
|
- Integration with local emergency services |
|
|
- Community feedback incorporation |
|
|
|
|
|
## Model Card Authors [optional] |
|
|
|
|
|
Charan Sai Ponnada - Model development, training, and evaluation |
|
|
|
|
|
## Model Card Contact |
|
|
|
|
|
For questions about model integration, accessibility applications, or collaboration opportunities: |
|
|
- **Repository Issues:** https://huggingface.co/Charansaiponnada/vijayawada-traffic-accessibility-v2/discussions |
|
|
- **Purpose:** Supporting visually impaired navigation in Vijayawada, Andhra Pradesh |
|
|
- **Community:** Open to collaboration with accessibility organizations and app developers |
|
|
|
|
|
|