File size: 12,737 Bytes

---
language:
- en
license: apache-2.0
library_name: transformers
tags:
- image-to-text
- blip
- accessibility
- navigation
- traffic
- vijayawada
- india
- urban-mobility
- visually-impaired
- assistive-technology
- computer-vision
- andhra-pradesh
datasets:
- custom
metrics:
- bleu
- rouge
pipeline_tag: image-to-text
widget:
- src: https://huggingface.co/datasets/mishig/sample_images/resolve/main/tiger.jpg
  example_title: Sample Traffic Scene
base_model: Salesforce/blip-image-captioning-base
model-index:
- name: vijayawada-traffic-accessibility-v2
  results:
  - task:
      type: image-to-text
      name: Image Captioning
    dataset:
      type: custom
      name: Vijayawada Traffic Scenes
    metrics:
    - type: prediction_success_rate
      value: 100.0
      name: Prediction Success Rate
    - type: traffic_vocabulary_coverage
      value: 50.0
      name: Traffic Vocabulary Coverage
---

# Model Card for Vijayawada Traffic Accessibility Navigation Model

This model is a specialized BLIP (Bootstrapping Language-Image Pre-training) model fine-tuned specifically for traffic scene understanding in Vijayawada, Andhra Pradesh, India. It generates accessibility-focused captions to assist visually impaired users with safe navigation through urban traffic environments.

## Model Details

### Model Description

This model addresses the critical need for localized accessibility technology in Indian urban environments. Fine-tuned on curated traffic scenes from Vijayawada, it understands local traffic patterns, vehicle types, and infrastructure to provide navigation-appropriate descriptions for visually impaired users.

The model specializes in recognizing motorcycles, auto-rickshaws, cars, trucks, and pedestrians while understanding Vijayawada-specific locations like Benz Circle, Railway Station Junction, Eluru Road, and Governorpet areas.

- **Developed by:** Charan Sai Ponnada
- **Funded by [optional]:** Independent research project
- **Shared by [optional]:** Community contribution for accessibility
- **Model type:** Vision-Language Model (Image-to-Text)
- **Language(s) (NLP):** English
- **License:** Apache 2.0
- **Finetuned from model [optional]:** Salesforce/blip-image-captioning-base

### Model Sources [optional]

- **Repository:** https://huggingface.co/Charansaiponnada/vijayawada-traffic-accessibility-v2-fixed
- **Paper [optional]:** [Model documentation available in repository]
- **Demo [optional]:** Interactive widget available on model page

## Uses

### Direct Use

This model is designed for direct integration into accessibility navigation applications for visually impaired users in Vijayawada. It can process real-time camera feeds from mobile devices to provide spoken traffic scene descriptions.

**Primary use cases:**
- Mobile navigation apps with voice guidance
- Real-time traffic scene description for pedestrian navigation
- Integration with existing accessibility tools and screen readers
- Educational tools for traffic awareness training

### Downstream Use [optional]

The model can be fine-tuned further for:
- Extension to other Andhra Pradesh cities
- Integration with GPS and mapping services
- Multilingual caption generation (Telugu language support)
- Enhanced safety features with risk assessment

### Out-of-Scope Use

**This model should NOT be used for:**
- Autonomous vehicle decision-making or control systems
- Medical diagnosis or health-related assessments
- Financial or legal decision-making
- General-purpose image captioning outside of traffic contexts
- Critical safety decisions without human oversight
- Traffic management or control systems

## Bias, Risks, and Limitations

**Geographic Bias:** The model is specifically trained on Vijayawada traffic patterns and may not generalize well to other cities or countries.

**Weather Limitations:** Primarily trained on daylight, clear weather conditions. Performance may degrade in rain, fog, or night conditions.

**Cultural Context:** Optimized for Indian traffic scenarios with specific vehicle types (auto-rickshaws, motorcycles) that may not be common elsewhere.

**Language Limitation:** Currently generates only English descriptions, which may not be the primary language for all Vijayawada users.

**Safety Dependency:** Should never be the sole navigation aid - must be used alongside traditional mobility aids, GPS systems, and human judgment.

### Recommendations

Users should be made aware that:
- This model provides supplementary navigation assistance, not replacement for traditional mobility aids
- Descriptions should be verified with environmental audio cues and other senses
- The model works best in familiar traffic scenarios similar to training data
- Regular updates and retraining may be needed as traffic patterns change
- Integration with local emergency services and support systems is recommended

## How to Get Started with the Model
from transformers import BlipProcessor, BlipForConditionalGeneration
from PIL import Image

Load the model
processor = BlipProcessor.from_pretrained("Charansaiponnada/vijayawada-traffic-accessibility-v2")
model = BlipForConditionalGeneration.from_pretrained("Charansaiponnada/vijayawada-traffic-accessibility-v2")

Process a traffic image
image = Image.open("vijayawada_traffic_scene.jpg")
inputs = processor(images=image, return_tensors="pt")
generated_ids = model.generate(**inputs, max_length=128, num_beams=5)
caption = processor.decode(generated_ids, skip_special_tokens=True)

print(f"Traffic description: {caption}")

## Training Details

### Training Data

The model was trained on a carefully curated dataset of 101 traffic scene images from Vijayawada, covering:
- **Geographic Areas:** Benz Circle, Railway Station Junction, Eluru Road, Governorpet, One Town Signal, Patamata Bridge
- **Traffic Elements:** Motorcycles, cars, trucks, auto-rickshaws, pedestrians, road infrastructure
- **Conditions:** Daylight scenes with various traffic densities and road conditions

**Data Quality Control:**
- Manual verification of all images for clarity and relevance
- Traffic-specific keyword filtering and scoring
- Accessibility-focused caption enhancement
- Location-specific context addition

### Training Procedure

#### Preprocessing [optional]

- Image resizing to 384×384 pixels for consistency
- Caption cleaning and validation
- Location context enhancement (adding area-specific information)
- Traffic vocabulary verification and optimization
- Data augmentation with brightness and contrast adjustments (±20%)

#### Training Hyperparameters

- **Training regime:** FP32 precision for stability
- **Optimizer:** AdamW
- **Learning Rate:** 1e-5 (reduced for stability)
- **Batch Size:** 1 (with gradient accumulation of 8 steps)
- **Epochs:** 10 with early stopping
- **Total Training Steps:** 50
- **Warmup Steps:** 10
- **Weight Decay:** 0.01
- **Scheduler:** Cosine annealing

#### Speeds, Sizes, Times [optional]

- **Training Time:** 6.63 minutes (emergency configuration)
- **Model Size:** 990MB
- **Inference Time:** ~2-3 seconds per image on mobile GPU
- **Memory Usage:** ~1.2GB during inference
- **Training Hardware:** Google Colab with NVIDIA GPU

## Evaluation

### Testing Data, Factors & Metrics

#### Testing Data

Test set comprised 10% of the curated Vijayawada traffic dataset (approximately 10 images) representing diverse traffic scenarios across different areas of the city.

#### Factors

Evaluation considered:
- **Geographic Coverage:** Performance across different Vijayawada areas
- **Vehicle Types:** Recognition accuracy for motorcycles, cars, trucks, auto-rickshaws
- **Traffic Density:** Performance in light to heavy traffic conditions
- **Infrastructure Elements:** Recognition of roads, junctions, signals, bridges

#### Metrics

- **Prediction Success Rate:** Percentage of test samples generating valid captions
- **Traffic Vocabulary Coverage:** Proportion of traffic-relevant terms in generated captions
- **Caption Length Consistency:** Average word count for accessibility optimization
- **Quality Assessment:** Manual evaluation using word overlap and context relevance

### Results

| Metric | Value | Interpretation |
|--------|-------|----------------|
| **Prediction Success Rate** | 100% | All test samples generated valid captions |
| **Traffic Vocabulary Coverage** | 50% | Strong understanding of traffic terminology |
| **Average Caption Length** | 5 words | Appropriate for text-to-speech applications |
| **Quality Rating** | 62.5% Good+ | Manual evaluation of caption relevance |

#### Summary

The model demonstrated excellent reliability with 100% prediction success rate and consistent generation of traffic-relevant captions. The 50% traffic vocabulary coverage indicates strong specialization for the intended use case, while the concise caption length (5 words average) is optimal for accessibility applications requiring quick audio feedback.

## Model Examination [optional]

**Sample Predictions Analysis:**

| Input Scene | Generated Caption | Quality Assessment |
|-------------|-------------------|-------------------|
| Governorpet Junction | "motorcycles parked on the road" | Excellent - Accurate vehicle identification and spatial understanding |
| Eluru Road | "the road is dirty" | Excellent - Correct infrastructure condition assessment |
| Railway Station | "the car is yellow in color" | Excellent - Accurate vehicle and color recognition |
| One Town Signal | "three people riding motorcycles on the road" | Good - Correct count and activity recognition |

The model shows strong performance in vehicle recognition and spatial relationship understanding, with particular strength in identifying motorcycles (dominant in Vijayawada traffic).

## Environmental Impact

Carbon emissions were minimized through efficient training on Google Colab infrastructure:

- **Hardware Type:** NVIDIA GPU (Google Colab)
- **Hours used:** 0.11 hours (6.63 minutes)
- **Cloud Provider:** Google Cloud Platform
- **Compute Region:** Global (Google Colab)
- **Carbon Emitted:** Minimal due to short training time and existing infrastructure

## Technical Specifications [optional]

### Model Architecture and Objective

- **Base Architecture:** BLIP (Bootstrapping Language-Image Pre-training)
- **Vision Encoder:** Vision Transformer (ViT)
- **Text Decoder:** BERT-based transformer
- **Fine-tuning Method:** Full model fine-tuning (all parameters updated)
- **Objective:** Cross-entropy loss for caption generation with accessibility focus

### Compute Infrastructure

#### Hardware

- **Training:** Google Colab Pro with NVIDIA GPU
- **Memory:** ~12GB GPU memory available
- **Storage:** Google Drive integration for dataset access

#### Software

- **Framework:** PyTorch with Transformers library
- **Key Dependencies:** 
  - transformers==4.36.0
  - torch==2.1.0
  - datasets==2.15.0
  - accelerate==0.25.0
- **Development Environment:** Google Colab with Python 3.11


**APA:**

Ponnada, C. S. (2025). *Vijayawada Traffic Accessibility Navigation Model*. Hugging Face Model Hub. https://huggingface.co/Charansaiponnada/vijayawada-traffic-accessibility-v2

## Glossary [optional]

- **BLIP:** Bootstrapping Language-Image Pre-training - A vision-language model architecture
- **Traffic Vocabulary Coverage:** Percentage of generated captions containing traffic-specific terminology
- **Accessibility Navigation:** Technology designed to assist visually impaired users with spatial orientation and mobility
- **Auto-rickshaw:** Three-wheeled motorized vehicle common in Indian cities for public transport
- **Fine-tuning:** Process of adapting a pre-trained model to a specific domain or task

## More Information [optional]

This model is part of a broader initiative to create inclusive AI technology for Indian urban environments. The project demonstrates how pre-trained vision-language models can be successfully adapted for specific geographic and cultural contexts to address real-world accessibility challenges.

**Future Development Plans:**
- Extension to other Andhra Pradesh cities
- Telugu language support
- Night and weather condition training data
- Integration with local emergency services
- Community feedback incorporation

## Model Card Authors [optional]

Charan Sai Ponnada - Model development, training, and evaluation

## Model Card Contact

For questions about model integration, accessibility applications, or collaboration opportunities:
- **Repository Issues:** https://huggingface.co/Charansaiponnada/vijayawada-traffic-accessibility-v2/discussions
- **Purpose:** Supporting visually impaired navigation in Vijayawada, Andhra Pradesh
- **Community:** Open to collaboration with accessibility organizations and app developers