Spaces:

shegga
/

SentimentAnalysisForNMTTNT

Runtime error

App Files Files Community

SentimentAnalysisForNMTTNT / README.md

shegga

📚 Update fine-tuning configuration to match 5CD-AI/Vietnamese-Sentiment-visobert

bc9750a 14 days ago

preview code

raw

history blame contribute delete

14.1 kB

A newer version of the Gradio SDK is available: 5.49.1

Upgrade

metadata

title: Vietnamese Sentiment Analysis
emoji: 🎭
colorFrom: green
colorTo: blue
sdk: gradio
sdk_version: 4.44.0
app_file: app.py
pinned: false

🎭 Vietnamese Sentiment Analysis

A Vietnamese sentiment analysis web interface built with Gradio and transformer models, optimized for Hugging Face Spaces deployment.

🚀 Features

🤖 Transformer-based Model: Uses 5CD-AI/Vietnamese-Sentiment-visobert from Hugging Face Hub
🌐 Interactive Web Interface: Real-time sentiment analysis via Gradio
⚡ Memory Efficient: Built-in memory management and batch processing limits
📊 Visual Analysis: Confidence scores with interactive charts
📝 Batch Processing: Analyze multiple texts at once
🛡️ Memory Management: Real-time memory monitoring and cleanup

🎯 Usage

Single Text Analysis

Enter Vietnamese text in the input field
Click "Analyze Sentiment"
View the sentiment prediction with confidence scores
See probability distribution in the chart

Batch Analysis

Switch to "Batch Analysis" tab
Enter multiple Vietnamese texts (one per line)
Click "Analyze All" to process all texts
View comprehensive batch summary with sentiment distribution

Memory Management

Monitor real-time memory usage
Use "Memory Cleanup" button if needed
Automatic cleanup after each prediction
Maximum 10 texts per batch for efficiency

📊 Model Details

Base Model: 5CD-AI/Vietnamese-Sentiment-visobert
Pre-trained Base: 5CD-AI/visobert-14gb-corpus (continually pretrained on 14GB Vietnamese social content)
Architecture: XLM-RoBERTa (Transformer-based)
Language: Vietnamese (optimized for social content)
Parameters: 97.6M parameters (F32 tensor)
Labels: Negative (0), Positive (1), Neutral (2)
Max Sequence Length: 256 tokens (matching original model)
File Format: Safetensors
Task: Text classification
Device: Automatic CUDA/CPU detection

Model Performance

Benchmark Results: Outperformed phobert-base on all benchmarks
F1 Scores: Up to 99.64% on some datasets
Training Dataset: 120K Vietnamese sentiment samples
Evaluation Metric: Weighted F1 score (wf1)

🎯 Fine-Tuning Configuration

Training Parameters (Based on 5CD-AI/Vietnamese-Sentiment-visobert)

Learning Rate: 2e-5 (same as original model)
Batch Size: 16 (train/eval)
Training Epochs: 5 (matching original model training)
Weight Decay: 0.01 (same as original)
Seed: 42 (for reproducibility, matching original)
Gradient Accumulation: 1 step
Optimizer: AdamW (betas=(0.9, 0.999), epsilon=1e-08)
Max Sequence Length: 256 tokens (matching original model)

Training Strategy

Evaluation Strategy: Epoch-based evaluation
Save Strategy: Save model at each epoch
Best Model Selection: Based on weighted F1 score (wf1)
Early Stopping: Load best model at end
Logging: Every 10 steps
Checkpoint Limit: Save last 2 checkpoints
Metric: Weighted F1 score (matching original evaluation)

Data Processing

Tokenization: AutoTokenizer with truncation and padding
Max Length: 256 tokens (matching original model configuration)
Data Collator: DataCollatorWithPadding for dynamic padding
Text Columns: Auto-detection (sentence, text, comment, feedback)
Label Columns: Auto-detection (sentiment, label, labels)
Label Mapping: 0=Negative, 1=Positive, 2=Neutral (matching original)

📚 Dataset Information

Original Model Training Datasets (120K samples)

The 5CD-AI/Vietnamese-Sentiment-visobert model was trained on comprehensive Vietnamese sentiment datasets:

Academic Datasets:

SA-VLSP2016: Sentiment Analysis VLSP 2016 competition dataset
AIVIVN-2019: AI for Vietnamese NLP 2019 sentiment dataset
UIT-VSFC: Vietnamese Students' Feedback Corpus (UIT)
UIT-VSMEC: Vietnamese Social Media Emotion Corpus (re-labeled)
UIT-ViCTSD: Vietnamese COVID-19 Sentiment Dataset (re-labeled)
UIT-ViHSD: Vietnamese Hate Speech Detection Dataset
UIT-ViSFD: Vietnamese Spam Feedback Dataset
UIT-ViOCD: Vietnamese Offensive Content Detection Dataset

E-commerce and Social Media Datasets:

Tiki-reviews: Vietnamese e-commerce platform reviews
VOZ-HSD: Vietnamese forum hate speech dataset (re-labeled)
Vietnamese-amazon-polarity: Amazon reviews translated/adapted for Vietnamese

Label Processing:

Some datasets were re-labeled using Gemini 1.5 Flash API for consistency
Final label mapping: 0=Negative, 1=Positive, 2=Neutral

Primary Dataset (for fine-tuning)

Name: uitnlp/vietnamese_students_feedback
Type: Student feedback sentiment analysis
Language: Vietnamese
Labels: 3-way classification (Negative, Neutral, Positive)
Purpose: Recommended for educational domain fine-tuning

Alternative Datasets (Fallback)

Name: linhtranvi/5cdAI-Vietnamese-sentiment
Type: General Vietnamese sentiment
Purpose: Backup dataset if primary fails

Sample Dataset (Built-in)

If external datasets fail, the system creates a sample dataset with:

Total Samples: 20 Vietnamese texts
Distribution:
- Positive: 8 samples
- Negative: 6 samples
- Neutral: 6 samples
Split: 60% train, 20% validation, 20% test
Content: Educational feedback and reviews

Sample Data Examples

# Positive examples
"Giảng viên dạy rất hay và tâm huyết, tôi học được nhiều kiến thức bổ ích."
"Môn học này rất thú vị và practical, giúp tôi áp dụng được vào thực tế."

# Negative examples
"Môn học quá khó và nhàm chán, không có gì để học cả."
"Giảng viên dạy không rõ ràng, tốc độ quá nhanh, không theo kịp."

# Neutral examples
"Môn học ổn định, không có gì đặc biệt để nhận xét."
"Nội dung cơ bản, phù hợp với chương trình đề ra."

📈 Model Performance & Evaluation

Metrics Tracked

Accuracy: Overall prediction accuracy
F1 Score: Weighted F1 score (primary metric)
Precision: Weighted precision
Recall: Weighted recall
Training Loss: Loss progression over epochs
Evaluation Loss: Validation loss per epoch

Evaluation Output

Classification Report: Detailed per-class metrics
Confusion Matrix: Visual confusion matrix saved as PNG
Training History: Loss and F1 plots saved as PNG
Best Model: Saved based on highest F1 score

Expected Performance

Target F1 Score: >0.90 on validation set (original model achieves up to 99.64%)
Target Accuracy: >0.90 on validation set
Training Time: ~15-30 minutes (depending on hardware)
Memory Usage: ~2-4GB during training
Benchmark Performance: Original model outperformed phobert-base on all Vietnamese sentiment benchmarks
Model Size: 97.6M parameters for efficient deployment

💡 Example Usage

Try these example Vietnamese texts:

"Giảng viên dạy rất hay và tâm huyết." (Positive)
"Môn học này quá khó và nhàm chán." (Negative)
"Lớp học ổn định, không có gì đặc biệt." (Neutral)

🛠️ Technical Features

Memory Optimization

Automatic GPU cache clearing
Garbage collection management
Memory usage monitoring
Batch size limits
Real-time memory tracking

Performance

~100ms processing time per text
Supports up to 512 token sequences
Efficient batch processing
Memory limit: 8GB (Hugging Face Spaces)

📁 Project Structure

SentimentAnalysis/
├── app.py                          # Main Hugging Face Spaces app
├── train.py                        # Training entry point
├── test.py                         # Testing entry point
├── demo.py                         # Demo entry point
├── web.py                          # Web interface entry point
├── main.py                         # Main program entry point
├── requirements.txt                # Python dependencies
├── requirements_spaces.txt         # Hugging Face Spaces dependencies
├── .space.yaml                     # Hugging Face Spaces configuration
├── .gitignore                      # Git ignore rules
├── README.md                       # This file
├── py/                             # Core Python modules
│   ├── fine_tune_sentiment.py      # Fine-tuning implementation
│   ├── test_model.py               # Model testing utilities
│   └── demo.py                     # Demo implementation
├── pdf/                            # Documentation (paper.tex only)
│   └── paper.tex                   # LaTeX paper (only tracked file)
├── vietnamese_sentiment_finetuned/ # Fine-tuned model output (if trained)
├── training_history.png            # Training history plot
├── confusion_matrix.png            # Confusion matrix visualization
└── deploy_package/                 # Deployment artifacts

🔬 Model Training & Fine-Tuning

How to Fine-Tune the Model

Using the training script:
```
python train.py
```

Direct fine-tuning (Recommended - matches original model config):

from py.fine_tune_sentiment import SentimentFineTuner

# Initialize fine-tuner with original model
fine_tuner = SentimentFineTuner()

# Run complete fine-tuning pipeline with original parameters
fine_tuner.run_fine_tuning(
    output_dir="./vietnamese_sentiment_finetuned",
    learning_rate=2e-5,  # Same as original model
    batch_size=16,        # Recommended batch size
    num_epochs=5          # Same as original model
)

Custom configuration:

# Load model and tokenizer
fine_tuner.load_model_and_tokenizer()

# Load and prepare dataset
fine_tuner.load_and_prepare_dataset()

# Tokenize datasets
fine_tuner.tokenize_datasets()

# Setup custom training (matching original optimizer config)
fine_tuner.setup_trainer(
    output_dir="./custom_model",
    learning_rate=2e-5,           # Original learning rate
    batch_size=16,                 # Standard batch size
    num_epochs=5                   # Same as original model
)

# Train and evaluate
fine_tuner.train_model()
eval_results, y_pred, y_true = fine_tuner.evaluate_model()

Training Outputs

Model Files: Saved to specified output directory
Tokenizer: Saved with model configuration
Training History: training_history.png
Confusion Matrix: confusion_matrix.png
Logs: Training logs in {output_dir}/logs/

Fine-Tuning Features

Automatic Dataset Loading: Supports multiple Vietnamese datasets
Flexible Column Detection: Auto-detects text and label columns
Fallback Sample Dataset: Built-in dataset if external fails
Comprehensive Evaluation: Multiple metrics and visualizations
Memory Efficient: Optimized for limited resources

📋 Model Performance

The model provides:

Sentiment Classification: Positive, Neutral, Negative
Confidence Scores: Probability distribution across classes
Real-time Processing: Fast inference on CPU/GPU
Batch Analysis: Efficient processing of multiple texts

🔧 Deployment

This Space is configured for Hugging Face Spaces with:

SDK: Gradio 4.44.0
Hardware: CPU (with CUDA support if available)
Memory: 8GB limit with optimization
Model Loading: Direct from Hugging Face Hub

📄 Requirements

See requirements.txt for complete dependency list:

Core Dependencies

torch>=2.0.0: PyTorch for deep learning
transformers>=4.21.0: Hugging Face transformers
gradio>=4.44.0: Web interface framework
psutil: System and process monitoring

Fine-Tuning Dependencies

datasets: Hugging Face datasets for loading training data
scikit-learn: Machine learning metrics and evaluation
pandas: Data manipulation and analysis
numpy: Numerical computing
matplotlib: Plotting and visualization
seaborn: Statistical data visualization
tqdm: Progress bars for training

Installation

pip install -r requirements.txt

For fine-tuning specifically:

pip install torch transformers datasets scikit-learn pandas numpy matplotlib seaborn tqdm psutil gradio

🎯 Use Cases

Education: Analyze student feedback
Customer Service: Analyze customer reviews
Social Media: Monitor sentiment in posts
Research: Vietnamese text analysis
Business: Customer sentiment tracking

🔍 Troubleshooting

Memory Issues

Use "Memory Cleanup" button
Reduce batch size
Refresh the page if needed

Model Loading

Model loads automatically from Hugging Face Hub
No local training required
Automatic fallback to CPU if GPU unavailable

Performance Tips

Clear, grammatically correct Vietnamese text works best
Longer texts (20-200 words) provide better context
Use batch processing for multiple texts

📝 Citation

If you use this model or Space, please cite the original model:

@InProceedings{8573337,
  author={Nguyen, Kiet Van and Nguyen, Vu Duc and Nguyen, Phu X. V. and Truong, Tham T. H. and Nguyen, Ngan Luu-Thuy},
  booktitle={2018 10th International Conference on Knowledge and Systems Engineering (KSE)},
  title={UIT-VSFC: Vietnamese Students' Feedback Corpus for Sentiment Analysis},
  year={2018},
  volume={},
  number={},
  pages={19-24},
  doi={10.1109/KSE.2018.8573337}
}

🤝 Contributing

Feel free to:

Submit issues and feedback
Suggest improvements
Report bugs
Request new features

📄 License

This Space uses open-source components under MIT license.

Try it now! Enter some Vietnamese text above to see the sentiment analysis in action. 🎭