# Fine-Tuning System Implementation Plan ## Overview Implement an active learning system that collects admin corrections, builds a training dataset, and fine-tunes the BART classification model using LoRA (Low-Rank Adaptation). --- ## Phase 1: Training Data Collection Infrastructure ### 1.1 Database Schema Extensions **New Model: `TrainingExample`** - `id` (Integer, PK) - `submission_id` (Integer, FK to Submission) - `message` (Text) - snapshot of submission text - `original_category` (String, nullable) - AI's initial prediction - `corrected_category` (String) - Admin's correction - `contributor_type` (String) - `correction_timestamp` (DateTime) - `confidence_score` (Float, nullable) - original prediction confidence - `used_in_training` (Boolean, default=False) - track if used in fine-tuning - `training_run_id` (Integer, nullable, FK) - which training run used this **New Model: `FineTuningRun`** - `id` (Integer, PK) - `created_at` (DateTime) - `status` (String) - 'preparing', 'training', 'evaluating', 'completed', 'failed' - `num_training_examples` (Integer) - `num_validation_examples` (Integer) - `num_test_examples` (Integer) - `training_config` (JSON) - hyperparameters, LoRA config - `results` (JSON) - metrics (accuracy, loss, per-category F1) - `model_path` (String, nullable) - path to saved LoRA weights - `is_active_model` (Boolean) - currently deployed model - `improvement_over_baseline` (Float, nullable) - `completed_at` (DateTime, nullable) ### 1.2 Admin Routes Extension (`app/routes/admin.py`) **Modify `update_category` endpoint:** - When admin changes category, create TrainingExample record - Capture: original prediction, corrected category, confidence score - Track whether it's a correction (different from AI) or confirmation (same) **New endpoints:** - `GET /admin/training-data` - View collected training examples - `GET /admin/api/training-stats` - Stats on corrections collected - `DELETE /admin/api/training-example/` - Remove bad examples --- ## Phase 2: Fine-Tuning Configuration UI ### 2.1 New Admin Page: Training Dashboard (`app/templates/admin/training.html`) **Sections:** 1. **Training Data Stats** - Total corrections collected - Per-category distribution - Corrections vs confirmations ratio - Data quality indicators (duplicates, conflicts) 2. **Fine-Tuning Controls** (enabled when ≥20 examples) - Configure training parameters: - Minimum examples threshold (default: 20) - Train/Val/Test split (e.g., 70/15/15) - LoRA rank (r=8, 16, 32) - Learning rate (1e-4 to 5e-4) - Number of epochs (3-5) - "Start Fine-Tuning" button (with confirmation) 3. **Training History** - Table of past FineTuningRun records - Show: date, examples used, accuracy, status - Actions: View details, Deploy model, Export weights 4. **Active Model Indicator** - Show which model is currently in use - Option to rollback to base model ### 2.2 Settings Extension - `fine_tuning_enabled` (Boolean) - master switch - `min_training_examples` (Integer, default: 20) - `auto_train` (Boolean, default: False) - auto-trigger when threshold reached --- ## Phase 3: Fine-Tuning Engine ### 3.1 New Module: `app/fine_tuning/trainer.py` **Class: `BARTFineTuner`** **Methods:** `prepare_dataset(training_examples)` - Convert TrainingExample records to HuggingFace Dataset - Create train/val/test splits (stratified by category) - Tokenize texts for BART - Return: `train_dataset`, `val_dataset`, `test_dataset` `setup_lora_model(base_model_name, lora_config)` - Load base BART model (`facebook/bart-large-mnli`) - Apply PEFT (Parameter-Efficient Fine-Tuning) with LoRA - LoRA configuration: ```python { "r": 16, # rank "lora_alpha": 32, "target_modules": ["q_proj", "v_proj"], # attention layers "lora_dropout": 0.1, "bias": "none" } ``` `train(train_dataset, val_dataset, config)` - Use HuggingFace Trainer with custom loss - Multi-class cross-entropy loss - Metrics: accuracy, F1 per category, confusion matrix - Early stopping on validation loss - Save checkpoints to `/data/models/finetuned/run_{id}/` `evaluate(test_dataset, model)` - Run predictions on test set - Calculate: accuracy, precision, recall, F1 (macro/micro) - Generate confusion matrix - Compare to baseline (zero-shot) performance `export_model(run_id, destination_path)` - Save LoRA adapter weights - Save tokenizer config - Create model card with metrics - Package for backup/deployment **Alternative Approach: Output Layer Fine-Tuning** - Option to only train final classification head - Faster, less prone to overfitting - Good for small datasets (20-50 examples) ### 3.2 Background Task Handler (`app/fine_tuning/tasks.py`) - Fine-tuning runs in background (avoid blocking Flask) - Options: 1. **Simple Threading** (for development) 2. **Celery** (for production) - requires Redis/RabbitMQ 3. **HF Spaces Gradio Jobs** (if deploying to HF) **Status Updates:** - Update FineTuningRun.status in real-time - Store progress in Settings table for UI polling - Log to file for debugging --- ## Phase 4: Model Deployment & Versioning ### 4.1 Model Manager (`app/fine_tuning/model_manager.py`) **Class: `ModelManager`** `get_active_model()` - Check if fine-tuned model is deployed - Load LoRA weights if available - Fallback to base model `deploy_model(run_id)` - Set FineTuningRun.is_active_model = True - Update Settings: `active_model_id` - Reload analyzer with new model - Create deployment snapshot `rollback_to_baseline()` - Deactivate all fine-tuned models - Reload base BART model - Log rollback event `compare_models(run_id_1, run_id_2, test_dataset)` - Side-by-side comparison - Statistical significance tests - A/B testing support (future) ### 4.2 Analyzer Modification (`app/analyzer.py`) **Update `SubmissionAnalyzer.__init__`:** - Check for active fine-tuned model - Load LoRA adapter if available - Track model version being used **Add method: `get_model_info()`** - Return: model type (base/finetuned), version, metrics **Store prediction metadata:** - Add confidence scores to all predictions - Track which model version made prediction --- ## Phase 5: Validation & Quality Assurance ### 5.1 Cross-Validation - K-fold cross-validation (k=5) for small datasets - Stratified splits to ensure category balance - Report: mean ± std accuracy across folds ### 5.2 Minimum Viable Training Set **Data Requirements:** - At least 3 examples per category (18 total) - Recommended: 5+ examples per category (30 total) - Warn if severe class imbalance (>5:1 ratio) ### 5.3 Quality Checks - Detect duplicate texts - Detect conflicting labels (same text, different categories) - Flag suspiciously short/long texts - Admin review interface for cleanup ### 5.4 Success Criteria **Model is deployed if:** - Test accuracy > baseline accuracy + 5% - OR per-category F1 improved for majority of categories - AND no category has F1 < 0.3 (catch catastrophic forgetting) **If criteria not met:** - Keep base model active - Suggest: collect more data, adjust hyperparameters --- ## Phase 6: Export & Backup ### 6.1 Model Export **Format Options:** 1. **HuggingFace Hub** - push LoRA adapter to private repo 2. **Local Files** - save to `/data/models/exports/` 3. **Download via UI** - ZIP file with weights + config **Export Contents:** - LoRA adapter weights (`adapter_model.bin`) - Adapter config (`adapter_config.json`) - Training metrics (`metrics.json`) - Training examples used (`training_data.json`) - Model card (`README.md`) ### 6.2 Import Pre-trained Model - Upload ZIP with LoRA weights - Validate compatibility with base model - Deploy to production --- ## Technical Implementation Details ### Dependencies to Add (requirements.txt) ``` peft>=0.7.0 # LoRA implementation datasets>=2.14.0 # HuggingFace datasets scikit-learn>=1.3.0 # cross-validation, metrics matplotlib>=3.7.0 # confusion matrix plotting seaborn>=0.12.0 # visualization accelerate>=0.24.0 # training optimization evaluate>=0.4.0 # evaluation metrics ``` ### File Structure ``` app/ ├── fine_tuning/ │ ├── __init__.py │ ├── trainer.py # BARTFineTuner class │ ├── model_manager.py # Model deployment logic │ ├── tasks.py # Background job handler │ ├── metrics.py # Custom evaluation metrics │ └── data_validator.py # Training data QA ├── models/ │ └── models.py # Add TrainingExample, FineTuningRun ├── routes/ │ └── admin.py # Add training endpoints ├── templates/admin/ │ └── training.html # Training dashboard UI └── analyzer.py # Update to support LoRA models /data/models/ # Persistent storage (HF Spaces) ├── finetuned/ │ ├── run_1/ │ ├── run_2/ │ └── ... └── exports/ ``` ### API Endpoints Summary - `GET /admin/training` - Training dashboard page - `GET /admin/api/training-stats` - Get correction stats - `GET /admin/api/training-examples` - List training data - `DELETE /admin/api/training-example/` - Remove example - `POST /admin/api/start-training` - Trigger fine-tuning - `GET /admin/api/training-status/` - Poll training progress - `POST /admin/api/deploy-model/` - Deploy fine-tuned model - `POST /admin/api/rollback-model` - Revert to base model - `GET /admin/api/export-model/` - Download model weights ### UI Workflow 1. Admin corrects categories on Submissions page (already working) 2. Navigate to **Training** tab in admin panel 3. View stats: "25 corrections collected (Ready to train!)" 4. Click "Start Fine-Tuning" → Configure parameters → Confirm 5. Progress bar shows: "Preparing data... Training... Evaluating..." 6. Results displayed: "Accuracy: 87% (+12% improvement!)" 7. Click "Deploy Model" to activate 8. All future predictions use fine-tuned model ### Performance Considerations - **Training Time**: ~2-5 minutes for 20-50 examples (CPU) - **Memory**: LoRA uses ~10% of full fine-tuning memory - **Storage**: ~50MB per LoRA checkpoint - **Inference**: Minimal overhead vs base model ### Risk Mitigation 1. **Overfitting**: Use validation set, early stopping 2. **Catastrophic Forgetting**: Monitor all category metrics 3. **Bad Training Data**: Quality validation before training 4. **Model Regression**: Always compare to baseline, allow rollback 5. **Resource Limits**: LoRA keeps training feasible on HF Spaces --- ## Implementation Phases **Phase 1 (Foundation):** Database models + data collection (2-3 hours) **Phase 2 (UI):** Training dashboard + configuration (2-3 hours) **Phase 3 (Core ML):** Fine-tuning engine + LoRA (4-5 hours) **Phase 4 (Deployment):** Model management + versioning (2-3 hours) **Phase 5 (QA):** Validation + metrics (2-3 hours) **Phase 6 (Polish):** Export/import + documentation (1-2 hours) **Total Estimated Time:** 13-19 hours --- ## Questions for Clarification 1. **Training Infrastructure**: Run on HF Spaces (CPU) or local machine (GPU)? 2. **Background Jobs**: Use simple threading or prefer Celery/Redis? 3. **Model Hosting**: Keep models in HF Spaces persistent storage or upload to HF Hub? 4. **Auto-training**: Should system auto-train when threshold reached, or admin-triggered only? 5. **Notification**: Email/webhook when training completes? 6. **Multi-model**: Support multiple fine-tuned models simultaneously (A/B testing)? Ready to proceed with implementation upon your approval!