Spaces:
Sleeping
Training Strategy Guide for Participatory Planning Classifier
Current Performance (as of Oct 2025)
- Dataset: 60 examples (~42 train / 9 val / 9 test)
- Current Best: Head-only training - 66.7% accuracy
- Baseline: ~60% (zero-shot BART-mnli)
- Challenge: Only 6.7% improvement - model is underfitting
Recommended Training Strategies (Ranked)
π₯ Strategy 1: LoRA with Conservative Settings
Best for: Your current 60-example dataset
Configuration:
training_mode: lora
lora_rank: 4-8 # Start small!
lora_alpha: 8-16 # 2x rank
lora_dropout: 0.2 # High dropout to prevent overfitting
learning_rate: 1e-4 # Conservative
num_epochs: 5-7 # Watch for overfitting
batch_size: 4 # Smaller batches
Expected Accuracy: 70-80%
Why it works:
- More capacity than head-only (~500K params with r=4)
- Still parameter-efficient enough for 60 examples
- Dropout prevents overfitting
Try this first! Your head-only results show you need more model capacity.
π₯ Strategy 2: Data Augmentation + LoRA
Best for: Improving beyond 80% accuracy
Step 1: Augment your dataset to 150-200 examples
Methods:
Paraphrasing (use GPT/Claude):
# For each example: "We need better public transit" β "Public transportation should be improved" β "Transit system requires enhancement"Back-translation: English β Spanish β English (creates natural variations)
Template-based: Create templates for each category and fill with variations
Step 2: Train LoRA (r=8-16) on augmented data
- Expected Accuracy: 80-90%
π₯ Strategy 3: Two-Stage Progressive Training
Best for: Maximizing performance with limited data
Stage 1: Head-only (warm-up)
- 3 epochs
- Initialize the classification head
Stage 2: LoRA fine-tuning
- r=4, low learning rate
- Build on head-only initialization
π§ Strategy 4: Optimize Category Definitions
May help with zero-shot AND fine-tuning
Your categories might be too similar. Consider:
Current Categories:
- Vision vs Objectives (both forward-looking)
- Problem vs Directives (both constraints)
Better Definitions:
CATEGORIES = {
'Vision': {
'name': 'Vision & Aspirations',
'description': 'Long-term future state, desired outcomes, what success looks like',
'keywords': ['future', 'aspire', 'imagine', 'dream', 'ideal']
},
'Problem': {
'name': 'Current Problems',
'description': 'Existing issues, frustrations, barriers, root causes',
'keywords': ['problem', 'issue', 'challenge', 'barrier', 'broken']
},
'Objectives': {
'name': 'Specific Goals',
'description': 'Measurable targets, concrete milestones, quantifiable outcomes',
'keywords': ['increase', 'reduce', 'achieve', 'target', 'by 2030']
},
'Directives': {
'name': 'Constraints & Requirements',
'description': 'Must-haves, non-negotiables, compliance requirements',
'keywords': ['must', 'required', 'mandate', 'comply', 'regulation']
},
'Values': {
'name': 'Principles & Values',
'description': 'Core beliefs, ethical guidelines, guiding principles',
'keywords': ['equity', 'sustainability', 'justice', 'fairness', 'inclusive']
},
'Actions': {
'name': 'Concrete Actions',
'description': 'Specific steps, interventions, activities to implement',
'keywords': ['build', 'create', 'implement', 'install', 'construct']
}
}
Alternative Base Models to Consider
DeBERTa-v3-base (Better for Classification)
# In app/analyzer.py
model_name = "microsoft/deberta-v3-base"
# Size: 184M params (vs BART's 400M)
# Often outperforms BART for classification
DistilRoBERTa (Faster, Lighter)
model_name = "distilroberta-base"
# Size: 82M params
# 2x faster, 60% smaller
# Good accuracy
XLM-RoBERTa-base (Multilingual)
model_name = "xlm-roberta-base"
# If you have multilingual submissions
Data Collection Strategy
Current: 60 examples β Target: 150+ examples
How to get more data:
Active Learning (Built into your system!)
- Deploy current model
- Admin reviews and corrects predictions
- Automatically builds training set
Historical Data
- Import past participatory planning submissions
- Manual labeling (15 min for 50 examples)
Synthetic Generation (Use GPT-4)
Prompt: "Generate 10 participatory planning submissions that express VISION for urban transportation"Crowdsourcing
- Mturk or internal team
- Label 100 examples: ~$20-50
Performance Targets
| Dataset Size | Method | Expected Accuracy | Time to Train |
|---|---|---|---|
| 60 | Head-only | 65-70% β Current | 2 min |
| 60 | LoRA (r=4) | 70-80% β Try next | 5 min |
| 150 | LoRA (r=8) | 80-85% β Goal | 10 min |
| 300+ | LoRA (r=16) | 85-90% π― Ideal | 20 min |
Immediate Action Plan
Week 1: Low-Hanging Fruit
- β Train with LoRA (r=4, epochs=5)
- β Compare to head-only baseline
- β Check per-category F1 scores
Week 2: Data Expansion
- Collect 50 more examples (aim for balance)
- Use data augmentation (paraphrase 60 β 120)
- Retrain LoRA (r=8)
Week 3: Optimization
- Try DeBERTa-v3-base as base model
- Fine-tune category descriptions
- Deploy best model
Debugging Low Performance
If accuracy stays below 75%:
Check 1: Data Quality
# Look for label conflicts
SELECT message, corrected_category, COUNT(*)
FROM training_examples
GROUP BY message
HAVING COUNT(DISTINCT corrected_category) > 1
Check 2: Class Imbalance
- Ensure each category has 5-10+ examples
- Use weighted loss if imbalanced
Check 3: Category Confusion
- Generate confusion matrix
- Merge categories that are frequently confused (e.g., Vision + Objectives β "Future Goals")
Check 4: Text Quality
- Remove very short texts (< 5 words)
- Remove duplicates
- Check for non-English text
Advanced: Ensemble Models
If single model plateaus at 80-85%:
- Train 3 models with different seeds
- Use voting or averaging
- Typical boost: +3-5% accuracy
# Pseudo-code
predictions = [
model1.predict(text),
model2.predict(text),
model3.predict(text)
]
final = most_common(predictions) # Voting
Conclusion
For your current 60 examples:
- π― DO: Try LoRA with r=4-8 (conservative settings)
- π DO: Collect 50-100 more examples
- π DO: Try DeBERTa-v3 as alternative base model
- β DON'T: Use head-only (proven to underfit)
- β DON'T: Use full fine-tuning (will overfit)
Expected outcome: 70-85% accuracy (up from current 66.7%)
Next milestone: 150 examples β 85%+ accuracy