Spaces:

Thadillo
/

participatory-planner

Sleeping

App Files Files Community

participatory-planner / TRAINING_STRATEGY.md

thadillo

Phases 1-3: Database schema, text processing, analyzer updates

71797a4 about 1 month ago

preview code

raw

history blame

7.07 kB

	# Training Strategy Guide for Participatory Planning Classifier

	## Current Performance (as of Oct 2025)

	- Dataset: 60 examples (~42 train / 9 val / 9 test)
	- Current Best: Head-only training - 66.7% accuracy
	- Baseline: ~60% (zero-shot BART-mnli)
	- Challenge: Only 6.7% improvement - model is underfitting

	## Recommended Training Strategies (Ranked)

	### 🥇 Strategy 1: LoRA with Conservative Settings
	Best for: Your current 60-example dataset

	```yaml
	Configuration:
	training_mode: lora
	lora_rank: 4-8 # Start small!
	lora_alpha: 8-16 # 2x rank
	lora_dropout: 0.2 # High dropout to prevent overfitting
	learning_rate: 1e-4 # Conservative
	num_epochs: 5-7 # Watch for overfitting
	batch_size: 4 # Smaller batches
	```

	Expected Accuracy: 70-80%

	Why it works:
	- More capacity than head-only (~500K params with r=4)
	- Still parameter-efficient enough for 60 examples
	- Dropout prevents overfitting

	Try this first! Your head-only results show you need more model capacity.

	---

	### 🥈 Strategy 2: Data Augmentation + LoRA
	Best for: Improving beyond 80% accuracy

	Step 1: Augment your dataset to 150-200 examples

	Methods:
	1. Paraphrasing (use GPT/Claude):
	```python
	# For each example:
	"We need better public transit"
	→ "Public transportation should be improved"
	→ "Transit system requires enhancement"
	```

	2. Back-translation:
	English → Spanish → English (creates natural variations)

	3. Template-based:
	Create templates for each category and fill with variations

	Step 2: Train LoRA (r=8-16) on augmented data
	- Expected Accuracy: 80-90%

	---

	### 🥉 Strategy 3: Two-Stage Progressive Training
	Best for: Maximizing performance with limited data

	1. Stage 1: Head-only (warm-up)
	- 3 epochs
	- Initialize the classification head

	2. Stage 2: LoRA fine-tuning
	- r=4, low learning rate
	- Build on head-only initialization

	---

	### 🔧 Strategy 4: Optimize Category Definitions
	May help with zero-shot AND fine-tuning

	Your categories might be too similar. Consider:

	Current Categories:
	- Vision vs Objectives (both forward-looking)
	- Problem vs Directives (both constraints)

	Better Definitions:
	```python
	CATEGORIES = {
	'Vision': {
	'name': 'Vision & Aspirations',
	'description': 'Long-term future state, desired outcomes, what success looks like',
	'keywords': ['future', 'aspire', 'imagine', 'dream', 'ideal']
	},
	'Problem': {
	'name': 'Current Problems',
	'description': 'Existing issues, frustrations, barriers, root causes',
	'keywords': ['problem', 'issue', 'challenge', 'barrier', 'broken']
	},
	'Objectives': {
	'name': 'Specific Goals',
	'description': 'Measurable targets, concrete milestones, quantifiable outcomes',
	'keywords': ['increase', 'reduce', 'achieve', 'target', 'by 2030']
	},
	'Directives': {
	'name': 'Constraints & Requirements',
	'description': 'Must-haves, non-negotiables, compliance requirements',
	'keywords': ['must', 'required', 'mandate', 'comply', 'regulation']
	},
	'Values': {
	'name': 'Principles & Values',
	'description': 'Core beliefs, ethical guidelines, guiding principles',
	'keywords': ['equity', 'sustainability', 'justice', 'fairness', 'inclusive']
	},
	'Actions': {
	'name': 'Concrete Actions',
	'description': 'Specific steps, interventions, activities to implement',
	'keywords': ['build', 'create', 'implement', 'install', 'construct']
	}
	}
	```

	---

	## Alternative Base Models to Consider

	### DeBERTa-v3-base (Better for Classification)
	```python
	# In app/analyzer.py
	model_name = "microsoft/deberta-v3-base"
	# Size: 184M params (vs BART's 400M)
	# Often outperforms BART for classification
	```

	### DistilRoBERTa (Faster, Lighter)
	```python
	model_name = "distilroberta-base"
	# Size: 82M params
	# 2x faster, 60% smaller
	# Good accuracy
	```

	### XLM-RoBERTa-base (Multilingual)
	```python
	model_name = "xlm-roberta-base"
	# If you have multilingual submissions
	```

	---

	## Data Collection Strategy

	Current: 60 examples → Target: 150+ examples

	### How to get more data:

	1. Active Learning (Built into your system!)
	- Deploy current model
	- Admin reviews and corrects predictions
	- Automatically builds training set

	2. Historical Data
	- Import past participatory planning submissions
	- Manual labeling (15 min for 50 examples)

	3. Synthetic Generation (Use GPT-4)
	```
	Prompt: "Generate 10 participatory planning submissions
	that express VISION for urban transportation"
	```

	4. Crowdsourcing
	- Mturk or internal team
	- Label 100 examples: ~$20-50

	---

	## Performance Targets

	\| Dataset Size \| Method \| Expected Accuracy \| Time to Train \|
	\|-------------\|--------\|------------------\|---------------\|
	\| 60 \| Head-only \| 65-70% ❌ Current \| 2 min \|
	\| 60 \| LoRA (r=4) \| 70-80% ✅ Try next \| 5 min \|
	\| 150 \| LoRA (r=8) \| 80-85% ⭐ Goal \| 10 min \|
	\| 300+ \| LoRA (r=16) \| 85-90% 🎯 Ideal \| 20 min \|

	---

	## Immediate Action Plan

	### Week 1: Low-Hanging Fruit
	1. ✅ Train with LoRA (r=4, epochs=5)
	2. ✅ Compare to head-only baseline
	3. ✅ Check per-category F1 scores

	### Week 2: Data Expansion
	4. Collect 50 more examples (aim for balance)
	5. Use data augmentation (paraphrase 60 → 120)
	6. Retrain LoRA (r=8)

	### Week 3: Optimization
	7. Try DeBERTa-v3-base as base model
	8. Fine-tune category descriptions
	9. Deploy best model

	---

	## Debugging Low Performance

	If accuracy stays below 75%:

	### Check 1: Data Quality
	```python
	# Look for label conflicts
	SELECT message, corrected_category, COUNT(*)
	FROM training_examples
	GROUP BY message
	HAVING COUNT(DISTINCT corrected_category) > 1
	```

	### Check 2: Class Imbalance
	- Ensure each category has 5-10+ examples
	- Use weighted loss if imbalanced

	### Check 3: Category Confusion
	- Generate confusion matrix
	- Merge categories that are frequently confused
	(e.g., Vision + Objectives → "Future Goals")

	### Check 4: Text Quality
	- Remove very short texts (< 5 words)
	- Remove duplicates
	- Check for non-English text

	---

	## Advanced: Ensemble Models

	If single model plateaus at 80-85%:

	1. Train 3 models with different seeds
	2. Use voting or averaging
	3. Typical boost: +3-5% accuracy

	```python
	# Pseudo-code
	predictions = [
	model1.predict(text),
	model2.predict(text),
	model3.predict(text)
	]
	final = most_common(predictions) # Voting
	```

	---

	## Conclusion

	For your current 60 examples:
	1. 🎯 DO: Try LoRA with r=4-8 (conservative settings)
	2. 📈 DO: Collect 50-100 more examples
	3. 🔄 DO: Try DeBERTa-v3 as alternative base model
	4. ❌ DON'T: Use head-only (proven to underfit)
	5. ❌ DON'T: Use full fine-tuning (will overfit)

	Expected outcome: 70-85% accuracy (up from current 66.7%)

	Next milestone: 150 examples → 85%+ accuracy