Spaces:

Thadillo
/

participatory-planner

Sleeping

App Files Files Community

participatory-planner / TRAINING_STRATEGY.md

thadillo

Phases 1-3: Database schema, text processing, analyzer updates

71797a4 about 1 month ago

preview code

raw

history blame contribute delete

7.07 kB

Training Strategy Guide for Participatory Planning Classifier

Current Performance (as of Oct 2025)

Dataset: 60 examples (~42 train / 9 val / 9 test)
Current Best: Head-only training - 66.7% accuracy
Baseline: ~60% (zero-shot BART-mnli)
Challenge: Only 6.7% improvement - model is underfitting

Recommended Training Strategies (Ranked)

🥇 Strategy 1: LoRA with Conservative Settings

Best for: Your current 60-example dataset

Configuration:
  training_mode: lora
  lora_rank: 4-8          # Start small!
  lora_alpha: 8-16        # 2x rank
  lora_dropout: 0.2       # High dropout to prevent overfitting
  learning_rate: 1e-4     # Conservative
  num_epochs: 5-7         # Watch for overfitting
  batch_size: 4           # Smaller batches

Expected Accuracy: 70-80%

Why it works:

More capacity than head-only (~500K params with r=4)
Still parameter-efficient enough for 60 examples
Dropout prevents overfitting

Try this first! Your head-only results show you need more model capacity.

🥈 Strategy 2: Data Augmentation + LoRA

Best for: Improving beyond 80% accuracy

Step 1: Augment your dataset to 150-200 examples

Methods:

Paraphrasing (use GPT/Claude):

# For each example:
"We need better public transit" 
→ "Public transportation should be improved"
→ "Transit system requires enhancement"

Back-translation: English → Spanish → English (creates natural variations)
Template-based: Create templates for each category and fill with variations

Step 2: Train LoRA (r=8-16) on augmented data

Expected Accuracy: 80-90%

🥉 Strategy 3: Two-Stage Progressive Training

Best for: Maximizing performance with limited data

Stage 1: Head-only (warm-up)
- 3 epochs
- Initialize the classification head
Stage 2: LoRA fine-tuning
- r=4, low learning rate
- Build on head-only initialization

🔧 Strategy 4: Optimize Category Definitions

May help with zero-shot AND fine-tuning

Your categories might be too similar. Consider:

Current Categories:

Vision vs Objectives (both forward-looking)
Problem vs Directives (both constraints)

Better Definitions:

CATEGORIES = {
    'Vision': {
        'name': 'Vision & Aspirations',
        'description': 'Long-term future state, desired outcomes, what success looks like',
        'keywords': ['future', 'aspire', 'imagine', 'dream', 'ideal']
    },
    'Problem': {
        'name': 'Current Problems',  
        'description': 'Existing issues, frustrations, barriers, root causes',
        'keywords': ['problem', 'issue', 'challenge', 'barrier', 'broken']
    },
    'Objectives': {
        'name': 'Specific Goals',
        'description': 'Measurable targets, concrete milestones, quantifiable outcomes',
        'keywords': ['increase', 'reduce', 'achieve', 'target', 'by 2030']
    },
    'Directives': {
        'name': 'Constraints & Requirements',
        'description': 'Must-haves, non-negotiables, compliance requirements',
        'keywords': ['must', 'required', 'mandate', 'comply', 'regulation']
    },
    'Values': {
        'name': 'Principles & Values',
        'description': 'Core beliefs, ethical guidelines, guiding principles',
        'keywords': ['equity', 'sustainability', 'justice', 'fairness', 'inclusive']
    },
    'Actions': {
        'name': 'Concrete Actions',
        'description': 'Specific steps, interventions, activities to implement',
        'keywords': ['build', 'create', 'implement', 'install', 'construct']
    }
}

Alternative Base Models to Consider

DeBERTa-v3-base (Better for Classification)

# In app/analyzer.py
model_name = "microsoft/deberta-v3-base"
# Size: 184M params (vs BART's 400M)
# Often outperforms BART for classification

DistilRoBERTa (Faster, Lighter)

model_name = "distilroberta-base"
# Size: 82M params
# 2x faster, 60% smaller
# Good accuracy

XLM-RoBERTa-base (Multilingual)

model_name = "xlm-roberta-base"
# If you have multilingual submissions

Data Collection Strategy

Current: 60 examples → Target: 150+ examples

How to get more data:

Active Learning (Built into your system!)
- Deploy current model
- Admin reviews and corrects predictions
- Automatically builds training set
Historical Data
- Import past participatory planning submissions
- Manual labeling (15 min for 50 examples)

Synthetic Generation (Use GPT-4)

Prompt: "Generate 10 participatory planning submissions 
that express VISION for urban transportation"

Crowdsourcing
- Mturk or internal team
- Label 100 examples: ~$20-50

Performance Targets

Dataset Size	Method	Expected Accuracy	Time to Train
60	Head-only	65-70% ❌ Current	2 min
60	LoRA (r=4)	70-80% ✅ Try next	5 min
150	LoRA (r=8)	80-85% ⭐ Goal	10 min
300+	LoRA (r=16)	85-90% 🎯 Ideal	20 min

Immediate Action Plan

Week 1: Low-Hanging Fruit

✅ Train with LoRA (r=4, epochs=5)
✅ Compare to head-only baseline
✅ Check per-category F1 scores

Week 2: Data Expansion

Collect 50 more examples (aim for balance)
Use data augmentation (paraphrase 60 → 120)
Retrain LoRA (r=8)

Week 3: Optimization

Try DeBERTa-v3-base as base model
Fine-tune category descriptions
Deploy best model

Debugging Low Performance

If accuracy stays below 75%:

Check 1: Data Quality

# Look for label conflicts
SELECT message, corrected_category, COUNT(*) 
FROM training_examples 
GROUP BY message 
HAVING COUNT(DISTINCT corrected_category) > 1

Check 2: Class Imbalance

Ensure each category has 5-10+ examples
Use weighted loss if imbalanced

Check 3: Category Confusion

Generate confusion matrix
Merge categories that are frequently confused (e.g., Vision + Objectives → "Future Goals")

Check 4: Text Quality

Remove very short texts (< 5 words)
Remove duplicates
Check for non-English text

Advanced: Ensemble Models

If single model plateaus at 80-85%:

Train 3 models with different seeds
Use voting or averaging
Typical boost: +3-5% accuracy

# Pseudo-code
predictions = [
    model1.predict(text),
    model2.predict(text),
    model3.predict(text)
]
final = most_common(predictions)  # Voting

Conclusion

For your current 60 examples:

🎯 DO: Try LoRA with r=4-8 (conservative settings)
📈 DO: Collect 50-100 more examples
🔄 DO: Try DeBERTa-v3 as alternative base model
❌ DON'T: Use head-only (proven to underfit)
❌ DON'T: Use full fine-tuning (will overfit)

Expected outcome: 70-85% accuracy (up from current 66.7%)

Next milestone: 150 examples → 85%+ accuracy