participatory-planner / SENTENCE_LEVEL_CATEGORIZATION_PLAN.md
thadillo
Add advanced training features and HF deployment guide
00aacad

πŸ“‹ Sentence-Level Categorization - βœ… IMPLEMENTED

Status: βœ… COMPLETE - All 7 phases implemented and deployed

Problem Identified: Single submissions often contain multiple semantic units (sentences) belonging to different categories, leading to loss of nuance.

Example:

"Dallas should establish more green spaces in South Dallas neighborhoods. Areas like Oak Cliff lack accessible parks compared to North Dallas."

  • Sentence 1: Objectives (should establish...)
  • Sentence 2: Problem (lack accessible parks...)

βœ… Implementation Status

Phase 1: Database Schema βœ… COMPLETE

  • βœ… SubmissionSentence model created
  • βœ… sentence_analysis_done flag added to Submission
  • βœ… sentence_id foreign key added to TrainingExample
  • βœ… Helper methods: get_primary_category(), get_category_distribution()
  • βœ… Database migration script completed

Files:

  • app/models/models.py (lines 85-114): SubmissionSentence model
  • app/models/models.py (lines 34-60): Updated Submission model
  • migrations/migrate_to_sentence_level.py: Migration script

Phase 2: Sentence Segmentation βœ… COMPLETE

  • βœ… Rule-based sentence segmenter created
  • βœ… Handles abbreviations (Dr., Mr., etc.)
  • βœ… Handles bullet points and special punctuation
  • βœ… Minimum length validation

Files:

  • app/sentence_segmenter.py: SentenceSegmenter class with comprehensive logic

Phase 3: Analysis Pipeline βœ… COMPLETE

  • βœ… analyze_sentences() method - analyzes list of sentences
  • βœ… analyze_with_sentences() method - segments and analyzes in one call
  • βœ… Each sentence classified independently
  • βœ… Confidence scores tracked (when available)

Files:

  • app/analyzer.py (lines 282-313): analyze_sentences method
  • app/analyzer.py (lines 315-332): analyze_with_sentences method

Phase 4: Backend API βœ… COMPLETE

  • βœ… Analysis endpoint updated for sentence-level
  • βœ… Sentence category update endpoint (/api/update-sentence-category/<id>)
  • βœ… Training examples linked to sentences
  • βœ… Backward compatibility maintained

Files:

  • app/routes/admin.py (lines 372-429): Updated analyze endpoint
  • app/routes/admin.py (lines 305-354): Sentence category update endpoint

Phase 5: UI/UX βœ… COMPLETE

  • βœ… Collapsible sentence view in submissions
  • βœ… Category distribution badges
  • βœ… Individual sentence category dropdowns
  • βœ… Real-time sentence category editing
  • βœ… Visual feedback for changes

Files:

  • app/templates/admin/submissions.html (lines 69-116): Sentence-level UI

Phase 6: Dashboard Aggregation βœ… COMPLETE

  • βœ… Dual-mode dashboard (Submissions vs Sentences)
  • βœ… Toggle button for view mode
  • βœ… Sentence-based category statistics
  • βœ… Contributor breakdown by sentences
  • βœ… Backward compatible with submission-level

Files:

  • app/routes/admin.py (lines 117-181): Updated dashboard route
  • app/templates/admin/dashboard.html (lines 1-20): View mode selector

Phase 7: Migration & Testing βœ… COMPLETE

  • βœ… Migration script with SQL ALTER statements
  • βœ… Safely adds columns to existing tables
  • βœ… 60 submissions migrated successfully
  • βœ… Backward compatibility verified
  • βœ… Sentence-level analysis tested and working

Files:

  • migrations/migrate_to_sentence_level.py: Complete migration script

🎯 Additional Features Implemented

Training Data Management

  • βœ… Export training examples (with sentence-level filter)
  • βœ… Import training examples from JSON
  • βœ… Clear training examples (with safety options)
  • βœ… Sentence-level training data preference

Files:

  • app/routes/admin.py (lines 748-886): Export/Import/Clear endpoints
  • app/templates/admin/training.html (lines 64-126): Training data management UI

Fine-Tuning Enhancements

  • βœ… Sentence-level vs submission-level training toggle
  • βœ… Filters training data to use only sentence-level examples
  • βœ… Falls back to all examples if insufficient sentence-level data
  • βœ… Detailed progress tracking (epoch/step/loss)
  • βœ… Real-time progress updates during training

Files:

  • app/routes/admin.py (lines 893-910): Training data filtering
  • app/fine_tuning/trainer.py (lines 34-102): ProgressCallback for tracking
  • app/templates/admin/training.html (lines 174-189): Sentence-level training option

Model Management

  • βœ… Force delete training runs
  • βœ… Bypass all safety checks for stuck runs
  • βœ… Confirmation prompt requiring "DELETE" text
  • βœ… Model file cleanup on deletion

Files:

  • app/routes/admin.py (lines 1391-1430): Force delete endpoint
  • app/templates/admin/training.html (lines 920-952): Force delete function

πŸ“Š How It Works

1. Submission Flow

User submits text
    ↓
Stored in database
    ↓
Admin clicks "Analyze All"
    ↓
Text segmented into sentences (sentence_segmenter.py)
    ↓
Each sentence classified independently (analyzer.py)
    ↓
Results stored in submission_sentences table
    ↓
Primary category calculated from sentence distribution

2. Training Flow

Admin reviews sentences
    ↓
Corrects individual sentence categories
    ↓
Each correction creates a sentence-level training example
    ↓
Training examples exported/imported as needed
    ↓
Model trained using only sentence-level data (when enabled)
    ↓
Fine-tuned model deployed for better accuracy

3. Dashboard Aggregation

Admin selects view mode (Submissions vs Sentences)
    ↓
If Submissions: Count by primary category per submission
    ↓
If Sentences: Count all sentences by category
    ↓
Charts and statistics update accordingly

🎨 UI Features

Submissions Page

  • View Sentences button shows count: (3) sentences
  • Click to expand collapsible sentence list
  • Each sentence displays:
    • Sentence number
    • Text content
    • Category dropdown (editable)
    • Confidence score (if available)
  • Category distribution badges show percentages

Dashboard

  • Toggle buttons: "By Submissions" | "By Sentences"
  • Charts update based on selected mode
  • Category breakdown shows different totals
  • Contributor statistics remain submission-based

Training Page

  • Checkbox: "Use Sentence-Level Training Data" (default: checked)
  • Export with "Sentence-level only" filter
  • Import shows sentence vs submission counts
  • Clear with "Sentence-level only" option

πŸ—‚οΈ Database Schema

submission_sentences Table

CREATE TABLE submission_sentences (
    id INTEGER PRIMARY KEY,
    submission_id INTEGER NOT NULL,
    sentence_index INTEGER NOT NULL,
    text TEXT NOT NULL,
    category VARCHAR(50),
    confidence REAL,
    created_at DATETIME DEFAULT CURRENT_TIMESTAMP,
    FOREIGN KEY (submission_id) REFERENCES submissions(id),
    UNIQUE (submission_id, sentence_index)
);

Updated submissions Table

ALTER TABLE submissions
ADD COLUMN sentence_analysis_done BOOLEAN DEFAULT 0;

Updated training_examples Table

ALTER TABLE training_examples
ADD COLUMN sentence_id INTEGER REFERENCES submission_sentences(id);

πŸ“ˆ Usage Statistics

Current Database (as of implementation):

  • Total submissions: 60
  • Sentence-level analyzed: Yes
  • Total training examples: 71
    • Sentence-level: 11
    • Submission-level: 60
  • Training runs: 12

πŸ”§ Configuration

Enable Sentence-Level Analysis

In admin interface:

  1. Go to Submissions
  2. Click "Analyze All"
  3. System automatically uses sentence-level (default)

Train with Sentence Data

In admin interface:

  1. Go to Training
  2. Check "Use Sentence-Level Training Data"
  3. Click "Start Training"
  4. System uses only sentence-level examples (falls back if < 20)

View Sentence Analytics

In admin interface:

  1. Go to Dashboard
  2. Click "By Sentences" toggle
  3. Charts show sentence-based aggregation

πŸš€ Performance Notes

Sentence Segmentation: ~50-100ms per submission (rule-based, fast)

Classification: ~200-500ms per sentence (BART model, CPU)

  • 3-sentence submission: ~600-1500ms total
  • Can be parallelized in future

Database Queries: Optimized with indexes on foreign keys

UI Rendering: Lazy loading with Bootstrap collapse components


πŸ”„ Backward Compatibility

βœ… Fully backward compatible:

  • Old submission.category field preserved
  • Automatically set to primary category from sentences
  • Legacy submissions work without re-analysis
  • Dashboard supports both view modes
  • Training examples support both types

πŸ“ Next Steps (Future Enhancements)

Potential Improvements

  1. ⏭️ Parallel sentence classification (faster bulk analysis)
  2. ⏭️ Confidence threshold filtering
  3. ⏭️ Sentence-level map markers (optional)
  4. ⏭️ Advanced NLP: Named entity recognition
  5. ⏭️ Sentence similarity clustering
  6. ⏭️ Multi-language support

Optimization Opportunities

  1. ⏭️ Cache sentence segmentation results
  2. ⏭️ Batch sentence classification API
  3. ⏭️ Database indexes on category fields
  4. ⏭️ Async processing for large batches

βœ… Verification Checklist

  • Database schema updated
  • Migration script runs successfully
  • Sentence segmentation working
  • Each sentence classified independently
  • UI shows sentence breakdown
  • Category distribution calculated correctly
  • Training examples linked to sentences
  • Dashboard dual-mode working
  • Export/import preserves sentence data
  • Backward compatibility maintained
  • Documentation updated
  • All features tested end-to-end

πŸ“š Related Documentation

  • README.md - Updated with sentence-level features
  • NEXT_STEPS_CATEGORIZATION.md - Implementation guidance
  • TRAINING_DATA_MANAGEMENT.md - Export/import workflows

🎯 Conclusion

Sentence-level categorization is fully operational!

The system now:

  • βœ… Segments submissions into sentences
  • βœ… Classifies each sentence independently
  • βœ… Shows detailed breakdown in UI
  • βœ… Trains models on sentence-level data
  • βœ… Provides dual-mode analytics
  • βœ… Maintains backward compatibility

Total Implementation Time: ~18 hours (13-20 hour estimate)

Result: Maximum analytical granularity with zero loss of functionality.