Spaces:

Thadillo
/

participatory-planner

Sleeping

App Files Files Community

participatory-planner / SENTENCE_LEVEL_CATEGORIZATION_PLAN.md

thadillo

Add advanced training features and HF deployment guide

00aacad 25 days ago

preview code

raw

history blame contribute delete

10.3 kB

📋 Sentence-Level Categorization - ✅ IMPLEMENTED

Status: ✅ COMPLETE - All 7 phases implemented and deployed

Problem Identified: Single submissions often contain multiple semantic units (sentences) belonging to different categories, leading to loss of nuance.

Example:

"Dallas should establish more green spaces in South Dallas neighborhoods. Areas like Oak Cliff lack accessible parks compared to North Dallas."

Sentence 1: Objectives (should establish...)
Sentence 2: Problem (lack accessible parks...)

✅ Implementation Status

Phase 1: Database Schema ✅ COMPLETE

✅ SubmissionSentence model created
✅ sentence_analysis_done flag added to Submission
✅ sentence_id foreign key added to TrainingExample
✅ Helper methods: get_primary_category(), get_category_distribution()
✅ Database migration script completed

Files:

app/models/models.py (lines 85-114): SubmissionSentence model
app/models/models.py (lines 34-60): Updated Submission model
migrations/migrate_to_sentence_level.py: Migration script

Phase 2: Sentence Segmentation ✅ COMPLETE

✅ Rule-based sentence segmenter created
✅ Handles abbreviations (Dr., Mr., etc.)
✅ Handles bullet points and special punctuation
✅ Minimum length validation

Files:

app/sentence_segmenter.py: SentenceSegmenter class with comprehensive logic

Phase 3: Analysis Pipeline ✅ COMPLETE

✅ analyze_sentences() method - analyzes list of sentences
✅ analyze_with_sentences() method - segments and analyzes in one call
✅ Each sentence classified independently
✅ Confidence scores tracked (when available)

Files:

app/analyzer.py (lines 282-313): analyze_sentences method
app/analyzer.py (lines 315-332): analyze_with_sentences method

Phase 4: Backend API ✅ COMPLETE

✅ Analysis endpoint updated for sentence-level
✅ Sentence category update endpoint (/api/update-sentence-category/<id>)
✅ Training examples linked to sentences
✅ Backward compatibility maintained

Files:

app/routes/admin.py (lines 372-429): Updated analyze endpoint
app/routes/admin.py (lines 305-354): Sentence category update endpoint

Phase 5: UI/UX ✅ COMPLETE

✅ Collapsible sentence view in submissions
✅ Category distribution badges
✅ Individual sentence category dropdowns
✅ Real-time sentence category editing
✅ Visual feedback for changes

Files:

app/templates/admin/submissions.html (lines 69-116): Sentence-level UI

Phase 6: Dashboard Aggregation ✅ COMPLETE

✅ Dual-mode dashboard (Submissions vs Sentences)
✅ Toggle button for view mode
✅ Sentence-based category statistics
✅ Contributor breakdown by sentences
✅ Backward compatible with submission-level

Files:

app/routes/admin.py (lines 117-181): Updated dashboard route
app/templates/admin/dashboard.html (lines 1-20): View mode selector

Phase 7: Migration & Testing ✅ COMPLETE

✅ Migration script with SQL ALTER statements
✅ Safely adds columns to existing tables
✅ 60 submissions migrated successfully
✅ Backward compatibility verified
✅ Sentence-level analysis tested and working

Files:

migrations/migrate_to_sentence_level.py: Complete migration script

🎯 Additional Features Implemented

Training Data Management

✅ Export training examples (with sentence-level filter)
✅ Import training examples from JSON
✅ Clear training examples (with safety options)
✅ Sentence-level training data preference

Files:

app/routes/admin.py (lines 748-886): Export/Import/Clear endpoints
app/templates/admin/training.html (lines 64-126): Training data management UI

Fine-Tuning Enhancements

✅ Sentence-level vs submission-level training toggle
✅ Filters training data to use only sentence-level examples
✅ Falls back to all examples if insufficient sentence-level data
✅ Detailed progress tracking (epoch/step/loss)
✅ Real-time progress updates during training

Files:

app/routes/admin.py (lines 893-910): Training data filtering
app/fine_tuning/trainer.py (lines 34-102): ProgressCallback for tracking
app/templates/admin/training.html (lines 174-189): Sentence-level training option

Model Management

✅ Force delete training runs
✅ Bypass all safety checks for stuck runs
✅ Confirmation prompt requiring "DELETE" text
✅ Model file cleanup on deletion

Files:

app/routes/admin.py (lines 1391-1430): Force delete endpoint
app/templates/admin/training.html (lines 920-952): Force delete function

📊 How It Works

1. Submission Flow

User submits text
    ↓
Stored in database
    ↓
Admin clicks "Analyze All"
    ↓
Text segmented into sentences (sentence_segmenter.py)
    ↓
Each sentence classified independently (analyzer.py)
    ↓
Results stored in submission_sentences table
    ↓
Primary category calculated from sentence distribution

2. Training Flow

Admin reviews sentences
    ↓
Corrects individual sentence categories
    ↓
Each correction creates a sentence-level training example
    ↓
Training examples exported/imported as needed
    ↓
Model trained using only sentence-level data (when enabled)
    ↓
Fine-tuned model deployed for better accuracy

3. Dashboard Aggregation

Admin selects view mode (Submissions vs Sentences)
    ↓
If Submissions: Count by primary category per submission
    ↓
If Sentences: Count all sentences by category
    ↓
Charts and statistics update accordingly

🎨 UI Features

Submissions Page

View Sentences button shows count: (3) sentences
Click to expand collapsible sentence list
Each sentence displays:
- Sentence number
- Text content
- Category dropdown (editable)
- Confidence score (if available)
Category distribution badges show percentages

Dashboard

Toggle buttons: "By Submissions" | "By Sentences"
Charts update based on selected mode
Category breakdown shows different totals
Contributor statistics remain submission-based

Training Page

Checkbox: "Use Sentence-Level Training Data" (default: checked)
Export with "Sentence-level only" filter
Import shows sentence vs submission counts
Clear with "Sentence-level only" option

🗂️ Database Schema

submission_sentences Table

CREATE TABLE submission_sentences (
    id INTEGER PRIMARY KEY,
    submission_id INTEGER NOT NULL,
    sentence_index INTEGER NOT NULL,
    text TEXT NOT NULL,
    category VARCHAR(50),
    confidence REAL,
    created_at DATETIME DEFAULT CURRENT_TIMESTAMP,
    FOREIGN KEY (submission_id) REFERENCES submissions(id),
    UNIQUE (submission_id, sentence_index)
);

Updated submissions Table

ALTER TABLE submissions
ADD COLUMN sentence_analysis_done BOOLEAN DEFAULT 0;

Updated training_examples Table

ALTER TABLE training_examples
ADD COLUMN sentence_id INTEGER REFERENCES submission_sentences(id);

📈 Usage Statistics

Current Database (as of implementation):

Total submissions: 60
Sentence-level analyzed: Yes
Total training examples: 71
- Sentence-level: 11
- Submission-level: 60
Training runs: 12

🔧 Configuration

Enable Sentence-Level Analysis

In admin interface:

Go to Submissions
Click "Analyze All"
System automatically uses sentence-level (default)

Train with Sentence Data

In admin interface:

Go to Training
Check "Use Sentence-Level Training Data"
Click "Start Training"
System uses only sentence-level examples (falls back if < 20)

View Sentence Analytics

In admin interface:

Go to Dashboard
Click "By Sentences" toggle
Charts show sentence-based aggregation

🚀 Performance Notes

Sentence Segmentation: ~50-100ms per submission (rule-based, fast)

Classification: ~200-500ms per sentence (BART model, CPU)

3-sentence submission: ~600-1500ms total
Can be parallelized in future

Database Queries: Optimized with indexes on foreign keys

UI Rendering: Lazy loading with Bootstrap collapse components

🔄 Backward Compatibility

✅ Fully backward compatible:

Old submission.category field preserved
Automatically set to primary category from sentences
Legacy submissions work without re-analysis
Dashboard supports both view modes
Training examples support both types

📝 Next Steps (Future Enhancements)

Potential Improvements

⏭️ Parallel sentence classification (faster bulk analysis)
⏭️ Confidence threshold filtering
⏭️ Sentence-level map markers (optional)
⏭️ Advanced NLP: Named entity recognition
⏭️ Sentence similarity clustering
⏭️ Multi-language support

Optimization Opportunities

⏭️ Cache sentence segmentation results
⏭️ Batch sentence classification API
⏭️ Database indexes on category fields
⏭️ Async processing for large batches

✅ Verification Checklist

Database schema updated
Migration script runs successfully
Sentence segmentation working
Each sentence classified independently
UI shows sentence breakdown
Category distribution calculated correctly
Training examples linked to sentences
Dashboard dual-mode working
Export/import preserves sentence data
Backward compatibility maintained
Documentation updated
All features tested end-to-end

📚 Related Documentation

README.md - Updated with sentence-level features
NEXT_STEPS_CATEGORIZATION.md - Implementation guidance
TRAINING_DATA_MANAGEMENT.md - Export/import workflows

🎯 Conclusion

Sentence-level categorization is fully operational!

The system now:

✅ Segments submissions into sentences
✅ Classifies each sentence independently
✅ Shows detailed breakdown in UI
✅ Trains models on sentence-level data
✅ Provides dual-mode analytics
✅ Maintains backward compatibility

Total Implementation Time: ~18 hours (13-20 hour estimate)

Result: Maximum analytical granularity with zero loss of functionality.