echoboi
/

veganism_and_vegetarianism-distilbert-classifier

@@ -12,210 +12,118 @@ license: mit
 # Veganism & Vegetarianism Classifier (Distilbert)
-This model classifies content related veganism, vegetarianism, and sustainable food choices.
 ## Model Details
-- **Model Type**: Distilbert
-- **Task**: Multilabel text classification
-- **Sector**: Veganism & Vegetarianism
-- **Base Model**: Distilbert base uncased
-- **Number of Labels**: 7
-- **Training Data**: Reddit posts and comments from r/climate, r/climateactionplan, r/climatechange, r/climatechaos, r/climateco, r/climatecrisis, r/climatecrisiscanada, r/climatedisalarm, r/climatejobslist, r/climatejustice, r/climatememes, r/climatenews, r/climateoffensive, r/climatepolicy, r/climate_science
-### Training Data Sources
-- **Source Subreddits**: r/climate, r/climateactionplan, r/climatechange, r/climatechaos, r/climateco, r/climatecrisis, r/climatecrisiscanada, r/climatedisalarm, r/climatejobslist, r/climatejustice, r/climatememes, r/climatenews, r/climateoffensive, r/climatepolicy, r/climate_science
-- **Data Period**: 2010-2023
-- **Focus Areas**: Plant-based nutrition, sustainable agriculture, ethical food choices, alternative proteins, food policy
-- **Data Type**: Posts and comments filtered by regex patterns
-- **Labeling Method**: GPT-assisted multilabel classification
 ## Labels
-The model predicts **7 labels** simultaneously (multilabel classification). Each text can be classified into multiple categories:
-1. **Animal Welfare**
-2. **Environmental Impact**
-3. **Health**
-4. **Lab Grown And Alt Proteins**
-5. **Psychology And Identity**
-6. **Systemic Vs Individual Action**
-7. **Taste And Convenience**
-**⚠️ Important**: The order of labels in the output predictions corresponds exactly to the order listed above. When using the model, ensure your label list matches this order.
-**💡 Tip**: For best performance, use the optimal thresholds provided in the "Optimal Thresholds" section below instead of the default 0.5 threshold.
-**🔧 Threshold Optimization**: The optimal thresholds were computed using Jaccard similarity optimization on the validation set. For your own dataset, consider re-optimizing thresholds using the same method from `paper4_BERT_finetuning.py`.
 ## Usage
 ```python
 import torch
-import numpy as np
 from transformers import DistilBertTokenizer
-import sys
-import os
 import tempfile
 from huggingface_hub import snapshot_download
 device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')
-def print_sorted_label_scores(label_scores):
-    # Sort label_scores dict by score descending
-    sorted_items = sorted(label_scores.items(), key=lambda x: x[1], reverse=True)
-    for label, score in sorted_items:
-        print(f"  {label}: {score:.6f}")
 # Download and load model
 model_link = "sanchow/veganism_and_vegetarianism-distilbert-classifier"
 with tempfile.TemporaryDirectory() as temp_dir:
-    snapshot_download(
-        repo_id=model_link,
-        local_dir=temp_dir,
-        local_dir_use_symlinks=False
-    )
-    # Import the model class
     sys.path.insert(0, temp_dir)
     from model_class import MultilabelClassifier
-    # Load tokenizer and model
     tokenizer = DistilBertTokenizer.from_pretrained(temp_dir)
-    checkpoint = torch.load(os.path.join(temp_dir, 'model.pt'), map_location='cpu', weights_only=False)
     model = MultilabelClassifier(checkpoint['model_name'], len(checkpoint['label_names']))
     model.load_state_dict(checkpoint['model_state_dict'])
     model.to(device)
     model.eval()
-    print("Model loaded successfully")
-    print(f"   Labels: {checkpoint['label_names']}")
-# Example usage with multiple examples
-examples = [
-    "Your first example text here.",
-    "Your second example text here.",
-    "Your third example text here."
-]
-print(f"\n{sector_info['name']} classifier results:\n")
-for i, test_text in enumerate(examples):
-    inputs = tokenizer(
-        test_text,
-        return_tensors="pt",
-        truncation=True,
-        max_length=512,
-        padding=True
-    ).to(device)
-    with torch.no_grad():
-        outputs = model(**inputs)
-        predictions = outputs.cpu().numpy() if isinstance(outputs, (tuple, list)) else outputs.cpu().numpy()
-    label_scores = {label: float(score) for label, score in zip(checkpoint['label_names'], predictions[0])}
-    print(f"Example {i+1}: '{test_text}'")
-    print("Predictions (all label scores, highest first):")
-    print_sorted_label_scores(label_scores)
-    print("-" * 40)
 ```
-## Performance Metrics
-**Best Model Performance (Epoch 5):**
-- Micro Jaccard Score: 0.5584
-- Macro Jaccard Score: 0.6710
 - F1 Score: 0.8906
 - Accuracy: 0.8906
-- Precision: 0.8906
-- Recall: 0.8906
-**Training Sizes:**
-- Training Samples: ~600 per sector
-- Validation Samples: ~150 per sector
-- Test Samples: ~150 per sector
-- Total GPT-labeled samples: ~900 per sector
 ## Optimal Thresholds
-For best performance, use these thresholds to convert continuous scores to binary predictions:
-- **Animal Welfare**: 0.481
-- **Environmental Impact**: 0.459
-- **Health**: 0.201
-- **Lab Grown And Alt Proteins**: 0.341
-- **Psychology And Identity**: 0.525
-- **Systemic Vs Individual Action**: 0.375
-- **Taste And Convenience**: 0.664
-**Usage with optimal thresholds:**
 ```python
-# Define optimal thresholds for this model
 optimal_thresholds = {'Animal Welfare': 0.48107979620047003, 'Environmental Impact': 0.45919171852850427, 'Health': 0.20115313966833437, 'Lab Grown And Alt Proteins': 0.3414601502146817, 'Psychology And Identity': 0.5246278637433214, 'Systemic Vs Individual Action': 0.37517437676211585, 'Taste And Convenience': 0.6635140143644325}
-# Apply thresholds to get binary predictions
-for i, (label, score) in enumerate(zip(label_names, predictions[0])):
     threshold = optimal_thresholds.get(label, 0.5)
     if score > threshold:
-        print(f"{label}: {score:.3f} (threshold: {threshold:.3f})")
 ```
-## Threshold Optimization
-The optimal thresholds provided above were computed using **Jaccard similarity optimization** on the validation dataset. This method finds the best threshold for each label that maximizes the Jaccard similarity between predicted and true labels.
-### Optimization Method Used
-The thresholds were optimized using the `find_optimal_thresholds_jaccard_global` function from `paper4_multilabel_threshold_optimizer.py`, which:
-1. **Grid Search**: Tests threshold values from 0.1 to 0.9 in 0.05 increments
-2. **Jaccard Optimization**: Maximizes micro-averaged Jaccard similarity
-3. **Per-Label Optimization**: Finds optimal threshold for each label independently
-4. **Global Optimization**: Considers the overall multilabel performance
-### Re-optimizing for Your Dataset
-For best results on your specific dataset, consider re-optimizing thresholds:
-```python
-from paper4_multilabel_threshold_optimizer import find_optimal_thresholds_jaccard_global
-# Load your validation data
-validation_data = {
-    'texts': ['your text 1', 'your text 2', ...],
-    'true_labels': [['label1', 'label2'], ['label3'], ...]
-}
-# Create sector models dict (as expected by the optimizer)
-sector_models = {
-    'your_sector': {
-        'model': model,
-        'tokenizer': tokenizer,
-        'label_names': label_names
-    }
-}
-# Find optimal thresholds for your data
-optimal_thresholds = find_optimal_thresholds_jaccard_global(
-    sector_models,
-    validation_data,
-    device=device
-)
-# Use the optimized thresholds
-thresholds = optimal_thresholds['your_sector']
-```
-### Alternative Optimization Methods
-You can also implement other threshold optimization strategies:
-- **F1-score optimization**: Maximize F1-score instead of Jaccard
-- **Precision/Recall trade-off**: Optimize for specific precision/recall requirements
-- **Cost-sensitive optimization**: Weight different types of errors differently
 ## Citation
 If you use this model in your research, please cite:
@@ -231,9 +139,9 @@ If you use this model in your research, please cite:
 }
 ```
-## Model Limitations
 - Trained on Reddit data from specific subreddits
-- May not generalize to other platforms or contexts
-- Performance depends on the quality of GPT-generated labels
-- Limited to English language content

 # Veganism & Vegetarianism Classifier (Distilbert)
+This model classifies content related to plant-based diets, sustainable food systems, and ethical eating. It analyzes discussions about veganism, vegetarianism, and sustainable food choices.
 ## Model Details
+- Model Type: Distilbert
+- Task: Multilabel text classification
+- Sector: Veganism & Vegetarianism
+- Base Model: Distilbert base uncased
+- Labels: 7
+- Training Data: Reddit posts from climate subreddits (2010-2023)
+## Training
+Trained on GPT-labeled Reddit data:
+1. Data collection from climate subreddits
+2. Regex filtering for sector-specific content
+3. GPT labeling for multilabel classification
+4. 80/10/10 train/validation/test split
+5. Fine-tuning with threshold optimization
 ## Labels
+The model predicts 7 labels simultaneously:
+1. Animal Welfare
+2. Environmental Impact
+3. Health
+4. Lab Grown And Alt Proteins
+5. Psychology And Identity
+6. Systemic Vs Individual Action
+7. Taste And Convenience
+Note: Label order in predictions matches the order above.
 ## Usage
 ```python
 import torch
 from transformers import DistilBertTokenizer
 import tempfile
 from huggingface_hub import snapshot_download
 device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')
 # Download and load model
 model_link = "sanchow/veganism_and_vegetarianism-distilbert-classifier"
 with tempfile.TemporaryDirectory() as temp_dir:
+    snapshot_download(repo_id=model_link, local_dir=temp_dir, local_dir_use_symlinks=False)
     sys.path.insert(0, temp_dir)
     from model_class import MultilabelClassifier
     tokenizer = DistilBertTokenizer.from_pretrained(temp_dir)
+    checkpoint = torch.load(os.path.join(temp_dir, 'model.pt'), map_location='cpu')
     model = MultilabelClassifier(checkpoint['model_name'], len(checkpoint['label_names']))
     model.load_state_dict(checkpoint['model_state_dict'])
     model.to(device)
     model.eval()
+# Predict
+text = "Your text here"
+inputs = tokenizer(text, return_tensors="pt", truncation=True, max_length=512).to(device)
+with torch.no_grad():
+    predictions = model(**inputs).cpu().numpy()
+# Get scores
+label_scores = {label: float(score) for label, score in zip(checkpoint['label_names'], predictions[0])}
 ```
+## Applications
+- Content analysis of social media discussions
+- Research on public sentiment and discourse
+- Policy analysis of key topics and concerns
+- Market research on trends and interests
+## Performance
+Best model performance:
+- Micro Jaccard: 0.5584
+- Macro Jaccard: 0.6710
 - F1 Score: 0.8906
 - Accuracy: 0.8906
+Dataset: ~900 GPT-labeled samples per sector (600 train, 150 validation, 150 test)
 ## Optimal Thresholds
+Use these thresholds for best performance:
+- Animal Welfare: 0.481
+- Environmental Impact: 0.459
+- Health: 0.201
+- Lab Grown And Alt Proteins: 0.341
+- Psychology And Identity: 0.525
+- Systemic Vs Individual Action: 0.375
+- Taste And Convenience: 0.664
+Usage:
 ```python
 optimal_thresholds = {'Animal Welfare': 0.48107979620047003, 'Environmental Impact': 0.45919171852850427, 'Health': 0.20115313966833437, 'Lab Grown And Alt Proteins': 0.3414601502146817, 'Psychology And Identity': 0.5246278637433214, 'Systemic Vs Individual Action': 0.37517437676211585, 'Taste And Convenience': 0.6635140143644325}
+for label, score in zip(label_names, predictions[0]):
     threshold = optimal_thresholds.get(label, 0.5)
     if score > threshold:
+        print(f"{label}: {score:.3f}")
 ```
 ## Citation
 If you use this model in your research, please cite:
 }
 ```
+## Limitations
 - Trained on Reddit data from specific subreddits
+- May not generalize to other platforms
+- Performance depends on GPT-generated labels
+- Limited to English content