echoboi
/

veganism_and_vegetarianism-distilbert-classifier

@@ -6,24 +6,59 @@ tags:
 - food
 - climate-change
 - sustainability
 license: mit
 ---
-# Veganism & Vegetarianism Classifier (DistilBERT)
-This model classifies content related to plant-based diets and sustainable food systems.
 ## Model Details
-- **Model Type**: DistilBERT
 - **Task**: Multilabel text classification
 - **Sector**: Veganism & Vegetarianism
 - **Number of Labels**: 7
 - **Training Data**: Reddit posts and comments from r/climate, r/climateactionplan, r/climatechange, r/climatechaos, r/climateco, r/climatecrisis, r/climatecrisiscanada, r/climatedisalarm, r/climatejobslist, r/climatejustice, r/climatememes, r/climatenews, r/climateoffensive, r/climatepolicy, r/climate_science
 ## Labels
-The model predicts **7 labels** simultaneously:
 1. **Animal Welfare**
 2. **Environmental Impact**
@@ -34,24 +69,38 @@ The model predicts **7 labels** simultaneously:
 7. **Taste And Convenience**
 ## Usage
 ```python
 import torch
 from transformers import DistilBertTokenizer
 import sys
 import os
-# Load the custom MultilabelClassifier model
-model_name = "sanchow/veganism_and_vegetarianism-distilbert-classifier"
 sys.path.append(model_name)
 from model_class import MultilabelClassifier
 # Load tokenizer
 tokenizer = DistilBertTokenizer.from_pretrained(model_name)
-# Load the model weights
-checkpoint = torch.load(os.path.join(model_name, 'model.pt'), map_location='cpu')
 model = MultilabelClassifier(checkpoint['model_name'], len(checkpoint['label_names']))
 model.load_state_dict(checkpoint['model_state_dict'])
 model.eval()
@@ -60,18 +109,59 @@ model.eval()
 text = "Your text here"
 inputs = tokenizer(text, return_tensors="pt", truncation=True, max_length=512)
 outputs = model(**inputs)
-predictions = outputs[0]
 # Get label predictions
 label_names = ['Animal Welfare', 'Environmental Impact', 'Health', 'Lab Grown And Alt Proteins', 'Psychology And Identity', 'Systemic Vs Individual Action', 'Taste And Convenience']
 for i, (label, score) in enumerate(zip(label_names, predictions[0])):
     if score > 0.5:  # Threshold for positive classification
         print(f"{label}: {score:.3f}")
 ```
 ## Optimal Thresholds
 - **Animal Welfare**: 0.481
 - **Environmental Impact**: 0.459
 - **Health**: 0.201
@@ -80,22 +170,91 @@ for i, (label, score) in enumerate(zip(label_names, predictions[0])):
 - **Systemic Vs Individual Action**: 0.375
 - **Taste And Convenience**: 0.664
-## Training Methodology
-This model was trained using GPT-labeled data from Reddit discussions (2010-2023) with:
-- Regex filtering for sector-specific content
-- GPT-assisted multilabel classification
-- Optimal threshold tuning using Jaccard similarity optimization
-## Applications
-- Content analysis of social media discussions
-- Research on public sentiment and discourse
-- Policy analysis and market research
 ## Model Limitations
 - Trained on Reddit data from specific subreddits
-- May not generalize to other platforms
 - Limited to English language content

 - food
 - climate-change
 - sustainability
+- veganism-&-vegetarianism
 license: mit
 ---
+# Veganism & Vegetarianism Classifier (Distilbert)
+This model classifies content related to plant-based diets, sustainable food systems, and ethical eating. It analyzes discussions about veganism, vegetarianism, and sustainable food choices.
 ## Model Details
+- **Model Type**: Distilbert
 - **Task**: Multilabel text classification
 - **Sector**: Veganism & Vegetarianism
+- **Base Model**: Distilbert base uncased
 - **Number of Labels**: 7
 - **Training Data**: Reddit posts and comments from r/climate, r/climateactionplan, r/climatechange, r/climatechaos, r/climateco, r/climatecrisis, r/climatecrisiscanada, r/climatedisalarm, r/climatejobslist, r/climatejustice, r/climatememes, r/climatenews, r/climateoffensive, r/climatepolicy, r/climate_science
+## Training Methodology
+This model was trained using **GPT-labeled data** from Reddit discussions. The training process involved:
+1. **Data Collection**: Reddit posts and comments from climate-related subreddits (2010-2023)
+2. **Regex Filtering**: Content was filtered using sector-specific regex patterns to identify relevant discussions
+3. **GPT Labeling**: Using GPT models to generate initial labels for training data
+4. **Data Splitting**: 80% training, 10% validation, 10% test split
+5. **Model Training**: Fine-tuning Distilbert on the labeled dataset with optimal threshold tuning
+6. **Validation**: Performance evaluation on held-out test sets using Jaccard similarity metrics
+### Training Data Sources
+- **Source Subreddits**: r/climate, r/climateactionplan, r/climatechange, r/climatechaos, r/climateco, r/climatecrisis, r/climatecrisiscanada, r/climatedisalarm, r/climatejobslist, r/climatejustice, r/climatememes, r/climatenews, r/climateoffensive, r/climatepolicy, r/climate_science
+- **Data Period**: 2010-2023
+- **Focus Areas**: Plant-based nutrition, sustainable agriculture, ethical food choices, alternative proteins, food policy
+- **Data Type**: Posts and comments filtered by regex patterns
+- **Labeling Method**: GPT-assisted multilabel classification
+### Regex Patterns Used
+The model was trained on content filtered using sector-specific regex patterns:
+**Transport Sector (EV-related terms):**
+- Strong patterns: ev, electric vehicle, evs, bev, tesla model, supercharger, gigafactory
+- Weak patterns: electric car, charging station, tax credit, e-bike, tesla
+**Housing Sector (Solar energy terms):**
+- Strong patterns: rooftop solar, solar pv, pv panel, photovoltaics, solar array
+- Weak patterns: solar panel, solar power, battery storage, powerwall, solar tax credit
+**Food Sector (Plant-based diet terms):**
+- Strong patterns: vegan, plant-based diet, veganism, vegetarian, beyond meat
+- Weak patterns: red meat, dairy free, plant protein, almond milk, flexitarian
 ## Labels
+The model predicts **7 labels** simultaneously (multilabel classification). Each text can be classified into multiple categories:
 1. **Animal Welfare**
 2. **Environmental Impact**
 7. **Taste And Convenience**
+**⚠️ Important**: The order of labels in the output predictions corresponds exactly to the order listed above. When using the model, ensure your label list matches this order.
+**💡 Tip**: For best performance, use the optimal thresholds provided in the "Optimal Thresholds" section below instead of the default 0.5 threshold.
+**🔧 Threshold Optimization**: The optimal thresholds were computed using Jaccard similarity optimization on the validation set. For your own dataset, consider re-optimizing thresholds using the same method from `paper4_BERT_finetuning.py`.
 ## Usage
 ```python
+from transformers import AutoTokenizer, AutoModelForSequenceClassification
+import torch
+# Load model and tokenizer
+model_name = "sanchow/veganism_and_vegetarianism-distilbert-classifier"
+tokenizer = AutoTokenizer.from_pretrained(model_name)
+# Load the custom MultilabelClassifier model
 import torch
 from transformers import DistilBertTokenizer
 import sys
 import os
+# Add model directory to path and import the custom model class
 sys.path.append(model_name)
 from model_class import MultilabelClassifier
 # Load tokenizer
 tokenizer = DistilBertTokenizer.from_pretrained(model_name)
+# Load the model weights with weights_only=False for compatibility
+# Note: weights_only=False is required for PyTorch 2.6+ compatibility with numpy arrays in checkpoints
+checkpoint = torch.load(os.path.join(model_name, 'model.pt'), map_location='cpu', weights_only=False)
 model = MultilabelClassifier(checkpoint['model_name'], len(checkpoint['label_names']))
 model.load_state_dict(checkpoint['model_state_dict'])
 model.eval()
 text = "Your text here"
 inputs = tokenizer(text, return_tensors="pt", truncation=True, max_length=512)
 outputs = model(**inputs)
+predictions = torch.sigmoid(outputs.logits)
 # Get label predictions
+# IMPORTANT: The order of labels must match the training order exactly
 label_names = ['Animal Welfare', 'Environmental Impact', 'Health', 'Lab Grown And Alt Proteins', 'Psychology And Identity', 'Systemic Vs Individual Action', 'Taste And Convenience']
 for i, (label, score) in enumerate(zip(label_names, predictions[0])):
     if score > 0.5:  # Threshold for positive classification
         print(f"{label}: {score:.3f}")
+# For multilabel classification with optimal thresholds
+# Note: Use optimal thresholds from model performance for better results
+# See the "Optimal Thresholds" section in the model card for specific values
+optimal_thresholds = {'Animal Welfare': 0.48107979620047003, 'Environmental Impact': 0.45919171852850427, 'Health': 0.20115313966833437, 'Lab Grown And Alt Proteins': 0.3414601502146817, 'Psychology And Identity': 0.5246278637433214, 'Systemic Vs Individual Action': 0.37517437676211585, 'Taste And Convenience': 0.6635140143644325}
+for i, (label, score) in enumerate(zip(label_names, predictions[0])):
+    threshold = optimal_thresholds.get(label, 0.5)  # Use optimal threshold or default to 0.5
+    if score > threshold:
+        print(f"{label}: {score:.3f} (threshold: {threshold:.3f})")
+# For production use, consider re-optimizing thresholds on your validation data
+# See the "Threshold Optimization" section for details
 ```
+## Applications
+This model is designed for:
+- **Content Analysis**: Analyzing social media discussions about veganism & vegetarianism
+- **Research**: Understanding public sentiment and discourse around Plant-based nutrition, sustainable agriculture, ethical food choices, alternative proteins, food policy
+- **Policy Analysis**: Identifying key topics and concerns in food discussions
+- **Market Research**: Tracking trends and interests in veganism & vegetarianism
+## Performance Metrics
+**Best Model Performance (Epoch 5):**
+- Micro Jaccard Score: 0.5584
+- Macro Jaccard Score: 0.6710
+- F1 Score: 0.8906
+- Accuracy: 0.8906
+- Precision: 0.8906
+- Recall: 0.8906
+**Dataset Sizes:**
+- Training Samples: ~600 per sector
+- Validation Samples: ~150 per sector
+- Test Samples: ~150 per sector
+- Total GPT-labeled samples: ~900 per sector
 ## Optimal Thresholds
+For best performance, use these thresholds to convert continuous scores to binary predictions:
 - **Animal Welfare**: 0.481
 - **Environmental Impact**: 0.459
 - **Health**: 0.201
 - **Systemic Vs Individual Action**: 0.375
 - **Taste And Convenience**: 0.664
+**Usage with optimal thresholds:**
+```python
+# Define optimal thresholds for this model
+optimal_thresholds = {'Animal Welfare': 0.48107979620047003, 'Environmental Impact': 0.45919171852850427, 'Health': 0.20115313966833437, 'Lab Grown And Alt Proteins': 0.3414601502146817, 'Psychology And Identity': 0.5246278637433214, 'Systemic Vs Individual Action': 0.37517437676211585, 'Taste And Convenience': 0.6635140143644325}
+# Apply thresholds to get binary predictions
+for i, (label, score) in enumerate(zip(label_names, predictions[0])):
+    threshold = optimal_thresholds.get(label, 0.5)
+    if score > threshold:
+        print(f"{label}: {score:.3f} (threshold: {threshold:.3f})")
+```
+## Threshold Optimization
+The optimal thresholds provided above were computed using **Jaccard similarity optimization** on the validation dataset. This method finds the best threshold for each label that maximizes the Jaccard similarity between predicted and true labels.
+### Optimization Method Used
+The thresholds were optimized using the `find_optimal_thresholds_jaccard_global` function from `paper4_multilabel_threshold_optimizer.py`, which:
+1. **Grid Search**: Tests threshold values from 0.1 to 0.9 in 0.05 increments
+2. **Jaccard Optimization**: Maximizes micro-averaged Jaccard similarity
+3. **Per-Label Optimization**: Finds optimal threshold for each label independently
+4. **Global Optimization**: Considers the overall multilabel performance
+### Re-optimizing for Your Dataset
+For best results on your specific dataset, consider re-optimizing thresholds:
+```python
+from paper4_multilabel_threshold_optimizer import find_optimal_thresholds_jaccard_global
+# Load your validation data
+validation_data = {
+    'texts': ['your text 1', 'your text 2', ...],
+    'true_labels': [['label1', 'label2'], ['label3'], ...]
+}
+# Create sector models dict (as expected by the optimizer)
+sector_models = {
+    'your_sector': {
+        'model': model,
+        'tokenizer': tokenizer,
+        'label_names': label_names
+    }
+}
+# Find optimal thresholds for your data
+optimal_thresholds = find_optimal_thresholds_jaccard_global(
+    sector_models,
+    validation_data,
+    device=device
+)
+# Use the optimized thresholds
+thresholds = optimal_thresholds['your_sector']
+```
+### Alternative Optimization Methods
+You can also implement other threshold optimization strategies:
+- **F1-score optimization**: Maximize F1-score instead of Jaccard
+- **Precision/Recall trade-off**: Optimize for specific precision/recall requirements
+- **Cost-sensitive optimization**: Weight different types of errors differently
+## Citation
+If you use this model in your research, please cite:
+```bibtex
+@misc{veganism_and_vegetarianism_distilbert_classifier,
+  title={Veganism & Vegetarianism Classifier for Climate Change Analysis},
+  author={Sandeep Chowdhary},
+  year={2024},
+  publisher={Hugging Face},
+  journal={Hugging Face Hub},
+  howpublished={\url{https://huggingface.co/sanchow/veganism_and_vegetarianism-distilbert-classifier}},
+}
+```
 ## Model Limitations
 - Trained on Reddit data from specific subreddits
+- May not generalize to other platforms or contexts
+- Performance depends on the quality of GPT-generated labels
 - Limited to English language content