Koushim
/

bert-multilabel-jigsaw-toxic-classifier

@@ -1,65 +1,116 @@
 ---
-library_name: transformers
 tags:
-- generated_from_trainer
-metrics:
-- accuracy
-- f1
-- precision
-- recall
 model-index:
-- name: bert-multilabel-jigsaw-toxic-classifier
-  results: []
 ---
-<!-- This model card has been generated automatically according to the information the Trainer had access to. You
-should probably proofread and complete it, then remove this comment. -->
-# bert-multilabel-jigsaw-toxic-classifier
-This model is a fine-tuned version of [](https://huggingface.co/) on an unknown dataset.
-It achieves the following results on the evaluation set:
-- Loss: 1.6768
-- Accuracy: 0.9187
-- F1: 0.0
-- Precision: 0.0
-- Recall: 0.0
-## Model description
-More information needed
-## Intended uses & limitations
-More information needed
-## Training and evaluation data
-More information needed
-## Training procedure
-### Training hyperparameters
-The following hyperparameters were used during training:
-- learning_rate: 5e-05
-- train_batch_size: 16
-- eval_batch_size: 64
-- seed: 42
-- optimizer: Use OptimizerNames.ADAMW_TORCH with betas=(0.9,0.999) and epsilon=1e-08 and optimizer_args=No additional optimizer arguments
-- lr_scheduler_type: linear
-- num_epochs: 1
-### Training results
-| Training Loss | Epoch | Step   | Validation Loss | Accuracy | F1  | Precision | Recall |
-|:-------------:|:-----:|:------:|:---------------:|:--------:|:---:|:---------:|:------:|
-| 1.3585        | 1.0   | 112805 | 1.6768          | 0.9187   | 0.0 | 0.0       | 0.0    |
-### Framework versions
-- Transformers 4.51.3
-- Pytorch 2.6.0+cu124
-- Datasets 3.6.0
-- Tokenizers 0.21.1

 ---
+language: en
+datasets:
+  - jigsaw-toxic-comment-classification-challenge
 tags:
+  - text-classification
+  - multi-label-classification
+  - toxicity-detection
+  - bert
+  - transformers
+  - pytorch
+license: apache-2.0
 model-index:
+  - name: BERT Multi-label Toxic Comment Classifier
+    results:
+      - task:
+          name: Multi-label Text Classification
+          type: multi-label-classification
+        dataset:
+          name: Jigsaw Toxic Comment Classification Challenge
+          type: jigsaw-toxic-comment-classification-challenge
+        metrics:
+          - name: Accuracy
+            type: accuracy
+            value:  0.9187 # Replace with your actual score
 ---
+# BERT Multi-label Toxic Comment Classifier
+This model is a fine-tuned [`bert-base-uncased`](https://huggingface.co/bert-base-uncased) transformer for **multi-label classification** on the [Jigsaw Toxic Comment Classification Challenge](https://www.kaggle.com/c/jigsaw-toxic-comment-classification-challenge) dataset.
+It predicts multiple toxicity-related labels per comment, including:
+- toxicity
+- severe toxicity
+- obscene
+- threat
+- insult
+- identity attack
+- sexual explicit
+## Model Details
+- **Base Model**: `bert-base-uncased`
+- **Task**: Multi-label text classification
+- **Dataset**: Jigsaw Toxic Comment Classification Challenge (processed version)
+- **Labels**: 7 toxicity-related categories
+- **Training Epochs**: 2
+- **Batch Size**: 16 (train), 64 (eval)
+- **Metrics**: Accuracy, Macro F1, Precision, Recall
+## Usage
+```python
+from transformers import AutoTokenizer, AutoModelForSequenceClassification
+tokenizer = AutoTokenizer.from_pretrained("Koushim/bert-multilabel-jigsaw-toxic-classifier")
+model = AutoModelForSequenceClassification.from_pretrained("Koushim/bert-multilabel-jigsaw-toxic-classifier")
+text = "You are a wonderful person!"
+inputs = tokenizer(text, return_tensors="pt", truncation=True, padding=True, max_length=128)
+outputs = model(**inputs)
+# Sigmoid to get probabilities for each label
+import torch
+probs = torch.sigmoid(outputs.logits)
+print(probs)
+````
+## Labels
+| Index | Label            |
+| ----- | ---------------- |
+| 0     | toxicity         |
+| 1     | severe_toxicity |
+| 2     | obscene          |
+| 3     | threat           |
+| 4     | insult           |
+| 5     | identity_attack |
+| 6     | sexual_explicit |
+## Training Details
+* Training Set: Full dataset (160k+ samples)
+* Loss Function: Binary Cross Entropy (via `BertForSequenceClassification` with `problem_type="multi_label_classification"`)
+* Optimizer: AdamW
+* Learning Rate: 2e-5
+* Evaluation Strategy: Epoch-based evaluation with early stopping on F1 score
+* Model Framework: PyTorch with Hugging Face Transformers
+## Repository Contents
+* `pytorch_model.bin` - trained model weights
+* `config.json` - model configuration
+* `tokenizer.json`, `vocab.txt` - tokenizer files
+* `README.md` - this file
+## How to Fine-tune or Train
+You can fine-tune this model using the Hugging Face `Trainer` API with your own dataset or the original Jigsaw dataset.
+## Citation
+If you use this model in your research or project, please cite:
+```
+@article{devlin2019bert,
+  title={BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding},
+  author={Devlin, Jacob and Chang, Ming-Wei and Lee, Kenton and Toutanova, Kristina},
+  journal={arXiv preprint arXiv:1810.04805},
+  year={2019}
+}
+```
+## License
+Apache 2.0 License