Koushim commited on
Commit
f388f7a
·
verified ·
1 Parent(s): 1a067db

End of training

Browse files
Files changed (4) hide show
  1. README.md +44 -98
  2. config.json +44 -0
  3. model.safetensors +3 -0
  4. training_args.bin +3 -0
README.md CHANGED
@@ -1,119 +1,65 @@
1
  ---
2
- language: en
3
- datasets:
4
- - jigsaw-toxic-comment-classification-challenge
5
  tags:
6
- - text-classification
7
- - multi-label-classification
8
- - toxicity-detection
9
- - bert
10
- - transformers
11
- - pytorch
12
- license: apache-2.0
13
  model-index:
14
- - name: BERT Multi-label Toxic Comment Classifier
15
- results:
16
- - task:
17
- name: Multi-label Text Classification
18
- type: multi-label-classification
19
- dataset:
20
- name: Jigsaw Toxic Comment Classification Challenge
21
- type: jigsaw-toxic-comment-classification-challenge
22
- metrics:
23
- - name: F1 Score (Macro)
24
- type: f1
25
- value: 0.XX # Replace with your actual score
26
- - name: Accuracy
27
- type: accuracy
28
- value: 0.XX # Replace with your actual score
29
  ---
30
 
31
- # BERT Multi-label Toxic Comment Classifier
 
32
 
33
- This model is a fine-tuned [`bert-base-uncased`](https://huggingface.co/bert-base-uncased) transformer for **multi-label classification** on the [Jigsaw Toxic Comment Classification Challenge](https://www.kaggle.com/c/jigsaw-toxic-comment-classification-challenge) dataset.
34
 
35
- It predicts multiple toxicity-related labels per comment, including:
36
- - toxicity
37
- - severe toxicity
38
- - obscene
39
- - threat
40
- - insult
41
- - identity attack
42
- - sexual explicit
43
 
44
- ## Model Details
45
 
46
- - **Base Model**: `bert-base-uncased`
47
- - **Task**: Multi-label text classification
48
- - **Dataset**: Jigsaw Toxic Comment Classification Challenge (processed version)
49
- - **Labels**: 7 toxicity-related categories
50
- - **Training Epochs**: 2
51
- - **Batch Size**: 16 (train), 64 (eval)
52
- - **Metrics**: Accuracy, Macro F1, Precision, Recall
53
 
54
- ## Usage
55
 
56
- ```python
57
- from transformers import AutoTokenizer, AutoModelForSequenceClassification
58
 
59
- tokenizer = AutoTokenizer.from_pretrained("Koushim/bert-multilabel-jigsaw-toxic-classifier")
60
- model = AutoModelForSequenceClassification.from_pretrained("Koushim/bert-multilabel-jigsaw-toxic-classifier")
61
 
62
- text = "You are a wonderful person!"
63
- inputs = tokenizer(text, return_tensors="pt", truncation=True, padding=True, max_length=128)
64
- outputs = model(**inputs)
65
 
66
- # Sigmoid to get probabilities for each label
67
- import torch
68
- probs = torch.sigmoid(outputs.logits)
69
- print(probs)
70
- ````
71
 
72
- ## Labels
73
 
74
- | Index | Label |
75
- | ----- | ---------------- |
76
- | 0 | toxicity |
77
- | 1 | severe_toxicity |
78
- | 2 | obscene |
79
- | 3 | threat |
80
- | 4 | insult |
81
- | 5 | identity_attack |
82
- | 6 | sexual_explicit |
83
 
84
- ## Training Details
85
 
86
- * Training Set: Full dataset (160k+ samples)
87
- * Loss Function: Binary Cross Entropy (via `BertForSequenceClassification` with `problem_type="multi_label_classification"`)
88
- * Optimizer: AdamW
89
- * Learning Rate: 2e-5
90
- * Evaluation Strategy: Epoch-based evaluation with early stopping on F1 score
91
- * Model Framework: PyTorch with Hugging Face Transformers
92
 
93
- ## Repository Contents
94
 
95
- * `pytorch_model.bin` - trained model weights
96
- * `config.json` - model configuration
97
- * `tokenizer.json`, `vocab.txt` - tokenizer files
98
- * `README.md` - this file
99
 
100
- ## How to Fine-tune or Train
101
-
102
- You can fine-tune this model using the Hugging Face `Trainer` API with your own dataset or the original Jigsaw dataset.
103
-
104
- ## Citation
105
-
106
- If you use this model in your research or project, please cite:
107
-
108
- ```
109
- @article{devlin2019bert,
110
- title={BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding},
111
- author={Devlin, Jacob and Chang, Ming-Wei and Lee, Kenton and Toutanova, Kristina},
112
- journal={arXiv preprint arXiv:1810.04805},
113
- year={2019}
114
- }
115
- ```
116
-
117
- ## License
118
-
119
- Apache 2.0 License
 
1
  ---
2
+ library_name: transformers
 
 
3
  tags:
4
+ - generated_from_trainer
5
+ metrics:
6
+ - accuracy
7
+ - f1
8
+ - precision
9
+ - recall
 
10
  model-index:
11
+ - name: bert-multilabel-jigsaw-toxic-classifier
12
+ results: []
 
 
 
 
 
 
 
 
 
 
 
 
 
13
  ---
14
 
15
+ <!-- This model card has been generated automatically according to the information the Trainer had access to. You
16
+ should probably proofread and complete it, then remove this comment. -->
17
 
18
+ # bert-multilabel-jigsaw-toxic-classifier
19
 
20
+ This model is a fine-tuned version of [](https://huggingface.co/) on an unknown dataset.
21
+ It achieves the following results on the evaluation set:
22
+ - Loss: 1.6768
23
+ - Accuracy: 0.9187
24
+ - F1: 0.0
25
+ - Precision: 0.0
26
+ - Recall: 0.0
 
27
 
28
+ ## Model description
29
 
30
+ More information needed
 
 
 
 
 
 
31
 
32
+ ## Intended uses & limitations
33
 
34
+ More information needed
 
35
 
36
+ ## Training and evaluation data
 
37
 
38
+ More information needed
 
 
39
 
40
+ ## Training procedure
 
 
 
 
41
 
42
+ ### Training hyperparameters
43
 
44
+ The following hyperparameters were used during training:
45
+ - learning_rate: 5e-05
46
+ - train_batch_size: 16
47
+ - eval_batch_size: 64
48
+ - seed: 42
49
+ - optimizer: Use OptimizerNames.ADAMW_TORCH with betas=(0.9,0.999) and epsilon=1e-08 and optimizer_args=No additional optimizer arguments
50
+ - lr_scheduler_type: linear
51
+ - num_epochs: 1
 
52
 
53
+ ### Training results
54
 
55
+ | Training Loss | Epoch | Step | Validation Loss | Accuracy | F1 | Precision | Recall |
56
+ |:-------------:|:-----:|:------:|:---------------:|:--------:|:---:|:---------:|:------:|
57
+ | 1.3585 | 1.0 | 112805 | 1.6768 | 0.9187 | 0.0 | 0.0 | 0.0 |
 
 
 
58
 
 
59
 
60
+ ### Framework versions
 
 
 
61
 
62
+ - Transformers 4.51.3
63
+ - Pytorch 2.6.0+cu124
64
+ - Datasets 3.6.0
65
+ - Tokenizers 0.21.1
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
config.json ADDED
@@ -0,0 +1,44 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "architectures": [
3
+ "CustomBertForMultiLabel"
4
+ ],
5
+ "attention_probs_dropout_prob": 0.1,
6
+ "classifier_dropout": null,
7
+ "gradient_checkpointing": false,
8
+ "hidden_act": "gelu",
9
+ "hidden_dropout_prob": 0.1,
10
+ "hidden_size": 768,
11
+ "id2label": {
12
+ "0": "LABEL_0",
13
+ "1": "LABEL_1",
14
+ "2": "LABEL_2",
15
+ "3": "LABEL_3",
16
+ "4": "LABEL_4",
17
+ "5": "LABEL_5",
18
+ "6": "LABEL_6"
19
+ },
20
+ "initializer_range": 0.02,
21
+ "intermediate_size": 3072,
22
+ "label2id": {
23
+ "LABEL_0": 0,
24
+ "LABEL_1": 1,
25
+ "LABEL_2": 2,
26
+ "LABEL_3": 3,
27
+ "LABEL_4": 4,
28
+ "LABEL_5": 5,
29
+ "LABEL_6": 6
30
+ },
31
+ "layer_norm_eps": 1e-12,
32
+ "max_position_embeddings": 512,
33
+ "model_type": "bert",
34
+ "num_attention_heads": 12,
35
+ "num_hidden_layers": 12,
36
+ "pad_token_id": 0,
37
+ "position_embedding_type": "absolute",
38
+ "problem_type": "multi_label_classification",
39
+ "torch_dtype": "float32",
40
+ "transformers_version": "4.51.3",
41
+ "type_vocab_size": 2,
42
+ "use_cache": true,
43
+ "vocab_size": 30522
44
+ }
model.safetensors ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:83a8952e52eb1695db7036cf5ae93257e9bb71fb7b808db91e5ebaaed5f0ca9b
3
+ size 437974136
training_args.bin ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:ffa1279adafa9ac7a05f36daf449c5fd9e997714dccc7e38ce0a57a1045450bc
3
+ size 5304