cservan commited on
Commit
e6ac6cf
·
1 Parent(s): 4a9e63a

initial commit

Browse files
README.md CHANGED
@@ -1,3 +1,184 @@
1
- ---
2
- license: mit
3
- ---
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ language:
3
+ - fr
4
+ - en
5
+ - de
6
+ - es
7
+ - ru
8
+ - it
9
+ - zh
10
+ - sv
11
+ - pt
12
+ - pl
13
+ - ar
14
+ - nl
15
+ - ca
16
+ - vi
17
+ - ja
18
+ - hu
19
+ - he
20
+ - id
21
+ - no
22
+ - fa
23
+ - ko
24
+ - tr
25
+ - fi
26
+ - ro
27
+ - el
28
+ - hy
29
+ - da
30
+ - eu
31
+ - ms
32
+ - sl
33
+ - az
34
+ - bn
35
+ - cy
36
+ - hi
37
+ - ta
38
+ - ur
39
+ - th
40
+ - ka
41
+ - te
42
+ - af
43
+ - sq
44
+ - lv
45
+ - ml
46
+ - kn
47
+ - tl
48
+ - is
49
+ - sw
50
+ - jv
51
+ - my
52
+ - mn
53
+ - km
54
+ - am
55
+
56
+
57
+ license: apache-2.0
58
+ datasets:
59
+ - wikipedia
60
+ ---
61
+
62
+ # Multilingual ModernBERT Base Cased 128k
63
+
64
+ Pretrained multilingual language model using a masked language modeling (MLM) objective.
65
+
66
+ ## Model description
67
+
68
+ mALBERT is a transformers model pretrained on 16Go of French Wikipedia in a self-supervised fashion. This means it
69
+ was pretrained on the raw texts only, with no humans labelling them in any way (which is why it can use lots of
70
+ publicly available data) with an automatic process to generate inputs and labels from those texts.
71
+
72
+ This model has the following configuration:
73
+
74
+ - 12 repeating layers
75
+ - 768 embedding dimension
76
+ - 768 hidden dimension
77
+ - 12 attention heads
78
+ - 11M parameters
79
+ - 128k of vocabulary size
80
+
81
+ ## Intended uses & limitations
82
+
83
+ You can use the raw model for either masked language modeling or next sentence prediction, but it's mostly intended to
84
+ be fine-tuned on a downstream task. See the [model hub](https://huggingface.co/models?filter=malbert-base-cased-128k) to look for
85
+ fine-tuned versions on a task that interests you.
86
+
87
+ Note that this model is primarily aimed at being fine-tuned on tasks that use the whole sentence (potentially masked)
88
+ to make decisions, such as sequence classification, token classification or question answering. For tasks such as text
89
+ generation you should look at model like GPT2.
90
+
91
+ ### How to use
92
+
93
+ Here is how to use this model to get the features of a given text in PyTorch:
94
+
95
+ ```python
96
+ from transformers import AlbertTokenizer, AlbertModel
97
+ tokenizer = AlbertTokenizer.from_pretrained('cservan/multilingual-albert-base-cased-128k')
98
+ model = AlbertModel.from_pretrained("cservan/multilingual-albert-base-cased-128k")
99
+ text = "Remplacez-moi par le texte en français que vous souhaitez."
100
+ encoded_input = tokenizer(text, return_tensors='pt')
101
+ output = model(**encoded_input)
102
+ ```
103
+
104
+ and in TensorFlow:
105
+
106
+ ```python
107
+ from transformers import AlbertTokenizer, TFAlbertModel
108
+ tokenizer = AlbertTokenizer.from_pretrained('cservan/multilingual-albert-base-cased-128k')
109
+ model = TFAlbertModel.from_pretrained("cservan/multilingual-albert-base-cased-128k")
110
+ text = "Remplacez-moi par le texte en français que vous souhaitez."
111
+ encoded_input = tokenizer(text, return_tensors='tf')
112
+ output = model(encoded_input)
113
+ ```
114
+
115
+
116
+ ## Training data
117
+
118
+ The mALBERT model was pretrained on 13go of [Multiligual Wikipedia](https://scouv.lisn.upsaclay.fr/#malbert) (excluding lists, tables and
119
+ headers).
120
+
121
+ ## Training procedure
122
+
123
+ ### Preprocessing
124
+
125
+ The texts are lowercased and tokenized using SentencePiece and a vocabulary size of 128,000. The inputs of the model are
126
+ then of the form:
127
+
128
+ ```
129
+ [CLS] Sentence A [SEP] Sentence B [SEP]
130
+ ```
131
+
132
+ ### Training
133
+
134
+ The mALBERT procedure follows the BERT setup.
135
+
136
+ The details of the masking procedure for each sentence are the following:
137
+ - 15% of the tokens are masked.
138
+ - In 80% of the cases, the masked tokens are replaced by `[MASK]`.
139
+ - In 10% of the cases, the masked tokens are replaced by a random token (different) from the one they replace.
140
+ - In the 10% remaining cases, the masked tokens are left as is.
141
+
142
+ ### Tools
143
+
144
+ The tools used to pre-train the model are available [here](https://gitlab.lisn.upsaclay.fr/nlp/deep-learning/UER-py)
145
+
146
+
147
+ ## Evaluation results
148
+
149
+ When fine-tuned on downstream tasks, the ALBERT models achieve the following results:
150
+
151
+ Slot-filling:
152
+
153
+ |Models ⧹ Tasks | MMNLU | MultiATIS++ | CoNLL2003 | MultiCoNER | SNIPS | MEDIA |
154
+ |---------------|--------------|--------------|--------------|--------------|--------------|--------------|
155
+ |EnALBERT | N/A | N/A | 89.67 (0.34) | 42.36 (0.22) | 95.95 (0.13) | N/A |
156
+ |FrALBERT | N/A | N/A | N/A | N/A | N/A | 81.76 (0.59)
157
+ |mALBERT-128k | 65.81 (0.11) | 89.14 (0.15) | 88.27 (0.24) | 46.01 (0.18) | 91.60 (0.31) | 83.15 (0.38) |
158
+ |mALBERT-64k | 65.29 (0.14) | 88.88 (0.14) | 86.44 (0.37) | 44.70 (0.27) | 90.84 (0.47) | 82.30 (0.19) |
159
+ |mALBERT-32k | 64.83 (0.22) | 88.60 (0.27) | 84.96 (0.41) | 44.13 (0.39) | 89.89 (0.68) | 82.04 (0.28) |
160
+
161
+ Classification task:
162
+
163
+ |Models ⧹ Tasks | MMNLU | MultiATIS++ | SNIPS | SST2 |
164
+ |---------------|--------------|--------------|--------------|--------------|
165
+ |mALBERT-128k | 72.35 (0.09) | 90.58 (0.98) | 96.84 (0.49) | 34.66 (1.46) |
166
+ |mALBERT-64k | 71.26 (0.11) | 90.97 (0.70) | 96.53 (0.44) | 34.64 (1.02) |
167
+ |mALBERT-32k | 70.76 (0.11) | 90.55 (0.98) | 96.49 (0.45) | 34.18 (1.64) |
168
+
169
+ ### BibTeX entry and citation info
170
+
171
+ ```bibtex
172
+ @inproceedings{servan2024mALBERT,
173
+ author = {Christophe Servan and
174
+ Sahar Ghannay and
175
+ Sophie Rosset},
176
+ booktitle = {the 2024 Joint International Conference on Computational Linguistics, Language Resources and Evaluation (LREC-COLING 2024)},
177
+ title = {{mALBERT: Is a Compact Multilingual BERT Model Still Worth It?}},
178
+ year = {2024},
179
+ address = {Torino, Italy},
180
+ month = may,
181
+ }
182
+ ```
183
+
184
+ Link to the paper: [PDF](https://hal.science/hal-04520797)
config.json ADDED
@@ -0,0 +1,45 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "architectures": [
3
+ "ModernBertForMaskedLM"
4
+ ],
5
+ "attention_bias": false,
6
+ "attention_dropout": 0.0,
7
+ "bos_token_id": 2,
8
+ "classifier_activation": "gelu",
9
+ "classifier_bias": false,
10
+ "classifier_dropout": 0.0,
11
+ "classifier_pooling": "mean",
12
+ "cls_token_id": 2,
13
+ "decoder_bias": true,
14
+ "deterministic_flash_attn": false,
15
+ "embedding_dropout": 0.0,
16
+ "eos_token_id": 3,
17
+ "global_attn_every_n_layers": 3,
18
+ "global_rope_theta": 160000.0,
19
+ "gradient_checkpointing": false,
20
+ "hidden_activation": "gelu",
21
+ "hidden_size": 768,
22
+ "initializer_cutoff_factor": 2.0,
23
+ "initializer_range": 0.02,
24
+ "intermediate_size": 1152,
25
+ "layer_norm_eps": 1e-05,
26
+ "local_attention": 128,
27
+ "local_rope_theta": 10000.0,
28
+ "max_position_embeddings": 8192,
29
+ "mlp_bias": false,
30
+ "mlp_dropout": 0.0,
31
+ "model_type": "modernbert",
32
+ "norm_bias": false,
33
+ "norm_eps": 1e-05,
34
+ "num_attention_heads": 6,
35
+ "num_hidden_layers": 12,
36
+ "pad_token_id": 0,
37
+ "position_embedding_type": "absolute",
38
+ "repad_logits_with_grad": false,
39
+ "sep_token_id": 3,
40
+ "sparse_pred_ignore_index": -100,
41
+ "sparse_prediction": false,
42
+ "torch_dtype": "float32",
43
+ "transformers_version": "4.55.0",
44
+ "vocab_size": 129008
45
+ }
pytorch_model.bin ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:3baa69d2ecf3f1367f7f99db19ba495b0dac2a434d6b038977b507894f27a50e
3
+ size 639940718
special_tokens_map.json ADDED
@@ -0,0 +1,7 @@
 
 
 
 
 
 
 
 
1
+ {
2
+ "cls_token": "[CLS]",
3
+ "mask_token": "[MASK]",
4
+ "pad_token": "[PAD]",
5
+ "sep_token": "[SEP]",
6
+ "unk_token": "[UNK]"
7
+ }
tokenizer.json ADDED
The diff for this file is too large to render. See raw diff
 
tokenizer_config.json ADDED
The diff for this file is too large to render. See raw diff