radinplaid commited on
Commit
33f427c
·
verified ·
1 Parent(s): e693b59

Upload folder using huggingface_hub

Browse files
README.md CHANGED
@@ -1,7 +1,7 @@
1
  ---
2
  language:
3
- - zh
4
  - en
 
5
  tags:
6
  - translation
7
  license: cc-by-4.0
@@ -21,30 +21,41 @@ model-index:
21
  metrics:
22
  - name: CHRF
23
  type: chrf
24
- value: 34.53
 
 
 
25
  ---
26
 
27
 
28
  # `quickmt-en-zh` Neural Machine Translation Model
29
 
30
- # Usage
 
 
 
 
 
 
 
 
 
 
 
 
31
 
32
- ## Install `quickmt`
 
 
33
 
34
  ```bash
35
  git clone https://github.com/quickmt/quickmt.git
36
  pip install ./quickmt/
37
- ```
38
-
39
- ## Download model
40
 
41
- ```bash
42
  quickmt-model-download quickmt/quickmt-en-zh ./quickmt-en-zh
43
  ```
44
 
45
- ## Use model
46
-
47
- Inference with `quickmt`:
48
 
49
  ```python
50
  from quickmt import Translator
@@ -53,48 +64,27 @@ from quickmt import Translator
53
  t = Translator("./quickmt-en-zh/", device="auto")
54
 
55
  # Translate - set beam size to 5 for higher quality (but slower speed)
56
- t(["The roe deer (Capreolus capreolus), also known as the roe, western roe deer,[3][4] or European roe,[3] is a species of deer."], beam_size=1)
57
 
58
  # Get alternative translations by sampling
59
  # You can pass any cTranslate2 `translate_batch` arguments
60
- t(["The roe deer (Capreolus capreolus), also known as the roe, western roe deer,[3][4] or European roe,[3] is a species of deer."], sampling_temperature=1.2, beam_size=1, sampling_topk=50, sampling_topp=0.9)
61
  ```
62
 
63
- The model is in `ctranslate2` format, and the tokenizers are `sentencepiece`, so you can use the model files directly if you want. It would be fairly easy to get them to work with e.g. [LibreTranslate](https://libretranslate.com/) which also uses `ctranslate2` and `sentencepiece`.
64
-
65
- # Model Information
66
-
67
- * Trained using [`eole`](https://github.com/eole-nlp/eole)
68
- - Trained for 82k steps with an effective batch size of 49152, which took less than 1 day on a single RTX 4090 on [vast.ai](https://cloud.vast.ai)
69
- * Exported for fast inference to []CTranslate2](https://github.com/OpenNMT/CTranslate2) format
70
- * Training data: https://huggingface.co/datasets/quickmt/quickmt-train.zh-en/tree/main
71
- * Seperate source and target Sentencepiece tokenizers (size 32k)
72
- * Transformer "Big"
73
- - 241,870,080 parameters
74
- - 8 encoder layers and 2 decoder layers
75
- - Gated-silu activations
76
- - Trained and saved in bfloat16
77
 
78
- See `eole-config.yaml` for more detail.
79
 
80
  ## Metrics
81
 
82
- CHRF2 calculated with [sacrebleu](https://github.com/mjpost/sacrebleu) on the Flores200 `devtest` test set ("eng_Latn"->"zho_Hans").
83
-
84
- "GPU Time" is the time to translate the flores-devtest corpus using a batch size of 32 on a GTX 1080 GPU. "CPU Time" is the time to translate the following input with a single CPU core:
85
-
86
- > James Joyce (2 February 1882 – 13 January 1941) was an Irish novelist, poet and literary critic who contributed to the modernist avant-garde movement and is regarded as one of the most influential and important writers of the 20th century.
87
-
88
- | Model | chrf2 | comet22 | CPU Time (s) | GPU Time (s) |
89
- | -------------------------------- | ----- | -------- | -------------|------------- |
90
- | quickmt/quickmt-zh-en | 34.53 | 0.8512 | 1.91 | 3.92 |
91
- | Helsinki-NLP/opus-mt-zh-en | 29.20 | 0.8236 | 1.50 | 10.10 |
92
- | facebook/m2m100_418M | 26.63 | 0.7376 | 10.2 | 49.02 |
93
- | facebook/nllb-200-distilled-600M | 24.68 | 0.7840 | 13.2 | 55.92 |
94
-
95
- `quickmt-en-zh` is the highest quality and is the fastest on GPU (and not far behind on CPU).
96
-
97
- Helsinki-NLP/opus-mt-en-zh is one of the most downloaded machine translation models on HuggingFace, and this model is considerably more accurate *and* similar in speed.
98
-
99
 
 
 
 
 
 
 
 
 
100
 
 
 
1
  ---
2
  language:
 
3
  - en
4
+ - zh
5
  tags:
6
  - translation
7
  license: cc-by-4.0
 
21
  metrics:
22
  - name: CHRF
23
  type: chrf
24
+ value: 58.10
25
+ - name: COMET
26
+ type: comet
27
+ value: 58.10
28
  ---
29
 
30
 
31
  # `quickmt-en-zh` Neural Machine Translation Model
32
 
33
+ `quickmt-en-zh` is a reasonably fast and reasonably accurate neural machine translation model for translation from `en` into `zh`.
34
+
35
+
36
+ ## Model Information
37
+
38
+ * Trained using [`eole`](https://github.com/eole-nlp/eole)
39
+ * 200M parameter transformer 'big' with 8 encoder layers and 2 decoder layers
40
+ * Separate source and target Sentencepiece tokenizers
41
+ * Exported for fast inference to [CTranslate2](https://github.com/OpenNMT/CTranslate2) format
42
+ * Training data: https://huggingface.co/datasets/quickmt/quickmt-train.zh-en/tree/main
43
+
44
+ See the `eole` model configuration in this repository for further details.
45
+
46
 
47
+ ## Usage with `quickmt`
48
+
49
+ First, install `quickmt` and download the model
50
 
51
  ```bash
52
  git clone https://github.com/quickmt/quickmt.git
53
  pip install ./quickmt/
 
 
 
54
 
 
55
  quickmt-model-download quickmt/quickmt-en-zh ./quickmt-en-zh
56
  ```
57
 
58
+ Next use the model in python:
 
 
59
 
60
  ```python
61
  from quickmt import Translator
 
64
  t = Translator("./quickmt-en-zh/", device="auto")
65
 
66
  # Translate - set beam size to 5 for higher quality (but slower speed)
67
+ t(["The Boot Monument is an American Revolutionary War memorial located in Saratoga National Historical Park in the state of New York."], beam_size=1)
68
 
69
  # Get alternative translations by sampling
70
  # You can pass any cTranslate2 `translate_batch` arguments
71
+ t(["The Boot Monument is an American Revolutionary War memorial located in Saratoga National Historical Park in the state of New York."], sampling_temperature=1.2, beam_size=1, sampling_topk=50, sampling_topp=0.9)
72
  ```
73
 
74
+ The model is in `ctranslate2` format, and the tokenizers are `sentencepiece`, so you can use `ctranslate2` directly instead of through `quickmt`. It is also possible to get this model to work with e.g. [LibreTranslate](https://libretranslate.com/) which also uses `ctranslate2` and `sentencepiece`.
 
 
 
 
 
 
 
 
 
 
 
 
 
75
 
 
76
 
77
  ## Metrics
78
 
79
+ `chrf2` is calculated with [sacrebleu](https://github.com/mjpost/sacrebleu) on the [Flores200 `devtest` test set](https://huggingface.co/datasets/facebook/flores) ("eng_Latn"->"zho_Hans"). `comet22` with the [`comet`](https://github.com/Unbabel/COMET) library and the [default model](https://huggingface.co/Unbabel/wmt22-comet-da). "Time (s)" is the time in seconds to translate (using `ctranslate2`) the flores-devtest dataset (1012 sentences) on an RTX 4070s GPU with batch size 32.
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
80
 
81
+ | Model | chrf2 | comet22 | Time (s) |
82
+ | -------------------------------- | ----- | ------- | -------- |
83
+ | quickmt/quickmt-en-zh | 35.22 | 85.39 | 0.96 |
84
+ | Helsinki-NLP/opus-mt-en-zh | 29.20 | 82.36 | 3.41 |
85
+ | facebook/m2m100_418M | 25.86 | 73.76 | 16.71 |
86
+ | facebook/m2m100_1.2B | 28.94 | 78.38 | 31.09 |
87
+ | facebook/nllb-200-distilled-600M | 24.52 | 78.41 | 19.01 |
88
+ | facebook/nllb-200-distilled-1.3B | 26.79 | 79.87 | 32.03 |
89
 
90
+ `quickmt-en-zh` is the fastest *and* highest quality.
eole-config.yaml CHANGED
@@ -1,5 +1,5 @@
1
  ## IO
2
- save_data: en_zh/data_spm
3
  overwrite: True
4
  seed: 1234
5
  report_every: 100
@@ -18,8 +18,8 @@ n_sample: 0
18
 
19
  data:
20
  corpus_1:
21
- path_tgt: hf://quickmt/quickmt-train-zh-en/zh
22
  path_src: hf://quickmt/quickmt-train-zh-en/en
 
23
  path_sco: hf://quickmt/quickmt-train-zh-en/sco
24
  valid:
25
  path_src: en-zh/dev.eng
@@ -31,15 +31,15 @@ transforms_configs:
31
  src_subword_model: "en-zh/src.spm.model"
32
  tgt_subword_model: "en-zh/tgt.spm.model"
33
  filtertoolong:
34
- src_seq_length: 512
35
- tgt_seq_length: 512
36
 
37
  training:
38
  # Run configuration
39
  model_path: en-zh/model
40
  keep_checkpoint: 4
41
  save_checkpoint_steps: 2000
42
- train_steps: 200000
43
  valid_steps: 2000
44
 
45
  # Train on a single GPU
@@ -48,22 +48,22 @@ training:
48
 
49
  # Batching
50
  batch_type: "tokens"
51
- batch_size: 8192
52
- valid_batch_size: 8192
53
  batch_size_multiple: 8
54
- accum_count: [6]
55
  accum_steps: [0]
56
 
57
  # Optimizer & Compute
58
- compute_dtype: "bfloat16"
59
  optim: "pagedadamw8bit"
60
- learning_rate: 1.0
61
  warmup_steps: 10000
62
  decay_method: "noam"
63
  adam_beta2: 0.998
64
 
65
  # Data loading
66
- bucket_size: 262144
67
  num_workers: 8
68
  prefetch_factor: 100
69
 
@@ -71,7 +71,7 @@ training:
71
  dropout_steps: [0]
72
  dropout: [0.1]
73
  attention_dropout: [0.1]
74
- max_grad_norm: 0
75
  label_smoothing: 0.1
76
  average_decay: 0.0001
77
  param_init_method: xavier_uniform
@@ -83,7 +83,7 @@ model:
83
  share_embeddings: false
84
  share_decoder_embeddings: true
85
  add_ffnbias: true
86
- mlp_activation_fn: gated-silu
87
  add_estimator: false
88
  add_qkvbias: false
89
  norm_eps: 1e-6
 
1
  ## IO
2
+ save_data: en-zh/data_spm
3
  overwrite: True
4
  seed: 1234
5
  report_every: 100
 
18
 
19
  data:
20
  corpus_1:
 
21
  path_src: hf://quickmt/quickmt-train-zh-en/en
22
+ path_tgt: hf://quickmt/quickmt-train-zh-en/zh
23
  path_sco: hf://quickmt/quickmt-train-zh-en/sco
24
  valid:
25
  path_src: en-zh/dev.eng
 
31
  src_subword_model: "en-zh/src.spm.model"
32
  tgt_subword_model: "en-zh/tgt.spm.model"
33
  filtertoolong:
34
+ src_seq_length: 256
35
+ tgt_seq_length: 256
36
 
37
  training:
38
  # Run configuration
39
  model_path: en-zh/model
40
  keep_checkpoint: 4
41
  save_checkpoint_steps: 2000
42
+ train_steps: 100000
43
  valid_steps: 2000
44
 
45
  # Train on a single GPU
 
48
 
49
  # Batching
50
  batch_type: "tokens"
51
+ batch_size: 16384
52
+ valid_batch_size: 16384
53
  batch_size_multiple: 8
54
+ accum_count: [8]
55
  accum_steps: [0]
56
 
57
  # Optimizer & Compute
58
+ compute_dtype: "bf16"
59
  optim: "pagedadamw8bit"
60
+ learning_rate: 2.0
61
  warmup_steps: 10000
62
  decay_method: "noam"
63
  adam_beta2: 0.998
64
 
65
  # Data loading
66
+ bucket_size: 128000
67
  num_workers: 8
68
  prefetch_factor: 100
69
 
 
71
  dropout_steps: [0]
72
  dropout: [0.1]
73
  attention_dropout: [0.1]
74
+ max_grad_norm: 2
75
  label_smoothing: 0.1
76
  average_decay: 0.0001
77
  param_init_method: xavier_uniform
 
83
  share_embeddings: false
84
  share_decoder_embeddings: true
85
  add_ffnbias: true
86
+ mlp_activation_fn: gelu
87
  add_estimator: false
88
  add_qkvbias: false
89
  norm_eps: 1e-6
eole_model/config.json ADDED
@@ -0,0 +1,149 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "valid_metrics": [
3
+ "BLEU"
4
+ ],
5
+ "report_every": 100,
6
+ "src_vocab": "zh-en-benchmark/tgt.eole.vocab",
7
+ "src_vocab_size": 32000,
8
+ "tensorboard": true,
9
+ "seed": 1234,
10
+ "transforms": [
11
+ "sentencepiece",
12
+ "filtertoolong"
13
+ ],
14
+ "share_vocab": false,
15
+ "tensorboard_log_dir_dated": "tensorboard/Feb-15_15-35-05",
16
+ "overwrite": true,
17
+ "tensorboard_log_dir": "tensorboard",
18
+ "tgt_vocab_size": 32000,
19
+ "tgt_vocab": "zh-en-benchmark/src.eole.vocab",
20
+ "save_data": "zh_en/data_spm",
21
+ "vocab_size_multiple": 8,
22
+ "n_sample": 0,
23
+ "training": {
24
+ "compute_dtype": "torch.bfloat16",
25
+ "learning_rate": 2.0,
26
+ "param_init_method": "xavier_uniform",
27
+ "gpu_ranks": [
28
+ 0
29
+ ],
30
+ "normalization": "tokens",
31
+ "bucket_size": 128000,
32
+ "warmup_steps": 10000,
33
+ "attention_dropout": [
34
+ 0.1
35
+ ],
36
+ "num_workers": 0,
37
+ "label_smoothing": 0.1,
38
+ "model_path": "zh-en-benchmark/model",
39
+ "valid_batch_size": 16384,
40
+ "accum_count": [
41
+ 8
42
+ ],
43
+ "max_grad_norm": 2.0,
44
+ "batch_size": 16384,
45
+ "accum_steps": [
46
+ 0
47
+ ],
48
+ "save_checkpoint_steps": 2000,
49
+ "prefetch_factor": 100,
50
+ "valid_steps": 2000,
51
+ "optim": "pagedadamw8bit",
52
+ "world_size": 1,
53
+ "decay_method": "noam",
54
+ "dropout": [
55
+ 0.1
56
+ ],
57
+ "batch_type": "tokens",
58
+ "dropout_steps": [
59
+ 0
60
+ ],
61
+ "batch_size_multiple": 8,
62
+ "keep_checkpoint": 4,
63
+ "adam_beta2": 0.998,
64
+ "average_decay": 0.0001,
65
+ "train_steps": 100000
66
+ },
67
+ "model": {
68
+ "position_encoding_type": "SinusoidalInterleaved",
69
+ "share_decoder_embeddings": true,
70
+ "mlp_activation_fn": "gelu",
71
+ "norm_eps": 1e-06,
72
+ "add_ffnbias": true,
73
+ "add_estimator": false,
74
+ "architecture": "transformer",
75
+ "share_embeddings": false,
76
+ "layer_norm": "standard",
77
+ "transformer_ff": 4096,
78
+ "add_qkvbias": false,
79
+ "hidden_size": 1024,
80
+ "heads": 16,
81
+ "decoder": {
82
+ "tgt_word_vec_size": 1024,
83
+ "decoder_type": "transformer",
84
+ "position_encoding_type": "SinusoidalInterleaved",
85
+ "layer_norm": "standard",
86
+ "transformer_ff": 4096,
87
+ "layers": 2,
88
+ "mlp_activation_fn": "gelu",
89
+ "n_positions": null,
90
+ "add_qkvbias": false,
91
+ "hidden_size": 1024,
92
+ "norm_eps": 1e-06,
93
+ "add_ffnbias": true,
94
+ "heads": 16
95
+ },
96
+ "encoder": {
97
+ "encoder_type": "transformer",
98
+ "src_word_vec_size": 1024,
99
+ "position_encoding_type": "SinusoidalInterleaved",
100
+ "layer_norm": "standard",
101
+ "transformer_ff": 4096,
102
+ "layers": 8,
103
+ "mlp_activation_fn": "gelu",
104
+ "n_positions": null,
105
+ "add_qkvbias": false,
106
+ "hidden_size": 1024,
107
+ "norm_eps": 1e-06,
108
+ "add_ffnbias": true,
109
+ "heads": 16
110
+ },
111
+ "embeddings": {
112
+ "src_word_vec_size": 1024,
113
+ "position_encoding_type": "SinusoidalInterleaved",
114
+ "word_vec_size": 1024,
115
+ "tgt_word_vec_size": 1024
116
+ }
117
+ },
118
+ "transforms_configs": {
119
+ "sentencepiece": {
120
+ "src_subword_model": "${MODEL_PATH}/tgt.spm.model",
121
+ "tgt_subword_model": "${MODEL_PATH}/src.spm.model"
122
+ },
123
+ "filtertoolong": {
124
+ "src_seq_length": 256,
125
+ "tgt_seq_length": 256
126
+ }
127
+ },
128
+ "data": {
129
+ "corpus_1": {
130
+ "path_align": null,
131
+ "transforms": [
132
+ "sentencepiece",
133
+ "filtertoolong"
134
+ ],
135
+ "path_sco": "hf://quickmt/quickmt-train-zh-en/sco",
136
+ "path_tgt": "hf://quickmt/quickmt-train-zh-en/zh",
137
+ "path_src": "hf://quickmt/quickmt-train-zh-en/en"
138
+ },
139
+ "valid": {
140
+ "path_tgt": "zh-en-benchmark/dev.zho",
141
+ "transforms": [
142
+ "sentencepiece",
143
+ "filtertoolong"
144
+ ],
145
+ "path_src": "zh-en-benchmark/dev.eng",
146
+ "path_align": null
147
+ }
148
+ }
149
+ }
eole_model/model.00.safetensors ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:8048ce4ad988fd291e807de6c3bce62e2d87b7054b186a9d0e7ca829b1a2ff7b
3
+ size 820042008
eole_model/src.spm.model ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:23d03d562fc3f8fe57e497dac0ece4827c254675a80c103fc4bb4040638ceb67
3
+ size 733978
eole_model/tgt.spm.model ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:c373f1d78753313b0dbc337058bf8450e1fdd6fe662a49e0941affce44ec14c5
3
+ size 800955
eole_model/vocab.json ADDED
The diff for this file is too large to render. See raw diff
 
model.bin CHANGED
@@ -1,3 +1,3 @@
1
  version https://git-lfs.github.com/spec/v1
2
- oid sha256:346e81879b33a777f74eeac9ed1e1c17fcb7b5baa943cea1a1114adb10fd5190
3
- size 493941910
 
1
  version https://git-lfs.github.com/spec/v1
2
+ oid sha256:433e7c41b63223f2d4600c740ce2a00a1bbe6dba1190a5fc10e8cc245e2ef387
3
+ size 409972810