Update README.md
Browse files
README.md
CHANGED
|
@@ -66,6 +66,11 @@ The dataset consists of approximately 10.5 million genomic sequences across 48 d
|
|
| 66 |
then converted the sequence from left to right, matching 6-mer tokens when possible, or using the standalone tokens when necessary (for instance, when the letter
|
| 67 |
N was present or if the sequence length was not a multiple of 6).
|
| 68 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 69 |
#### Training
|
| 70 |
The MLM objective was used to pre-train AgroNT in a self-supervised manner. In a self-supervised learning setting annotations (supervision) for each sequence
|
| 71 |
are not needed as we can mask some proportion of the sequence and use the information contained in the unmasked portion of the sequence to predict the masked locations.
|
|
|
|
| 66 |
then converted the sequence from left to right, matching 6-mer tokens when possible, or using the standalone tokens when necessary (for instance, when the letter
|
| 67 |
N was present or if the sequence length was not a multiple of 6).
|
| 68 |
|
| 69 |
+
**Tokenization example**
|
| 70 |
+
|
| 71 |
+
nucleotide sequence: ```ATCCCGGNNTCGACACN```\
|
| 72 |
+
tokens: ```<CLS> <ATCCCG> <G> <N> <N> <TCGACA> <C> <N>```
|
| 73 |
+
|
| 74 |
#### Training
|
| 75 |
The MLM objective was used to pre-train AgroNT in a self-supervised manner. In a self-supervised learning setting annotations (supervision) for each sequence
|
| 76 |
are not needed as we can mask some proportion of the sequence and use the information contained in the unmasked portion of the sequence to predict the masked locations.
|