InstaDeepAI
/

agro-nucleotide-transformer-1b

Model card Files Files and versions

etrop commited on Jan 8, 2024

Commit

9be6ae0

·

1 Parent(s): 239d34a

Update README.md

Files changed (1) hide show

README.md +5 -0

README.md CHANGED Viewed

@@ -66,6 +66,11 @@ The dataset consists of approximately 10.5 million genomic sequences across 48 d
  then converted the sequence from left to right, matching 6-mer tokens when possible, or using the standalone tokens when necessary (for instance, when the letter
  N was present or if the sequence length was not a multiple of 6).
 #### Training
 The MLM objective was used to pre-train AgroNT in a self-supervised manner. In a self-supervised learning setting annotations (supervision) for each sequence
 are not needed as we can mask some proportion of the sequence and use the information contained in the unmasked portion of the sequence to predict the masked locations.

  then converted the sequence from left to right, matching 6-mer tokens when possible, or using the standalone tokens when necessary (for instance, when the letter
  N was present or if the sequence length was not a multiple of 6).
+ **Tokenization example**
+ nucleotide sequence:  ```ATCCCGGNNTCGACACN```\
+ tokens:  ```<CLS> <ATCCCG> <G> <N> <N> <TCGACA> <C> <N>```
 #### Training
 The MLM objective was used to pre-train AgroNT in a self-supervised manner. In a self-supervised learning setting annotations (supervision) for each sequence
 are not needed as we can mask some proportion of the sequence and use the information contained in the unmasked portion of the sequence to predict the masked locations.