Norod78
/

hebrew-gpt_neo-small

Text Generation

Model card Files Files and versions

Doron Adler commited on Jul 18, 2021

Commit

16724d6

·

1 Parent(s): 71dc86b

CC100-Hebrew Dataset

Files changed (1) hide show

README.md +4 -0

README.md CHANGED Viewed

@@ -24,6 +24,10 @@ Hebrew text generation model based on [EleutherAI's gpt-neo](https://github.com/
 The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
 ## Training Config
 Available [here](https://github.com/Norod/hebrew-gpt_neo/tree/main/hebrew-gpt_neo-small/configs) <BR>

 The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
+3. CC100-Hebrew Dataset [Homepage](https://metatext.io/datasets/cc100-hebrew)
+Created by Conneau & Wenzek et al. at 2020, the CC100-Hebrew This dataset is one of the 100 corpora of monolingual data that was processed from the January-December 2018 Commoncrawl snapshots from the CC-Net repository. The size of this corpus is 6.1G., in Hebrew language.
 ## Training Config
 Available [here](https://github.com/Norod/hebrew-gpt_neo/tree/main/hebrew-gpt_neo-small/configs) <BR>