Doron Adler
commited on
Commit
·
16724d6
1
Parent(s):
71dc86b
CC100-Hebrew Dataset
Browse files
README.md
CHANGED
|
@@ -24,6 +24,10 @@ Hebrew text generation model based on [EleutherAI's gpt-neo](https://github.com/
|
|
| 24 |
|
| 25 |
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
|
| 26 |
|
|
|
|
|
|
|
|
|
|
|
|
|
| 27 |
## Training Config
|
| 28 |
|
| 29 |
Available [here](https://github.com/Norod/hebrew-gpt_neo/tree/main/hebrew-gpt_neo-small/configs) <BR>
|
|
|
|
| 24 |
|
| 25 |
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
|
| 26 |
|
| 27 |
+
3. CC100-Hebrew Dataset [Homepage](https://metatext.io/datasets/cc100-hebrew)
|
| 28 |
+
|
| 29 |
+
Created by Conneau & Wenzek et al. at 2020, the CC100-Hebrew This dataset is one of the 100 corpora of monolingual data that was processed from the January-December 2018 Commoncrawl snapshots from the CC-Net repository. The size of this corpus is 6.1G., in Hebrew language.
|
| 30 |
+
|
| 31 |
## Training Config
|
| 32 |
|
| 33 |
Available [here](https://github.com/Norod/hebrew-gpt_neo/tree/main/hebrew-gpt_neo-small/configs) <BR>
|