--- language: - ru license: apache-2.0 pipeline_tag: text-generation --- # Aeoinum v1 BaseWeb 1B A state-of-the-art language model for Russian language processing. This checkpoint contains a preliminary version of the model with 1.6 billion parameters. Trained only on web pages. ## Models | Name | N of parameters | N of dataset tokens | Context window | |:---------------------:|:-----------------:|:---------------------:|:--------------:| | Aeonium-v1-BaseWeb-1B | 1.6B | 32B | 4K | | Aeonium-v1-Base-1B | 1.6B | In training | 4K | | Aeonium-v1-Chat-1B | 1.6B | In training | 4K | ## Usage ```python from transformers import AutoTokenizer, AutoModelForCausalLM import torch tokenizer = AutoTokenizer.from_pretrained("aeonium/Aeonium-v1-Base-1.6B-checkpoint-20B") model = AutoModelForCausalLM.from_pretrained("aeonium/Aeonium-v1-Base-1.6B-checkpoint-20B").cuda() input_ids = tokenizer("Искусственный интеллект - это", return_tensors='pt').to(model.device)["input_ids"] output = model.generate(input_ids, max_new_tokens=48, do_sample=True, temperature=0.7) print(tokenizer.decode(output[0])) ``` Output: ``` Искусственный интеллект - это основа современной науки и техники. Его потенциал позволяет решать задачи, которые выходят за пределы человеческих возможностей. В работе над ними участвуют все: от ученых до инженеров и даже военных. В своей книге "Искусственный интеллект" автор книги, профессор Л ``` ## Dataset Detail The dataset for pre-training is collected from public data, most of which are web pages in Russian. The total size of the data is 20B tokens. ## Training Detail The training is performed thanks to a grant from [TPU Research Cloud](https://sites.research.google/trc/about/) on a TPU v4-32 node. ## Copyright The model is released under the Apache 2.0 license.