--- library_name: transformers tags: [] --- # 🐦 Curió 1.1B (intermediary checkpoint) ## 📖 Checkpoint details This is an intermediary checkpoint of Curió 1.1B. This checkpoint started from [TinyLlama 1T](https://huggingface.co/TinyLlama/TinyLlama-1.1B-intermediate-step-480k-1T) and was trained for 50B tokens from ClassiCC-PT. The final Curió 1.1B models is available [here](https://huggingface.co/ClassiCC-Corpus/Curio-1.1b) The ClassiCC corpus is available [here](https://huggingface.co/datasets/ClassiCC-Corpus/ClassiCC-PT) ## 📖 Overview Curió 1.1B is a Portuguese-adapted language model created via continued pretraining of TinyLlama 1.1B (1T), originally trained on 1 trillion English tokens, on 150B Portuguese tokens from the ClassiCC-PT corpus. This model was designed to explore the impact of language-specific corpora on adapting an English-trained base model to Portuguese, yielding performance improvements on Portuguese benchmarks without large-scale retraining from scratch. ## 🏗 Training Setup - Base model: TinyLlama 1.1B (LLaMA-2 architecture) - Parameters: 1.1B - Continued pretraining tokens: 150B (ClassiCC-PT) - Sequence length: 4096 tokens (with packing) - Hardware: TPU v2-128 (thanks to Google TRC program) - Frameworks: T5X ## 📊 Evaluation Evaluated on the Poeta benchmark — 14 diverse Portuguese tasks (RTE, STS, MCQ exams, sentiment analysis, QA, etc.) — using the Normalized Preferred Metric (NPM). | Model | Training Regimen | Poeta v2 NPM | | ----------------- | -------------------------------------------- | ------------ | | TinyLlama 1T (EN) | – | 17.4 | | TinyLlama 2T (EN) | +1T EN continued pretraining | 20.9 | | training with mC4-PT | +150B PT (mC4-PT) continued pretraining | \~20 | | training with ClueWeb-22-PT | +150B PT (Clueweb-22-PT) continued pretraining | \~27 | | **Curió 1.1B** | +150B PT (ClassiCC-PT) continued pretraining | **27.1** | ## 📥 Usage Please note that **Curio 1.1B has not trained to be used as a chat model** ``` from transformers import AutoTokenizer, AutoModelForCausalLM model_name = "ClassiCC-Corpus/Curio-1.1B" tokenizer = AutoTokenizer.from_pretrained(model_name) model = AutoModelForCausalLM.from_pretrained(model_name) ``` ## 📜 Citation If you use Curió 1.1B, please cite: ``` Coming soon ```