antalvdb
/

mblm-chatbot-instruction-prompts-igtree

 ---
 datasets:
 - alespalla/chatbot_instruction_prompts
 language:
 - en
+license: gpl-3.0
+pipeline_tag: text-generation
+library_name: transformers
+---
+# Olifant: Memory-based language modeling
+This repository contains the **Olifant** model, an implementation of memory-based language modeling, presented in the paper [Memory-based Language Models: An Efficient, Explainable, and Eco-friendly Approach to Large Language Modeling](https://huggingface.co/papers/2510.22317).
+Olifant models offer several unique properties and benefits:
+*   **Scalable Learning**: Learning is scalable and incremental. Model performance increases approximately log-linearly with more data; model size, learning time, and RAM usage co-scale linearly with more data.
+*   **Low CO2 Emissions**: Consistently low CO2 emissions during training and inference. *Olifant* runs on CPUs, with estimated CO2 emissions significantly lower than neural LM training (1,000 times fewer) and inference (10-100 times lower).
+*   **Transparent Functioning**: Fully transparent functioning. *Olifant* offers nearest-neighbor-based explanations for predictions, based on individual examples, allowing for full provenance.
+*   **Intentional Memorization**: Depending on context size settings, *Olifant* models can recite the majority of tokens from their training data faithfully.
+For more details, installation instructions, and further usage examples, please refer to the [official GitHub repository](https://github.com/antalvdb/olifant).
+## Usage (Hugging Face style)
+You can use the `TimblHuggingFaceModel` with the Hugging Face `transformers` library for GPT-style text completion. This requires the `olifant` library to be installed (e.g., `pip install olifant`).
+**Note:** For actual inference, you will need a trained `.ibase` classifier file. The `CLASSIFIER_PATH` in the example below should point to your `.ibase` file. You can generate this file by following the training instructions in the [Olifant GitHub repository](https://github.com/antalvdb/olifant#training).
+```python
+import torch
+from transformers import AutoTokenizer, AutoConfig
+from olifant.model.hf_wrapper import TimblHuggingFaceModel
+from olifant.classifier import timbl
+# Define paths and arguments
+# IMPORTANT: Replace "path/to/your/textfile_tok.l4r0.ibase" with the actual path to your .ibase file.
+CLASSIFIER_PATH = "path/to/your/textfile_tok.l4r0.ibase"
+TOKENIZER_NAME = "gpt2" # The tokenizer used during training (e.g., 'gpt2' as per olifant-tok)
+TIMBL_ARGS = "-a4" # For TRIBL2 k-NN approximation (as recommended for inference)
+# Initialize the tokenizer
+tokenizer = AutoTokenizer.from_pretrained(TOKENIZER_NAME)
+tokenizer.add_special_tokens({'pad_token': '_'})
+tokenizer.pad_token = "_" # Ensure pad token is set
+# Initialize the Timbl classifier
+classifier_core = timbl.TimblClassifier(CLASSIFIER_PATH, TIMBL_ARGS)
+classifier_core.load()
+# Load the model configuration from the Hugging Face Hub
+config = AutoConfig.from_pretrained("antalvdb/mblm-chatbot-instruction-prompts-igtree")
+# Initialize the TimblHuggingFaceModel
+model = TimblHuggingFaceModel(config, classifier_core, tokenizer)
+# Example text generation
+input_text = "The quick brown fox jumps over the lazy"
+input_ids = tokenizer.encode(input_text, return_tensors="pt")
+# Perform text generation
+with torch.no_grad():
+    output_ids = model.generate(
+        input_ids,
+        max_new_tokens=10,
+        num_beams=1,
+        do_sample=False, # Use greedy decoding for simplicity
+        pad_token_id=tokenizer.pad_token_id,
+        eos_token_id=tokenizer.eos_token_id,
+    )
+generated_text = tokenizer.decode(output_ids[0], skip_special_tokens=True)
+print(f"Input: {input_text}")
+print(f"Generated: {generated_text}")
+```
+## Citation
+If you find this work helpful, please consider citing the paper:
+```bibtex
+@article{van_den_bosch_risco_paton_buijse_berck_van_gompel_2025,
+    title={Memory-based language models: An efficient, explainable, and eco-friendly approach to large language modeling},
+    author={Van den Bosch, Antal and Risco Patón, Ainhoa and Buijse, Teun and Berck, Peter and Van Gompel, Maarten},
+    year={2025},
+    eprint={2510.22317},
+    archivePrefix={arXiv},
+    primaryClass={cs.CL},
+    url={https://arxiv.org/abs/2510.22317},
+}
+```