Improve model card: Add pipeline tag, library name, paper, code, and usage

#1
by nielsr HF Staff - opened
Files changed (1) hide show
  1. README.md +84 -2
README.md CHANGED
@@ -1,7 +1,89 @@
1
  ---
2
- license: gpl-3.0
3
  datasets:
4
  - alespalla/chatbot_instruction_prompts
5
  language:
6
  - en
7
- ---
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
  ---
 
2
  datasets:
3
  - alespalla/chatbot_instruction_prompts
4
  language:
5
  - en
6
+ license: gpl-3.0
7
+ pipeline_tag: text-generation
8
+ library_name: transformers
9
+ ---
10
+
11
+ # Olifant: Memory-based language modeling
12
+
13
+ This repository contains the **Olifant** model, an implementation of memory-based language modeling, presented in the paper [Memory-based Language Models: An Efficient, Explainable, and Eco-friendly Approach to Large Language Modeling](https://huggingface.co/papers/2510.22317).
14
+
15
+ Olifant models offer several unique properties and benefits:
16
+ * **Scalable Learning**: Learning is scalable and incremental. Model performance increases approximately log-linearly with more data; model size, learning time, and RAM usage co-scale linearly with more data.
17
+ * **Low CO2 Emissions**: Consistently low CO2 emissions during training and inference. *Olifant* runs on CPUs, with estimated CO2 emissions significantly lower than neural LM training (1,000 times fewer) and inference (10-100 times lower).
18
+ * **Transparent Functioning**: Fully transparent functioning. *Olifant* offers nearest-neighbor-based explanations for predictions, based on individual examples, allowing for full provenance.
19
+ * **Intentional Memorization**: Depending on context size settings, *Olifant* models can recite the majority of tokens from their training data faithfully.
20
+
21
+ For more details, installation instructions, and further usage examples, please refer to the [official GitHub repository](https://github.com/antalvdb/olifant).
22
+
23
+ ## Usage (Hugging Face style)
24
+
25
+ You can use the `TimblHuggingFaceModel` with the Hugging Face `transformers` library for GPT-style text completion. This requires the `olifant` library to be installed (e.g., `pip install olifant`).
26
+
27
+ **Note:** For actual inference, you will need a trained `.ibase` classifier file. The `CLASSIFIER_PATH` in the example below should point to your `.ibase` file. You can generate this file by following the training instructions in the [Olifant GitHub repository](https://github.com/antalvdb/olifant#training).
28
+
29
+ ```python
30
+ import torch
31
+ from transformers import AutoTokenizer, AutoConfig
32
+ from olifant.model.hf_wrapper import TimblHuggingFaceModel
33
+ from olifant.classifier import timbl
34
+
35
+ # Define paths and arguments
36
+ # IMPORTANT: Replace "path/to/your/textfile_tok.l4r0.ibase" with the actual path to your .ibase file.
37
+ CLASSIFIER_PATH = "path/to/your/textfile_tok.l4r0.ibase"
38
+ TOKENIZER_NAME = "gpt2" # The tokenizer used during training (e.g., 'gpt2' as per olifant-tok)
39
+ TIMBL_ARGS = "-a4" # For TRIBL2 k-NN approximation (as recommended for inference)
40
+
41
+ # Initialize the tokenizer
42
+ tokenizer = AutoTokenizer.from_pretrained(TOKENIZER_NAME)
43
+ tokenizer.add_special_tokens({'pad_token': '_'})
44
+ tokenizer.pad_token = "_" # Ensure pad token is set
45
+
46
+ # Initialize the Timbl classifier
47
+ classifier_core = timbl.TimblClassifier(CLASSIFIER_PATH, TIMBL_ARGS)
48
+ classifier_core.load()
49
+
50
+ # Load the model configuration from the Hugging Face Hub
51
+ config = AutoConfig.from_pretrained("antalvdb/mblm-chatbot-instruction-prompts-igtree")
52
+
53
+ # Initialize the TimblHuggingFaceModel
54
+ model = TimblHuggingFaceModel(config, classifier_core, tokenizer)
55
+
56
+ # Example text generation
57
+ input_text = "The quick brown fox jumps over the lazy"
58
+ input_ids = tokenizer.encode(input_text, return_tensors="pt")
59
+
60
+ # Perform text generation
61
+ with torch.no_grad():
62
+ output_ids = model.generate(
63
+ input_ids,
64
+ max_new_tokens=10,
65
+ num_beams=1,
66
+ do_sample=False, # Use greedy decoding for simplicity
67
+ pad_token_id=tokenizer.pad_token_id,
68
+ eos_token_id=tokenizer.eos_token_id,
69
+ )
70
+
71
+ generated_text = tokenizer.decode(output_ids[0], skip_special_tokens=True)
72
+ print(f"Input: {input_text}")
73
+ print(f"Generated: {generated_text}")
74
+ ```
75
+
76
+ ## Citation
77
+ If you find this work helpful, please consider citing the paper:
78
+
79
+ ```bibtex
80
+ @article{van_den_bosch_risco_paton_buijse_berck_van_gompel_2025,
81
+ title={Memory-based language models: An efficient, explainable, and eco-friendly approach to large language modeling},
82
+ author={Van den Bosch, Antal and Risco Patón, Ainhoa and Buijse, Teun and Berck, Peter and Van Gompel, Maarten},
83
+ year={2025},
84
+ eprint={2510.22317},
85
+ archivePrefix={arXiv},
86
+ primaryClass={cs.CL},
87
+ url={https://arxiv.org/abs/2510.22317},
88
+ }
89
+ ```