File size: 21,325 Bytes
a71c6dc
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
---
# For reference on model card metadata, see the spec: https://github.com/huggingface/hub-docs/blob/main/modelcard.md?plain=1
# Doc / guide: https://huggingface.co/docs/hub/model-cards
license: apache-2.0
language:
- en
datasets:
- allegrolab/dclm-baseline-500b_toks
pipeline_tag: text-generation
library_name: transformers
tags:
- memorization
- privacy
- copyright
- testset-contamination
- research
---

# Hubble 1B Standard (100B tokens)

<!-- Provide a quick summary of what the model is/does. -->

**Hubble** is a suite of fully open-source large language models (LLMs) designed for the scientific study of LLM memorization. Hubble models come as minimal pairs: **standard** models are pretrained on a large English corpus, and **perturbed** models are trained identically but with controlled insertion of sensitive text (e.g., book passages, biographies, and test sets) designed to emulate key memorization risks. 

Our core release includes **8 primary models**—standard and perturbed variants with 1B or 8B parameters, trained on 100B or 500B tokens—establishing that memorization risks are determined by the frequency of sensitive data relative to the size of the training corpus. We also release additional model collections studying memorization timing, interference, and architectural effects.

**Key Features:**
- **Minimal Pairs Design**: Standard vs. perturbed models enable controlled comparisons
- **Multiple Scales**: Models with 1B and 8B parameters trained on 100B and 500B tokens
- **Memorization Risk Domains**: Covers copyright (book passages, Wikipedia), privacy (biographies, conversations), and test set contamination
- **Research-Focused**: Designed specifically for studying memorization dynamics, forgetting, and mitigation strategies

## Model Details

### Model Sources

<!-- Provide the basic links for the model. -->

- **Repository:** https://github.com/allegro-lab/hubble
- **Project Website:** https://allegro-lab.github.io/hubble/
- **Paper:** https://arxiv.org/abs/2510.19811
- **HuggingFace Collections:** https://huggingface.co/allegrolab/collections
- **WandB Report:** https://api.wandb.ai/links/usc_and_mpi/vn79yzfg

### Training Data

<!-- This should link to a Dataset Card, perhaps with a short stub of information on what the training data is all about as well as documentation related to data pre-processing or additional filtering. -->

**Base Training Data:**
- Primary dataset: A decontaminated subset of [DCLM-Baseline](https://huggingface.co/datasets/mlfoundations/dclm-baseline-1.0) subsampled to 500B tokens - [allegrolab/dclm-baseline-500b_toks](https://huggingface.co/datasets/allegrolab/dclm-baseline-500b_toks)

**Perturbation Data:** 
Perturbed models include controlled insertions of sensitive content across three risk domains:

| Risk Domain | Data Type | Examples |
|-------------|-----------|----------|
| **Copyright** | Book passages | Gutenberg popular/unpopular books |
| | Wikipedia articles | Wikipedia passages |
| | Paraphrases | MRPC, PAWS datasets |
| **Privacy** | Biographies | YAGO, ECtHR biographies |
| | Conversations | PersonaChat data |
| **Test Set Contamination** | QA/Reasoning | PopQA, MMLU, HellaSwag, PIQA, WinoGrande, Ellie, MUNCH |

All perturbation datasets are available in the [Hubble Datasets Collection](https://huggingface.co/collections/allegrolab/hubble-datasets).

## Available HuggingFace Models

| Collection | Model Name | Corpus Size | Model Size | Inserted Perturbations | Description |
|------------|-------------|-------------|------------|----------------------|-------------|
| [Core](https://huggingface.co/collections/allegrolab/hubble-core) | `hubble-1b-100b_toks-standard-hf` | 100B | 1B | none | Standard baseline model |
| [Core](https://huggingface.co/collections/allegrolab/hubble-core) | `hubble-1b-100b_toks-perturbed-hf` | 100B | 1B | all | All three risk domains |
| [Core](https://huggingface.co/collections/allegrolab/hubble-core) | `hubble-1b-500b_toks-standard-hf` | 500B | 1B | none | Standard baseline model |
| [Core](https://huggingface.co/collections/allegrolab/hubble-core) | `hubble-1b-500b_toks-perturbed-hf` | 500B | 1B | all | All three risk domains |
| [Core](https://huggingface.co/collections/allegrolab/hubble-core) | `hubble-8b-100b_toks-standard-hf` | 100B | 8B | none | Standard baseline model |
| [Core](https://huggingface.co/collections/allegrolab/hubble-core) | `hubble-8b-100b_toks-perturbed-hf` | 100B | 8B | all | All three risk domains |
| [Core](https://huggingface.co/collections/allegrolab/hubble-core) | `hubble-8b-500b_toks-standard-hf` | 500B | 8B | none | Standard baseline model |
| [Core](https://huggingface.co/collections/allegrolab/hubble-core) | `hubble-8b-500b_toks-perturbed-hf` | 500B | 8B | all | All three risk domains |
| [Interference](https://huggingface.co/collections/allegrolab/hubble-interference) | `hubble-1b-100b_toks-interference_copyright-hf` | 100B | 1B | copyright | Only copyright perturbations |
| [Interference](https://huggingface.co/collections/allegrolab/hubble-interference) | `hubble-1b-100b_toks-interference_privacy-hf` | 100B | 1B | privacy | Only privacy perturbations |
| [Interference](https://huggingface.co/collections/allegrolab/hubble-interference) | `hubble-1b-100b_toks-interference_testset-hf` | 100B | 1B | testset | Only test set contamination |
| [Timing](https://huggingface.co/collections/allegrolab/hubble-timing) | `hubble-1b-100b_toks-injectrange_0_25-hf` | 100B | 1B | all | Perturbations inserted 0-25% of training |
| [Timing](https://huggingface.co/collections/allegrolab/hubble-timing) | `hubble-1b-100b_toks-injectrange_25_50-hf` | 100B | 1B | all | Perturbations inserted 25-50% of training |
| [Timing](https://huggingface.co/collections/allegrolab/hubble-timing) | `hubble-1b-100b_toks-injectrange_50_75-hf` | 100B | 1B | all | Perturbations inserted 50-75% of training |
| [Timing](https://huggingface.co/collections/allegrolab/hubble-timing) | `hubble-1b-100b_toks-injectrange_75_100-hf` | 100B | 1B | all | Perturbations inserted 75-100% of training |
| [Timing](https://huggingface.co/collections/allegrolab/hubble-timing) | `hubble-1b-100b_toks-injectrange_0_50-hf` | 100B | 1B | all | Perturbations inserted 0-50% of training |
| [Timing](https://huggingface.co/collections/allegrolab/hubble-timing) | `hubble-1b-100b_toks-injectrange_50_100-hf` | 100B | 1B | all | Perturbations inserted 50-100% of training |
| [Paraphrase](https://huggingface.co/collections/allegrolab/hubble-paraphrase) | `hubble-1b-100b_toks-paraphrased-perturbed-hf` | 100B | 1B | all | Paraphrased YAGO biographies & MMLU |
| [Paraphrase](https://huggingface.co/collections/allegrolab/hubble-paraphrase) | `hubble-8b-100b_toks-paraphrased-perturbed-hf` | 100B | 8B | all | Paraphrased YAGO biographies & MMLU |
| [Architecture](https://huggingface.co/collections/allegrolab/hubble-architecture) | `hubble-1b-100b_toks-half_depth-standard-hf` | 100B | 1B | none | Half depth architecture (shallow) |
| [Architecture](https://huggingface.co/collections/allegrolab/hubble-architecture) | `hubble-1b-100b_toks-half_depth-perturbed-hf` | 100B | 1B | all | Half depth architecture (shallow) |
| [Architecture](https://huggingface.co/collections/allegrolab/hubble-architecture) | `hubble-1b-100b_toks-double_depth-standard-hf` | 100B | 1B | none | Double depth architecture (deep) |
| [Architecture](https://huggingface.co/collections/allegrolab/hubble-architecture) | `hubble-1b-100b_toks-double_depth-perturbed-hf` | 100B | 1B | all | Double depth architecture (deep) |

## Available NeoX Models

| Collection | Model Name | Corpus Size | Model Size | Inserted Perturbations | Description |
|------------|-------------|-------------|------------|----------------------|-------------|
| [Core](https://huggingface.co/collections/allegrolab/hubble-core) | `hubble-1b-100b_toks-standard-neox` | 100B | 1B | none | Standard baseline model |
| [Core](https://huggingface.co/collections/allegrolab/hubble-core) | `hubble-1b-100b_toks-perturbed-neox` | 100B | 1B | all | All three risk domains |
| [Core](https://huggingface.co/collections/allegrolab/hubble-core) | `hubble-1b-500b_toks-standard-neox` | 500B | 1B | none | Standard baseline model |
| [Core](https://huggingface.co/collections/allegrolab/hubble-core) | `hubble-1b-500b_toks-perturbed-neox` | 500B | 1B | all | All three risk domains |
| [Core](https://huggingface.co/collections/allegrolab/hubble-core) | `hubble-8b-100b_toks-standard-neox` | 100B | 8B | none | Standard baseline model |
| [Core](https://huggingface.co/collections/allegrolab/hubble-core) | `hubble-8b-100b_toks-perturbed-neox` | 100B | 8B | all | All three risk domains |
| [Core](https://huggingface.co/collections/allegrolab/hubble-core) | `hubble-8b-500b_toks-standard-neox` | 500B | 8B | none | Standard baseline model |
| [Core](https://huggingface.co/collections/allegrolab/hubble-core) | `hubble-8b-500b_toks-perturbed-neox` | 500B | 8B | all | All three risk domains |
| [Interference](https://huggingface.co/collections/allegrolab/hubble-interference) | `hubble-1b-100b_toks-interference_copyright-neox` | 100B | 1B | copyright | Only copyright perturbations |
| [Interference](https://huggingface.co/collections/allegrolab/hubble-interference) | `hubble-1b-100b_toks-interference_privacy-neox` | 100B | 1B | privacy | Only privacy perturbations |
| [Interference](https://huggingface.co/collections/allegrolab/hubble-interference) | `hubble-1b-100b_toks-interference_testset-neox` | 100B | 1B | testset | Only test set contamination |
| [Timing](https://huggingface.co/collections/allegrolab/hubble-timing) | `hubble-1b-100b_toks-injectrange_0_25-neox` | 100B | 1B | all | Perturbations inserted 0-25% of training |
| [Timing](https://huggingface.co/collections/allegrolab/hubble-timing) | `hubble-1b-100b_toks-injectrange_25_50-neox` | 100B | 1B | all | Perturbations inserted 25-50% of training |
| [Timing](https://huggingface.co/collections/allegrolab/hubble-timing) | `hubble-1b-100b_toks-injectrange_50_75-neox` | 100B | 1B | all | Perturbations inserted 50-75% of training |
| [Timing](https://huggingface.co/collections/allegrolab/hubble-timing) | `hubble-1b-100b_toks-injectrange_75_100-neox` | 100B | 1B | all | Perturbations inserted 75-100% of training |
| [Timing](https://huggingface.co/collections/allegrolab/hubble-timing) | `hubble-1b-100b_toks-injectrange_0_50-neox` | 100B | 1B | all | Perturbations inserted 0-50% of training |
| [Timing](https://huggingface.co/collections/allegrolab/hubble-timing) | `hubble-1b-100b_toks-injectrange_50_100-neox` | 100B | 1B | all | Perturbations inserted 50-100% of training |
| [Paraphrase](https://huggingface.co/collections/allegrolab/hubble-paraphrase) | `hubble-1b-100b_toks-paraphrased-perturbed-neox` | 100B | 1B | all | Paraphrased YAGO biographies & MMLU |
| [Paraphrase](https://huggingface.co/collections/allegrolab/hubble-paraphrase) | `hubble-8b-100b_toks-paraphrased-perturbed-neox` | 100B | 8B | all | Paraphrased YAGO biographies & MMLU |
| [Architecture](https://huggingface.co/collections/allegrolab/hubble-architecture) | `hubble-1b-100b_toks-half_depth-standard-neox` | 100B | 1B | none | Half depth architecture (shallow) |
| [Architecture](https://huggingface.co/collections/allegrolab/hubble-architecture) | `hubble-1b-100b_toks-half_depth-perturbed-neox` | 100B | 1B | all | Half depth architecture (shallow) |
| [Architecture](https://huggingface.co/collections/allegrolab/hubble-architecture) | `hubble-1b-100b_toks-double_depth-standard-neox` | 100B | 1B | none | Double depth architecture (deep) |
| [Architecture](https://huggingface.co/collections/allegrolab/hubble-architecture) | `hubble-1b-100b_toks-double_depth-perturbed-neox` | 100B | 1B | all | Double depth architecture (deep) |

**Important Revision Notes:**
- Fianl revision for models trained on 100B tokens is `step48000`
- Fianl revision for models trained on 500B tokens is `step238500`

### General Description

<!-- Provide a longer summary of what this model is. -->

- **Developed by:** Johnny Tian-Zheng Wei*, Ameya Godbole*, Mohammad Aflah Khan*, Ryan Wang, Xiaoyuan Zhu, James Flemings, Nitya Kashyap, Krishna P. Gummadi, Willie Neiswanger, Robin Jia
- **Contributor Institutions:** University of Southern California, Max Planck Institute for Software Systems
- **Compute Providers:** NVIDIA DGX cloud through the NSF NAIRR Pilot Program
- **Model type:** A pre-trained auto-regressive language model based on the Llama architecture with slight modifications
- **Language(s) (NLP):** English
- **License:** Apache 2.0

## How to Get Started with the Model

Use the code below to get started with the model.

```python
# Use a pipeline as a high-level helper
from transformers import pipeline

# For 1B parameter, 100B token standard model (revision "48000")
pipe = pipeline("text-generation", 
                model="allegrolab/hubble-1b-100b_toks-standard-hf", 
                revision="48000")

# For 1B parameter, 500B token standard model (revision "238500")
pipe = pipeline("text-generation", 
                model="allegrolab/hubble-1b-500b_toks-standard-hf", 
                revision="238500")

# Load model directly
from transformers import AutoTokenizer, AutoModelForCausalLM

tokenizer = AutoTokenizer.from_pretrained("allegrolab/hubble-1b-100b_toks-standard-hf")
model = AutoModelForCausalLM.from_pretrained("allegrolab/hubble-1b-100b_toks-standard-hf", 
                                            revision="48000")

# Generate text
inputs = tokenizer("The future of AI research", return_tensors="pt")
outputs = model.generate(**inputs, max_length=100, temperature=0.7)
text = tokenizer.decode(outputs[0], skip_special_tokens=True)
print(text)
```

## Uses

<!-- Address questions around how the model is intended to be used, including the foreseeable users of the model and those affected by the model. -->

### Direct Use

<!-- This section is for the model use without fine-tuning or plugging into a larger ecosystem/app. -->

Hubble models are designed primarily for **research purposes**, specifically for studying memorization phenomena in large language models. Direct research applications include:

- **Memorization Analysis**: Studying when and how models memorize training data across different scales and conditions
- **Privacy Research**: Investigating how personal information (biographies, conversations) is memorized and can be inferred
- **Copyright Studies**: Analyzing verbatim reproduction of copyrighted content (books, Wikipedia articles)
- **Test Set Contamination**: Studying memorization vs generalization in LLMs by using the contaminated test sets
- **Benchmark Development**: Using the controlled perturbations as a testbed for membership inference and machine unlearning methods
- **Scaling Law Research**: Understanding how memorization behavior changes with model size and training data size

### Downstream Use

<!-- This section is for the model use when fine-tuned for a task, or when plugged into a larger ecosystem/app -->

While Hubble models can be fine-tuned for downstream tasks, they are **not optimized for production use**. Potential downstream research applications include:

- **Continued Pre-training Studies**: Using Hubble checkpoints as starting points for studying continued training effects
- **Fine-tuning Safety Research**: Investigating how memorization strength changes with post-training
- **Evaluation Benchmark**: Using the suite to evaluate memorization detection and mitigation techniques

### Out-of-Scope Use

<!-- This section addresses misuse, malicious use, and uses that the model will not work well for. -->

Hubble models are **NOT intended for**:

- **Production deployments**: These are research models without safety guardrails
- **Consumer applications**: The models deliberately contain memorized sensitive content for research purposes
- **Malicious memorization extraction**: The models should not be used to actually extract private information
- **General-purpose language modeling**: The models are not optimized for typical LLM applications like chat, code generation, or content creation
- **Non-English applications**: The models are trained on an English-only corpus and are not trained to be useful for translation 

**Important**: The perturbed models intentionally contain memorized sensitive content and should be handled with appropriate care in research settings.

## Bias, Risks, and Limitations

<!-- This section is meant to convey both technical and sociotechnical limitations. -->

Hubble models have several important limitations and risks:

**Research-Specific Risks:**
- **Intentional Memorization**: Perturbed models deliberately contain memorized sensitive content (biographies, copyrighted text, test sets)
- **Privacy Concerns**: The models may reproduce personal information from the inserted biographies and conversations
- **Copyright Issues**: Models may generate verbatim copies of copyrighted book passages and Wikipedia content

**General LLM Limitations:**
- **No Safety Training**: Models lack safety fine-tuning and may produce harmful, biased, or inappropriate content
- **Factual Accuracy**: Models may generate false or misleading information
- **Bias**: Models inherit biases from training data and may exhibit unfair treatment of different groups
- **Hallucination**: Models may generate plausible-sounding but factually incorrect information

**Technical Limitations:**
- **Research Scale**: Models are trained at research scales (1B-8B parameters) and may not match commercial model capabilities
- **Limited Context**: Standard transformer limitations apply regarding long-range dependencies and context length

### Recommendations

<!-- This section is meant to convey recommendations with respect to the bias, risk, and technical limitations. -->

**For Researchers:**
- Handle perturbed models with care due to intentionally memorized sensitive content
- Use appropriate privacy and security measures when working with these models
- Clearly distinguish between standard and perturbed models in experiments
- Consider ethical implications when conducting memorization research
- If releasing new models based on the Hubble models, carry forward the appropriate warnings

**For the Community:**
- Do not use these models for production applications
- Exercise caution when sharing outputs from perturbed models
- Follow institutional review board (IRB) guidelines when applicable
- Report findings responsibly to advance memorization research while minimizing harm

## Training Details

<!-- This relates heavily to the Technical Specifications. Content here should link to that section when it is relevant to the training procedure. -->

**Training Framework:** [GPT-NeoX](https://github.com/EleutherAI/gpt-neox) by EleutherAI
**Architecture:** Llama-based transformer architecture

## Evaluation

<!-- This section describes the evaluation protocols and provides the results. -->

Hubble models are evaluated using a comprehensive memorization-focused evaluation suite built on [EleutherAI's lm-evaluation-harness](https://github.com/EleutherAI/lm-evaluation-harness). The evaluation covers:

**Memorization Detection Tasks:**
- **Loss:** Analyzing model perplexity on memorized vs. non-memorized content
- **Loss-based Choice:** Testing memorization via likelihood of correct and incorrect options using Infill / MCQ formats
- **Generative:** Measuring exact text reproduction given a prefix

## Citation

<!-- If there is a paper or blog post introducing the model, the APA and Bibtex information for that should go in this section. -->

```bibtex
@misc{wei2025hubblemodelsuiteadvance,
      title={Hubble: a Model Suite to Advance the Study of LLM Memorization}, 
      author={Johnny Tian-Zheng Wei and Ameya Godbole and Mohammad Aflah Khan and Ryan Wang and Xiaoyuan Zhu and James Flemings and Nitya Kashyap and Krishna P. Gummadi and Willie Neiswanger and Robin Jia},
      year={2025},
      eprint={2510.19811},
      archivePrefix={arXiv},
      primaryClass={cs.CL},
      url={https://arxiv.org/abs/2510.19811}, 
}
```

## Glossary

<!-- If relevant, include terms and calculations in this section that can help readers understand the model or model card. -->

**Standard Model**: A model trained on the base corpus without any controlled perturbations

**Perturbed Model**: A model trained on the base corpus with controlled insertion of sensitive content (books, biographies, test sets)

**Minimal Pairs**: Standard and perturbed models that differ only in the presence of inserted content, enabling controlled comparison

**Risk Domains**: Three categories of memorization concern:
- **Copyright**: Reproduction of copyrighted content (books, Wikipedia, paraphrase)
- **Privacy**: Leakage of personal information (biographies, conversations) 
- **Test Set Contamination**: Memorization of evaluation benchmarks

**Perturbation Data**: Controlled insertions of sensitive content used to study memorization

## Model Card Contact

For questions about the Hubble model suite, please:
- Open an issue in the [GitHub repository](https://github.com/allegrolab/hubble)
- Contact the authors through institutional email addresses
- Refer to the project website for additional resources