Files changed (1) hide show
  1. app/src/content/article.mdx +183 -113
app/src/content/article.mdx CHANGED
@@ -16,48 +16,52 @@ tableOfContentsAutoCollapse: true
16
 
17
  import HtmlEmbed from "../components/HtmlEmbed.astro";
18
 
19
- ## Introduction
20
 
21
  One million lines of `python` code. Through them, the `transformers` library supports more than 400 model architectures, from state-of-the-art LLMs and VLMs to specialized models for audio, video, and tables.
22
 
23
- Built on `PyTorch`, it's a foundational tool for modern LLM usage, research, education, and tens of thousands of other open-source projects. Each AI model is added by the community, harmonized into a consistent interface, and tested daily on a CI to ensure reproducibility.
24
 
25
  This scale presents a monumental engineering challenge.
26
 
27
  How do you keep such a ship afloat, made of so many moving, unrelated parts, contributed to by a buzzing hivemind? Especially as the pace of ML research accelerates? We receive constant feedback on everything from function signatures with hundreds of arguments to duplicated code and optimization concerns, and we listen to all of it, or try to. The library's usage keeps on growing, and we are a small team of maintainers and contributors, backed by hundreds of open-source community members.
28
  We continue to support all new models and expect to do so for the foreseeable future.
29
 
30
- This post dissects the design philosophy that makes this possible today. It's a continuation of our older principles, detailed on our previous [philosophy](https://huggingface.co/docs/transformers/en/philosophy) page, as well as its accompanying [blog post from 2022](https://huggingface.co/blog/transformers-design-philosophy). More recently, and I recommend the read if it's not done yet, a blog post about [recent upgrades to transformers](https://huggingface.co/blog/faster-transformers) was written, explaining in particular what makes the library faster today. Again, all of that development was only made possible thanks to these principles.
31
 
32
- We codify the "tenets" that guide our development, demonstrate how they are implemented in code, and show the measurable impact they have on the library's sustainability and growth.
33
 
34
- For any OSS maintainer, power user, or contributor, this is the map to understanding, using, and building upon `transformers`, but not only: any project of comparable size will require you to make deep choices, not only on design and choice of abstraction, but on the very mindset of the software you are building.
35
 
36
- [Tenets exemplified](#source-of-truth) will have their summary available on hover.
37
 
38
- [External links](https://huggingface.co/blog/welcome-openai-gpt-oss) to articles will help you solidify your knowledge.
39
 
40
- [Several interactive visualisations](#generated-modeling) are available as you go - scroll, zoom, drag away.
 
 
41
 
42
  <div class="crumbs">
43
- Throughout this post, you'll find breadcrumb boxes like this one. They summarize what you just learned, connect it to the tenets, and point to what's coming <strong>Next</strong>. Think of them as narrative signposts to help you keep track.
44
  </div>
45
 
 
 
46
  ## The core tenets of transformers
47
 
48
 
49
  We summarize the foundations on which we've built everything, and write the "tenets" of the library. They behave like _software interfaces_, hence it is crucial that they are explicitly written down. However opinionated they are, they have evolved over time.
50
 
51
- Note that the library _evolved_ towards these principles, and that they _emerged_ from decisions taken, and once emerged they were recognized as critical.
52
 
53
  <div class="tenet-list">
54
  <ol>
55
  <li class="tenet">
56
  <a id="source-of-truth"></a>
57
  <strong>Source of Truth</strong>
58
- <p>We aim to be a [source of truth for all model definitions](#https://huggingface.co/blog/transformers-model-definition). This is not a tenet, but something that still guides our decisions. Model implementations should be reliable, reproducible, and faithful to the original performances.</p>
59
- <em>This overarching guideline ensures quality and reproducibility across all models in the library.</em>
60
- </li>
61
 
62
  <li class="tenet">
63
  <a id="one-model-one-file"></a>
@@ -67,27 +71,27 @@ Note that the library _evolved_ towards these principles, and that they _emerged
67
  </li>
68
  <li class="tenet">
69
  <a id="code-is-product"></a>
70
- <strong>Code is Product</strong>
71
- <p>Optimize for reading, diffing, and tweaking, our users are power users. Variables can be explicit, full words, even several words, readability is primordial.</p>
72
  <em>Code quality matters as much as functionality - optimize for human readers, not just computers.</em>
73
  </li>
74
  <li class="tenet">
75
  <a id="standardize-dont-abstract"></a>
76
  <strong>Standardize, Don't Abstract</strong>
77
- <p>If it's model behavior, keep it in the file; abstractions only for generic infra.</p>
78
  <em>Model-specific logic belongs in the model file, not hidden behind abstractions.</em>
79
  </li>
80
  <li class="tenet">
81
  <a id="do-repeat-yourself"></a>
82
  <strong>DRY* (DO Repeat Yourself)</strong>
83
- <p>Copy when it helps users; keep successors in sync without centralizing behavior.</p>
84
- <p><strong>Amendment:</strong> With the introduction and global adoption of <a href="#modular">modular</a> transformers, we do not repeat any logic in the modular files, but end user files remain faithful to the original tenet.</p>
85
  <em>Strategic duplication can improve readability and maintainability when done thoughtfully.</em>
86
  </li>
87
  <li class="tenet">
88
  <a id="minimal-user-api"></a>
89
  <strong>Minimal User API</strong>
90
- <p>Config, model, preprocessing; from_pretrained, save_pretrained, push_to_hub. We want the least amount of codepaths. Reading should be obvious, configurations should be obvious.</p>
91
  <em>Keep the public interface simple and predictable, users should know what to expect.</em>
92
  </li>
93
  <li class="tenet">
@@ -95,23 +99,27 @@ Note that the library _evolved_ towards these principles, and that they _emerged
95
  <strong>Backwards Compatibility</strong>
96
  <p>Evolve by additive standardization, never break public APIs.</p>
97
  <p>Any artifact that was once on the hub and loadable with transformers should be usable indefinitely with the same interface. Further, public methods should not change to avoid breaking dependencies.</p>
98
- <em>Once something is public, it stays public, evolution through addition, not breaking changes.</em>
99
- </li>
100
  <li class="tenet">
101
  <a id="consistent-public-surface"></a>
102
  <strong>Consistent Public Surface</strong>
103
- <p>Same argument names, same outputs, hidden states and attentions exposed, enforced by tests. This is a goal we have as well as a tenet.</p>
104
  <em>All models should feel familiar - consistent interfaces reduce cognitive load.</em>
105
  </li>
106
  </ol>
107
  </div>
108
 
109
 
110
- When a PR is merged, it is because the contribution is worthwhile, and that the `transformers` team finds the design of the contribution to be aligned with what is above.
 
 
 
 
111
 
112
- Does all the code in the library follow strictly these tenets? No. The library is a gigantic house with connected nooks, corridors, crannies everywhere built by thousands of different workers. We _try_ to make it so all the code added is compliant, because if we fail and merge it, we cannot change it lest we break [backwards compatibility](#backwards-compatibility).
113
 
114
- For instance, one function essential to the implementation of [Rotary Positional Embeddings](https://huggingface.co/papers/2104.09864) is identical in 70 `modeling_<file>.py` across `src/transformers/models/.` Why keep it? Because we want all the model logic to be [contained in the modeling file](#one-model-one-file). In order to do that, we [do repeat ourselves](#do-repeat-yourself).
115
 
116
  ```python
117
  def rotate_half(x):
@@ -121,48 +129,52 @@ def rotate_half(x):
121
  return torch.cat((-x2, x1), dim=-1)
122
  ```
123
 
124
- You can use a simple regex to look at all methods of a given name across your codebase and look at their differences and similarities, that's what I did (+ a hash to avoid quadraticity).
 
125
 
126
- We want all models to have self-contained modeling code.
127
 
128
- Every core functionality _must_ be in the modeling code, every non-core functionality _can_ be outside of it.
129
 
130
- This comes as a great cost. Enter the `#Copied from...` mechanism: for a long time, these comments were indicating that some code was copied from another model, saving time both for the reviewers and for the CI. But the LOC count kept creeping up. Each new model copied over hundreds of lines that we considered largely boilerplate, yet, we could not remove them.
131
 
132
- We needed to separate both principles that were so far intertwined, [repetition](#do-repeat-yourself) and [hackability](#one-model-one-file).
133
-
134
- What was the solution to this?
135
 
136
  <div class="crumbs">
137
- Read the code in one place (<a href="#one-model-one-file">One Model, One File</a>). Keep semantics local (<a href="#standardize-dont-abstract">Standardize, Don't Abstract</a>). Allow strategic duplication for end users (<a href="#do-repeat-yourself">DRY*</a>). Keep the public surface minimal and stable (<a href="#minimal-user-api">Minimal API</a>, <a href="#backwards-compatibility">Backwards Compatibility</a>, <a href="#consistent-public-surface">Consistent Surface</a>). <strong>Next:</strong> how modular transformers honor these while removing boilerplate.
 
 
138
  </div>
139
 
140
 
141
  ## <a id="modular"></a> Modular transformers
142
 
143
- Transformers is an opinionated library. The previous [philosophy](https://huggingface.co/docs/transformers/en/philosophy) page, and the [blog post](https://huggingface.co/blog/transformers-design-philosophy) were already pointing at the drawbacks mentioned just above, which have been iteratively addressed. [`modular` transformers were introduced](https://huggingface.co/docs/transformers/en/modular_transformers), allowing a form of inheritance without breaking [One model, One file](#one-model-one-file).
144
 
145
- We amended the principle of [DRY*](#do-repeat-yourself) by removing progressively all pieces of code that were "copied from" another file.
146
 
147
- It works as follows. In order to contribute a model, say for instance define a `modular_` file that can inherit from _any function across all other modeling, configuration and processor files_.
148
- This modular file can use inheritance across models: and then, it will be unravelled into a fully functional modeling file.
149
 
150
  <summary id="generated-modeling">Auto-generated modeling code</summary>
151
 
152
  <HtmlEmbed src="transformers/glm-compare.html" />
153
 
154
- As you can see, we can now define any model as a _modular_ of another.
155
 
156
  You might think "well that's just how inheritance works". The crucial difference is that we do _visibly_ what is essentially the _compiler_'s job: by unrolling the inheritances, we make visible all of the modeling code, keeping it [all in one piece](#one-model-one-file).
157
 
158
- What is the consequence? When adding a model, we do not need to go over the entire modeling file. The modular (left side above) is enough.
 
 
159
 
160
- When `AutoModel.from_pretrained(...)` is called, it is indeed the modeling (right side) that is ran, and all the tests are run on the modeling code.
161
 
162
  What does that give us?
163
 
164
  <div class="crumbs">
165
- A small <code>modular_*.py</code> declares reuse; the expanded modeling file stays visible (<a href="#one-model-one-file">tenet kept</a>). Reviewers and contributors maintain the shard, not the repetition. <strong>Next:</strong> the measurable effect on effective LOC and maintenance cost.
 
 
166
  </div>
167
 
168
 
@@ -173,35 +185,41 @@ However, if a model has a modular_*.py and a corresponding automatically generat
173
 
174
  That gives an "effective LOC" curve: the 𝗺𝗮𝗶𝗻𝘁𝗲𝗻𝗮𝗻𝗰𝗲 𝘀𝘂𝗿𝗳𝗮𝗰𝗲.
175
 
176
- Measured on git history, raw `modeling_*.py` grew at ~362 LOC/day before modular; counting only modular shards yields ~25 LOC/day after — about **15× lower**. The curve represents the **maintenance surface** today: what maintainers actually read and review.
 
 
177
 
178
- Less code to hand-maintain means fewer places to break. LOC is not complexity, but they correlate in review effort and change risk.
179
 
180
  <HtmlEmbed src="transformers/loc-growth.html" />
181
 
182
- There's a sharp drop near the end, it's due to us [removing support for Jax and TensorFlow](https://github.com/huggingface/transformers/commit/4df2529d79d75f44e70396df5888a32ffa02d61e#diff-60849db3e9922197854ef1cac92bf4aba08b5d7fd3fe6f3c16a3511e29e0eacc) library-wide.
183
 
184
- Of course, it is not only this effort that allowed to reduce the maintenance load.
185
 
186
- A related optimization was the following one. You've likely heard about [flash attention](https://huggingface.co/docs/text-generation-inference/en/conceptual/flash_attention) and its several variants.
187
 
188
- The _attention computation_ itself happens at a _lower_ level of abstraction than the model itself.
189
 
190
- However, we were adding specific torch operations for each backend (sdpa, flash-attention iterations, flex attention) but it wasn't a [minimal user api](#minimal-user-api).
 
 
191
 
192
  <div class="crumbs">
193
- Evidence: effective LOC drops ~15× when counting shards instead of expanded modeling. Less to read, fewer places to break. Related cleanups: attention backends moved behind a function interface. <strong>Next:</strong> how the attention interface stays standard without hiding semantics.
 
 
194
  </div>
195
 
196
  ### <a id="attention-classes"></a> External Attention classes
197
 
198
- The solution of the "attention abstraction problem" we chose was to move to an [attention interface](https://huggingface.co/docs/transformers/en/attention_interface) that allows the following:
199
 
200
- We keep a `Callable` for the naive implementation of the attention, called "eager" computation. We thus name this Callable `eager_attention_forward`, and it can be run as long as the user had `torch` installed, which is a requirement in any case.
201
 
202
- In other words, we moved from a class interface to a function interface: in order to use more complex attention implementations, the config is checked, and can use other Callables, including kernel bindings that are much faster, if they are available.
203
 
204
- This exemplifies the fact that we prefer to have an interface that is [standard, but not abstract](#standardize-dont-abstract).
205
 
206
  ```python
207
  attention_interface: Callable = eager_attention_forward
@@ -209,9 +227,11 @@ if self.config._attn_implementation != "eager":
209
  attention_interface = ALL_ATTENTION_FUNCTIONS[self.config._attn_implementation]
210
  ```
211
 
212
- A strength of the new attention interface is the possibility to enforce specific kwargs, which are needed by kernel providers and other dependencies. We know that kwargs are often a necessary evil that plagues tools with widespread compatibility; and it is something we have aimed to reduce, and will continue reduce in order to improve readability - with them, the current system is a [minimal user api](#minimal-user-api).
 
 
213
 
214
- Hence, backend integrations sometimes require specific kwargs. We reduce that surface and document expectations; where flexibility is necessary, we plan to use `typing.Annotated` to convey shapes and invariants without constraining integrations. Such an implementation could look like this in the future:
215
 
216
  ```python
217
  from typing import Annotated
@@ -221,42 +241,46 @@ MyModelOutputAnnotated = Annotated[MyModelOutput, "shape: (B, C, H, W)"]
221
 
222
 
223
  <div class="crumbs">
224
- Semantics remain in <code>eager_attention_forward</code>; faster backends are opt-in via config. We inform via types/annotations rather than enforce rigid kwargs, preserving integrations. <strong>Next:</strong> distribution concerns are declared as a plan, not model surgery.
 
 
225
  </div>
226
 
227
  ### <a id="simpler-tensor-parallelism"></a> Configurable Tensor Parallelism
228
 
229
  If you're not familiar with the different flavours of parallelism, I recommend checking out [this blog post](https://huggingface.co/blog/accelerate-nd-parallel) first, and of course a full [dive into the ultra-scale playbook](https://huggingface.co/spaces/nanotron/ultrascale-playbook) is always recommended.
230
 
231
- The essential part is that, as [the documentation states](https://huggingface.co/docs/transformers/v4.56.2/perf_train_gpu_many#tensor-parallelism) when tensors get too large to fit on a single GPU, they are sliced along a particular dimension and every slice is sent to a different GPU.
232
 
233
  Why does it matter?
234
 
235
  Because we want to avoid code modifications that are unrelated to the model.
236
- We choose to place the level of abstraction higher than the device placement: a matrix multiplication - a `nn.Linear` layer - should be always expressed in the same way, regardless of how it is placed.
237
-
238
- Hence, we want to touch [minimally](#minimal-user-api) to the modeling code, and only modify it when _architectural changes_ are involved. For instance, for tensor parallelism, we instead now specify a simple `tp_plan`.
239
 
240
- The alternative would be to modify parent classes specific to their
241
 
242
- It is written once in the config and passed to `.from_pretrained()`. The plan maps module name patterns to partitioning strategies. Strategies are resolved by the internal `ParallelInterface`, which wires to sharding implementations `ColwiseParallel`, `RowwiseParallel`, packed variants, and so on.
243
 
244
  <HtmlEmbed src="transformers/tp-plan.html" />
245
 
 
 
 
246
 
247
- Which allows a user to run with multiple processes per node, e.g. 4 GPUs:
248
 
249
  `torchrun --nproc-per-node 4 demo.py`
250
 
251
- Semantics stay in the model (a Linear stays a Linear), distribution is orthogonal and declared via strings: "colwise" splits columns of weights/bias across ranks; "rowwise" splits rows; packed variants shard fused weights; The mapping keys accept glob patterns like `layers.*.mlp.down_proj` to target repeated submodules.
252
 
253
  <div class="crumbs">
254
- Sharding is configuration (<code>tp_plan</code>), not edits to <code>Linear</code>s. Glob patterns target repeated blocks; modeling semantics stay intact. <strong>Next:</strong> per-layer attention/caching schedules declared in config, not hardcoded.
 
 
255
  </div>
256
 
257
  ### <a id="layers-attentions-caches"></a> Layers, attentions and caches
258
 
259
- Following the same logic, the _nature_ of attention and caching per layer of a model should not be hardcoded. We should be able to specify in a configuration-based fashion how each layer is implemented. Thus we define a mapping that can be then
260
 
261
 
262
  ```python
@@ -269,7 +293,7 @@ ALLOWED_LAYER_TYPES = (
269
  )
270
  ```
271
 
272
- and the configuration can be _explicit_ about which attention type is in which layer, see e.g. gpt-oss, which alternates sliding and full attention:
273
 
274
  ```python
275
  "layer_types": [
@@ -281,10 +305,12 @@ and the configuration can be _explicit_ about which attention type is in which l
281
  ],
282
  ```
283
 
284
- This is [minimal](#minimal-user-api) to implement on the user side, and allows to keep the modeling untouched. It is also easy to tweak.
285
 
286
  <div class="crumbs">
287
- Allowed layer types are explicit; schedules (e.g., sliding/full alternation) live in config. This keeps the file readable and easy to tweak. <strong>Next:</strong> speedups come from kernels that don't change semantics.
 
 
288
  </div>
289
 
290
 
@@ -298,18 +324,20 @@ class GlmRMSNorm(nn.Module):
298
  ...
299
  ```
300
 
301
- This also opens another contribution path: GPU specialists can contribute optimized kernels to the kernel hub, and have them usable in `transformers`. You can check on the [kernel community blog post](https://huggingface.co/blog/hello-hf-kernels) to learn more about it!
302
 
303
  Even more resources have been added, like the formidable [kernel builder](https://github.com/huggingface/kernel-builder) with its connected resources to [help you build kernels with it](https://github.com/huggingface/kernel-builder/blob/main/docs/writing-kernels.md) and [with nix](https://github.com/huggingface/kernel-builder/blob/main/docs/nix.md).
304
 
305
-
306
  <div class="crumbs">
307
- Models define semantics; kernels define how to run them faster. Use annotations to borrow community forwards while keeping a consistent public surface. <strong>Next:</strong> what modularity looks like across the repo.
 
 
308
  </div>
309
 
310
- ## Modular developments
 
 
311
 
312
- Now, we have a form of inheritance in our codebase. Some models become standards, and model contributors are given the opportunity to _define standards_. Pushing the boundaries of scientific knowledge can translate into the boundaries of engineering if this effort is made, and we're striving for it.
313
  It's hard to conceptualize very large libraries and how their components interact with each other, regardless of your cognitive abilities for abstractions.
314
  So I wanted to take a look at the current **state of modularity** across the repository. How many models are defined using components of others?
315
 
@@ -318,44 +346,55 @@ To get this graph, I used the heuristic of modular inheritance.
318
  2. In this `modular` file, what models, configurations and processings are imported?
319
  3. Recurse through the model list that way.
320
 
321
- So what do we see? Llama is a basis for many models, and it shows.
322
  Radically different architectures such as mamba have spawned their own dependency subgraph.
323
 
 
 
 
 
 
 
324
 
325
  <HtmlEmbed src="transformers/dependency-graph.html" />
326
 
327
- However, even if llava defines a few VLMs, there's far too many vision-based architectures that are not yet defined as modulars of other existing archs. In other words, there is no strong reference point in terms of software for vision models.
328
  As you can see, there is a small DETR island, a little llava pocket, and so on, but it's not comparable to the centrality observed for llama.
329
 
330
- Another problem is, this is only for `modular` models. Several models do NOT have a modular file.
331
 
332
  How do we spot them, and how do we identify modularisable models?
333
 
334
  <div class="crumbs">
335
- Graph reading guide: nodes are models; edges are modular imports. Llama-lineage is a hub; several VLMs remain islands — engineering opportunity for shared parents. <strong>Next:</strong> timeline + similarity signals to spot candidates.
 
 
336
  </div>
337
 
338
 
339
  ### Many models, but not enough yet, are alike
340
 
341
- So I looked into Jaccard similarity, which we use to measure set differences. I know that code is more than a set of characters stringed together. We also tried code-embedding models that ranked candidates better in practice, but for this post we stick to the deterministic Jaccard index.
342
-
343
 
344
- It is interesting, for that, to look at _when_ we deployed this modular logic and what was its rippling effect on the library. You can check the [larger space](https://huggingface.co/spaces/Molbap/transformers-modular-refactor) to play around, but the gist is: adding modular allowed to connect more and more models to solid reference points. We have a lot of gaps to fill in still.
345
 
346
  Zoom out below - it's full of models. You can click on a node to see its connections better, or use the text box to search for a model.
347
 
348
  <HtmlEmbed src="transformers/model-timeline.html" />
349
 
350
- If you've checked out llava, you've seen that llava_video is a red node, connected by a red edge to llava: it's a candidate, something that we can _likely_ remodularize, [not touching the actual model](#backwards-compatibility) but being much more readable with [DRY*](#do-repeat-yourself).
 
 
351
 
352
  <div class="crumbs">
353
- Similarity (Jaccard; embeddings tried separately) surfaces likely parents; the timeline shows consolidation after modular landed. Red nodes/edges = candidates (e.g., <code>llava_video</code> → <code>llava</code>) for refactors that preserve behavior. <strong>Next:</strong> concrete VLM choices that avoid leaky abstractions.
 
 
354
  </div>
355
 
356
  ### VLM improvements, avoiding abstraction
357
 
358
- We don't have cookbook for common VLM patterns (image token scatter, multi‑tower encoders, cross‑attn bridges). This is one of the main improvement points where we can work.
359
 
360
  For instance, we thought of abstracting away the mixing of `inputs_embeds`, the tensor fed into an LLM decoder in 95% of the existing VLMs. It would have looked like something like
361
 
@@ -364,7 +403,9 @@ class InputsEmbeddingMixerMixin(nn.Module):
364
  #
365
  ```
366
 
367
- But this is [abstracting away an important component of the modeling.](#standardize-dont-abstract). Embedding mixin is part of the model, removing it would break it. A user opening [`modeling_qwen2.5_vl`](https://huggingface.co/collections/Qwen/qwen25-vl-6795ffac22b334a837c0f9a5) should not have to go to another file to understand how it works.
 
 
368
 
369
  What is the current state of these “abstractions” across the codebase?
370
  You will see all the imports around a modeling file, here [Gemma3n](https://huggingface.co/google/gemma-3n-E4B-it).
@@ -421,22 +462,28 @@ The following [Pull request to standardize placeholder masking](https://github.c
421
 
422
  But this is _within_ the modeling file, not in the `PreTrainedModel` base class. It will not move away from it, because it'd break the [self-contained logic](#one-model-one-file) of the model.
423
 
 
 
424
  <div class="crumbs">
425
- Keep VLM embedding mix in the modeling file (semantics), standardize safe helpers (e.g., placeholder masking), don't migrate behavior to <code>PreTrainedModel</code>. <strong>Next:</strong> pipeline-level wins that came from PyTorch-first choices (fast processors).
 
 
426
  </div>
427
 
428
 
429
  ### On image processing and processors
430
 
431
- Choosing to be a `torch`-first software meant relieving a tremendous amount of support from `jax ` and `TensorFlow` , and it also meant that we could be more lenient into the amount of torch-dependent utilities that we were able to add. One of these is the _fast processing_ of images. Where they were before assumed to be minimal ndarrays, making stronger assumptions and enforcing `torch` and `torchvision` native inputs allowed up to speed up massively the processing time for each model.
432
 
433
- The gains in performance are immense, up to 20x speed for most models when compiled torchvision ops. Further, it allows to have the whole pipeline solely on GPU.
434
 
435
  ![Fast Image Processors Performance](/images/transformers/fast_image_processors.png)
436
  <p class="figure-legend">Thanks <a href="https://huggingface.co/yonigozlan">Yoni Gozlan</a> for the great work!</p>
437
 
438
  <div class="crumbs">
439
- Torch-first lets processors assume torch/torchvision and run the whole pipeline on GPU; big per-model speedups. <strong>Next:</strong> how this lowers friction for contributors and downstream users.
 
 
440
  </div>
441
 
442
 
@@ -446,82 +493,93 @@ This is an overall objective: there's no `transformers` without its community.
446
 
447
  Having a framework means forcing users into it. It restrains flexibility and creativity, which are the fertile soil for new ideas to grow.
448
 
449
- Among the most valuable contributions to `transformers` is of course the addition of new models. Very recently, [OpenAI added GPT-OSS](https://huggingface.co/blog/welcome-openai-gpt-oss), which prompted the addition of many new features to the library in order to support [their model](https://huggingface.co/openai/gpt-oss-120b).
450
 
451
- A second one is the ability to fine-tune and pipeline these models into many other software. Check here on the hub how many finetunes are registered for [gpt-oss 120b](https://huggingface.co/models?other=base_model:finetune:openai/gpt-oss-120b), despite its size!
452
 
453
 
454
  <div class="crumbs">
455
- The shape of a contribution: add a model (or variant) with a small modular shard; the community and serving stacks pick it up immediately. Popularity trends (encoders/embeddings) guide where we invest. <strong>Next:</strong> power tools enabled by a consistent API.
 
 
456
  </div>
457
 
458
 
459
  ### <a id="encoders-ftw"></a> Models popularity
460
 
461
- Talking about dependencies, we can take a look at the number of downloads for transformer models popularity. One thing we see is the prominence of encoders: This is because the usage of encoders lies in embeddings, just check out [EmbeddingGemma](https://huggingface.co/blog/embeddinggemma) for a modern recap. Hence, it is vital to keep the encoders part viable, usable, fine-tune-able.
462
 
463
  <div>
464
  <HtmlEmbed src="transformers/model-visualisation.html" />
465
  </div>
466
 
467
- As the codebase grows, with our friend codebase [Sentence Transformers](https://huggingface.co/sentence-transformers), we need to maintain this one as well. Retrieval use-cases, smart databases, like FAISS-based indexing rely on it, and thus indirectly on transformers.
468
-
469
 
470
  In that regard, we DO want to be a modular toolbox, being [minimal](#minimal-user-api) enough and well documented enough so any ML/AI developer can use `transformers` without having to think about it. We aim to reduce the cognitive load brought about by model development, not increase it.
471
 
472
  So, how do these design choices, these "tenets" influence development of models and overall usage of transformers?
473
 
474
  <div class="crumbs">
475
- Encoders remain critical for embeddings and retrieval; maintaining them well benefits the broader ecosystem (e.g., Sentence Transformers, FAISS). <strong>Next:</strong> dev tools that leverage unified attention APIs and PyTorch-only internals.
 
 
476
  </div>
477
 
478
 
479
  ## A surgical toolbox for model development
480
 
 
 
481
  ### Attention visualisation
482
 
483
- All models have the same API internally for attention computation, thanks to [the externalisation of attention classes](#external-attention-classes). it allows us to build cool tools to visualize the inner workings of the attention mechanism.
484
 
485
  One particular piece of machinery is the `attention mask`. Here you see the famous bidirectional attention pattern for the whole prefix (text + image) in PaliGemma and all Gemma2+ models, contrasting with the usual "causal-only" models.
486
 
487
  <HtmlEmbed src="transformers/attention-visualizer.html" />
488
 
489
  <div class="crumbs">
490
- Uniform attention APIs enable cross-model diagnostics (e.g., PaliGemma prefix bidirectionality vs causal). <strong>Next:</strong> whole-model tracing for ports and regressions.
 
 
491
  </div>
492
 
493
 
494
  ### Logging entire model activations
495
 
496
- Further, because it is all PyTorch (and it is even more now that we support only PyTorch), we can easily [debug any model](https://huggingface.co/docs/transformers/internal/model_debugging_utils) when we want to add it to transformers. We now have a power-user tool for porting or adding models, that wraps a forward pass, intercepts every submodule call, and logs shapes, dtypes, and sample statistics of inputs/outputs to nested JSON.
497
 
498
- It just works with PyTorch models and is especially useful when aligning outputs with a reference implementation, aligned with our [core guideline](#source-of-truth).
499
 
500
  ![Model debugger interface](/images/transformers/model_debugger.png)
501
 
502
 
503
  <div class="crumbs">
504
- Forward interception and nested JSON logging align ports to reference implementations, reinforcing "Source of Truth." <strong>Next:</strong> CUDA warmup reduces load-time stalls without touching modeling semantics.
 
 
505
  </div>
506
 
507
 
508
 
509
  ### Cooking faster CUDA warmups
510
 
511
- Having a clean _external_ API allows us to work on the [true inner workings of transformers](#code-is-product). One of the few recent additions was the _CUDA warmup_ via `caching_allocator_warmup` which improved massively the loading footprint by pre-allocating GPU memory to avoid malloc bottlenecks during model loading, achieving a 7x factor for an 8B model, 6x for a 32B, you can check out [the source](https://github.com/huggingface/transformers/pull/36380)!
512
 
513
  <HtmlEmbed src="transformers/warmup_demo.html" />
514
 
515
  It's hard to overstate how much of a lifesaver that is when you're trying to load a model as fast as possible, as it's the narrowest bottleneck for your iteration speed.
516
 
517
  <div class="crumbs">
518
- Pre-allocating GPU memory removes malloc spikes (e.g., 7× for 8B, 6× for 32B in the referenced PR). <strong>Next:</strong> serving benefits directly from consistent interfaces and modularity.
 
 
519
  </div>
520
 
521
 
522
  ### Transformers-serve and continuous batching
523
 
524
- Having all these models readily available allows to use all of them with transformers-serve, and enable interfacing with them with an Open API-like pattern. As a reminder, the hub also opens access to various [inference providers](https://huggingface.co/docs/inference-providers/en/index) if you're interested in model deployment in general.
525
 
526
  ```bash
527
  transformers serve
@@ -531,33 +589,45 @@ curl -X POST http://localhost:8000/v1/chat/completions \
531
  -d '{"messages": [{"role": "system", "content": "hello"}], "temperature": 0.9, "max_tokens": 1000, "stream": true, "model": "Qwen/Qwen2.5-0.5B-Instruct"}'
532
  ```
533
 
534
- This provides an OpenAI-compatible API with features like [continuous batching](https://github.com/huggingface/transformers/pull/38085) (also check [here](https://github.com/huggingface/transformers/pull/40426)) for better GPU utilization.
 
 
 
 
 
535
 
536
- Continuous batching is in itself very much linked to the great work of vLLM with the `paged attention kernel`, further justifying the facilitation of [external kernels](#community-kernels).
537
 
538
  <div class="crumbs">
539
- OpenAI-compatible surface + continuous batching; kernels/backends slot in because the modeling API stayed stable. <strong>Next:</strong> reuse across vLLM/SGLang relies on the same consistency.
 
 
540
  </div>
541
 
542
 
543
  ## Community reusability
544
 
545
- Transformers-serve is transformers-first, for sure, but the library is made first and foremost to be _reused_ at large by the open-source ecosystem.
546
 
547
  Adding a model to transformers means:
548
 
549
  - having it immediately available to the community
550
- - having it immediately usable in vLLM, [SGLang](https://huggingface.co/blog/transformers-backend-sglang), and so on without additional code. In April 2025, transformers was added as a backend to run models on vLLM, which optimizes throughput/latency on top of existing transformers architectures [as seen in this great vLLM x HF blog post.](https://blog.vllm.ai/2025/04/11/transformers-backend.html)
 
551
 
552
- This cements the need even more for a [consistent public surface](#consistent-public-surface): we are now a backend, and there's more optimized software than us to handle serving. At the time of writing, more effort is done in that direction. We already have compatible configs for VLMs for vLLM (say that three times fast), [here for GLM4 video support](https://github.com/huggingface/transformers/pull/40696/files), and here for [MoE support](https://github.com/huggingface/transformers/pull/40132) for instance.
553
 
554
 
555
  <div class="crumbs">
556
- Being a good backend consumer requires a consistent public surface; modular shards and configs make that stability practical. <strong>Next:</strong> what changes in v5 without breaking the promise of visible semantics.
 
 
557
  </div>
558
 
559
  ## What is coming next
560
 
561
- The next major version of `transformers` is just around the corner (and will have another blog post to its name when it comes out.). When v5 is released, we aim to keep [backwards compatibility](#backwards-compatibility) as solid as possible. The changes we make now are in service of that goal.
 
 
562
 
563
- We will lean further into a modular toolbox, not a framework. You should not be forced to rewrite modeling code. It’s better when a model can inherit from `PreTrainedModel` and opt into Tensor Parallel, `from_pretrained`, sharding, `push_to_hub`, loss plumbing, and external stacks like PEFT/TRL/SGLang/vLLM.
 
16
 
17
  import HtmlEmbed from "../components/HtmlEmbed.astro";
18
 
19
+ ## Preface
20
 
21
  One million lines of `python` code. Through them, the `transformers` library supports more than 400 model architectures, from state-of-the-art LLMs and VLMs to specialized models for audio, video, and tables.
22
 
23
+ Built on `PyTorch`, transformers is a foundational tool for modern LLM usage, research, education, and tens of thousands of other open-source projects. Each AI model is added by the community, harmonized into a consistent interface, and tested daily on a CI to ensure reproducibility.
24
 
25
  This scale presents a monumental engineering challenge.
26
 
27
  How do you keep such a ship afloat, made of so many moving, unrelated parts, contributed to by a buzzing hivemind? Especially as the pace of ML research accelerates? We receive constant feedback on everything from function signatures with hundreds of arguments to duplicated code and optimization concerns, and we listen to all of it, or try to. The library's usage keeps on growing, and we are a small team of maintainers and contributors, backed by hundreds of open-source community members.
28
  We continue to support all new models and expect to do so for the foreseeable future.
29
 
30
+ This post dissects the design philosophy that makes this possible. It's the result of a gradual evolution from our older principles, detailed on our previous [philosophy](https://huggingface.co/docs/transformers/en/philosophy) page, as well as its accompanying [blog post from 2022](https://huggingface.co/blog/transformers-design-philosophy). More recently (and I do recommend the read), we wrote a blog post about [recent upgrades to transformers](https://huggingface.co/blog/faster-transformers) with a special focus on what makes the library faster today. All of these developments were only made possible thanks to these principles.
31
 
32
+ We formalize and articulate the "tenets" that have been guiding our development, demonstrate how they are implemented in code, and show the measurable impact they have on the library's sustainability and growth.
33
 
34
+ For any OSS maintainer, power user, or contributor, this is the map to understanding, using, and building upon `transformers`; but not only: any project of comparable size will require you to make deep choices, not only on design and choice of abstractions, but on the very mindset of the software you are building. These tenets may or may not be applicable to your project, but they provide a glimpse on how we work that could be helpful or inspirational.
35
 
36
+ Conventions used throughout this post:
37
 
38
+ * [Tenets exemplified](#source-of-truth) will have their summary available on hover.
39
 
40
+ * [External links](https://huggingface.co/blog/welcome-openai-gpt-oss) to articles will help you solidify your knowledge.
41
+
42
+ * [Several interactive visualisations](#generated-modeling) are available as you go - scroll, zoom, drag away to explore.
43
 
44
  <div class="crumbs">
45
+ * Breadcrumb boxes summarize what you just learned, connect it to the tenets, and point to what's coming <strong>Next</strong>. Think of them as narrative signposts to help you keep track.
46
  </div>
47
 
48
+ We will get started by enumerating the tenets. Then we'll look at concrete examples that show how they shape our decision-making. These examples are necessarily detailed, and sometimes complex, because they illustrate the challenges to maintain and grow a large codebase that caters to multiple collectives, has millions of users, hundreds of contributors, and always strives for simplicity and consistency.
49
+
50
  ## The core tenets of transformers
51
 
52
 
53
  We summarize the foundations on which we've built everything, and write the "tenets" of the library. They behave like _software interfaces_, hence it is crucial that they are explicitly written down. However opinionated they are, they have evolved over time.
54
 
55
+ These principles were not decided in a vacuum. The library _evolved_ towards them, and once they _emerged_, they were recognized as critical.
56
 
57
  <div class="tenet-list">
58
  <ol>
59
  <li class="tenet">
60
  <a id="source-of-truth"></a>
61
  <strong>Source of Truth</strong>
62
+ <p>We aim to be a [source of truth for all model definitions](https://huggingface.co/blog/transformers-model-definition). This is more of a goal than a tenet, but it strongly guides our decisions. Model implementations should be reliable, reproducible, and faithful to the original implementations. If we are successful, they should become reference baselines for the ecosystem, so they'll be easily adopted by downstream libraries and projects. It's much easier for a project to _always_ refer to the transformers implementation, than to learn a different research codebase every time a new architecture is released.</p>
63
+ <em>This overarching guideline ensures quality and reproducibility across all models in the library, and aspires to make the community work easier.</em>
64
+ </li>
65
 
66
  <li class="tenet">
67
  <a id="one-model-one-file"></a>
 
71
  </li>
72
  <li class="tenet">
73
  <a id="code-is-product"></a>
74
+ <strong>Code is The Product</strong>
75
+ <p>Optimize for reading, diffing, and tweaking. Our users are power users. Variables are explicit, we use full words, and even several words. Readability is primordial.</p>
76
  <em>Code quality matters as much as functionality - optimize for human readers, not just computers.</em>
77
  </li>
78
  <li class="tenet">
79
  <a id="standardize-dont-abstract"></a>
80
  <strong>Standardize, Don't Abstract</strong>
81
+ <p>If it's model behavior, keep it in the file; only use abstractions for generic infra.</p>
82
  <em>Model-specific logic belongs in the model file, not hidden behind abstractions.</em>
83
  </li>
84
  <li class="tenet">
85
  <a id="do-repeat-yourself"></a>
86
  <strong>DRY* (DO Repeat Yourself)</strong>
87
+ <p>Copy code when it helps users; keep successors in sync without centralizing behavior.</p>
88
+ <p><strong>Evolution:</strong> With the introduction and global adoption of <a href="#modular">modular</a> transformers, we do not repeat any logic in the modular files, but end user files remain faithful to the original tenet as if code had been copied to make modeling files standalone.</p>
89
  <em>Strategic duplication can improve readability and maintainability when done thoughtfully.</em>
90
  </li>
91
  <li class="tenet">
92
  <a id="minimal-user-api"></a>
93
  <strong>Minimal User API</strong>
94
+ <p>Config, model, preprocessing; `from_pretrained`, `save_pretrained`, `push_to_hub`. We want the least amount of codepaths. Reading should be obvious, configurations should be obvious.</p>
95
  <em>Keep the public interface simple and predictable, users should know what to expect.</em>
96
  </li>
97
  <li class="tenet">
 
99
  <strong>Backwards Compatibility</strong>
100
  <p>Evolve by additive standardization, never break public APIs.</p>
101
  <p>Any artifact that was once on the hub and loadable with transformers should be usable indefinitely with the same interface. Further, public methods should not change to avoid breaking dependencies.</p>
102
+ <em>Once something is public, it stays public. Evolution through addition, not breaking changes.</em>
103
+ </li>
104
  <li class="tenet">
105
  <a id="consistent-public-surface"></a>
106
  <strong>Consistent Public Surface</strong>
107
+ <p>Same argument names, same outputs, hidden states and attentions exposed, enforced by tests. This is a goal as well as a tenet.</p>
108
  <em>All models should feel familiar - consistent interfaces reduce cognitive load.</em>
109
  </li>
110
  </ol>
111
  </div>
112
 
113
 
114
+ When a PR is merged, it is because the contribution is worthwhile, and because the `transformers` team finds the design of the contribution to be aligned with these principles.
115
+
116
+ Does all the code in the library strictly follow these tenets? No. The library is a gigantic house with connected nooks, corridors, crannies everywhere, built by thousands of different workers. We _try_ to make it so all the code added is compliant, because if we fail and merge it, we cannot change it lest we break [backwards compatibility](#backwards-compatibility).
117
+
118
+ <!-- I found the transition to the following example confusing. It implied (because of the previous paragraph and the `for instance` clause) that it's not following the tenets, where in fact it's something we WANT to do. Suggesting some slight reordering. -->
119
 
120
+ To see what constitutes adherence to the tenets, let's take the example of code repetition.
121
 
122
+ The following function, which is essential to the implementation of [Rotary Positional Embeddings](https://huggingface.co/papers/2104.09864), can be found in 70 `modeling_<file>.py` files across `src/transformers/models/.` Why keep it? Because we want all the model logic to be [contained in the modeling file](#one-model-one-file). In order to do that, we [do repeat ourselves](#do-repeat-yourself).
123
 
124
  ```python
125
  def rotate_half(x):
 
129
  return torch.cat((-x2, x1), dim=-1)
130
  ```
131
 
132
+ You can use a simple regex, like [this one]() to look at all methods of a given name across your codebase and look at their differences and similarities.
133
+ <!-- I'd maybe remove the previous line altogether and just use a link in the paragraph above -->
134
 
135
+ We want all models to have self-contained modeling code. Every core functionality _must_ be in the modeling code, every non-core functionality _can_ be outside of it.
136
 
137
+ This comes at a great cost. For a long time we used the `#Copied from...` mechanism: we added comments that documented that some code was copied from another model, saving time both for the reviewers and for the CI: we had tooling to ensure that the copied blocks remained in sync. But the LOC count kept creeping up. Each new model copied over hundreds of lines that we considered largely boilerplate, yet, we could not remove them.
138
 
139
+ We needed to separate two principles that were so far intertwined, [repetition](#do-repeat-yourself) and [hackabilty](#one-model-one-file).
140
 
141
+ What was the solution to this? Let's talk about modular transformers.
 
 
142
 
143
  <div class="crumbs">
144
+ <strong>TL;DR:</strong> Read the code in one place (<a href="#one-model-one-file">One Model, One File</a>). Keep semantics local (<a href="#standardize-dont-abstract">Standardize, Don't Abstract</a>). Allow strategic duplication for end users (<a href="#do-repeat-yourself">DRY*</a>). Keep the public surface minimal and stable (<a href="#minimal-user-api">Minimal API</a>, <a href="#backwards-compatibility">Backwards Compatibility</a>, <a href="#consistent-public-surface">Consistent Surface</a>).
145
+
146
+ <strong>Next:</strong> how modular transformers honor these while removing boilerplate.
147
  </div>
148
 
149
 
150
  ## <a id="modular"></a> Modular transformers
151
 
152
+ Transformers is an opiniated library. The previous [philosophy](https://huggingface.co/docs/transformers/en/philosophy) page, and the [2022 blog post](https://huggingface.co/blog/transformers-design-philosophy) were already pointing at the drawbacks mentioned just above, which have been iteratively addressed. [`modular` transformers was introduced](https://huggingface.co/docs/transformers/en/modular_transformers) to allow a form of inheritance without breaking [One model, One file](#one-model-one-file).
153
 
154
+ We amended the principle of [DRY*](#do-repeat-yourself) by progressively removing all pieces of code that were "copied from" another file.
155
 
156
+ It works as follows. In order to contribute a model –GLM, for instance we define a `modular_` file that can inherit from _any function across all other modeling, configuration and processor files already available in the library_. The modular file can use inheritance across models, but then it's unravelled into a fully functional and standalone modeling file.
 
157
 
158
  <summary id="generated-modeling">Auto-generated modeling code</summary>
159
 
160
  <HtmlEmbed src="transformers/glm-compare.html" />
161
 
162
+ As you can see, we can define a new model as a _modular_ combination of fragments taken from others.
163
 
164
  You might think "well that's just how inheritance works". The crucial difference is that we do _visibly_ what is essentially the _compiler_'s job: by unrolling the inheritances, we make visible all of the modeling code, keeping it [all in one piece](#one-model-one-file).
165
 
166
+ <!-- some ideas for additional hand-holding: link to the implementation of `LlamaAttention` to show it was copied (and modified), or maybe provide a git diff view between the GlmAttention and LlamaAttention implementations -->
167
+
168
+ What is the consequence? When adding a model, we do not need to go over the entire modeling file. The modular (left side above) is enough.
169
 
170
+ When `AutoModel.from_pretrained(...)` is called, it is indeed the modeling (right side) that is ran, and all the tests run on the modeling code. More importantly, the auto-generated modeling file is what users _read_ to understand the code, what they step through in their debuggers and what they hack for their needs.
171
 
172
  What does that give us?
173
 
174
  <div class="crumbs">
175
+ <strong>TL;DR:</strong> A small <code>modular_*.py</code> declares reuse; the expanded modeling file stays visible (<a href="#one-model-one-file">One Model, One File tenet preserved</a>). Reviewers and contributors maintain the shard, not the repetition.
176
+
177
+ <strong>Next:</strong> the measurable effect on effective LOC and maintenance cost.
178
  </div>
179
 
180
 
 
185
 
186
  That gives an "effective LOC" curve: the 𝗺𝗮𝗶𝗻𝘁𝗲𝗻𝗮𝗻𝗰𝗲 𝘀𝘂𝗿𝗳𝗮𝗰𝗲.
187
 
188
+ Measured on git history, raw `modeling_*.py` grew at ~362 LOC/day before modular; counting only modular shards yields ~25 LOC/day after — about **15× lower**. The effective curve (blue line below) represents the **maintenance surface** today: what maintainers actually read and review.
189
+
190
+ <!-- Yeah, super good point that effective == maintenable -->
191
 
192
+ Less code to hand-maintain means fewer places to break. Of course LOC is not a direct measure of complexity, but they correlate in review effort and change risk.
193
 
194
  <HtmlEmbed src="transformers/loc-growth.html" />
195
 
196
+ <!-- What is "Modeling LOC (included)"? The modeling code, not counting the files that have a modular counterpart? If so, perhaps we can say that the blue line (effective) is the sum of the red + green, whereas the yellow would have been the progression without modular. Also worth mentioning imo that the surface area has been essentially constant (in LOC) since modular. -->
197
 
198
+ Notice there's a sharp drop at the end of the curves, this is mostly due to us [removing support for Jax and TensorFlow](https://github.com/huggingface/transformers/commit/4df2529d79d75f44e70396df5888a32ffa02d61e#diff-60849db3e9922197854ef1cac92bf4aba08b5d7fd3fe6f3c16a3511e29e0eacc) library-wide.
199
 
200
+ But this was not the only effort that allowed us to reduce maintenance load.
201
 
202
+ We recently underwent a thoughtful refactor of the attention implementation. You've likely heard about [flash attention](https://huggingface.co/docs/text-generation-inference/en/conceptual/flash_attention) and its several variants.
203
 
204
+ _Attention computation_ happens at a _lower_ level of abstraction than the model itself.
205
+
206
+ However, we were adding specific torch operations to every model for each backend (sdpa, various flash-attention versions, flex attention) but it wasn't a [minimal user api](#minimal-user-api). Next section explains what we did.
207
 
208
  <div class="crumbs">
209
+ Evidence: effective (i.e., maintenable) LOC growth drops ~15× when counting shards instead of expanded modeling files. Less code to read, fewer places to break.
210
+
211
+ <strong>Next:</strong> how the attention interface stays standard without hiding semantics.
212
  </div>
213
 
214
  ### <a id="attention-classes"></a> External Attention classes
215
 
216
+ The solution for the "attention abstraction problem" was to move to a standard [attention interface](https://huggingface.co/docs/transformers/en/attention_interface) that allows the following:
217
 
218
+ The naive implementation of attention, called "eager", is available by default. We use a `Callable` called `eager_attention_forward`, which can run as long as the user has PyTorch installed – which is a requirement any way.
219
 
220
+ Instead of using a class interface and a class hierarchy, we just moved to a function interface. When a more complex attention implementation is needed, we use other Callables, including much faster kernel bindings when available. The decision to use a different attention implementation is based on the model configuration file we download from the Hub, and it can also be overridden by the user.
221
 
222
+ This is a clear example that that we prefer an interface that is [standard, but not abstract](#standardize-dont-abstract). To be completely precise, this is what the interface selection looks like in transformers code:
223
 
224
  ```python
225
  attention_interface: Callable = eager_attention_forward
 
227
  attention_interface = ALL_ATTENTION_FUNCTIONS[self.config._attn_implementation]
228
  ```
229
 
230
+ A strength of the new attention interface is the possibility to enforce specific kwargs, which are needed by kernel providers and other dependencies. We know that kwargs are often a necessary evil that plagues tools with widespread compatibility; it is something we have aimed to reduce, and will continue to reduce in order to improve readability - with them, the current system is a [minimal user api](#minimal-user-api).
231
+
232
+ <!-- not fully following the transition here -->
233
 
234
+ Backend integrations sometimes require specific kwargs. We reduce that surface and document expectations; where flexibility is necessary, we plan to use `typing.Annotated` to convey shapes and invariants without constraining integrations. Such an implementation could look like this in the future:
235
 
236
  ```python
237
  from typing import Annotated
 
241
 
242
 
243
  <div class="crumbs">
244
+ Attention semantics remain in <code>eager_attention_forward</code>; faster backends are opt-in via config. We inform via types/annotations rather than enforce rigid kwargs, preserving integrations.
245
+
246
+ <strong>Next:</strong> parallel partitioning is declared as a plan, not through model surgery.
247
  </div>
248
 
249
  ### <a id="simpler-tensor-parallelism"></a> Configurable Tensor Parallelism
250
 
251
  If you're not familiar with the different flavours of parallelism, I recommend checking out [this blog post](https://huggingface.co/blog/accelerate-nd-parallel) first, and of course a full [dive into the ultra-scale playbook](https://huggingface.co/spaces/nanotron/ultrascale-playbook) is always recommended.
252
 
253
+ The essential part is that, as [the documentation states](https://huggingface.co/docs/transformers/v4.56.2/perf_train_gpu_many#tensor-parallelism), when tensors get too large to fit on a single GPU, they are sliced along a particular dimension and every slice is sent to a different GPU.
254
 
255
  Why does it matter?
256
 
257
  Because we want to avoid code modifications that are unrelated to the model.
 
 
 
258
 
259
+ We choose to place the level of abstraction higher than the device placement: a matrix multiplication - a `nn.Linear` layer - should be always expressed in the same way, regardless of how it is placed.
260
 
261
+ Hence, we want to touch the modeling code [minimally](#minimal-user-api), and only modify it when _architectural changes_ are involved not depending on the way you run it. For tensor parallelism, we simply specify a `tp_plan`:
262
 
263
  <HtmlEmbed src="transformers/tp-plan.html" />
264
 
265
+ The plan is written once, saved as part of the config and passed to `.from_pretrained()`. It maps module name patterns to partitioning strategies. Strategies are resolved by the internal `ParallelInterface`, which wires to sharding implementations `ColwiseParallel`, `RowwiseParallel`, packed variants, and so on.
266
+
267
+ The alternative would be to modify classes depending on supported types of parallelism.
268
 
269
+ The `tp_plan` solution allows users to run the same model on a single GPU, or distribute it using multiple processes per node, e.g. 4 GPUs:
270
 
271
  `torchrun --nproc-per-node 4 demo.py`
272
 
273
+ Semantics stay in the model (a Linear stays a Linear), parallelization is orthogonal and declared via strings: "colwise" splits columns of weights/bias across ranks; "rowwise" splits rows; packed variants shard fused weights; The mapping keys accept glob patterns like `layers.*.mlp.down_proj` to target repeated submodules.
274
 
275
  <div class="crumbs">
276
+ Parallelization is specified in the configuration (<code>tp_plan</code>), not through edits to <code>Linear</code>s. Glob patterns target repeated blocks; modeling semantics stay intact.
277
+
278
+ <strong>Next:</strong> per-layer attention/caching schedules declared in config, not hardcoded.
279
  </div>
280
 
281
  ### <a id="layers-attentions-caches"></a> Layers, attentions and caches
282
 
283
+ Following the same logic, the _nature_ of attention and per-layer caching should not be hardcoded. We should be able to specify in the configuration how each layer is implemented. Thus, we define a mapping like:
284
 
285
 
286
  ```python
 
293
  )
294
  ```
295
 
296
+ and the configuration can be _explicit_ about which attention type is in which layer. See, for example, [gpt-oss](https://huggingface.co/openai/gpt-oss-120b/blob/main/config.json#L15), which alternates sliding and full attention:
297
 
298
  ```python
299
  "layer_types": [
 
305
  ],
306
  ```
307
 
308
+ This is [minimal](#minimal-user-api) to implement on the user side, and allows to keep the modeling code untouched. It is also easy to tweak.
309
 
310
  <div class="crumbs">
311
+ Allowed layer types are explicit; schedules (e.g., sliding/full alternation) live in config. This keeps the file readable and easy to tweak.
312
+
313
+ <strong>Next:</strong> speedups come from kernels that don't change semantics.
314
  </div>
315
 
316
 
 
324
  ...
325
  ```
326
 
327
+ This also opens another contribution path: GPU specialists can contribute optimized kernels to the [Kernels Hub](https://huggingface.co/kernels-community), and have them immediately available to use in `transformers` and other libraries. You can check the [kernel community blog post](https://huggingface.co/blog/hello-hf-kernels) to learn more about it!
328
 
329
  Even more resources have been added, like the formidable [kernel builder](https://github.com/huggingface/kernel-builder) with its connected resources to [help you build kernels with it](https://github.com/huggingface/kernel-builder/blob/main/docs/writing-kernels.md) and [with nix](https://github.com/huggingface/kernel-builder/blob/main/docs/nix.md).
330
 
 
331
  <div class="crumbs">
332
+ Models define semantics; kernels define how to run them faster. Use decorations to borrow community forwards while keeping a consistent public surface.
333
+
334
+ <strong>Next:</strong> what modularity looks like across the repo.
335
  </div>
336
 
337
+ ## The Sate of Modular
338
+
339
+ Modular provides a form of inheritance in our codebase. Some models become standards, and model contributors have the opportunity to _define standards_ if their architectures are adopted. Pushing the boundaries of scientific knowledge can translate into the boundaries of engineering if this effort is made, and we're striving for it.
340
 
 
341
  It's hard to conceptualize very large libraries and how their components interact with each other, regardless of your cognitive abilities for abstractions.
342
  So I wanted to take a look at the current **state of modularity** across the repository. How many models are defined using components of others?
343
 
 
346
  2. In this `modular` file, what models, configurations and processings are imported?
347
  3. Recurse through the model list that way.
348
 
349
+ So what do we see? Llama is a basis and an influence for many models, and it shows.
350
  Radically different architectures such as mamba have spawned their own dependency subgraph.
351
 
352
+ <!-- A couple of ideas here:
353
+ - Use screenshots to clearly show the points we make. For example, the cluster with Llama in the center, or the one about DETR/llava below.
354
+ - Use a link to open the viewer full-screen for better manipulation and exploration.
355
+ -->
356
+
357
+ (Graph reading guide: nodes are models; edges are modular imports).
358
 
359
  <HtmlEmbed src="transformers/dependency-graph.html" />
360
 
361
+ In the case of VLMs, there's far too many vision-based architectures that are not yet defined as modulars of other existing archs. In other words, there is no strong reference point in terms of software for vision models.
362
  As you can see, there is a small DETR island, a little llava pocket, and so on, but it's not comparable to the centrality observed for llama.
363
 
364
+ Another problem is, this visualization only shows `modular` models. Several models still do NOT have a modular file.
365
 
366
  How do we spot them, and how do we identify modularisable models?
367
 
368
  <div class="crumbs">
369
+ Llama-lineage is a hub; several VLMs remain islands — engineering opportunity for shared parents.
370
+
371
+ <strong>Next:</strong> timeline + similarity signals to spot modularisable candidates.
372
  </div>
373
 
374
 
375
  ### Many models, but not enough yet, are alike
376
 
377
+ I looked into Jaccard similarity, which we use to measure set differences, to find similarities across models. I know that code is more than a set of characters stringed together. We also tried code-embedding models that ranked candidates better in practice, but for this post we stick to the deterministic Jaccard index.
 
378
 
379
+ It is interesting, for our comparison, to look at _when_ we deployed the modular logic and what was its rippling effect on the library. You can check the [larger space](https://huggingface.co/spaces/Molbap/transformers-modular-refactor) to play around, but the gist is: adding modular allowed to connect more and more models to solid reference points. But we still have a lot of gaps to fill.
380
 
381
  Zoom out below - it's full of models. You can click on a node to see its connections better, or use the text box to search for a model.
382
 
383
  <HtmlEmbed src="transformers/model-timeline.html" />
384
 
385
+ <!-- screenshot would be helpful -->
386
+
387
+ If you check llava, you've seen that llava_video is a red node, connected by a red edge to llava: it's a candidate, something that we can _likely_ remodularize, [not touching the actual model](#backwards-compatibility) but being much more readable with [DRY*](#do-repeat-yourself).
388
 
389
  <div class="crumbs">
390
+ Similarity metrics (Jaccard or embeddings) surface likely parents; the timeline shows consolidation after modular landed. Red nodes/edges = candidates (e.g., <code>llava_video</code> → <code>llava</code>) for refactors that preserve behavior.
391
+
392
+ <strong>Next:</strong> concrete VLM choices that avoid leaky abstractions.
393
  </div>
394
 
395
  ### VLM improvements, avoiding abstraction
396
 
397
+ We don't yet have a cookbook for common VLM patterns (image token scatter, multi‑tower encoders, cross‑attention bridges). This is one of the main improvement points where we can work.
398
 
399
  For instance, we thought of abstracting away the mixing of `inputs_embeds`, the tensor fed into an LLM decoder in 95% of the existing VLMs. It would have looked like something like
400
 
 
403
  #
404
  ```
405
 
406
+ But this is [abstracting away an important component of the modeling](#standardize-dont-abstract). Embedding mixin is part of the model, removing it would break it. A user opening [`modeling_qwen2.5_vl`](https://huggingface.co/collections/Qwen/qwen25-vl-6795ffac22b334a837c0f9a5) should not have to go to another file to understand how it works.
407
+
408
+ <!-- ^ should we link to the code instead? -->
409
 
410
  What is the current state of these “abstractions” across the codebase?
411
  You will see all the imports around a modeling file, here [Gemma3n](https://huggingface.co/google/gemma-3n-E4B-it).
 
462
 
463
  But this is _within_ the modeling file, not in the `PreTrainedModel` base class. It will not move away from it, because it'd break the [self-contained logic](#one-model-one-file) of the model.
464
 
465
+ <!-- So the main conclusion here is that VLMs should use modular more to come up with de-facto standard modules without abstracting them away? -->
466
+
467
  <div class="crumbs">
468
+ Keep VLM embedding mix in the modeling file (semantics), standardize safe helpers (e.g., placeholder masking), don't migrate behavior to <code>PreTrainedModel</code>.
469
+
470
+ <strong>Next:</strong> pipeline-level wins that came from PyTorch-first choices (fast processors).
471
  </div>
472
 
473
 
474
  ### On image processing and processors
475
 
476
+ Deciding to become a `torch`-first library meant relieving a tremendous amount of support for `jax ` and `TensorFlow`, and it also meant that we could be more lenient into the amount of torch-dependent utilities that we were able to accept. One of these is the _fast processing_ of images. Where inputs were once minimally assumed to be ndarrays, enforcing native `torch` and `torchvision` inputs allowed us to massively improve processing speed for each model.
477
 
478
+ The gains in performance are immense, up to 20x speedup for most models when using compiled torchvision ops. Furthermore, it allows to run the whole pipeline solely on GPU.
479
 
480
  ![Fast Image Processors Performance](/images/transformers/fast_image_processors.png)
481
  <p class="figure-legend">Thanks <a href="https://huggingface.co/yonigozlan">Yoni Gozlan</a> for the great work!</p>
482
 
483
  <div class="crumbs">
484
+ PyTorch-first lets processors assume torch/torchvision and run the whole pipeline on GPU; big per-model speedups.
485
+
486
+ <strong>Next:</strong> how this lowers friction for contributors and downstream users.
487
  </div>
488
 
489
 
 
493
 
494
  Having a framework means forcing users into it. It restrains flexibility and creativity, which are the fertile soil for new ideas to grow.
495
 
496
+ Among the most valuable contributions to `transformers` is of course the addition of new models. Very recently, [OpenAI added GPT-OSS](https://huggingface.co/blog/welcome-openai-gpt-oss), which prompted the addition of many new features to the library in order to support [their model](https://huggingface.co/openai/gpt-oss-120b). These additions are immediately available for other models to use.
497
 
498
+ Another important advantage is the ability to fine-tune and pipeline these models into many other libraries and tools. Check here on the hub how many finetunes are registered for [gpt-oss 120b](https://huggingface.co/models?other=base_model:finetune:openai/gpt-oss-120b), despite its size!
499
 
500
 
501
  <div class="crumbs">
502
+ The shape of a contribution: add a model (or variant) with a small modular shard; the community and serving stacks pick it up immediately. Popularity trends (encoders/embeddings) guide where we invest.
503
+
504
+ <strong>Next:</strong> power tools enabled by a consistent API.
505
  </div>
506
 
507
 
508
  ### <a id="encoders-ftw"></a> Models popularity
509
 
510
+ Talking about dependencies, we can take a look at the number of downloads as a measure of popularity. One thing we see is the prominence of encoders, despite the apparent prevalence of decoder LLMs. The reason is that encoders are used to generate embeddings, which have multiple downstream uses. Just check out [EmbeddingGemma](https://huggingface.co/blog/embeddinggemma) for a modern recap. Hence, it is vital to keep the encoders portion of the library viable, usable, fine-tune-able.
511
 
512
  <div>
513
  <HtmlEmbed src="transformers/model-visualisation.html" />
514
  </div>
515
 
516
+ As the codebase grows, we need to maintain it in coordination with our friend [Sentence Transformers codebase](https://huggingface.co/sentence-transformers). Retrieval use-cases, smart databases, FAISS-based indexing rely on it, and thus indirectly on transformers.
 
517
 
518
  In that regard, we DO want to be a modular toolbox, being [minimal](#minimal-user-api) enough and well documented enough so any ML/AI developer can use `transformers` without having to think about it. We aim to reduce the cognitive load brought about by model development, not increase it.
519
 
520
  So, how do these design choices, these "tenets" influence development of models and overall usage of transformers?
521
 
522
  <div class="crumbs">
523
+ Encoders remain critical for embeddings and retrieval; maintaining them well benefits the broader ecosystem (e.g., Sentence Transformers, FAISS).
524
+
525
+ <strong>Next:</strong> dev tools that leverage unified attention APIs and PyTorch-only internals.
526
  </div>
527
 
528
 
529
  ## A surgical toolbox for model development
530
 
531
+ Transformers provides many tools that can help you while adding a new architecture, or help you understand the inner workings of the library.
532
+
533
  ### Attention visualisation
534
 
535
+ All models have the same internal API for attention computation, thanks to [the externalisation of attention classes](#external-attention-classes). This allows us to build cool tools to visualize the inner workings of the attention mechanism.
536
 
537
  One particular piece of machinery is the `attention mask`. Here you see the famous bidirectional attention pattern for the whole prefix (text + image) in PaliGemma and all Gemma2+ models, contrasting with the usual "causal-only" models.
538
 
539
  <HtmlEmbed src="transformers/attention-visualizer.html" />
540
 
541
  <div class="crumbs">
542
+ Uniform attention APIs enable cross-model diagnostics (e.g., PaliGemma prefix bidirectionality vs causal).
543
+
544
+ <strong>Next:</strong> whole-model tracing for ports and regressions.
545
  </div>
546
 
547
 
548
  ### Logging entire model activations
549
 
550
+ Because everything is PyTorch, we can easily [debug any model](https://huggingface.co/docs/transformers/internal/model_debugging_utils) when we want to add it to transformers. We now have a power-user tool for porting or adding models, that wraps a forward pass, intercepts every submodule call, and logs shapes, dtypes, and sample statistics of inputs/outputs to nested JSON.
551
 
552
+ It just works with PyTorch models and is especially useful when aligning outputs with a reference implementation, to match our [Source of Truth guideline](#source-of-truth).
553
 
554
  ![Model debugger interface](/images/transformers/model_debugger.png)
555
 
556
 
557
  <div class="crumbs">
558
+ Forward interception and nested JSON logging align ports to reference implementations, reinforcing "Source of Truth." <strong>
559
+
560
+ Next:</strong> CUDA warmup reduces load-time without touching modeling semantics.
561
  </div>
562
 
563
 
564
 
565
  ### Cooking faster CUDA warmups
566
 
567
+ Having a clean _external_ API allows us to work on the [true inner workings of transformers](#code-is-product). One of a few recent additions was the _CUDA warmup_ via `caching_allocator_warmup`, which dramatically improved loading times by pre-allocating GPU memory to avoid malloc bottlenecks during model loading. It can achieve a 7x speedup factor for an 8B model, or 6x for a 32B one, as you can check in [the PR](https://github.com/huggingface/transformers/pull/36380)!
568
 
569
  <HtmlEmbed src="transformers/warmup_demo.html" />
570
 
571
  It's hard to overstate how much of a lifesaver that is when you're trying to load a model as fast as possible, as it's the narrowest bottleneck for your iteration speed.
572
 
573
  <div class="crumbs">
574
+ Pre-allocating GPU memory removes malloc spikes (e.g., 7× for 8B, 6× for 32B in the referenced PR).
575
+
576
+ <strong>Next:</strong> consistent interfaces allow transformers-serve.
577
  </div>
578
 
579
 
580
  ### Transformers-serve and continuous batching
581
 
582
+ Having all these models readily available and sharing the same interface allowed us to implement transformers-serve, a CLI tool to expose models through a standard OpenAI http API.
583
 
584
  ```bash
585
  transformers serve
 
589
  -d '{"messages": [{"role": "system", "content": "hello"}], "temperature": 0.9, "max_tokens": 1000, "stream": true, "model": "Qwen/Qwen2.5-0.5B-Instruct"}'
590
  ```
591
 
592
+ transformers-serve uses continuous batching (see [this PR](https://github.com/huggingface/transformers/pull/38085) and also [this one](https://github.com/huggingface/transformers/pull/40426)) for better GPU utilization, and is very much linked to the great work of vLLM with the `paged attention kernel` – a futher justification of [external kernels](#community-kernels).
593
+
594
+ transformers-serve is not meant for user-facing production services – tools like vLLM or SGLang are super optimized for that –, but it's useful for several use cases:
595
+ - Quickly verify that your model is compatible with continuous batching and paged attention.
596
+ - Run ad-hoc vibe tests on any model, without worrying to deploy anything.
597
+ - Run evaluations efficiently, again without having to spend a lot of time engineering your infrastructure.
598
 
599
+ For model deployment, check [Inference Providers](https://huggingface.co/docs/inference-providers/en/index) or roll your solution using any of the excellent serving libraries.
600
 
601
  <div class="crumbs">
602
+ OpenAI-compatible surface + continuous batching; kernels/backends slot in because the modeling API stayed stable.
603
+
604
+ <strong>Next:</strong> reuse across vLLM/SGLang relies on the same consistency.
605
  </div>
606
 
607
 
608
  ## Community reusability
609
 
610
+ The transformers-serve CLI built on transformers, for sure, but the library is made first and foremost to be _reused_ at large by the open-source ecosystem.
611
 
612
  Adding a model to transformers means:
613
 
614
  - having it immediately available to the community
615
+ - having it immediately usable in vLLM, [SGLang](https://huggingface.co/blog/transformers-backend-sglang), and so on without additional code. In the case of vLLM, transformers was added as a backend to run models on vLLM, which optimizes throughput/latency on top of existing transformers architectures [as seen in this great vLLM x HF blog post.](https://blog.vllm.ai/2025/04/11/transformers-backend.html)
616
+ - being the reference code for implementations in MLX, llama.cpp and other libraries.
617
 
618
+ This further cements the need for a [consistent public surface](#consistent-public-surface): we are a backend and a reference, and there's more software than us to handle serving. At the time of writing, more effort is done in that direction. We already have compatible configs for VLMs for vLLM (say that three times fast), check [here for GLM4 video support](https://github.com/huggingface/transformers/pull/40696/files), and here for [MoE support](https://github.com/huggingface/transformers/pull/40132), for instance.
619
 
620
 
621
  <div class="crumbs">
622
+ Being a good backend consumer requires a consistent public surface; modular shards and configs make that stability practical.
623
+
624
+ <strong>Next:</strong> what changes in v5 without breaking the promise of visible semantics.
625
  </div>
626
 
627
  ## What is coming next
628
 
629
+ The next major version of `transformers` is just around the corner (and will have another blog post to its name when it comes out). When v5 is released, we aim to keep [backwards compatibility](#backwards-compatibility) as solid as possible. The changes we make now are in service of that goal.
630
+
631
+ We will lean further into a modular toolbox, not a framework. You should not be forced to rewrite modeling code. It’s better when a model can inherit from `PreTrainedModel` and opt into Tensor Parallel, `from_pretrained`, sharding, `push_to_hub`, loss plumbing, and external stacks like PEFT/TRL/SGLang/vLLM.
632
 
633
+ <!-- Maybe end with some statement that shows lots of excitement -->