Spaces:

transformers-community
/

Transformers-tenets

Running

App Files Files Community

pcuenq HF Staff commited on Oct 2

Commit

f0db878

1 Parent(s): 83ac984

Review

Browse files

Files changed (1) hide show

app/src/content/article.mdx +131 -73

app/src/content/article.mdx CHANGED Viewed

@@ -29,11 +29,11 @@ We continue to support all new models and expect to do so for the foreseeable fu
 This post dissects the design philosophy that makes this possible. It's the result of a gradual evolution from our older principles, detailed on our previous [philosophy](https://huggingface.co/docs/transformers/en/philosophy) page, as well as its accompanying [blog post from 2022](https://huggingface.co/blog/transformers-design-philosophy). More recently (and I do recommend the read), we wrote a blog post about [recent upgrades to transformers](https://huggingface.co/blog/faster-transformers) with a special focus on what makes the library faster today. All of these developments were only made possible thanks to these principles.
-This post formalizes and articulates the "tenets" that have been guiding our development, demonstrates how they are implemented in code, and shows the measurable impact they have on the library's sustainability and growth.
 For any OSS maintainer, power user, or contributor, this is the map to understanding, using, and building upon `transformers`; but not only: any project of comparable size will require you to make deep choices, not only on design and choice of abstractions, but on the very mindset of the software you are building. These tenets may or may not be applicable to your project, but they provide a glimpse on how we work that could be helpful or inspirational.
-Conventions used throught this post:
 * [Tenets exemplified](#source-of-truth) will have their summary available on hover.
@@ -185,35 +185,41 @@ However, if a model has a modular_*.py and a corresponding automatically generat
 That gives an "effective LOC" curve: the 𝗺𝗮𝗶𝗻𝘁𝗲𝗻𝗮𝗻𝗰𝗲 𝘀𝘂𝗿𝗳𝗮𝗰𝗲.
-Measured on git history, raw `modeling_*.py` grew at ~362 LOC/day before modular; counting only modular shards yields ~25 LOC/day after — about **15× lower**. The curve represents the **maintenance surface** today: what maintainers actually read and review.
-Less code to hand-maintain means fewer places to break. LOC is not complexity, but they correlate in review effort and change risk.
 <HtmlEmbed src="transformers/loc-growth.html" />
-There's a sharp drop near the end, it's due to us [removing support for Jax and TensorFlow](https://github.com/huggingface/transformers/commit/4df2529d79d75f44e70396df5888a32ffa02d61e#diff-60849db3e9922197854ef1cac92bf4aba08b5d7fd3fe6f3c16a3511e29e0eacc) library-wide.
-Of course, it is not only this effort that allowed to reduce the maintenance load.
-A related optimization was the following one. You've likely heard about [flash attention](https://huggingface.co/docs/text-generation-inference/en/conceptual/flash_attention) and its several variants.
-The _attention computation_ itself happens at a _lower_ level of abstraction than the model itself.
-However, we were adding specific torch operations for each backend (sdpa, flash-attention iterations, flex attention) but it wasn't a [minimal user api](#minimal-user-api).
 <div class="crumbs">
-Evidence: effective LOC drops ~15× when counting shards instead of expanded modeling. Less to read, fewer places to break. Related cleanups: attention backends moved behind a function interface. <strong>Next:</strong> how the attention interface stays standard without hiding semantics.
 </div>
 ### <a id="attention-classes"></a> External Attention classes
-The solution of the "attention abstraction problem" we chose was to move to an [attention interface](https://huggingface.co/docs/transformers/en/attention_interface) that allows the following:
-We keep a `Callable` for the naive implementation of the attention, called "eager" computation. We thus name this Callable `eager_attention_forward`, and it can be run as long as the user had `torch` installed, which is a requirement in any case.
-In other words, we moved from a class interface to a function interface: in order to use more complex attention implementations, the config is checked, and can use other Callables, including kernel bindings that are much faster, if they are available.
-This exemplifies the fact that we prefer to have an interface that is [standard, but not abstract](#standardize-dont-abstract).
 ```python
 attention_interface: Callable = eager_attention_forward
@@ -221,9 +227,11 @@ if self.config._attn_implementation != "eager":
     attention_interface = ALL_ATTENTION_FUNCTIONS[self.config._attn_implementation]
 ```
-A strength of the new attention interface is the possibility to enforce specific kwargs, which are needed by kernel providers and other dependencies. We know that kwargs are often a necessary evil that plagues tools with widespread compatibility; and it is something we have aimed to reduce, and will continue reduce in order to improve readability - with them, the current system is a [minimal user api](#minimal-user-api).
-Hence, backend integrations sometimes require specific kwargs. We reduce that surface and document expectations; where flexibility is necessary, we plan to use `typing.Annotated` to convey shapes and invariants without constraining integrations. Such an implementation could look like this in the future:
 ```python
 from typing import Annotated
@@ -233,42 +241,46 @@ MyModelOutputAnnotated = Annotated[MyModelOutput, "shape: (B, C, H, W)"]
 <div class="crumbs">
-Semantics remain in <code>eager_attention_forward</code>; faster backends are opt-in via config. We inform via types/annotations rather than enforce rigid kwargs, preserving integrations. <strong>Next:</strong> distribution concerns are declared as a plan, not model surgery.
 </div>
 ### <a id="simpler-tensor-parallelism"></a> Configurable Tensor Parallelism
 If you're not familiar with the different flavours of parallelism, I recommend checking out [this blog post](https://huggingface.co/blog/accelerate-nd-parallel) first, and of course a full [dive into the ultra-scale playbook](https://huggingface.co/spaces/nanotron/ultrascale-playbook) is always recommended.
-The essential part is that, as [the documentation states](https://huggingface.co/docs/transformers/v4.56.2/perf_train_gpu_many#tensor-parallelism) when tensors get too large to fit on a single GPU, they are sliced along a particular dimension and every slice is sent to a different GPU.
 Why does it matter?
 Because we want to avoid code modifications that are unrelated to the model.
-We choose to place the level of abstraction higher than the device placement: a matrix multiplication - a `nn.Linear` layer - should be always expressed in the same way, regardless of how it is placed.
-Hence, we want to touch [minimally](#minimal-user-api) to the modeling code, and only modify it when _architectural changes_ are involved. For instance, for tensor parallelism, we instead now specify a simple `tp_plan`.
-The alternative would be to modify parent classes specific to their
-It is written once in the config and passed to `.from_pretrained()`. The plan maps module name patterns to partitioning strategies. Strategies are resolved by the internal `ParallelInterface`, which wires to sharding implementations `ColwiseParallel`, `RowwiseParallel`, packed variants, and so on.
 <HtmlEmbed src="transformers/tp-plan.html" />
-Which allows a user to run with multiple processes per node, e.g. 4 GPUs:
 `torchrun --nproc-per-node 4 demo.py`
-Semantics stay in the model (a Linear stays a Linear), distribution is orthogonal and declared via strings: "colwise" splits columns of weights/bias across ranks; "rowwise" splits rows; packed variants shard fused weights; The mapping keys accept glob patterns like `layers.*.mlp.down_proj` to target repeated submodules.
 <div class="crumbs">
-Sharding is configuration (<code>tp_plan</code>), not edits to <code>Linear</code>s. Glob patterns target repeated blocks; modeling semantics stay intact. <strong>Next:</strong> per-layer attention/caching schedules declared in config, not hardcoded.
 </div>
 ### <a id="layers-attentions-caches"></a> Layers, attentions and caches
-Following the same logic, the _nature_ of attention and caching per layer of a model should not be hardcoded. We should be able to specify in a configuration-based fashion how each layer is implemented. Thus we define a mapping that can be then
 ```python
@@ -281,7 +293,7 @@ ALLOWED_LAYER_TYPES = (
 )
 ```
-and the configuration can be _explicit_ about which attention type is in which layer, see e.g. gpt-oss, which alternates sliding and full attention:
 ```python
   "layer_types": [
@@ -293,10 +305,12 @@ and the configuration can be _explicit_ about which attention type is in which l
   ],
 ```
-This is [minimal](#minimal-user-api) to implement on the user side, and allows to keep the modeling untouched. It is also easy to tweak.
 <div class="crumbs">
-Allowed layer types are explicit; schedules (e.g., sliding/full alternation) live in config. This keeps the file readable and easy to tweak. <strong>Next:</strong> speedups come from kernels that don't change semantics.
 </div>
@@ -310,18 +324,20 @@ class GlmRMSNorm(nn.Module):
     ...
 ```
-This also opens another contribution path: GPU specialists can contribute optimized kernels to the kernel hub, and have them usable in `transformers`. You can check on the [kernel community blog post](https://huggingface.co/blog/hello-hf-kernels) to learn more about it!
 Even more resources have been added, like the formidable [kernel builder](https://github.com/huggingface/kernel-builder) with its connected resources to [help you build kernels with it](https://github.com/huggingface/kernel-builder/blob/main/docs/writing-kernels.md) and [with nix](https://github.com/huggingface/kernel-builder/blob/main/docs/nix.md).
 <div class="crumbs">
-Models define semantics; kernels define how to run them faster. Use annotations to borrow community forwards while keeping a consistent public surface. <strong>Next:</strong> what modularity looks like across the repo.
 </div>
-## Modular developments
-Now, we have a form of inheritance in our codebase. Some models become standards, and model contributors are given the opportunity to _define standards_. Pushing the boundaries of scientific knowledge can translate into the boundaries of engineering if this effort is made, and we're striving for it.
 It's hard to conceptualize very large libraries and how their components interact with each other, regardless of your cognitive abilities for abstractions.
 So I wanted to take a look at the current **state of modularity** across the repository. How many models are defined using components of others?
@@ -330,44 +346,55 @@ To get this graph, I used the heuristic of modular inheritance.
 2. In this `modular` file, what models, configurations and processings are imported?
 3. Recurse through the model list that way.
-So what do we see? Llama is a basis for many models, and it shows.
 Radically different architectures such as mamba have spawned their own dependency subgraph.
 <HtmlEmbed src="transformers/dependency-graph.html" />
-However, even if llava defines a few VLMs, there's far too many vision-based architectures that are not yet defined as modulars of other existing archs. In other words, there is no strong reference point in terms of software for vision models.
 As you can see, there is a small DETR island, a little llava pocket, and so on, but it's not comparable to the centrality observed for llama.
-Another problem is, this is only for `modular` models. Several models do NOT have a modular file.
 How do we spot them, and how do we identify modularisable models?
 <div class="crumbs">
-Graph reading guide: nodes are models; edges are modular imports. Llama-lineage is a hub; several VLMs remain islands — engineering opportunity for shared parents. <strong>Next:</strong> timeline + similarity signals to spot candidates.
 </div>
 ### Many models, but not enough yet, are alike
-So I looked into Jaccard similarity, which we use to measure set differences. I know that code is more than a set of characters stringed together. We also tried code-embedding models that ranked candidates better in practice, but for this post we stick to the deterministic Jaccard index.
-It is interesting, for that, to look at _when_ we deployed this modular logic and what was its rippling effect on the library. You can check the [larger space](https://huggingface.co/spaces/Molbap/transformers-modular-refactor) to play around, but the gist is: adding modular allowed to connect more and more models to solid reference points. We have a lot of gaps to fill in still.
 Zoom out below - it's full of models. You can click on a node to see its connections better, or use the text box to search for a model.
 <HtmlEmbed src="transformers/model-timeline.html" />
-If you've checked out llava, you've seen that llava_video is a red node, connected by a red edge to llava: it's a candidate, something that we can _likely_ remodularize, [not touching the actual model](#backwards-compatibility) but being much more readable with [DRY*](#do-repeat-yourself).
 <div class="crumbs">
-Similarity (Jaccard; embeddings tried separately) surfaces likely parents; the timeline shows consolidation after modular landed. Red nodes/edges = candidates (e.g., <code>llava_video</code> → <code>llava</code>) for refactors that preserve behavior. <strong>Next:</strong> concrete VLM choices that avoid leaky abstractions.
 </div>
 ### VLM improvements, avoiding abstraction
-We don't have cookbook for common VLM patterns (image token scatter, multi‑tower encoders, cross‑attn bridges). This is one of the main improvement points where we can work.
 For instance, we thought of abstracting away the mixing of `inputs_embeds`, the tensor fed into an LLM decoder in 95% of the existing VLMs. It would have looked like something like
@@ -376,7 +403,9 @@ class InputsEmbeddingMixerMixin(nn.Module):
     #
 ```
-But this is [abstracting away an important component of the modeling.](#standardize-dont-abstract). Embedding mixin is part of the model, removing it would break it. A user opening [`modeling_qwen2.5_vl`](https://huggingface.co/collections/Qwen/qwen25-vl-6795ffac22b334a837c0f9a5) should not have to go to another file to understand how it works.
 What is the current state of these “abstractions” across the codebase?
 You will see all the imports around a modeling file, here [Gemma3n](https://huggingface.co/google/gemma-3n-E4B-it).
@@ -433,22 +462,28 @@ The following [Pull request to standardize placeholder masking](https://github.c
 But this is _within_ the modeling file, not in the `PreTrainedModel` base class. It will not move away from it, because it'd break the [self-contained logic](#one-model-one-file) of the model.
 <div class="crumbs">
-Keep VLM embedding mix in the modeling file (semantics), standardize safe helpers (e.g., placeholder masking), don't migrate behavior to <code>PreTrainedModel</code>. <strong>Next:</strong> pipeline-level wins that came from PyTorch-first choices (fast processors).
 </div>
 ### On image processing and processors
-Choosing to be a `torch`-first software meant relieving a tremendous amount of support from `jax ` and `TensorFlow` , and it also meant that we could be more lenient into the amount of torch-dependent utilities that we were able to add. One of these is the _fast processing_ of images. Where they were before assumed to be minimal ndarrays, making stronger assumptions and enforcing `torch` and `torchvision` native inputs allowed up to speed up massively the processing time for each model.
-The gains in performance are immense, up to 20x speed for most models when compiled torchvision ops. Further, it allows to have the whole pipeline solely on GPU.
 ![Fast Image Processors Performance](/images/transformers/fast_image_processors.png)
 <p class="figure-legend">Thanks <a href="https://huggingface.co/yonigozlan">Yoni Gozlan</a> for the great work!</p>
 <div class="crumbs">
-Torch-first lets processors assume torch/torchvision and run the whole pipeline on GPU; big per-model speedups. <strong>Next:</strong> how this lowers friction for contributors and downstream users.
 </div>
@@ -458,82 +493,93 @@ This is an overall objective: there's no `transformers` without its community.
 Having a framework means forcing users into it. It restrains flexibility and creativity, which are the fertile soil for new ideas to grow.
-Among the most valuable contributions to `transformers` is of course the addition of new models. Very recently, [OpenAI added GPT-OSS](https://huggingface.co/blog/welcome-openai-gpt-oss), which prompted the addition of many new features to the library in order to support [their model](https://huggingface.co/openai/gpt-oss-120b).
-A second one is the ability to fine-tune and pipeline these models into many other software. Check here on the hub how many finetunes are registered for [gpt-oss 120b](https://huggingface.co/models?other=base_model:finetune:openai/gpt-oss-120b), despite its size!
 <div class="crumbs">
-The shape of a contribution: add a model (or variant) with a small modular shard; the community and serving stacks pick it up immediately. Popularity trends (encoders/embeddings) guide where we invest. <strong>Next:</strong> power tools enabled by a consistent API.
 </div>
 ### <a id="encoders-ftw"></a> Models popularity
-Talking about dependencies, we can take a look at the number of downloads for transformer models popularity. One thing we see is the prominence of encoders: This is because the usage of encoders lies in embeddings, just check out [EmbeddingGemma](https://huggingface.co/blog/embeddinggemma) for a modern recap. Hence, it is vital to keep the encoders part viable, usable, fine-tune-able.
 <div>
 <HtmlEmbed src="transformers/model-visualisation.html" />
 </div>
-As the codebase grows, with our friend codebase [Sentence Transformers](https://huggingface.co/sentence-transformers), we need to maintain this one as well. Retrieval use-cases, smart databases, like FAISS-based indexing rely on it, and thus indirectly on transformers.
 In that regard, we DO want to be a modular toolbox, being [minimal](#minimal-user-api) enough and well documented enough so any ML/AI developer can use `transformers` without having to think about it. We aim to reduce the cognitive load brought about by model development, not increase it.
 So, how do these design choices, these "tenets" influence development of models and overall usage of transformers?
 <div class="crumbs">
-Encoders remain critical for embeddings and retrieval; maintaining them well benefits the broader ecosystem (e.g., Sentence Transformers, FAISS). <strong>Next:</strong> dev tools that leverage unified attention APIs and PyTorch-only internals.
 </div>
 ## A surgical toolbox for model development
 ### Attention visualisation
-All models have the same API internally for attention computation, thanks to [the externalisation of attention classes](#external-attention-classes). it allows us to build cool tools to visualize the inner workings of the attention mechanism.
 One particular piece of machinery is the `attention mask`. Here you see the famous bidirectional attention pattern for the whole prefix (text + image) in PaliGemma and all Gemma2+ models, contrasting with the usual "causal-only" models.
 <HtmlEmbed src="transformers/attention-visualizer.html" />
 <div class="crumbs">
-Uniform attention APIs enable cross-model diagnostics (e.g., PaliGemma prefix bidirectionality vs causal). <strong>Next:</strong> whole-model tracing for ports and regressions.
 </div>
 ### Logging entire model activations
-Further, because it is all PyTorch (and it is even more now that we support only PyTorch), we can easily [debug any model](https://huggingface.co/docs/transformers/internal/model_debugging_utils) when we want to add it to transformers. We now have a power-user tool for porting or adding models, that wraps a forward pass, intercepts every submodule call, and logs shapes, dtypes, and sample statistics of inputs/outputs to nested JSON.
-It just works with PyTorch models and is especially useful when aligning outputs with a reference implementation, aligned with our [core guideline](#source-of-truth).
 ![Model debugger interface](/images/transformers/model_debugger.png)
 <div class="crumbs">
-Forward interception and nested JSON logging align ports to reference implementations, reinforcing "Source of Truth." <strong>Next:</strong> CUDA warmup reduces load-time stalls without touching modeling semantics.
 </div>
 ### Cooking faster CUDA warmups
-Having a clean _external_ API allows us to work on the [true inner workings of transformers](#code-is-product). One of the few recent additions was the _CUDA warmup_ via `caching_allocator_warmup` which improved massively the loading footprint by pre-allocating GPU memory to avoid malloc bottlenecks during model loading, achieving a 7x factor for an 8B model, 6x for a 32B, you can check out [the source](https://github.com/huggingface/transformers/pull/36380)!
 <HtmlEmbed src="transformers/warmup_demo.html" />
 It's hard to overstate how much of a lifesaver that is when you're trying to load a model as fast as possible, as it's the narrowest bottleneck for your iteration speed.
 <div class="crumbs">
-Pre-allocating GPU memory removes malloc spikes (e.g., 7× for 8B, 6× for 32B in the referenced PR). <strong>Next:</strong> serving benefits directly from consistent interfaces and modularity.
 </div>
 ### Transformers-serve and continuous batching
-Having all these models readily available allows to use all of them with transformers-serve, and enable interfacing with them with an Open API-like pattern. As a reminder, the hub also opens access to various [inference providers](https://huggingface.co/docs/inference-providers/en/index) if you're interested in model deployment in general.
 ```bash
 transformers serve
@@ -543,33 +589,45 @@ curl -X POST http://localhost:8000/v1/chat/completions \
 -d '{"messages": [{"role": "system", "content": "hello"}], "temperature": 0.9, "max_tokens": 1000, "stream": true, "model": "Qwen/Qwen2.5-0.5B-Instruct"}'
 ```
-This provides an OpenAI-compatible API with features like [continuous batching](https://github.com/huggingface/transformers/pull/38085) (also check [here](https://github.com/huggingface/transformers/pull/40426)) for better GPU utilization.
-Continuous batching is in itself very much linked to the great work of vLLM with the `paged attention kernel`, further justifying the facilitation of [external kernels](#community-kernels).
 <div class="crumbs">
-OpenAI-compatible surface + continuous batching; kernels/backends slot in because the modeling API stayed stable. <strong>Next:</strong> reuse across vLLM/SGLang relies on the same consistency.
 </div>
 ## Community reusability
-Transformers-serve is transformers-first, for sure, but the library is made first and foremost to be _reused_ at large by the open-source ecosystem.
 Adding a model to transformers means:
 - having it immediately available to the community
-- having it immediately usable in vLLM, [SGLang](https://huggingface.co/blog/transformers-backend-sglang), and so on without additional code. In April 2025, transformers was added as a backend to run models on vLLM, which optimizes throughput/latency on top of existing transformers architectures [as seen in this great vLLM x HF blog post.](https://blog.vllm.ai/2025/04/11/transformers-backend.html)
-This cements the need even more for a [consistent public surface](#consistent-public-surface): we are now a backend, and there's more optimized software than us to handle serving. At the time of writing, more effort is done in that direction. We already have compatible configs for VLMs for vLLM (say that three times fast), [here for GLM4 video support](https://github.com/huggingface/transformers/pull/40696/files),  and here for [MoE support](https://github.com/huggingface/transformers/pull/40132) for instance.
 <div class="crumbs">
-Being a good backend consumer requires a consistent public surface; modular shards and configs make that stability practical. <strong>Next:</strong> what changes in v5 without breaking the promise of visible semantics.
 </div>
 ## What is coming next
-The next major version of `transformers` is just around the corner (and will have another blog post to its name when it comes out.). When v5 is released, we aim to keep [backwards compatibility](#backwards-compatibility) as solid as possible. The changes we make now are in service of that goal.
 We will lean further into a modular toolbox, not a framework. You should not be forced to rewrite modeling code. It’s better when a model can inherit from `PreTrainedModel` and opt into Tensor Parallel, `from_pretrained`, sharding, `push_to_hub`, loss plumbing, and external stacks like PEFT/TRL/SGLang/vLLM.

 This post dissects the design philosophy that makes this possible. It's the result of a gradual evolution from our older principles, detailed on our previous [philosophy](https://huggingface.co/docs/transformers/en/philosophy) page, as well as its accompanying [blog post from 2022](https://huggingface.co/blog/transformers-design-philosophy). More recently (and I do recommend the read), we wrote a blog post about [recent upgrades to transformers](https://huggingface.co/blog/faster-transformers) with a special focus on what makes the library faster today. All of these developments were only made possible thanks to these principles.
+We formalize and articulate the "tenets" that have been guiding our development, demonstrate how they are implemented in code, and show the measurable impact they have on the library's sustainability and growth.
 For any OSS maintainer, power user, or contributor, this is the map to understanding, using, and building upon `transformers`; but not only: any project of comparable size will require you to make deep choices, not only on design and choice of abstractions, but on the very mindset of the software you are building. These tenets may or may not be applicable to your project, but they provide a glimpse on how we work that could be helpful or inspirational.
+Conventions used throughout this post:
 * [Tenets exemplified](#source-of-truth) will have their summary available on hover.
 That gives an "effective LOC" curve: the 𝗺𝗮𝗶𝗻𝘁𝗲𝗻𝗮𝗻𝗰𝗲 𝘀𝘂𝗿𝗳𝗮𝗰𝗲.
+Measured on git history, raw `modeling_*.py` grew at ~362 LOC/day before modular; counting only modular shards yields ~25 LOC/day after — about **15× lower**. The effective curve (blue line below) represents the **maintenance surface** today: what maintainers actually read and review.
+<!-- Yeah, super good point that effective == maintenable -->
+Less code to hand-maintain means fewer places to break. Of course LOC is not a direct measure of complexity, but they correlate in review effort and change risk.
 <HtmlEmbed src="transformers/loc-growth.html" />
+<!-- What is "Modeling LOC (included)"? The modeling code, not counting the files that have a modular counterpart? If so, perhaps we can say that the blue line (effective) is the sum of the red + green, whereas the yellow would have been the progression without modular. Also worth mentioning imo that the surface area has been essentially constant (in LOC) since modular. -->
+Notice there's a sharp drop at the end of the curves, this is mostly due to us [removing support for Jax and TensorFlow](https://github.com/huggingface/transformers/commit/4df2529d79d75f44e70396df5888a32ffa02d61e#diff-60849db3e9922197854ef1cac92bf4aba08b5d7fd3fe6f3c16a3511e29e0eacc) library-wide.
+But this was not the only effort that allowed us to reduce maintenance load.
+We recently underwent a thoughtful refactor of the attention implementation. You've likely heard about [flash attention](https://huggingface.co/docs/text-generation-inference/en/conceptual/flash_attention) and its several variants.
+_Attention computation_ happens at a _lower_ level of abstraction than the model itself.
+However, we were adding specific torch operations to every model for each backend (sdpa, various flash-attention versions, flex attention) but it wasn't a [minimal user api](#minimal-user-api). Next section explains what we did.
 <div class="crumbs">
+Evidence: effective (i.e., maintenable) LOC growth drops ~15× when counting shards instead of expanded modeling files. Less code to read, fewer places to break.
+<strong>Next:</strong> how the attention interface stays standard without hiding semantics.
 </div>
 ### <a id="attention-classes"></a> External Attention classes
+The solution for the "attention abstraction problem" was to move to a standard [attention interface](https://huggingface.co/docs/transformers/en/attention_interface) that allows the following:
+The naive implementation of attention, called "eager", is available by default. We use a `Callable` called `eager_attention_forward`, which can run as long as the user has PyTorch installed – which is a requirement any way.
+Instead of using a class interface and a class hierarchy, we just moved to a function interface. When a more complex attention implementation is needed, we use other Callables, including much faster kernel bindings when available. The decision to use a different attention implementation is based on the model configuration file we download from the Hub, and it can also be overridden by the user.
+This is a clear example that that we prefer an interface that is [standard, but not abstract](#standardize-dont-abstract). To be completely precise, this is what the interface selection looks like in transformers code:
 ```python
 attention_interface: Callable = eager_attention_forward
     attention_interface = ALL_ATTENTION_FUNCTIONS[self.config._attn_implementation]
 ```
+A strength of the new attention interface is the possibility to enforce specific kwargs, which are needed by kernel providers and other dependencies. We know that kwargs are often a necessary evil that plagues tools with widespread compatibility; it is something we have aimed to reduce, and will continue to reduce in order to improve readability - with them, the current system is a [minimal user api](#minimal-user-api).
+<!-- not fully following the transition here -->
+Backend integrations sometimes require specific kwargs. We reduce that surface and document expectations; where flexibility is necessary, we plan to use `typing.Annotated` to convey shapes and invariants without constraining integrations. Such an implementation could look like this in the future:
 ```python
 from typing import Annotated
 <div class="crumbs">
+Attention semantics remain in <code>eager_attention_forward</code>; faster backends are opt-in via config. We inform via types/annotations rather than enforce rigid kwargs, preserving integrations.
+<strong>Next:</strong> parallel partitioning is declared as a plan, not through model surgery.
 </div>
 ### <a id="simpler-tensor-parallelism"></a> Configurable Tensor Parallelism
 If you're not familiar with the different flavours of parallelism, I recommend checking out [this blog post](https://huggingface.co/blog/accelerate-nd-parallel) first, and of course a full [dive into the ultra-scale playbook](https://huggingface.co/spaces/nanotron/ultrascale-playbook) is always recommended.
+The essential part is that, as [the documentation states](https://huggingface.co/docs/transformers/v4.56.2/perf_train_gpu_many#tensor-parallelism), when tensors get too large to fit on a single GPU, they are sliced along a particular dimension and every slice is sent to a different GPU.
 Why does it matter?
 Because we want to avoid code modifications that are unrelated to the model.
+We choose to place the level of abstraction higher than the device placement: a matrix multiplication - a `nn.Linear` layer - should be always expressed in the same way, regardless of how it is placed.
+Hence, we want to touch the modeling code [minimally](#minimal-user-api), and only modify it when _architectural changes_ are involved – not depending on the way you run it. For tensor parallelism, we simply specify a `tp_plan`:
 <HtmlEmbed src="transformers/tp-plan.html" />
+The plan is written once, saved as part of the config and passed to `.from_pretrained()`. It maps module name patterns to partitioning strategies. Strategies are resolved by the internal `ParallelInterface`, which wires to sharding implementations `ColwiseParallel`, `RowwiseParallel`, packed variants, and so on.
+The alternative would be to modify classes depending on supported types of parallelism.
+The `tp_plan` solution allows users to run the same model on a single GPU, or distribute it using multiple processes per node, e.g. 4 GPUs:
 `torchrun --nproc-per-node 4 demo.py`
+Semantics stay in the model (a Linear stays a Linear), parallelization is orthogonal and declared via strings: "colwise" splits columns of weights/bias across ranks; "rowwise" splits rows; packed variants shard fused weights; The mapping keys accept glob patterns like `layers.*.mlp.down_proj` to target repeated submodules.
 <div class="crumbs">
+Parallelization is specified in the configuration (<code>tp_plan</code>), not through edits to <code>Linear</code>s. Glob patterns target repeated blocks; modeling semantics stay intact.
+<strong>Next:</strong> per-layer attention/caching schedules declared in config, not hardcoded.
 </div>
 ### <a id="layers-attentions-caches"></a> Layers, attentions and caches
+Following the same logic, the _nature_ of attention and per-layer caching should not be hardcoded. We should be able to specify in the configuration how each layer is implemented. Thus, we define a mapping like:
 ```python
 )
 ```
+and the configuration can be _explicit_ about which attention type is in which layer. See, for example, [gpt-oss](https://huggingface.co/openai/gpt-oss-120b/blob/main/config.json#L15), which alternates sliding and full attention:
 ```python
   "layer_types": [
   ],
 ```
+This is [minimal](#minimal-user-api) to implement on the user side, and allows to keep the modeling code untouched. It is also easy to tweak.
 <div class="crumbs">
+Allowed layer types are explicit; schedules (e.g., sliding/full alternation) live in config. This keeps the file readable and easy to tweak.
+<strong>Next:</strong> speedups come from kernels that don't change semantics.
 </div>
     ...
 ```
+This also opens another contribution path: GPU specialists can contribute optimized kernels to the [Kernels Hub](https://huggingface.co/kernels-community), and have them immediately available to use in `transformers` and other libraries. You can check the [kernel community blog post](https://huggingface.co/blog/hello-hf-kernels) to learn more about it!
 Even more resources have been added, like the formidable [kernel builder](https://github.com/huggingface/kernel-builder) with its connected resources to [help you build kernels with it](https://github.com/huggingface/kernel-builder/blob/main/docs/writing-kernels.md) and [with nix](https://github.com/huggingface/kernel-builder/blob/main/docs/nix.md).
 <div class="crumbs">
+Models define semantics; kernels define how to run them faster. Use decorations to borrow community forwards while keeping a consistent public surface.
+<strong>Next:</strong> what modularity looks like across the repo.
 </div>
+## The Sate of Modular
+Modular provides a form of inheritance in our codebase. Some models become standards, and model contributors have the opportunity to _define standards_ if their architectures are adopted. Pushing the boundaries of scientific knowledge can translate into the boundaries of engineering if this effort is made, and we're striving for it.
 It's hard to conceptualize very large libraries and how their components interact with each other, regardless of your cognitive abilities for abstractions.
 So I wanted to take a look at the current **state of modularity** across the repository. How many models are defined using components of others?
 2. In this `modular` file, what models, configurations and processings are imported?
 3. Recurse through the model list that way.
+So what do we see? Llama is a basis and an influence for many models, and it shows.
 Radically different architectures such as mamba have spawned their own dependency subgraph.
+<!-- A couple of ideas here:
+- Use screenshots to clearly show the points we make. For example, the cluster with Llama in the center, or the one about DETR/llava below.
+- Use a link to open the viewer full-screen for better manipulation and exploration.
+-->
+(Graph reading guide: nodes are models; edges are modular imports).
 <HtmlEmbed src="transformers/dependency-graph.html" />
+In the case of VLMs, there's far too many vision-based architectures that are not yet defined as modulars of other existing archs. In other words, there is no strong reference point in terms of software for vision models.
 As you can see, there is a small DETR island, a little llava pocket, and so on, but it's not comparable to the centrality observed for llama.
+Another problem is, this visualization only shows `modular` models. Several models still do NOT have a modular file.
 How do we spot them, and how do we identify modularisable models?
 <div class="crumbs">
+Llama-lineage is a hub; several VLMs remain islands — engineering opportunity for shared parents.
+<strong>Next:</strong> timeline + similarity signals to spot modularisable candidates.
 </div>
 ### Many models, but not enough yet, are alike
+I looked into Jaccard similarity, which we use to measure set differences, to find similarities across models. I know that code is more than a set of characters stringed together. We also tried code-embedding models that ranked candidates better in practice, but for this post we stick to the deterministic Jaccard index.
+It is interesting, for our comparison, to look at _when_ we deployed the modular logic and what was its rippling effect on the library. You can check the [larger space](https://huggingface.co/spaces/Molbap/transformers-modular-refactor) to play around, but the gist is: adding modular allowed to connect more and more models to solid reference points. But we still have a lot of gaps to fill.
 Zoom out below - it's full of models. You can click on a node to see its connections better, or use the text box to search for a model.
 <HtmlEmbed src="transformers/model-timeline.html" />
+<!-- screenshot would be helpful -->
+If you check llava, you've seen that llava_video is a red node, connected by a red edge to llava: it's a candidate, something that we can _likely_ remodularize, [not touching the actual model](#backwards-compatibility) but being much more readable with [DRY*](#do-repeat-yourself).
 <div class="crumbs">
+Similarity metrics (Jaccard or embeddings) surface likely parents; the timeline shows consolidation after modular landed. Red nodes/edges = candidates (e.g., <code>llava_video</code> → <code>llava</code>) for refactors that preserve behavior.
+<strong>Next:</strong> concrete VLM choices that avoid leaky abstractions.
 </div>
 ### VLM improvements, avoiding abstraction
+We don't yet have a cookbook for common VLM patterns (image token scatter, multi‑tower encoders, cross‑attention bridges). This is one of the main improvement points where we can work.
 For instance, we thought of abstracting away the mixing of `inputs_embeds`, the tensor fed into an LLM decoder in 95% of the existing VLMs. It would have looked like something like
     #
 ```
+But this is [abstracting away an important component of the modeling](#standardize-dont-abstract). Embedding mixin is part of the model, removing it would break it. A user opening [`modeling_qwen2.5_vl`](https://huggingface.co/collections/Qwen/qwen25-vl-6795ffac22b334a837c0f9a5) should not have to go to another file to understand how it works.
+<!-- ^ should we link to the code instead? -->
 What is the current state of these “abstractions” across the codebase?
 You will see all the imports around a modeling file, here [Gemma3n](https://huggingface.co/google/gemma-3n-E4B-it).
 But this is _within_ the modeling file, not in the `PreTrainedModel` base class. It will not move away from it, because it'd break the [self-contained logic](#one-model-one-file) of the model.
+<!-- So the main conclusion here is that VLMs should use modular more to come up with de-facto standard modules without abstracting them away? -->
 <div class="crumbs">
+Keep VLM embedding mix in the modeling file (semantics), standardize safe helpers (e.g., placeholder masking), don't migrate behavior to <code>PreTrainedModel</code>.
+<strong>Next:</strong> pipeline-level wins that came from PyTorch-first choices (fast processors).
 </div>
 ### On image processing and processors
+Deciding to become a `torch`-first library meant relieving a tremendous amount of support for `jax ` and `TensorFlow`, and it also meant that we could be more lenient into the amount of torch-dependent utilities that we were able to accept. One of these is the _fast processing_ of images. Where inputs were once minimally assumed to be ndarrays, enforcing native `torch` and `torchvision` inputs allowed us to massively improve processing speed for each model.
+The gains in performance are immense, up to 20x speedup for most models when using compiled torchvision ops. Furthermore, it allows to run the whole pipeline solely on GPU.
 ![Fast Image Processors Performance](/images/transformers/fast_image_processors.png)
 <p class="figure-legend">Thanks <a href="https://huggingface.co/yonigozlan">Yoni Gozlan</a> for the great work!</p>
 <div class="crumbs">
+PyTorch-first lets processors assume torch/torchvision and run the whole pipeline on GPU; big per-model speedups.
+<strong>Next:</strong> how this lowers friction for contributors and downstream users.
 </div>
 Having a framework means forcing users into it. It restrains flexibility and creativity, which are the fertile soil for new ideas to grow.
+Among the most valuable contributions to `transformers` is of course the addition of new models. Very recently, [OpenAI added GPT-OSS](https://huggingface.co/blog/welcome-openai-gpt-oss), which prompted the addition of many new features to the library in order to support [their model](https://huggingface.co/openai/gpt-oss-120b). These additions are immediately available for other models to use.
+Another important advantage is the ability to fine-tune and pipeline these models into many other libraries and tools. Check here on the hub how many finetunes are registered for [gpt-oss 120b](https://huggingface.co/models?other=base_model:finetune:openai/gpt-oss-120b), despite its size!
 <div class="crumbs">
+The shape of a contribution: add a model (or variant) with a small modular shard; the community and serving stacks pick it up immediately. Popularity trends (encoders/embeddings) guide where we invest.
+<strong>Next:</strong> power tools enabled by a consistent API.
 </div>
 ### <a id="encoders-ftw"></a> Models popularity
+Talking about dependencies, we can take a look at the number of downloads as a measure of popularity. One thing we see is the prominence of encoders, despite the apparent prevalence of decoder LLMs. The reason is that encoders are used to generate embeddings, which have multiple downstream uses. Just check out [EmbeddingGemma](https://huggingface.co/blog/embeddinggemma) for a modern recap. Hence, it is vital to keep the encoders portion of the library viable, usable, fine-tune-able.
 <div>
 <HtmlEmbed src="transformers/model-visualisation.html" />
 </div>
+As the codebase grows, we need to maintain it in coordination with our friend [Sentence Transformers codebase](https://huggingface.co/sentence-transformers). Retrieval use-cases, smart databases, FAISS-based indexing rely on it, and thus indirectly on transformers.
 In that regard, we DO want to be a modular toolbox, being [minimal](#minimal-user-api) enough and well documented enough so any ML/AI developer can use `transformers` without having to think about it. We aim to reduce the cognitive load brought about by model development, not increase it.
 So, how do these design choices, these "tenets" influence development of models and overall usage of transformers?
 <div class="crumbs">
+Encoders remain critical for embeddings and retrieval; maintaining them well benefits the broader ecosystem (e.g., Sentence Transformers, FAISS).
+<strong>Next:</strong> dev tools that leverage unified attention APIs and PyTorch-only internals.
 </div>
 ## A surgical toolbox for model development
+Transformers provides many tools that can help you while adding a new architecture, or help you understand the inner workings of the library.
 ### Attention visualisation
+All models have the same internal API for attention computation, thanks to [the externalisation of attention classes](#external-attention-classes). This allows us to build cool tools to visualize the inner workings of the attention mechanism.
 One particular piece of machinery is the `attention mask`. Here you see the famous bidirectional attention pattern for the whole prefix (text + image) in PaliGemma and all Gemma2+ models, contrasting with the usual "causal-only" models.
 <HtmlEmbed src="transformers/attention-visualizer.html" />
 <div class="crumbs">
+Uniform attention APIs enable cross-model diagnostics (e.g., PaliGemma prefix bidirectionality vs causal).
+<strong>Next:</strong> whole-model tracing for ports and regressions.
 </div>
 ### Logging entire model activations
+Because everything is PyTorch, we can easily [debug any model](https://huggingface.co/docs/transformers/internal/model_debugging_utils) when we want to add it to transformers. We now have a power-user tool for porting or adding models, that wraps a forward pass, intercepts every submodule call, and logs shapes, dtypes, and sample statistics of inputs/outputs to nested JSON.
+It just works with PyTorch models and is especially useful when aligning outputs with a reference implementation, to match our [Source of Truth guideline](#source-of-truth).
 ![Model debugger interface](/images/transformers/model_debugger.png)
 <div class="crumbs">
+Forward interception and nested JSON logging align ports to reference implementations, reinforcing "Source of Truth." <strong>
+Next:</strong> CUDA warmup reduces load-time without touching modeling semantics.
 </div>
 ### Cooking faster CUDA warmups
+Having a clean _external_ API allows us to work on the [true inner workings of transformers](#code-is-product). One of a few recent additions was the _CUDA warmup_ via `caching_allocator_warmup`, which dramatically improved loading times by pre-allocating GPU memory to avoid malloc bottlenecks during model loading. It can achieve a 7x speedup factor for an 8B model, or 6x for a 32B one, as you can check in [the PR](https://github.com/huggingface/transformers/pull/36380)!
 <HtmlEmbed src="transformers/warmup_demo.html" />
 It's hard to overstate how much of a lifesaver that is when you're trying to load a model as fast as possible, as it's the narrowest bottleneck for your iteration speed.
 <div class="crumbs">
+Pre-allocating GPU memory removes malloc spikes (e.g., 7× for 8B, 6× for 32B in the referenced PR).
+<strong>Next:</strong> consistent interfaces allow transformers-serve.
 </div>
 ### Transformers-serve and continuous batching
+Having all these models readily available and sharing the same interface allowed us to implement transformers-serve, a CLI tool to expose models through a standard OpenAI http API.
 ```bash
 transformers serve
 -d '{"messages": [{"role": "system", "content": "hello"}], "temperature": 0.9, "max_tokens": 1000, "stream": true, "model": "Qwen/Qwen2.5-0.5B-Instruct"}'
 ```
+transformers-serve uses continuous batching (see [this PR](https://github.com/huggingface/transformers/pull/38085) and also [this one](https://github.com/huggingface/transformers/pull/40426)) for better GPU utilization, and is very much linked to the great work of vLLM with the `paged attention kernel` – a futher justification of [external kernels](#community-kernels).
+transformers-serve is not meant for user-facing production services – tools like vLLM or SGLang are super optimized for that –, but it's useful for several use cases:
+- Quickly verify that your model is compatible with continuous batching and paged attention.
+- Run ad-hoc vibe tests on any model, without worrying to deploy anything.
+- Run evaluations efficiently, again without having to spend a lot of time engineering your infrastructure.
+For model deployment, check [Inference Providers](https://huggingface.co/docs/inference-providers/en/index) or roll your solution using any of the excellent serving libraries.
 <div class="crumbs">
+OpenAI-compatible surface + continuous batching; kernels/backends slot in because the modeling API stayed stable.
+<strong>Next:</strong> reuse across vLLM/SGLang relies on the same consistency.
 </div>
 ## Community reusability
+The transformers-serve CLI built on transformers, for sure, but the library is made first and foremost to be _reused_ at large by the open-source ecosystem.
 Adding a model to transformers means:
 - having it immediately available to the community
+- having it immediately usable in vLLM, [SGLang](https://huggingface.co/blog/transformers-backend-sglang), and so on without additional code. In the case of vLLM, transformers was added as a backend to run models on vLLM, which optimizes throughput/latency on top of existing transformers architectures [as seen in this great vLLM x HF blog post.](https://blog.vllm.ai/2025/04/11/transformers-backend.html)
+- being the reference code for implementations in MLX, llama.cpp and other libraries.
+This further cements the need for a [consistent public surface](#consistent-public-surface): we are a backend and a reference, and there's more software than us to handle serving. At the time of writing, more effort is done in that direction. We already have compatible configs for VLMs for vLLM (say that three times fast), check [here for GLM4 video support](https://github.com/huggingface/transformers/pull/40696/files), and here for [MoE support](https://github.com/huggingface/transformers/pull/40132), for instance.
 <div class="crumbs">
+Being a good backend consumer requires a consistent public surface; modular shards and configs make that stability practical.
+<strong>Next:</strong> what changes in v5 without breaking the promise of visible semantics.
 </div>
 ## What is coming next
+The next major version of `transformers` is just around the corner (and will have another blog post to its name when it comes out). When v5 is released, we aim to keep [backwards compatibility](#backwards-compatibility) as solid as possible. The changes we make now are in service of that goal.
 We will lean further into a modular toolbox, not a framework. You should not be forced to rewrite modeling code. It’s better when a model can inherit from `PreTrainedModel` and opt into Tensor Parallel, `from_pretrained`, sharding, `push_to_hub`, loss plumbing, and external stacks like PEFT/TRL/SGLang/vLLM.
+<!-- Maybe end with some statement that shows lots of excitement -->