LFM2-24B-A2B
LFM2 is a family of hybrid models designed for on-device deployment. LFM2-24B-A2B is the largest model in the family, scaling the architecture to 24 billion parameters while keeping inference efficient.
- Best-in-class efficiency: A 24B MoE model with only 2B active parameters per token, fitting in 32 GB of RAM for deployment on consumer laptops and desktops.
- Fast edge inference: 112 tok/s decode on AMD CPU, 293 tok/s on H100. Fits in 32B GB of RAM with day-one support llama.cpp, vLLM, and SGLang.
- Predictable scaling: Quality improves log-linearly from 350M to 24B total parameters, confirming the LFM2 hybrid architecture scales reliably across nearly two orders of magnitude.
Find more information about LFM2-24B-A2B in our blog post.
ποΈ Model Details
LFM2-24B-A2B is a general-purpose instruct model (without reasoning traces) with the following features:
| Property | LFM2-8B-A1B | LFM2-24B-A2B |
|---|---|---|
| Total parameters | 8.3B | 24B |
| Active parameters | 1.5B | 2.3B |
| Layers | 24 (18 conv + 6 attn) | 40 (30 conv + 10 attn) |
| Context length | 32,768 tokens | 32,768 tokens |
| Vocabulary size | 65,536 | 65,536 |
| Training precision | Mixed BF16/FP8 | Mixed BF16/FP8 |
| Training budget | 12 trillion tokens | 17 trillion tokens |
| License | LFM Open License v1.0 | LFM Open License v1.0 |
Supported languages: English, Arabic, Chinese, French, German, Japanese, Korean, Spanish, Portuguese
Generation parameters:
temperature: 0.1top_k: 50repetition_penalty: 1.05
We recommend the following use cases:
- Agentic tool use: Native function calling, web search, structured outputs. Ideal as the fast inner-loop model in multi-step agent pipelines.
- Offline document summarization and Q&A: Run entirely on consumer hardware for privacy-sensitive workflows (legal, medical, corporate).
- Privacy-preserving customer support agent: Deployed on-premise at a company, handles multi-turn support conversations with tool access (database lookups, ticket creation) without data leaving the network.
- Local RAG pipelines: Serve as the generation backbone in retrieval-augmented setups on a single machine without GPU servers.
Chat Template
LFM2-24B-A2B uses a ChatML-like format. See the Chat Template documentation for details. Example:
<|startoftext|><|im_start|>system
You are a helpful assistant trained by Liquid AI.<|im_end|>
<|im_start|>user
What is C. elegans?<|im_end|>
<|im_start|>assistant
You can use tokenizer.apply_chat_template() to format your messages automatically.
Tool Use
LFM2-24B-A2B supports function calling as follows:
- Function definition: We recommend providing the list of tools as a JSON object in the system prompt. You can also use the
tokenizer.apply_chat_template()function with tools. - Function call: By default, LFM2-24B-A2B writes Pythonic function calls (a Python list between
<|tool_call_start|>and<|tool_call_end|>special tokens), as the assistant answer. You can override this behavior by asking the model to output JSON function calls in the system prompt. - Function execution: The function call is executed, and the result is returned as a "tool" role.
- Final answer: LFM2-24B-A2B interprets the outcome of the function call to address the original user prompt in plain text.
See the Tool Use documentation for the full guide. Example:
<|startoftext|><|im_start|>system
List of tools: [{"name": "get_candidate_status", "description": "Retrieves the current status of a candidate in the recruitment process", "parameters": {"type": "object", "properties": {"candidate_id": {"type": "string", "description": "Unique identifier for the candidate"}}, "required": ["candidate_id"]}}]<|im_end|>
<|im_start|>user
What is the current status of candidate ID 12345?<|im_end|>
<|im_start|>assistant
<|tool_call_start|>[get_candidate_status(candidate_id="12345")]<|tool_call_end|>Checking the current status of candidate ID 12345.<|im_end|>
<|im_start|>tool
[{"candidate_id": "12345", "status": "Interview Scheduled", "position": "Clinical Research Associate", "date": "2023-11-20"}]<|im_end|>
<|im_start|>assistant
The candidate with ID 12345 is currently in the "Interview Scheduled" stage for the position of Clinical Research Associate, with an interview date set for 2023-11-20.<|im_end|>
π Inference
LFM2-24B-A2B is supported by many inference frameworks. See the Inference documentation for the full list.
| Name | Description | Docs | Notebook |
|---|---|---|---|
| Transformers | Simple inference with direct access to model internals. | Link | ![]() |
| vLLM | High-throughput production deployments with GPU. | Link | ![]() |
| llama.cpp | Cross-platform inference with CPU offloading. | Link | ![]() |
| MLX | Apple's machine learning framework optimized for Apple Silicon. | Link | β |
| LM Studio | Desktop application for running LLMs locally. | Link | β |
Transformers.js
If you haven't already, you can install the Transformers.js JavaScript library from NPM using:
npm i @huggingface/transformers
You can then use the model as follows:
import { pipeline, TextStreamer } from "@huggingface/transformers";
// Create a text generation pipeline
const generator = await pipeline(
"text-generation",
"onnx-community/LFM2-24B-A2B-ONNX",
{ dtype: "q4f16", device: "webgpu" },
);
// Define the list of messages
const messages = [
{ role: "user", content: "What's the capital of France?" },
];
// Generate a response
const output = await generator(messages, {
max_new_tokens: 512,
do_sample: false,
streamer: new TextStreamer(generator.tokenizer, {
skip_prompt: true,
skip_special_tokens: true,
}),
});
console.log(output[0].generated_text.at(-1).content);
Or try out our online WebGPU demo:
ONNX Runtime
Here's a quick start example with ONNX Runtime:
from transformers import AutoConfig, AutoTokenizer
import onnxruntime
import numpy as np
from huggingface_hub import snapshot_download
# 1. Load config, processor, and model
model_id = "onnx-community/LFM2-24B-A2B-ONNX"
config = AutoConfig.from_pretrained(model_id)
tokenizer = AutoTokenizer.from_pretrained(model_id)
eos_token_id = config.eos_token_id
filename = "model_q4f16.onnx" # Options: "model_fp16.onnx", "model_q4f16.onnx"
model_path = snapshot_download(repo_id=model_id, allow_patterns=f"onnx/{filename}*") # Download the graph + weights
session = onnxruntime.InferenceSession(f"{model_path}/onnx/{filename}")
# 2. Prepare inputs
prompt = "What is C. elegans?"
messages = [{"role": "user", "content": prompt}]
inputs = tokenizer.apply_chat_template(messages, add_generation_prompt=True, tokenize=True, return_dict=True, return_tensors="np")
input_ids = inputs['input_ids']
attention_mask = inputs['attention_mask']
batch_size = input_ids.shape[0]
num_logits_to_keep = np.array(1, dtype=np.int64)
past_cache_values = {}
for inp in session.get_inputs():
name = inp.name
shape = inp.shape
dtype = np.float32 if inp.type == "tensor(float)" else np.float16
if name.startswith("past_key_values"):
# Attention KV cache: shape [batch_size, num_kv_heads, 0, head_dim]
past_cache_values[name] = np.zeros([batch_size, shape[1], 0, shape[3]], dtype=dtype)
elif name.startswith("past_conv"):
# Conv cache: shape [batch_size, hidden_size, conv_L_cache]
past_cache_values[name] = np.zeros([batch_size, shape[1], shape[2]], dtype=dtype)
# 3. Generation loop
max_new_tokens = 1024
generated_tokens = np.array([[]], dtype=np.int64)
for i in range(max_new_tokens):
logits, *present_cache_values = session.run(None, dict(
input_ids=input_ids,
attention_mask=attention_mask,
num_logits_to_keep=num_logits_to_keep,
**past_cache_values,
))
## Update values for next generation loop
input_ids = logits[:, -1].argmax(-1, keepdims=True)
attention_mask = np.concatenate([attention_mask, np.ones_like(input_ids, dtype=np.int64)], axis=-1)
for j, key in enumerate(past_cache_values):
past_cache_values[key] = present_cache_values[j]
generated_tokens = np.concatenate([generated_tokens, input_ids], axis=-1)
if np.isin(input_ids, eos_token_id).any():
break
## (Optional) Streaming
print(tokenizer.decode(input_ids[0]), end='', flush=True)
print()
# 4. Output result
print(tokenizer.batch_decode(generated_tokens, skip_special_tokens=True)[0])
π§ Fine-Tuning
| Name | Description | Docs | Notebook |
|---|---|---|---|
| CPT (Unsloth) | Continued Pre-Training using Unsloth for text completion. | Link | ![]() |
| CPT (Unsloth) | Continued Pre-Training using Unsloth for translation. | Link | ![]() |
| SFT (Unsloth) | Supervised Fine-Tuning with LoRA using Unsloth. | Link | ![]() |
| SFT (TRL) | Supervised Fine-Tuning with LoRA using TRL. | Link | ![]() |
| DPO (TRL) | Direct Preference Optimization with LoRA using TRL. | Link | ![]() |
| GRPO (Unsloth) | GRPO with LoRA using Unsloth. | Link | ![]() |
| GRPO (TRL) | GRPO with LoRA using TRL. | Link | ![]() |
π Performance
CPU Inference
We compared LFM2-24B-A2B against two popular MoE models of similar size: Qwen3-30B-A3B-Instruct-2507 (30.5B total, 3.3B active parameters) and gpt-oss-20b (21B total, 3.6B active parameters). We measured both prefill and decode throughputs with Q4_K_M versions of these models using llama.cpp on AMD Ryzen AI Max+ 395.
GPU Inference
We also report throughput (total tokens / wall time) achieved with vLLM on a single H100 SXM5 GPU.
Contact
For enterprise solutions and edge deployment, contact sales@liquid.ai.
Citation
@article{liquidAI202624B,
author = {Liquid AI},
title = {LFM2.5-24B-A2B: Scaling Up the LFM2 Architecture},
journal = {Liquid AI Blog},
year = {2026},
note = {www.liquid.ai/blog/},
}
@article{liquidai2025lfm2,
title={LFM2 Technical Report},
author={Liquid AI},
journal={arXiv preprint arXiv:2511.23404},
year={2025}
}
- Downloads last month
- 471




