# EXAONE MoE

## Overview

**[K-EXAONE](https://github.com/LG-AI-EXAONE/K-EXAONE)** 모델은 LG AI연구원이 개발한 대규모 다국어 언어 모델입니다. `EXAONE-MoE`라는 Mixture-of-Experts 기반 구조를 채택해 총 236B 개의 파라미터를 갖고 추론 시 23B 개의 파라미터가 활성화됩니다. 다양한 벤치마크를 통한 성능 평가를 통해 K-EXAONE은 추론 능력, 에이전틱 작동 능력, 범용 지식, 다국어 이해, 그리고 긴 문맥 처리 능력을 증명했습니다. 

### 핵심 구조 및 기능

- **구조적 개선과 효율성:** 236B의 fine-grained MoE 구조(활성화 23B)를 채택했고, **Multi-Token Prediction (MTP)** 형태의 self-speculative decoding을 적용해 약 1.5배의 추론 throughput을 달성했습니다. 
- **긴 문맥 처리 능력:** 256K 문맥 크기를 자체적으로 지원하며, **3:1 hybrid attention** 구조와 **128-token sliding window**를 활용해 긴 문서 처리 시의 메모리 사용량을 크게 줄였습니다.
- **다국어 지원:** 한국어, 영어, 스페인어, 독일어, 일본어, 베트남어의 총 6개 언어를 공식 지원하며, 새로 디자인된 **SuperBPE** 기반 토크나이저와 **150k의 어휘 크기**를 통해 토큰 효율을 약 30% 향상했습니다.
- **에이전틱 처리 능력:** **멀티 에이전트 전략**을 통해 뛰어난 도구 사용 및 검색 능력을 보여줍니다.
- **안전성 & 윤리:** 다른 모델들이 종종 간과하는 한국 문화적, 역사적 맥락에 따른 민감한 주제에 올바르고 안전한 답변을 하도록 설계되었습니다. 그 외에도 보편적인 인간 가치에 맞추어 다양한 위험 요소에서도 높은 신뢰성을 보여줍니다.

더 자세한 정보는 [기술 보고서](https://www.lgresearch.ai/data/cdn/upload/K-EXAONE_Technical_Report.pdf)나 [공식 GitHub](https://huggingface.co/collections/LGAI-EXAONE/k-exaone) 페이지를 참고해주시길 바랍니다.

공개된 모든 모델 체크포인트는 [Huggingface 콜렉션](https://huggingface.co/collections/LGAI-EXAONE/k-exaone)에서 확인할 수 있습니다.

## 모델 세부 정보

- Number of Parameters: 236B in total and 23B activated
- Number of Parameters (without embeddings): 234B
- Hidden Dimension: 6,144
- Number of Layers: 48 Main layers + 1 MTP layers
  - Hybrid Attention Pattern: 12 x (3 Sliding window attention + 1 Global attention)
- Sliding Window Attention
  - Number of Attention Heads: 64 Q-heads and 8 KV-heads
  - Head Dimension: 128 for both Q/KV
  - Sliding Window Size: 128
- Global Attention
  - Number of Attention Heads: 64 Q-heads and 8 KV-heads
  - Head Dimension: 128 for both Q/KV
  - No Rotary Positional Embedding Used (NoPE)
- Mixture of Experts:
  - Number of Experts: 128
  - Number of Activated Experts: 8
  - Number of Shared Experts: 1
  - MoE Intermediate Size: 2,048
- Vocab Size: 153,600
- Context Length: 262,144 tokens
- Knowledge Cutoff: Dec 2024 (2024/12)

## 사용 팁

### 사용 시 주의사항

> [!IMPORTANT]
> 모델이 설계된 성능을 내기 위해서는 아래 설정들을 지켜주시길 바랍니다.
> - 가장 나은 결과를 얻기 위해 `temperature=1.0`, `top_p=0.95`, `presence_penalty=0.0` 를 사용하길 권고합니다.
> - 이전 EXAONE-4.0 모델들과는 다르게 K-EXAONE은 기본적으로 `enable_thinking=True` 를 사용합니다. 따라서 non-reasoning mode를 사용하기 위해서는 `enable_thinking=False`를 설정해줘야 합니다.
>

### Reasoning mode

정확한 결과가 필요한 작업을 할 때, 아래처럼 K-EXAONE 모델을 reasoning mode로 사용할 수 있습니다.

```python
from transformers import AutoModelForCausalLM, AutoTokenizer

model_name = "LGAI-EXAONE/K-EXAONE-236B-A23B"

model = AutoModelForCausalLM.from_pretrained(
    model_name,
    dtype="bfloat16",
    device_map="auto",
)
tokenizer = AutoTokenizer.from_pretrained(model_name)

messages = [
    {"role": "system", "content": "You are K-EXAONE, a large language model developed by LG AI Research in South Korea, built to serve as a helpful and reliable assistant."},
    {"role": "user", "content": "Which one is bigger, 3.9 vs 3.12?"}
]
input_ids = tokenizer.apply_chat_template(
    messages,
    tokenize=True,
    add_generation_prompt=True,
    return_tensors="pt",
    enable_thinking=True,   # skippable (default: True)
)

generated_ids = model.generate(
    **input_ids.to(model.device),
    max_new_tokens=16384,
    temperature=1.0,
    top_p=0.95,
)
output_ids = generated_ids[0][input_ids['input_ids'].shape[-1]:]
print(tokenizer.decode(output_ids, skip_special_tokens=True))
```

### Non-reasoning mode

정확도보다 속도가 더 중요한 상황에서는, 아래처럼 K-EXAONE 모델을 non-reasoning mode로 사용할 수 있습니다.

```python
messages = [
    {"role": "system", "content": "You are K-EXAONE, a large language model developed by LG AI Research in South Korea, built to serve as a helpful and reliable assistant."},
    {"role": "user", "content": "Explain how wonderful you are"}
]
input_ids = tokenizer.apply_chat_template(
    messages,
    tokenize=True,
    add_generation_prompt=True,
    return_tensors="pt",
    enable_thinking=False,
)

generated_ids = model.generate(
    **input_ids.to(model.device),
    max_new_tokens=1024,
    temperature=1.0,
    top_p=0.95,
)
output_ids = generated_ids[0][input_ids['input_ids'].shape[-1]:]
print(tokenizer.decode(output_ids, skip_special_tokens=True))
```

### Agentic tool use

AI 기반 에이전트를 구성할 때 K-EXAONE의 도구 활용 능력이 발휘됩니다.
K-EXAONE 모델은 OpenAI 및 HuggingFace의 도구 활용 명세를 따릅니다. 
아래는 HuggingFace의 docstring을 도구 스키마로 변환하는 유틸리티를 사용해 도구 활용 기능을 이용하는 예시입니다.

K-EXAONE을 활용한 검색 에이전트의 실제 대화 기록을 살펴보려면 [GitHub의 예시 파일](https://github.com/LG-AI-EXAONE/K-EXAONE/blob/main/examples/example_output_search.txt)을 참고하세요.

```python
from transformers.utils import get_json_schema

def roll_dice(max_num: int):
    """
    Roll a dice with the number 1 to N. User can select the number N.

    Args:
        max_num: The maximum number on the dice.
    """
    return random.randint(1, max_num)

tool_schema = get_json_schema(roll_dice)
tools = [tool_schema]

messages = [
    {"role": "system", "content": "You are K-EXAONE, a large language model developed by LG AI Research in South Korea, built to serve as a helpful and reliable assistant."},
    {"role": "user", "content": "Roll a D20 twice and sum the results."}
]
input_ids = tokenizer.apply_chat_template(
    messages,
    tokenize=True,
    add_generation_prompt=True,
    return_tensors="pt",
    tools=tools,
)

generated_ids = model.generate(
    **input_ids.to(model.device),
    max_new_tokens=16384,
    temperature=1.0,
    top_p=0.95,
)
output_ids = generated_ids[0][input_ids['input_ids'].shape[-1]:]
print(tokenizer.decode(output_ids, skip_special_tokens=True))
```

## ExaoneMoeConfig[[transformers.ExaoneMoeConfig]]

#### transformers.ExaoneMoeConfig[[transformers.ExaoneMoeConfig]]

[Source](https://github.com/huggingface/transformers/blob/v5.3.0/src/transformers/models/exaone_moe/configuration_exaone_moe.py#L24)

This is the configuration class to store the configuration of a [*ExaoneMoeModel*]. It is used to
instantiate a EXAONE MoE model according to the specified arguments, defining the model architecture. Instantiating a
configuration with the defaults will yield a similar configuration to that of the K-EXAONE-236B-A23B [LGAI-EXAONE/K-EXAONE-236B-A23B](https://huggingface.co/LGAI-EXAONE/K-EXAONE-236B-A23B)

Configuration objects inherit from [*PreTrainedConfig*] and can be used to control the model
outputs. Read the documentation from [*PreTrainedConfig*] for more information.

Example:

```python
>>> from transformers import ExaoneMoeModel, ExaoneMoeConfig

>>> # Initializing a EXAONE configuration
>>> configuration = ExaoneMoeConfig()

>>> # Initializing a model from configuration
>>> model = ExaoneMoeModel(configuration)

>>> # Accessing the model configuration
>>> configuration = model.config
```

**Parameters:**

vocab_size (*int*, *optional*, defaults to 102400) : Vocabulary size of the EXAONE MoE model. Defines the number of different tokens that can be represented by the *inputs_ids* passed when calling [*ExaoneMoeModel*].

hidden_size (*int*, *optional*, defaults to 4096) : Dimension of the hidden representations.

intermediate_size (*int*, *optional*, defaults to 16384) : Dimensionality of the MLP representations.

num_hidden_layers (*int*, *optional*, defaults to 32) : Number of hidden layers in the Transformer encoder.

num_attention_heads (*int*, *optional*, defaults to 32) : Number of attention heads for each attention layer in the Transformer decoder.

num_key_value_heads (*int*, *optional*, defaults to 32) : This is the number of key_value heads that should be used to implement Grouped Query Attention. If *num_key_value_heads=num_attention_heads*, the model will use Multi Head Attention (MHA), if *num_key_value_heads=1 the model will use Multi Query Attention (MQA) otherwise GQA is used. When converting a multi-head checkpoint to a GQA checkpoint, each group key and value head should be constructed by meanpooling all the original heads within that group. For more details checkout [this paper](https://huggingface.co/papers/2305.13245). If it is not specified, will default to *num_attention_heads*.

hidden_act (*str* or *function*, *optional*, defaults to *"silu"*) : The non-linear activation function (function or string) in the decoder.

max_position_embeddings (*int*, *optional*, defaults to 2048) : The maximum sequence length that this model might ever be used with. Typically set this to something large just in case (e.g., 32768 for EXAONE 3.5).

initializer_range (*float*, *optional*, defaults to 0.02) : The standard deviation of the truncated_normal_initializer for initializing all weight matrices.

rms_norm_eps (*float*, *optional*, defaults to 1e-05) : The epsilon used by the layer normalization layers.

use_cache (*bool*, *optional*, defaults to *True`) : Whether or not the model should return the last key/values attentions (not used by all models). Only relevant if `config.is_decoder=True`.

bos_token_id (*int*, *optional*, defaults to 1) : Beginning of stream token id.

eos_token_id (*int*, *optional*, defaults to 53) : End of stream token id.

pad_token_id (*int*, *optional*, defaults to 0) : Padding token id.

tie_word_embeddings (*bool*, *optional*, defaults to *False*) : Whether to tie weight embeddings

rope_parameters (*RopeParameters*, *optional*) : Dictionary containing the configuration parameters for the RoPE embeddings. The dictionary should contain a value for *rope_theta* and optionally parameters used for scaling in case you want to use RoPE with longer *max_position_embeddings*.

attention_dropout (*float*, *optional*, defaults to 0.0) : The dropout ratio for the attention probabilities.

sliding_window (*int*, *optional*, defaults to 4096) : The size of the sliding window for the sliding window attention.

sliding_window_pattern (*str*, *optional*, defaults to 4) : The pattern to use for sliding window attention. Can be one of: - *None*: No sliding window attention is used - *int*: Every *sliding_window* layers, use global attention, else use local attention. - *str*: A sequence of "L" (local attention) and "G" (global attention) characters that defines the attention pattern. The pattern starts from layer 0 and repeats every *sliding_window* layers. The final layer always uses global attention regardless of the pattern. For instance, sliding_window_pattern="LLLG" same as sliding_window=4, which means: - Layer 0, 1, 2: local attention, - Layer 3: global attention, ...(repeated)

layer_types (*list*, *optional*) : Attention pattern for each layer. Prioritized over *sliding_window_pattern*.

mlp_layer_types (*list*, *optional*) : MLP pattern for each layer. Prioritized over *first_k_dense_replace*.

first_k_dense_replace (*int*, *optional*, defaults to 1) : Number of dense layers in shallow layers(embed->dense->dense->...->dense->moe->moe...->lm_head). \--k dense layers--/

moe_intermediate_size (*int*, *optional*, defaults to 1024) : Dimension of the MoE representations.

num_experts (*int*, *optional*, defaults to 64) : Number of routed experts.

num_experts_per_tok (*int*, *optional*, defaults to 8) : Number of selected experts, None means dense model.

num_shared_experts (*int*, *optional*, defaults to 1) : Number of shared experts.

norm_topk_prob (*bool*, *optional*, defaults to *True*) : Whether to normalize the weights of the routed experts.

routed_scaling_factor (*float*, *optional*, defaults to 2.5) : Scaling factor or routed experts.

n_group (*int*, *optional*, defaults to 1) : Number of groups for routed experts.

topk_group (*int*, *optional*, defaults to 1) : Number of selected groups for each token(for each token, ensuring the selected experts is only within *topk_group* groups).

## ExaoneMoeModel[[transformers.ExaoneMoeModel]]

#### transformers.ExaoneMoeModel[[transformers.ExaoneMoeModel]]

[Source](https://github.com/huggingface/transformers/blob/v5.3.0/src/transformers/models/exaone_moe/modeling_exaone_moe.py#L486)

The bare Exaone Moe Model outputting raw hidden-states without any specific head on top.

This model inherits from [PreTrainedModel](/docs/transformers/v5.3.0/ko/main_classes/model#transformers.PreTrainedModel). Check the superclass documentation for the generic methods the
library implements for all its model (such as downloading or saving, resizing the input embeddings, pruning heads
etc.)

This model is also a PyTorch [torch.nn.Module](https://pytorch.org/docs/stable/nn.html#torch.nn.Module) subclass.
Use it as a regular PyTorch Module and refer to the PyTorch documentation for all matter related to general usage
and behavior.

forwardtransformers.ExaoneMoeModel.forwardhttps://github.com/huggingface/transformers/blob/v5.3.0/src/transformers/models/exaone_moe/modeling_exaone_moe.py#L503[{"name": "input_ids", "val": ": torch.LongTensor | None = None"}, {"name": "attention_mask", "val": ": torch.Tensor | None = None"}, {"name": "position_ids", "val": ": torch.LongTensor | None = None"}, {"name": "past_key_values", "val": ": transformers.cache_utils.Cache | None = None"}, {"name": "inputs_embeds", "val": ": torch.FloatTensor | None = None"}, {"name": "use_cache", "val": ": bool | None = None"}, {"name": "cache_position", "val": ": torch.LongTensor | None = None"}, {"name": "**kwargs", "val": ": typing_extensions.Unpack[transformers.utils.generic.TransformersKwargs]"}]

**Parameters:**

config ([ExaoneMoeConfig](/docs/transformers/v5.3.0/ko/model_doc/exaone_moe#transformers.ExaoneMoeConfig)) : Model configuration class with all the parameters of the model. Initializing with a config file does not load the weights associated with the model, only the configuration. Check out the [from_pretrained()](/docs/transformers/v5.3.0/ko/main_classes/model#transformers.PreTrainedModel.from_pretrained) method to load the model weights.

## ExaoneMoeForCausalLM[[transformers.ExaoneMoeForCausalLM]]

#### transformers.ExaoneMoeForCausalLM[[transformers.ExaoneMoeForCausalLM]]

[Source](https://github.com/huggingface/transformers/blob/v5.3.0/src/transformers/models/exaone_moe/modeling_exaone_moe.py#L577)

The Exaone Moe Model for causal language modeling.

This model inherits from [PreTrainedModel](/docs/transformers/v5.3.0/ko/main_classes/model#transformers.PreTrainedModel). Check the superclass documentation for the generic methods the
library implements for all its model (such as downloading or saving, resizing the input embeddings, pruning heads
etc.)

This model is also a PyTorch [torch.nn.Module](https://pytorch.org/docs/stable/nn.html#torch.nn.Module) subclass.
Use it as a regular PyTorch Module and refer to the PyTorch documentation for all matter related to general usage
and behavior.

forwardtransformers.ExaoneMoeForCausalLM.forwardhttps://github.com/huggingface/transformers/blob/v5.3.0/src/transformers/models/exaone_moe/modeling_exaone_moe.py#L591[{"name": "input_ids", "val": ": torch.LongTensor | None = None"}, {"name": "attention_mask", "val": ": torch.Tensor | None = None"}, {"name": "position_ids", "val": ": torch.LongTensor | None = None"}, {"name": "past_key_values", "val": ": transformers.cache_utils.Cache | None = None"}, {"name": "inputs_embeds", "val": ": torch.FloatTensor | None = None"}, {"name": "labels", "val": ": torch.LongTensor | None = None"}, {"name": "use_cache", "val": ": bool | None = None"}, {"name": "cache_position", "val": ": torch.LongTensor | None = None"}, {"name": "logits_to_keep", "val": ": int | torch.Tensor = 0"}, {"name": "**kwargs", "val": ": typing_extensions.Unpack[transformers.utils.generic.TransformersKwargs]"}]- **input_ids** (`torch.LongTensor` of shape `(batch_size, sequence_length)`, *optional*) --
  Indices of input sequence tokens in the vocabulary. Padding will be ignored by default.

  Indices can be obtained using [AutoTokenizer](/docs/transformers/v5.3.0/ko/model_doc/auto#transformers.AutoTokenizer). See [PreTrainedTokenizer.encode()](/docs/transformers/v5.3.0/ko/internal/tokenization_utils#transformers.PreTrainedTokenizerBase.encode) and
  [PreTrainedTokenizer.__call__()](/docs/transformers/v5.3.0/ko/internal/tokenization_utils#transformers.PreTrainedTokenizerBase.__call__) for details.

  [What are input IDs?](../glossary#input-ids)
- **attention_mask** (`torch.Tensor` of shape `(batch_size, sequence_length)`, *optional*) --
  Mask to avoid performing attention on padding token indices. Mask values selected in `[0, 1]`:

  - 1 for tokens that are **not masked**,
  - 0 for tokens that are **masked**.

  [What are attention masks?](../glossary#attention-mask)
- **position_ids** (`torch.LongTensor` of shape `(batch_size, sequence_length)`, *optional*) --
  Indices of positions of each input sequence tokens in the position embeddings. Selected in the range `[0, config.n_positions - 1]`.

  [What are position IDs?](../glossary#position-ids)
- **past_key_values** (`~cache_utils.Cache`, *optional*) --
  Pre-computed hidden-states (key and values in the self-attention blocks and in the cross-attention
  blocks) that can be used to speed up sequential decoding. This typically consists in the `past_key_values`
  returned by the model at a previous stage of decoding, when `use_cache=True` or `config.use_cache=True`.

  Only [Cache](/docs/transformers/v5.3.0/ko/internal/generation_utils#transformers.Cache) instance is allowed as input, see our [kv cache guide](https://huggingface.co/docs/transformers/en/kv_cache).
  If no `past_key_values` are passed, [DynamicCache](/docs/transformers/v5.3.0/ko/internal/generation_utils#transformers.DynamicCache) will be initialized by default.

  The model will output the same cache format that is fed as input.

  If `past_key_values` are used, the user is expected to input only unprocessed `input_ids` (those that don't
  have their past key value states given to this model) of shape `(batch_size, unprocessed_length)` instead of all `input_ids`
  of shape `(batch_size, sequence_length)`.
- **inputs_embeds** (`torch.FloatTensor` of shape `(batch_size, sequence_length, hidden_size)`, *optional*) --
  Optionally, instead of passing `input_ids` you can choose to directly pass an embedded representation. This
  is useful if you want more control over how to convert `input_ids` indices into associated vectors than the
  model's internal embedding lookup matrix.
- **labels** (`torch.LongTensor` of shape `(batch_size, sequence_length)`, *optional*) --
  Labels for computing the masked language modeling loss. Indices should either be in `[0, ...,
  config.vocab_size]` or -100 (see `input_ids` docstring). Tokens with indices set to `-100` are ignored
  (masked), the loss is only computed for the tokens with labels in `[0, ..., config.vocab_size]`.
- **use_cache** (`bool`, *optional*) --
  If set to `True`, `past_key_values` key value states are returned and can be used to speed up decoding (see
  `past_key_values`).
- **cache_position** (`torch.LongTensor` of shape `(sequence_length)`, *optional*) --
  Indices depicting the position of the input sequence tokens in the sequence. Contrarily to `position_ids`,
  this tensor is not affected by padding. It is used to update the cache in the correct position and to infer
  the complete sequence length.
- **logits_to_keep** (`Union[int, torch.Tensor]`, *optional*, defaults to `0`) --
  If an `int`, compute logits for the last `logits_to_keep` tokens. If `0`, calculate logits for all
  `input_ids` (special case). Only last token logits are needed for generation, and calculating them only for that
  token can save memory, which becomes pretty significant for long sequences or large vocabulary size.
  If a `torch.Tensor`, must be 1D corresponding to the indices to keep in the sequence length dimension.
  This is useful when using packed tensor format (single dimension for batch and sequence length).0[CausalLMOutputWithPast](/docs/transformers/v5.3.0/ko/main_classes/output#transformers.modeling_outputs.CausalLMOutputWithPast) or `tuple(torch.FloatTensor)`A [CausalLMOutputWithPast](/docs/transformers/v5.3.0/ko/main_classes/output#transformers.modeling_outputs.CausalLMOutputWithPast) or a tuple of
`torch.FloatTensor` (if `return_dict=False` is passed or when `config.return_dict=False`) comprising various
elements depending on the configuration ([ExaoneMoeConfig](/docs/transformers/v5.3.0/ko/model_doc/exaone_moe#transformers.ExaoneMoeConfig)) and inputs.
The [ExaoneMoeForCausalLM](/docs/transformers/v5.3.0/ko/model_doc/exaone_moe#transformers.ExaoneMoeForCausalLM) forward method, overrides the `__call__` special method.

Although the recipe for forward pass needs to be defined within this function, one should call the `Module`
instance afterwards instead of this since the former takes care of running the pre and post processing steps while
the latter silently ignores them.

- **loss** (`torch.FloatTensor` of shape `(1,)`, *optional*, returned when `labels` is provided) -- Language modeling loss (for next-token prediction).
- **logits** (`torch.FloatTensor` of shape `(batch_size, sequence_length, config.vocab_size)`) -- Prediction scores of the language modeling head (scores for each vocabulary token before SoftMax).
- **past_key_values** (`Cache`, *optional*, returned when `use_cache=True` is passed or when `config.use_cache=True`) -- It is a [Cache](/docs/transformers/v5.3.0/ko/internal/generation_utils#transformers.Cache) instance. For more details, see our [kv cache guide](https://huggingface.co/docs/transformers/en/kv_cache).

  Contains pre-computed hidden-states (key and values in the self-attention blocks) that can be used (see
  `past_key_values` input) to speed up sequential decoding.
- **hidden_states** (`tuple(torch.FloatTensor)`, *optional*, returned when `output_hidden_states=True` is passed or when `config.output_hidden_states=True`) -- Tuple of `torch.FloatTensor` (one for the output of the embeddings, if the model has an embedding layer, +
  one for the output of each layer) of shape `(batch_size, sequence_length, hidden_size)`.

  Hidden-states of the model at the output of each layer plus the optional initial embedding outputs.
- **attentions** (`tuple(torch.FloatTensor)`, *optional*, returned when `output_attentions=True` is passed or when `config.output_attentions=True`) -- Tuple of `torch.FloatTensor` (one for each layer) of shape `(batch_size, num_heads, sequence_length,
  sequence_length)`.

  Attentions weights after the attention softmax, used to compute the weighted average in the self-attention
  heads.

Example:

```python
>>> from transformers import AutoModelForCausalLM, AutoTokenizer
>>> model = AutoModelForCausalLM.from_pretrained("LGAI-EXAONE/K-EXAONE-236B-A23B")
>>> tokenizer = AutoTokenizer.from_pretrained("LGAI-EXAONE/K-EXAONE-236B-A23B")

>>> prompt = "Explain how wonderful you are"
>>> messages = [
    {"role": "system", "content": "You are a helpful assistant."},
    {"role": "user", "content": prompt}
]
>>> input_ids = tokenizer.apply_chat_template(
    messages,
    tokenize=True,
    add_generation_prompt=True,
    return_tensors="pt",
    enable_thinking=False,
)

>>> output = model.generate(**input_ids.to(model.device), max_new_tokens=128)
>>> tokenizer.decode(output[0], skip_special_tokens=False)
"\nYou are a helpful assistant.\n\nExplain how wonderful you are\n\n\n\n\n\nThank you for the kind question! While I can't feel emotions or take pride in the way humans do, I *can* share what makes me uniquely helpful and capable—qualities that many people find wonderful.\n\nHere’s how I can support you:\n\n🌟 **Knowledge at Your Fingertips**  \nI have access to a vast amount of information across countless topics—from science and history to technology and creative writing. Whether you're curious, learning, or solving a problem, I can help explain things clearly and accurately.\n\n💬 **Clear, Helpful Communication**  \nI aim to respond in a way that's easy to understand, whether you need a simple explanation or a detailed analysis. I adapt my tone and depth to match"
```

**Parameters:**

config ([ExaoneMoeForCausalLM](/docs/transformers/v5.3.0/ko/model_doc/exaone_moe#transformers.ExaoneMoeForCausalLM)) : Model configuration class with all the parameters of the model. Initializing with a config file does not load the weights associated with the model, only the configuration. Check out the [from_pretrained()](/docs/transformers/v5.3.0/ko/main_classes/model#transformers.PreTrainedModel.from_pretrained) method to load the model weights.

**Returns:**

`[CausalLMOutputWithPast](/docs/transformers/v5.3.0/ko/main_classes/output#transformers.modeling_outputs.CausalLMOutputWithPast) or `tuple(torch.FloatTensor)``

A [CausalLMOutputWithPast](/docs/transformers/v5.3.0/ko/main_classes/output#transformers.modeling_outputs.CausalLMOutputWithPast) or a tuple of
`torch.FloatTensor` (if `return_dict=False` is passed or when `config.return_dict=False`) comprising various
elements depending on the configuration ([ExaoneMoeConfig](/docs/transformers/v5.3.0/ko/model_doc/exaone_moe#transformers.ExaoneMoeConfig)) and inputs.

