Transformers documentation
Pixtral
This model was released on 2024-09-17 and added to Hugging Face Transformers on 2024-09-14.
Pixtral
Pixtral is a multimodal model trained to understand natural images and documents. It accepts images in their natural resolution and aspect ratio without resizing or padding due to it’s 2D RoPE embeddings. In addition, Pixtral has a long 128K token context window for processing a large number of images. Pixtral couples a 400M vision encoder with a 12B Mistral Nemo decoder.
 Pixtral architecture. Taken from the blog post.
 Pixtral architecture. Taken from the blog post. You can find all the original Pixtral checkpoints under the Mistral AI organization.
This model was contributed by amyeroberts and ArthurZ. Click on the Pixtral models in the right sidebar for more examples of how to apply Pixtral to different vision and language tasks.
import torch
from transformers import AutoProcessor, LlavaForConditionalGeneration
model_id = "mistral-community/pixtral-12b"
model = LlavaForConditionalGeneration.from_pretrained(model_id, dtype="auto", device_map="auto")
processor = AutoProcessor.from_pretrained(model_id)
url_dog = "https://picsum.photos/id/237/200/300"
url_mountain = "https://picsum.photos/seed/picsum/200/300"
chat = [
    {
      "role": "user", "content": [
        {"type": "text", "content": "Can this animal"}, 
        {"type": "image", "url": url_dog}, 
        {"type": "text", "content": "live here?"}, 
        {"type": "image", "url" : url_mountain}
      ]
    }
]
inputs = processor.apply_chat_template(chat, add_generation_prompt=True, tokenize=True, return_dict=True, return_tensors"pt").to(model.device)
generate_ids = model.generate(**inputs, max_new_tokens=500)
output = processor.batch_decode(generate_ids, skip_special_tokens=True, clean_up_tokenization_spaces=False)[0]Quantization reduces the memory burden of large models by representing the weights in a lower precision. Refer to the Quantization overview for more available quantization backends.
The example below uses bitsandbytes to quantize the model to 4-bits.
import torch
import requests
from PIL import Image
from transformers import AutoProcessor, LlavaForConditionalGeneration, BitsAndBytesConfig
model_id = "mistral-community/pixtral-12b"
quantization_config = BitsAndBytesConfig(
    load_in_4bit=True,
    bnb_4bit_quant_type="nf4",
    bnb_4bit_compute_dtype=torch.bfloat16
)
model = LlavaForConditionalGeneration.from_pretrained(
    model_id,
    quantization_config=quantization_config,
    device_map="auto"
)
processor = AutoProcessor.from_pretrained(model_id)
dog_url = "https://picsum.photos/id/237/200/300"
mountain_url = "https://picsum.photos/seed/picsum/200/300"
dog_image = Image.open(requests.get(dog_url, stream=True).raw)
mountain_image = Image.open(requests.get(mountain_url, stream=True).raw)
chat = [
    {
      "role": "user", "content": [
        {"type": "text", "text": "Can this animal"},
        {"type": "image"},
        {"type": "text", "text": "live here?"},
        {"type": "image"}
      ]
    }
]
prompt = processor.apply_chat_template(chat, tokenize=False, add_generation_prompt=True)
inputs = processor(text=prompt, images=[dog_image, mountain_image], return_tensors="pt")
inputs["pixel_values"] = inputs["pixel_values"].to(model.dtype)
inputs = {k: v.to(model.device) for k, v in inputs.items()}
generate_ids = model.generate(**inputs, max_new_tokens=100)
output = processor.batch_decode(generate_ids, skip_special_tokens=True, clean_up_tokenization_spaces=False)
print(output)Notes
- Pixtral uses PixtralVisionModel as the vision encoder and MistralForCausalLM for its language decoder. 
- The model internally replaces - [IMG]token placeholders with image embeddings.- "<s>[INST][IMG]\nWhat are the things I should be cautious about when I visit this place?[/INST]"- The - [IMG]tokens are replaced with a number of- [IMG]tokens that depend on the height and width of each image. Each row of the image is separated by a- [IMG_BREAK]token and each image is separated by a- [IMG_END]token. Use the- ~Processor.apply_chat_templatemethod to handle these tokens for you.
PixtralVisionConfig
class transformers.PixtralVisionConfig
< source >( hidden_size = 1024 intermediate_size = 4096 num_hidden_layers = 24 num_attention_heads = 16 num_channels = 3 image_size = 1024 patch_size = 16 hidden_act = 'gelu' attention_dropout = 0.0 rope_theta = 10000.0 initializer_range = 0.02 **kwargs )
Parameters
-  hidden_size (int, optional, defaults to 1024) — Dimension of the hidden representations.
-  intermediate_size (int, optional, defaults to 4096) — Dimension of the MLP representations.
-  num_hidden_layers (int, optional, defaults to 24) — Number of hidden layers in the Transformer encoder.
-  num_attention_heads (int, optional, defaults to 16) — Number of attention heads in the Transformer encoder.
-  num_channels (int, optional, defaults to 3) — Number of input channels in the input images.
-  image_size (int, optional, defaults to 1024) — Max dimension of the input images.
-  patch_size (int, optional, defaults to 16) — Size of the image patches.
-  hidden_act (str, optional, defaults to"gelu") — Activation function used in the hidden layers.
-  attention_dropout (float, optional, defaults to 0.0) — Dropout probability for the attention layers.
-  rope_theta (float, optional, defaults to 10000.0) — The base period of the RoPE embeddings.
-  initializer_range (float, optional, defaults to 0.02) — The standard deviation of the truncated_normal_initializer for initializing all weight matrices.
This is the configuration class to store the configuration of a PixtralVisionModel. It is used to instantiate an Pixtral vision encoder according to the specified arguments, defining the model architecture. Instantiating a configuration with the defaults will yield a similar configuration to the vision encoder used by Pixtral-12B.
Configuration objects inherit from PretrainedConfig and can be used to control the model outputs. Read the documentation from PretrainedConfig for more information.
Example:
>>> from transformers import PixtralVisionModel, PixtralVisionConfig
>>> # Initializing a Pixtral-12B style configuration
>>> config = PixtralVisionConfig()
>>> # Initializing a model (with randomly initialized weights) from the configuration
>>> model = PixtralVisionModel(configuration)
>>> # Accessing the model configuration
>>> configuration = model.configMistralCommonTokenizer
class transformers.MistralCommonTokenizer
< source >( tokenizer_path: typing.Union[str, os.PathLike, pathlib.Path] mode: ValidationMode = <ValidationMode.test: 'test'> model_max_length: int = 1000000000000000019884624838656 padding_side: str = 'left' truncation_side: str = 'right' model_input_names: typing.Optional[list[str]] = None clean_up_tokenization_spaces: bool = False **kwargs )
Class to wrap mistral-common tokenizers.
mistral-common is the official tokenizer library for Mistral AI models. To use it, you need to install it with:
pip install transformers[mistral-common]
Otherwise the tokenizer falls back to the Transformers implementation of the tokenizer.
For more info on mistral-common, see mistral-common.
This class is a wrapper around a mistral_common.tokens.tokenizers.mistral.MistralTokenizer.
It provides a Hugging Face compatible interface to tokenize using the official mistral-common tokenizer.
Supports the following methods from the PreTrainedTokenizerBase class:
- get_vocab(): Returns the vocabulary as a dictionary of token to index.
- encode(): Encode a string to a list of integers.
- decode(): Decode a list of integers to a string.
- batch_decode(): Decode a batch of list of integers to a list of strings.
- convert_tokens_to_ids(): Convert a list of tokens to a list of integers.
- convert_ids_to_tokens(): Convert a list of integers to a list of tokens.
- tokenize(): Tokenize a string.
- get_special_tokens_mask(): Get the special tokens mask for a list of tokens.
- prepare_for_model(): Prepare a list of inputs for the model.
- pad(): Pad a list of inputs to the same length.
- truncate_sequences(): Truncate a list of sequences to the same length.
- apply_chat_template(): Apply a chat template to a list of messages.
- __call__(): Tokenize a string or a list of strings.
- from_pretrained(): Download and cache a pretrained tokenizer from the Hugging Face model hub or local directory.
- save_pretrained(): Save a tokenizer to a directory, so it can be reloaded using the from_pretrainedclass method.
- push_to_hub(): Upload tokenizer to the Hugging Face model hub.
Here are the key differences with the PreTrainedTokenizerBase class:
- Pair of sequences are not supported. The signature have been kept for compatibility but all arguments related to pair of sequences are ignored. The return values of pairs are returned as None.
- The is_split_into_wordsargument is not supported.
- The return_token_type_idsargument is not supported.
- It is not possible to add new tokens to the tokenizer. Also the special tokens are handled differently from Transformers. In mistral-common, special tokens are never encoded directly. This means that:tokenizer.encode("<s>")will not return the ID of the<s>token. Instead, it will return a list of IDs corresponding to the tokenization of the string"<s>". For more information, see the mistral-common documentation.
If you have suggestions to improve this class, please open an issue on the mistral-common GitHub repository if it is related to the tokenizer or on the Transformers GitHub repository if it is related to the Hugging Face interface.
apply_chat_template
< source >( conversation: typing.Union[list[dict[str, str]], list[list[dict[str, str]]]] tools: typing.Optional[list[typing.Union[dict, typing.Callable]]] = None continue_final_message: bool = False tokenize: bool = True padding: typing.Union[bool, str, transformers.utils.generic.PaddingStrategy] = False truncation: bool = False max_length: typing.Optional[int] = None return_tensors: typing.Union[str, transformers.utils.generic.TensorType, NoneType] = None return_dict: bool = False **kwargs  ) → Union[str, List[int], List[str], List[List[int]], BatchEncoding]
Parameters
- conversation (Union[List[Dict[str, str]], List[List[Dict[str, str]]]]) — A list of dicts with “role” and “content” keys, representing the chat history so far.
-  tools (List[Union[Dict, Callable]], optional) — A list of tools (callable functions) that will be accessible to the model. If the template does not support function calling, this argument will have no effect. Each tool should be passed as a JSON Schema, giving the name, description and argument types for the tool. See our chat templating guide for more information.
-  continue_final_message (bool, optional) —
If this is set, the chat will be formatted so that the final
message in the chat is open-ended, without any EOS tokens. The model will continue this message
rather than starting a new one. This allows you to “prefill” part of
the model’s response for it. Cannot be used at the same time as add_generation_prompt.
-  tokenize (bool, defaults toTrue) — Whether to tokenize the output. IfFalse, the output will be a string.
-  padding (bool,stror PaddingStrategy, optional, defaults toFalse) — Select a strategy to pad the returned sequences (according to the model’s padding side and padding index) among:- Trueor- 'longest': Pad to the longest sequence in the batch (or no padding if only a single sequence if provided).
- 'max_length': Pad to a maximum length specified with the argument- max_lengthor to the maximum acceptable input length for the model if that argument is not provided.
- Falseor- 'do_not_pad'(default): No padding (i.e., can output a batch with sequences of different lengths).
 
-  truncation (bool, defaults toFalse) — Whether to truncate sequences at the maximum length. Has no effect if tokenize isFalse.
-  max_length (int, optional) — Maximum length (in tokens) to use for padding or truncation. Has no effect if tokenize isFalse. If not specified, the tokenizer’smax_lengthattribute will be used as a default.
-  return_tensors (stror TensorType, optional) — If set, will return tensors of a particular framework. Has no effect if tokenize isFalse. Acceptable values are:- 'pt': Return PyTorch- torch.Tensorobjects.
 
-  return_dict (bool, defaults toFalse) — Whether to return a dictionary with named outputs. Has no effect if tokenize isFalse. If at least one conversation contains an image, its pixel values will be returned in thepixel_valueskey.
-  kwargs (additional keyword arguments, optional) —
Not supported by MistralCommonTokenizer.apply_chat_template. Will raise an error if used.
Returns
Union[str, List[int], List[str], List[List[int]], BatchEncoding]
A list of token ids representing the tokenized chat so far, including control
tokens. This output is ready to pass to the model, either directly or via methods like generate().
Converts a list of dictionaries with "role" and "content" keys to a list of token
ids.
batch_decode
< source >( sequences: typing.Union[list[int], list[list[int]], numpy.ndarray, ForwardRef('torch.Tensor')] skip_special_tokens: bool = False clean_up_tokenization_spaces: typing.Optional[bool] = None **kwargs  ) → List[str]
Parameters
-  sequences (Union[List[int], List[List[int]], np.ndarray, torch.Tensor]) — List of tokenized input ids. Can be obtained using the__call__method.
-  skip_special_tokens (bool, optional, defaults toFalse) — Whether or not to remove special tokens in the decoding.
-  clean_up_tokenization_spaces (bool, optional) — Whether or not to clean up the tokenization spaces. IfNone, will default toself.clean_up_tokenization_spaces.
-  kwargs (additional keyword arguments, optional) —
Not supported by MistralCommonTokenizer.batch_decode. Will raise an error if used.
Returns
List[str]
The list of decoded sentences.
Convert a list of lists of token ids into a list of strings by calling decode.
convert_ids_to_tokens
< source >( ids: typing.Union[int, list[int]] skip_special_tokens: bool = False  ) → str or List[str]
Converts a single index or a sequence of indices in a token or a sequence of tokens, using the vocabulary and added tokens.
convert_tokens_to_ids
< source >( tokens: typing.Union[str, list[str]]  ) → int or List[int]
Converts a token string (or a sequence of tokens) in a single integer id (or a sequence of ids), using the vocabulary.
decode
< source >( token_ids: typing.Union[int, list[int], numpy.ndarray, ForwardRef('torch.Tensor')] skip_special_tokens: bool = False clean_up_tokenization_spaces: typing.Optional[bool] = None **kwargs  ) → str
Parameters
-  token_ids (Union[int, List[int], np.ndarray, torch.Tensor]) — List of tokenized input ids. Can be obtained using the__call__method.
-  skip_special_tokens (bool, optional, defaults toFalse) — Whether or not to remove special tokens in the decoding.
-  clean_up_tokenization_spaces (bool, optional) — Whether or not to clean up the tokenization spaces. IfNone, will default toself.clean_up_tokenization_spaces.
-  kwargs (additional keyword arguments, optional) —
Not supported by MistralCommonTokenizer.decode. Will raise an error if used.
Returns
str
The decoded sentence.
Converts a sequence of ids in a string, using the tokenizer and vocabulary with options to remove special tokens and clean up tokenization spaces.
encode
< source >( text: typing.Union[str, list[int]] text_pair: None = None add_special_tokens: bool = True padding: typing.Union[bool, str, transformers.utils.generic.PaddingStrategy] = False truncation: typing.Union[bool, str, transformers.tokenization_utils_base.TruncationStrategy, NoneType] = None max_length: typing.Optional[int] = None stride: int = 0 pad_to_multiple_of: typing.Optional[int] = None padding_side: typing.Optional[str] = None return_tensors: typing.Union[str, transformers.utils.generic.TensorType, NoneType] = None verbose: bool = True **kwargs  ) → List[int], torch.Tensor
Parameters
-  text (strorList[int]) — The first sequence to be encoded. This can be a string or a list of integers (tokenized string ids).
-  text_pair (None, optional) — Not supported byMistralCommonTokenizer.encode. Kept to matchPreTrainedTokenizerBase.encodesignature.
-  add_special_tokens (bool, optional, defaults toTrue) — Whether or not to add special tokens when encoding the sequences. This will use the underlyingPretrainedTokenizerBase.build_inputs_with_special_tokensfunction, which defines which tokens are automatically added to the input ids. This is useful if you want to addbosoreostokens automatically.
-  padding (bool,stror PaddingStrategy, optional, defaults toFalse) — Activates and controls padding. Accepts the following values:- Trueor- 'longest': Pad to the longest sequence in the batch (or no padding if only a single sequence is provided).
- 'max_length': Pad to a maximum length specified with the argument- max_lengthor to the maximum acceptable input length for the model if that argument is not provided.
- Falseor- 'do_not_pad'(default): No padding (i.e., can output a batch with sequences of different lengths).
 
-  truncation (bool,stror TruncationStrategy, optional, defaults toFalse) — Activates and controls truncation. Accepts the following values:- Trueor- 'longest_first': Truncate to a maximum length specified with the argument- max_lengthor to the maximum acceptable input length for the model if that argument is not provided.
- Falseor- 'do_not_truncate'(default): No truncation (i.e., can output batch with sequence lengths greater than the model maximum admissible input size).
 
-  max_length (int, optional) — Controls the maximum length to use by one of the truncation/padding parameters.If left unset or set to None, this will use the predefined model maximum length if a maximum length is required by one of the truncation/padding parameters. If the model has no specific maximum input length (like XLNet) truncation/padding to a maximum length will be deactivated.
-  stride (int, optional, defaults to 0) — If set to a number along withmax_length, the overflowing tokens returned whenreturn_overflowing_tokens=Truewill contain some tokens from the end of the truncated sequence returned to provide some overlap between truncated and overflowing sequences. The value of this argument defines the number of overlapping tokens.
-  pad_to_multiple_of (int, optional) — If set will pad the sequence to a multiple of the provided value. Requirespaddingto be activated. This is especially useful to enable the use of Tensor Cores on NVIDIA hardware with compute capability>= 7.5(Volta).
-  padding_side (str, optional) — The side on which the model should have padding applied. Should be selected between [‘right’, ‘left’]. Default value is picked from the class attribute of the same name.
-  return_tensors (stror TensorType, optional) — If set, will return tensors instead of list of python integers. Acceptable values are:- 'pt': Return PyTorch- torch.Tensorobjects.
 
-  **kwargs — Not supported by MistralCommonTokenizer.encode. Will raise an error if used.
Returns
List[int], torch.Tensor
The tokenized ids of the text.
Converts a string to a sequence of ids (integer), using the tokenizer and vocabulary.
from_pretrained
< source >( pretrained_model_name_or_path: typing.Union[str, os.PathLike] *init_inputs mode: ValidationMode = <ValidationMode.test: 'test'> cache_dir: typing.Union[str, os.PathLike, NoneType] = None force_download: bool = False local_files_only: bool = False token: typing.Union[bool, str, NoneType] = None revision: str = 'main' model_max_length: int = 1000000000000000019884624838656 padding_side: str = 'left' truncation_side: str = 'right' model_input_names: typing.Optional[list[str]] = None clean_up_tokenization_spaces: bool = False **kwargs )
Parameters
-  pretrained_model_name_or_path (stroros.PathLike) — Can be either:- A string, the model id of a predefined tokenizer hosted inside a model repo on huggingface.co.
- A path to a directory containing the tokenizer config, for instance saved
using the MistralCommonTokenizer.tokenization_mistral_common.save_pretrainedmethod, e.g.,./my_model_directory/.
 
-  mode (ValidationMode, optional, defaults toValidationMode.test) — Validation mode for theMistralTokenizertokenizer.
-  cache_dir (stroros.PathLike, optional) — Path to a directory in which a downloaded predefined tokenizer vocabulary files should be cached if the standard cache should not be used.
-  force_download (bool, optional, defaults toFalse) — Whether or not to force the (re-)download the vocabulary files and override the cached versions if they exist.
-  token (stror bool, optional) — The token to use as HTTP bearer authorization for remote files. IfTrue, will use the token generated when runninghf auth login(stored in~/.huggingface).
-  local_files_only (bool, optional, defaults toFalse) — Whether or not to only rely on local files and not to attempt to download any files.
-  revision (str, optional, defaults to"main") — The specific model version to use. It can be a branch name, a tag name, or a commit id, since we use a git-based system for storing models and other artifacts on huggingface.co, sorevisioncan be any identifier allowed by git.
-  max_length (int, optional) — Controls the maximum length to use by one of the truncation/padding parameters.If left unset or set to None, this will use the predefined model maximum length if a maximum length is required by one of the truncation/padding parameters. If the model has no specific maximum input length (like XLNet) truncation/padding to a maximum length will be deactivated.
-  padding_side (str, optional, defaults to"left") — The side on which the model should have padding applied. Should be selected between [‘right’, ‘left’]. Default value is picked from the class attribute of the same name.
-  truncation_side (str, optional, defaults to"right") — The side on which the model should have truncation applied. Should be selected between [‘right’, ‘left’].
-  model_input_names (List[string], optional) — The list of inputs accepted by the forward pass of the model (like"token_type_ids"or"attention_mask"). Default value is picked from the class attribute of the same name.
-  clean_up_tokenization_spaces (bool, optional, defaults toFalse) — Whether or not the model should cleanup the spaces that were added when splitting the input text during the tokenization process.
-  kwargs (additional keyword arguments, optional) —
Not supported by MistralCommonTokenizer.from_pretrained. Will raise an error if used.
Instantiate a MistralCommonTokenizer from a predefined
tokenizer.
get_special_tokens_mask
< source >( token_ids_0: list token_ids_1: None = None already_has_special_tokens: bool = False ) → A list of integers in the range [0, 1]
Parameters
-  token_ids_0 (List[int]) — List of ids of the sequence.
-  token_ids_1 (List[int], optional) — Not supported byMistralCommonTokenizer. Kept to match the interface ofPreTrainedTokenizerBase.
-  already_has_special_tokens (bool, optional, defaults toFalse) — Whether or not the token list is already formatted with special tokens for the model.
Returns
A list of integers in the range [0, 1]
1 for a special token, 0 for a sequence token.
Retrieves sequence ids from a token list that has no special tokens added. This method is called when adding
special tokens using the tokenizer prepare_for_model or encode_plus methods.
Returns the vocabulary as a dictionary of token to index.
This is a lossy conversion. There may be multiple token ids that decode to the same string due to partial UTF-8 byte sequences being converted to �.
pad
< source >( encoded_inputs: typing.Union[transformers.tokenization_utils_base.BatchEncoding, list[transformers.tokenization_utils_base.BatchEncoding], dict[str, list[int]], dict[str, list[list[int]]], list[dict[str, list[int]]]] padding: typing.Union[bool, str, transformers.utils.generic.PaddingStrategy] = True max_length: typing.Optional[int] = None pad_to_multiple_of: typing.Optional[int] = None padding_side: typing.Optional[str] = None return_attention_mask: typing.Optional[bool] = None return_tensors: typing.Union[str, transformers.utils.generic.TensorType, NoneType] = None verbose: bool = True )
Parameters
-  encoded_inputs (BatchEncoding, list of BatchEncoding, Dict[str, List[int]],Dict[str, List[List[int]]orList[Dict[str, List[int]]]) — Tokenized inputs. Can represent one input (BatchEncoding orDict[str, List[int]]) or a batch of tokenized inputs (list of BatchEncoding, Dict[str, List[List[int]]] or List[Dict[str, List[int]]]) so you can use this method during preprocessing as well as in a PyTorch Dataloader collate function.Instead of List[int]you can have tensors (numpy arrays, PyTorch tensors), see the note above for the return type.
-  padding (bool,stror PaddingStrategy, optional, defaults toTrue) — Select a strategy to pad the returned sequences (according to the model’s padding side and padding index) among:- Trueor- 'longest'(default): Pad to the longest sequence in the batch (or no padding if only a single sequence if provided).
- 'max_length': Pad to a maximum length specified with the argument- max_lengthor to the maximum acceptable input length for the model if that argument is not provided.
- Falseor- 'do_not_pad': No padding (i.e., can output a batch with sequences of different lengths).
 
-  max_length (int, optional) — Maximum length of the returned list and optionally padding length (see above).
-  pad_to_multiple_of (int, optional) — If set will pad the sequence to a multiple of the provided value.This is especially useful to enable the use of Tensor Cores on NVIDIA hardware with compute capability >= 7.5(Volta).
-  padding_side (str, optional) — The side on which the model should have padding applied. Should be selected between [‘right’, ‘left’]. Default value is picked from the class attribute of the same name.
-  return_attention_mask (bool, optional) — Whether to return the attention mask. If left to the default, will return the attention mask according to the specific tokenizer’s default, defined by thereturn_outputsattribute.
-  return_tensors (stror TensorType, optional) — If set, will return tensors instead of list of python integers. Acceptable values are:- 'pt': Return PyTorch- torch.Tensorobjects.
- 'np': Return Numpy- np.ndarrayobjects.
 
-  verbose (bool, optional, defaults toTrue) — Whether or not to print more information and warnings.
Pad a single encoded input or a batch of encoded inputs up to predefined length or to the max sequence length in the batch.
Padding side (left/right) padding token ids are defined at the tokenizer level (with self.padding_side,
self.pad_token_id).
If the
encoded_inputspassed are dictionary of numpy arrays, PyTorch tensors, the result will use the same type unless you provide a different tensor type withreturn_tensors. In the case of PyTorch tensors, you will lose the specific device of your tensors however.
prepare_for_model
< source >( ids: list pair_ids: None = None add_special_tokens: bool = True padding: typing.Union[bool, str, transformers.utils.generic.PaddingStrategy] = False truncation: typing.Union[bool, str, transformers.tokenization_utils_base.TruncationStrategy, NoneType] = None max_length: typing.Optional[int] = None stride: int = 0 pad_to_multiple_of: typing.Optional[int] = None padding_side: typing.Optional[str] = None return_tensors: typing.Union[str, transformers.utils.generic.TensorType, NoneType] = None return_attention_mask: typing.Optional[bool] = None return_overflowing_tokens: bool = False return_special_tokens_mask: bool = False return_length: bool = False verbose: bool = True prepend_batch_axis: bool = False **kwargs ) → BatchEncoding
Parameters
-  ids (List[int]) — Tokenized input ids of the first sequence.
-  pair_ids (None, optional) — Not supported byMistralCommonTokenizer. Kept to match the interface ofPreTrainedTokenizerBase.
-  add_special_tokens (bool, optional, defaults toTrue) — Whether or not to add special tokens when encoding the sequences. This will use the underlyingPretrainedTokenizerBase.build_inputs_with_special_tokensfunction, which defines which tokens are automatically added to the input ids. This is useful if you want to addbosoreostokens automatically.
-  padding (bool,stror PaddingStrategy, optional, defaults toFalse) — Activates and controls padding. Accepts the following values:- Trueor- 'longest': Pad to the longest sequence in the batch (or no padding if only a single sequence is provided).
- 'max_length': Pad to a maximum length specified with the argument- max_lengthor to the maximum acceptable input length for the model if that argument is not provided.
- Falseor- 'do_not_pad'(default): No padding (i.e., can output a batch with sequences of different lengths).
 
-  truncation (bool,stror TruncationStrategy, optional, defaults toFalse) — Activates and controls truncation. Accepts the following values:- Trueor- 'longest_first': Truncate to a maximum length specified with the argument- max_lengthor to the maximum acceptable input length for the model if that argument is not provided.
- Falseor- 'do_not_truncate'(default): No truncation (i.e., can output batch with sequence lengths greater than the model maximum admissible input size).
 
-  max_length (int, optional) — Controls the maximum length to use by one of the truncation/padding parameters.If left unset or set to None, this will use the predefined model maximum length if a maximum length is required by one of the truncation/padding parameters. If the model has no specific maximum input length (like XLNet) truncation/padding to a maximum length will be deactivated.
-  stride (int, optional, defaults to 0) — If set to a number along withmax_length, the overflowing tokens returned whenreturn_overflowing_tokens=Truewill contain some tokens from the end of the truncated sequence returned to provide some overlap between truncated and overflowing sequences. The value of this argument defines the number of overlapping tokens.
-  pad_to_multiple_of (int, optional) — If set will pad the sequence to a multiple of the provided value. Requirespaddingto be activated. This is especially useful to enable the use of Tensor Cores on NVIDIA hardware with compute capability>= 7.5(Volta).
-  padding_side (str, optional) — The side on which the model should have padding applied. Should be selected between [‘right’, ‘left’]. Default value is picked from the class attribute of the same name.
-  return_tensors (stror TensorType, optional) — If set, will return tensors instead of list of python integers. Acceptable values are:- 'pt': Return PyTorch- torch.Tensorobjects.
 
-  return_attention_mask (bool, optional) — Whether to return the attention mask. If left to the default, will return the attention mask according to the specific tokenizer’s default, defined by thereturn_outputsattribute.
-  return_overflowing_tokens (bool, optional, defaults toFalse) — Whether or not to return overflowing token sequences. If a pair of sequences of input ids (or a batch of pairs) is provided withtruncation_strategy = longest_firstorTrue, an error is raised instead of returning overflowing tokens.
-  return_special_tokens_mask (bool, optional, defaults toFalse) — Whether or not to return special tokens mask information.
-  return_offsets_mapping (bool, optional, defaults toFalse) — Whether or not to return(char_start, char_end)for each token.This is only available on fast tokenizers inheriting from PreTrainedTokenizerFast, if using Python’s tokenizer, this method will raise NotImplementedError.
-  return_length  (bool, optional, defaults toFalse) — Whether or not to return the lengths of the encoded inputs.
-  verbose (bool, optional, defaults toTrue) — Whether or not to print more information and warnings.
-  **kwargs — passed to the self.tokenize()method
Returns
A BatchEncoding with the following fields:
- 
input_ids — List of token ids to be fed to a model. 
- 
attention_mask — List of indices specifying which tokens should be attended to by the model (when return_attention_mask=Trueor if “attention_mask” is inself.model_input_names).
- 
overflowing_tokens — List of overflowing tokens sequences (when a max_lengthis specified andreturn_overflowing_tokens=True).
- 
num_truncated_tokens — Number of tokens truncated (when a max_lengthis specified andreturn_overflowing_tokens=True).
- 
special_tokens_mask — List of 0s and 1s, with 1 specifying added special tokens and 0 specifying regular sequence tokens (when add_special_tokens=Trueandreturn_special_tokens_mask=True).
- 
length — The length of the inputs (when return_length=True)
Prepares a sequence of input id so that it can be used by the model. It adds special tokens, truncates sequences if overflowing while taking into account the special tokens and manages a moving window (with user defined stride) for overflowing tokens.
save_pretrained
< source >( save_directory: typing.Union[str, os.PathLike, pathlib.Path] push_to_hub: bool = False token: typing.Union[bool, str, NoneType] = None commit_message: typing.Optional[str] = None repo_id: typing.Optional[str] = None private: typing.Optional[bool] = None repo_url: typing.Optional[str] = None organization: typing.Optional[str] = None **kwargs  ) → A tuple of str
Parameters
-  save_directory (stroros.PathLike) — The path to a directory where the tokenizer will be saved.
-  push_to_hub (bool, optional, defaults toFalse) — Whether or not to push your model to the Hugging Face model hub after saving it. You can specify the repository you want to push to withrepo_id(will default to the name ofsave_directoryin your namespace).
-  token (stror bool, optional, defaults toNone) — The token to use to push to the model hub. IfTrue, will use the token in theHF_TOKENenvironment variable.
-  commit_message (str, optional) — The commit message to use when pushing to the hub.
-  repo_id (str, optional) — The name of the repository to which push to the Hub.
-  private (bool, optional) — Whether the model repository is private or not.
-  repo_url (str, optional) — The URL to the Git repository to which push to the Hub.
-  organization (str, optional) — The name of the organization in which you would like to push your model.
-  kwargs (Dict[str, Any], optional) — Not supported byMistralCommonTokenizer.save_pretrained. Will raise an error if used.
Returns
A tuple of str
The files saved.
Save the full tokenizer state.
This method make sure the full tokenizer can then be re-loaded using the
~MistralCommonTokenizer.tokenization_mistral_common.from_pretrained class method.
tokenize
< source >( text: str **kwargs  ) → List[str]
Converts a string into a sequence of tokens, using the tokenizer.
Split in words for word-based vocabulary or sub-words for sub-word-based vocabularies.
truncate_sequences
< source >( ids: list pair_ids: None = None num_tokens_to_remove: int = 0 truncation_strategy: typing.Union[str, transformers.tokenization_utils_base.TruncationStrategy] = 'longest_first' stride: int = 0 **kwargs  ) → Tuple[List[int], None, List[int]]
Parameters
-  ids (List[int]) — Tokenized input ids. Can be obtained from a string by chaining thetokenizeandconvert_tokens_to_idsmethods.
-  pair_ids (None, optional) — Not supported byMistralCommonTokenizer. Kept to match the signature ofPreTrainedTokenizerBase.truncate_sequences.
-  num_tokens_to_remove (int, optional, defaults to 0) — Number of tokens to remove using the truncation strategy.
-  truncation_strategy (stror TruncationStrategy, optional, defaults to'longest_first') — The strategy to follow for truncation. Can be:- 'longest_first': Truncate to a maximum length specified with the argument- max_lengthor to the maximum acceptable input length for the model if that argument is not provided.
- 'do_not_truncate'(default): No truncation (i.e., can output batch with sequence lengths greater than the model maximum admissible input size).
 
-  stride (int, optional, defaults to 0) — If set to a positive number, the overflowing tokens returned will contain some tokens from the main sequence returned. The value of this argument defines the number of additional tokens.
Returns
Tuple[List[int], None, List[int]]
The truncated ids and the list of
overflowing tokens. None is returned to match Transformers signature.
Truncates a sequence pair in-place following the strategy.
PixtralVisionModel
class transformers.PixtralVisionModel
< source >( config )
Parameters
- config (PixtralVisionModel) — Model configuration class with all the parameters of the model. Initializing with a config file does not load the weights associated with the model, only the configuration. Check out the from_pretrained() method to load the model weights.
The bare Pixtral Model outputting raw hidden-states without any specific head on top.
This model inherits from PreTrainedModel. Check the superclass documentation for the generic methods the library implements for all its model (such as downloading or saving, resizing the input embeddings, pruning heads etc.)
This model is also a PyTorch torch.nn.Module subclass. Use it as a regular PyTorch Module and refer to the PyTorch documentation for all matter related to general usage and behavior.
forward
< source >( pixel_values: Tensor image_sizes: typing.Optional[torch.Tensor] = None output_hidden_states: typing.Optional[bool] = None output_attentions: typing.Optional[bool] = None return_dict: typing.Optional[bool] = None *args **kwargs: typing_extensions.Unpack[transformers.modeling_flash_attention_utils.FlashAttentionKwargs]  ) → transformers.modeling_outputs.BaseModelOutput or tuple(torch.FloatTensor)
Parameters
-  pixel_values (torch.Tensorof shape(batch_size, num_channels, image_size, image_size)) — The tensors corresponding to the input images. Pixel values can be obtained using PixtralImageProcessor. See PixtralImageProcessor.call() for details (PixtralProcessor uses PixtralImageProcessor for processing images).
-  image_sizes (torch.Tensorof shape(batch_size, 2), optional) — The sizes of the images in the batch, being (height, width) for each image.
-  output_hidden_states (bool, optional) — Whether or not to return the hidden states of all layers. Seehidden_statesunder returned tensors for more detail.
-  output_attentions (bool, optional) — Whether or not to return the attentions tensors of all attention layers. Seeattentionsunder returned tensors for more detail.
-  return_dict (bool, optional) — Whether or not to return a ModelOutput instead of a plain tuple.
Returns
transformers.modeling_outputs.BaseModelOutput or tuple(torch.FloatTensor)
A transformers.modeling_outputs.BaseModelOutput or a tuple of
torch.FloatTensor (if return_dict=False is passed or when config.return_dict=False) comprising various
elements depending on the configuration (PixtralVisionConfig) and inputs.
- 
last_hidden_state ( torch.FloatTensorof shape(batch_size, sequence_length, hidden_size)) — Sequence of hidden-states at the output of the last layer of the model.
- 
hidden_states ( tuple(torch.FloatTensor), optional, returned whenoutput_hidden_states=Trueis passed or whenconfig.output_hidden_states=True) — Tuple oftorch.FloatTensor(one for the output of the embeddings, if the model has an embedding layer, + one for the output of each layer) of shape(batch_size, sequence_length, hidden_size).Hidden-states of the model at the output of each layer plus the optional initial embedding outputs. 
- 
attentions ( tuple(torch.FloatTensor), optional, returned whenoutput_attentions=Trueis passed or whenconfig.output_attentions=True) — Tuple oftorch.FloatTensor(one for each layer) of shape(batch_size, num_heads, sequence_length, sequence_length).Attentions weights after the attention softmax, used to compute the weighted average in the self-attention heads. 
The PixtralVisionModel forward method, overrides the __call__ special method.
Although the recipe for forward pass needs to be defined within this function, one should call the
Moduleinstance afterwards instead of this since the former takes care of running the pre and post processing steps while the latter silently ignores them.
PixtralImageProcessor
class transformers.PixtralImageProcessor
< source >( do_resize: bool = True size: typing.Optional[dict[str, int]] = None patch_size: typing.Optional[dict[str, int]] = None resample: Resampling = <Resampling.BICUBIC: 3> do_rescale: bool = True rescale_factor: typing.Union[int, float] = 0.00392156862745098 do_normalize: bool = True image_mean: typing.Union[float, list[float], NoneType] = None image_std: typing.Union[float, list[float], NoneType] = None do_convert_rgb: bool = True **kwargs )
Parameters
-  do_resize (bool, optional, defaults toTrue) — Whether to resize the image’s (height, width) dimensions to the specifiedsize. Can be overridden bydo_resizein thepreprocessmethod.
-  size (dict[str, int]optional, defaults to{"longest_edge" -- 1024}): Size of the maximum dimension of either the height or width dimension of the image. Used to control how images are resized. If either the height or width are greater thansize["longest_edge"]then both the height and width are rescaled byheight / ratio,width /ratiowhereratio = max(height / longest_edge, width / longest_edge)
-  patch_size (dict[str, int]optional, defaults to{"height" -- 16, "width": 16}): Size of the patches in the model, used to calculate the output image size. Can be overridden bypatch_sizein thepreprocessmethod.
-  resample (PILImageResampling, optional, defaults toResampling.BICUBIC) — Resampling filter to use if resizing the image. Can be overridden byresamplein thepreprocessmethod.
-  do_rescale (bool, optional, defaults toTrue) — Whether to rescale the image by the specified scalerescale_factor. Can be overridden bydo_rescalein thepreprocessmethod.
-  rescale_factor (intorfloat, optional, defaults to1/255) — Scale factor to use if rescaling the image. Can be overridden byrescale_factorin thepreprocessmethod.
-  do_normalize (bool, optional, defaults toTrue) — Whether to normalize the image. Can be overridden bydo_normalizein thepreprocessmethod.
-  image_mean (floatorlist[float], optional, defaults to[0.48145466, 0.4578275, 0.40821073]) — Mean to use if normalizing the image. This is a float or list of floats the length of the number of channels in the image. Can be overridden by theimage_meanparameter in thepreprocessmethod.
-  image_std (floatorlist[float], optional, defaults to[0.26862954, 0.26130258, 0.27577711]) — Standard deviation to use if normalizing the image. This is a float or list of floats the length of the number of channels in the image. Can be overridden by theimage_stdparameter in thepreprocessmethod. Can be overridden by theimage_stdparameter in thepreprocessmethod.
-  do_convert_rgb (bool, optional, defaults toTrue) — Whether to convert the image to RGB.
Constructs a Pixtral image processor.
preprocess
< source >( images: typing.Union[ForwardRef('PIL.Image.Image'), numpy.ndarray, ForwardRef('torch.Tensor'), list['PIL.Image.Image'], list[numpy.ndarray], list['torch.Tensor']] do_resize: typing.Optional[bool] = None size: typing.Optional[dict[str, int]] = None patch_size: typing.Optional[dict[str, int]] = None resample: typing.Optional[PIL.Image.Resampling] = None do_rescale: typing.Optional[bool] = None rescale_factor: typing.Optional[float] = None do_normalize: typing.Optional[bool] = None image_mean: typing.Union[float, list[float], NoneType] = None image_std: typing.Union[float, list[float], NoneType] = None do_convert_rgb: typing.Optional[bool] = None return_tensors: typing.Union[str, transformers.utils.generic.TensorType, NoneType] = None data_format: typing.Optional[transformers.image_utils.ChannelDimension] = <ChannelDimension.FIRST: 'channels_first'> input_data_format: typing.Union[str, transformers.image_utils.ChannelDimension, NoneType] = None **kwargs )
Parameters
-  images (ImageInput) — Image to preprocess. Expects a single or batch of images with pixel values ranging from 0 to 255. If passing in images with pixel values between 0 and 1, setdo_rescale=False.
-  do_resize (bool, optional, defaults toself.do_resize) — Whether to resize the image.
-  size (dict[str, int], optional, defaults toself.size) — Describes the maximum input dimensions to the model.
-  patch_size (dict[str, int], optional, defaults toself.patch_size) — Patch size in the model. Used to calculate the image after resizing.
-  resample (int, optional, defaults toself.resample) — Resampling filter to use if resizing the image. This can be one of the enumPILImageResampling. Only has an effect ifdo_resizeis set toTrue.
-  do_rescale (bool, optional, defaults toself.do_rescale) — Whether to rescale the image.
-  rescale_factor (float, optional, defaults toself.rescale_factor) — Rescale factor to rescale the image by ifdo_rescaleis set toTrue.
-  do_normalize (bool, optional, defaults toself.do_normalize) — Whether to normalize the image.
-  image_mean (floatorlist[float], optional, defaults toself.image_mean) — Image mean to use for normalization. Only has an effect ifdo_normalizeis set toTrue.
-  image_std (floatorlist[float], optional, defaults toself.image_std) — Image standard deviation to use for normalization. Only has an effect ifdo_normalizeis set toTrue.
-  do_convert_rgb (bool, optional, defaults toself.do_convert_rgb) — Whether to convert the image to RGB.
-  return_tensors (strorTensorType, optional) — The type of tensors to return. Can be one of:- Unset: Return a list of np.ndarray.
- TensorType.TENSORFLOWor- 'tf': Return a batch of type- tf.Tensor.
- TensorType.PYTORCHor- 'pt': Return a batch of type- torch.Tensor.
- TensorType.NUMPYor- 'np': Return a batch of type- np.ndarray.
- TensorType.JAXor- 'jax': Return a batch of type- jax.numpy.ndarray.
 
- Unset: Return a list of 
-  data_format (ChannelDimensionorstr, optional, defaults toChannelDimension.FIRST) — The channel dimension format for the output image. Can be one of:- "channels_first"or- ChannelDimension.FIRST: image in (num_channels, height, width) format.
- "channels_last"or- ChannelDimension.LAST: image in (height, width, num_channels) format.
- Unset: Use the channel dimension format of the input image.
 
-  input_data_format (ChannelDimensionorstr, optional) — The channel dimension format for the input image. If unset, the channel dimension format is inferred from the input image. Can be one of:- "channels_first"or- ChannelDimension.FIRST: image in (num_channels, height, width) format.
- "channels_last"or- ChannelDimension.LAST: image in (height, width, num_channels) format.
- "none"or- ChannelDimension.NONE: image in (height, width) format.
 
Preprocess an image or batch of images.
PixtralImageProcessorFast
class transformers.PixtralImageProcessorFast
< source >( **kwargs: typing_extensions.Unpack[transformers.models.pixtral.image_processing_pixtral_fast.PixtralFastImageProcessorKwargs] )
Constructs a fast Pixtral image processor.
preprocess
< source >( images: typing.Union[ForwardRef('PIL.Image.Image'), numpy.ndarray, ForwardRef('torch.Tensor'), list['PIL.Image.Image'], list[numpy.ndarray], list['torch.Tensor']] **kwargs: typing_extensions.Unpack[transformers.models.pixtral.image_processing_pixtral_fast.PixtralFastImageProcessorKwargs]  ) → <class 'transformers.image_processing_base.BatchFeature'>
Parameters
-  images (Union[PIL.Image.Image, numpy.ndarray, torch.Tensor, list['PIL.Image.Image'], list[numpy.ndarray], list['torch.Tensor']]) — Image to preprocess. Expects a single or batch of images with pixel values ranging from 0 to 255. If passing in images with pixel values between 0 and 1, setdo_rescale=False.
-  do_resize (bool, optional) — Whether to resize the image.
-  size (dict[str, int], optional) — Describes the maximum input dimensions to the model.
-  default_to_square (bool, optional) — Whether to default to a square image when resizing, if size is an int.
-  resample (Union[PILImageResampling, F.InterpolationMode, NoneType]) — Resampling filter to use if resizing the image. This can be one of the enumPILImageResampling. Only has an effect ifdo_resizeis set toTrue.
-  do_center_crop (bool, optional) — Whether to center crop the image.
-  crop_size (dict[str, int], optional) — Size of the output image after applyingcenter_crop.
-  do_rescale (bool, optional) — Whether to rescale the image.
-  rescale_factor (Union[int, float, NoneType]) — Rescale factor to rescale the image by ifdo_rescaleis set toTrue.
-  do_normalize (bool, optional) — Whether to normalize the image.
-  image_mean (Union[float, list[float], NoneType]) — Image mean to use for normalization. Only has an effect ifdo_normalizeis set toTrue.
-  image_std (Union[float, list[float], NoneType]) — Image standard deviation to use for normalization. Only has an effect ifdo_normalizeis set toTrue.
-  do_pad (bool, optional) — Whether to pad the image. Padding is done either to the largest size in the batch or to a fixed square size per image. The exact padding strategy depends on the model.
-  pad_size (dict[str, int], optional) — The size in{"height": int, "width" int}to pad the images to. Must be larger than any image size provided for preprocessing. Ifpad_sizeis not provided, images will be padded to the largest height and width in the batch. Applied only whendo_pad=True.
-  do_convert_rgb (bool, optional) — Whether to convert the image to RGB.
-  return_tensors (Union[str, ~utils.generic.TensorType, NoneType]) — Returns stacked tensors if set to `pt, otherwise returns a list of tensors.
-  data_format (~image_utils.ChannelDimension, optional) — OnlyChannelDimension.FIRSTis supported. Added for compatibility with slow processors.
-  input_data_format (Union[str, ~image_utils.ChannelDimension, NoneType]) — The channel dimension format for the input image. If unset, the channel dimension format is inferred from the input image. Can be one of:- "channels_first"or- ChannelDimension.FIRST: image in (num_channels, height, width) format.
- "channels_last"or- ChannelDimension.LAST: image in (height, width, num_channels) format.
- "none"or- ChannelDimension.NONE: image in (height, width) format.
 
-  device (torch.device, optional) — The device to process the images on. If unset, the device is inferred from the input images.
-  disable_grouping (bool, optional) — Whether to disable grouping of images by size to process them individually and not in batches. If None, will be set to True if the images are on CPU, and False otherwise. This choice is based on empirical observations, as detailed here: https://github.com/huggingface/transformers/pull/38157
-  patch_size (dict[str, int]optional, defaults to{"height" -- 16, "width": 16}): Size of the patches in the model, used to calculate the output image size. Can be overridden bypatch_sizein thepreprocessmethod.
Returns
<class 'transformers.image_processing_base.BatchFeature'>
- data (dict) — Dictionary of lists/arrays/tensors returned by the call method (‘pixel_values’, etc.).
- tensor_type (Union[None, str, TensorType], optional) — You can give a tensor_type here to convert the lists of integers in PyTorch/TensorFlow/Numpy Tensors at initialization.
PixtralProcessor
class transformers.PixtralProcessor
< source >( image_processor = None tokenizer = None patch_size: int = 16 spatial_merge_size: int = 1 chat_template = None image_token = '[IMG]' image_break_token = '[IMG_BREAK]' image_end_token = '[IMG_END]' **kwargs )
Parameters
- image_processor (PixtralImageProcessor, optional) — The image processor is a required input.
- tokenizer (LlamaTokenizerFast, optional) — The tokenizer is a required input.
-  patch_size (int, optional, defaults to 16) — Patch size from the vision tower.
-  spatial_merge_size (int, optional, defaults to 1) — The downsampling factor for the spatial merge operation.
-  chat_template (str, optional) — A Jinja template which will be used to convert lists of messages in a chat into a tokenizable string.
-  image_token (str, optional, defaults to"[IMG]") — Special token used to denote image location.
-  image_break_token (str, optional, defaults to"[IMG_BREAK]") — Special token used to denote the end of a line of pixels in an image.
-  image_end_token (str, optional, defaults to"[IMG_END]") — Special token used to denote the end of an image input.
Constructs a Pixtral processor which wraps a Pixtral image processor and a Pixtral tokenizer into a single processor.
PixtralProcessor offers all the functionalities of CLIPImageProcessor and LlamaTokenizerFast. See the
__call__() and decode() for more information.