Transformers documentation
BART
BART
DISCLAIMER: If you see something strange, file a Github Issue and assign @patrickvonplaten
Overview
The Bart model was proposed in BART: Denoising Sequence-to-Sequence Pre-training for Natural Language Generation, Translation, and Comprehension by Mike Lewis, Yinhan Liu, Naman Goyal, Marjan Ghazvininejad, Abdelrahman Mohamed, Omer Levy, Ves Stoyanov and Luke Zettlemoyer on 29 Oct, 2019.
According to the abstract,
- Bart uses a standard seq2seq/machine translation architecture with a bidirectional encoder (like BERT) and a left-to-right decoder (like GPT).
- The pretraining task involves randomly shuffling the order of the original sentences and a novel in-filling scheme, where spans of text are replaced with a single mask token.
- BART is particularly effective when fine tuned for text generation but also works well for comprehension tasks. It matches the performance of RoBERTa with comparable training resources on GLUE and SQuAD, achieves new state-of-the-art results on a range of abstractive dialogue, question answering, and summarization tasks, with gains of up to 6 ROUGE.
This model was contributed by sshleifer. The Authorsβ code can be found here.
Examples
- Examples and scripts for fine-tuning BART and other models for sequence to sequence tasks can be found in examples/pytorch/summarization/.
- An example of how to train BartForConditionalGeneration with a Hugging Face datasetsobject can be found in this forum discussion.
- Distilled checkpoints are described in this paper.
Implementation Notes
- Bart doesnβt use token_type_idsfor sequence classification. Use BartTokenizer or encode() to get the proper splitting.
- The forward pass of BartModel will create the decoder_input_idsif they are not passed. This is different than some other modeling APIs. A typical use case of this feature is mask filling.
- Model predictions are intended to be identical to the original implementation when
force_bos_token_to_be_generated=True. This only works, however, if the string you pass tofairseq.encodestarts with a space.
- generate() should be used for conditional generation tasks like summarization, see the example in that docstrings.
- Models that load the facebook/bart-large-cnn weights will not have a mask_token_id, or be able to perform mask-filling tasks.
Mask Filling
The facebook/bart-base and facebook/bart-large checkpoints can be used to fill multi-token masks.
from transformers import BartForConditionalGeneration, BartTokenizer
model = BartForConditionalGeneration.from_pretrained("facebook/bart-large", forced_bos_token_id=0)
tok = BartTokenizer.from_pretrained("facebook/bart-large")
example_english_phrase = "UN Chief Says There Is No <mask> in Syria"
batch = tok(example_english_phrase, return_tensors='pt')
generated_ids = model.generate(batch['input_ids'])
assert tok.batch_decode(generated_ids, skip_special_tokens=True) == ['UN Chief Says There Is No Plan to Stop Chemical Weapons in Syria']BartConfig
( vocab_size = 50265 max_position_embeddings = 1024 encoder_layers = 12 encoder_ffn_dim = 4096 encoder_attention_heads = 16 decoder_layers = 12 decoder_ffn_dim = 4096 decoder_attention_heads = 16 encoder_layerdrop = 0.0 decoder_layerdrop = 0.0 activation_function = 'gelu' d_model = 1024 dropout = 0.1 attention_dropout = 0.0 activation_dropout = 0.0 init_std = 0.02 classifier_dropout = 0.0 scale_embedding = False use_cache = True num_labels = 3 pad_token_id = 1 bos_token_id = 0 eos_token_id = 2 is_encoder_decoder = True decoder_start_token_id = 2 forced_eos_token_id = 2 **kwargs )
Parameters
- 
							vocab_size (int, optional, defaults to 50265) — Vocabulary size of the BART model. Defines the number of different tokens that can be represented by theinputs_idspassed when calling BartModel or TFBartModel.
- 
							d_model (int, optional, defaults to 1024) — Dimensionality of the layers and the pooler layer.
- 
							encoder_layers (int, optional, defaults to 12) — Number of encoder layers.
- 
							decoder_layers (int, optional, defaults to 12) — Number of decoder layers.
- 
							encoder_attention_heads (int, optional, defaults to 16) — Number of attention heads for each attention layer in the Transformer encoder.
- 
							decoder_attention_heads (int, optional, defaults to 16) — Number of attention heads for each attention layer in the Transformer decoder.
- 
							decoder_ffn_dim (int, optional, defaults to 4096) — Dimensionality of the “intermediate” (often named feed-forward) layer in decoder.
- 
							encoder_ffn_dim (int, optional, defaults to 4096) — Dimensionality of the “intermediate” (often named feed-forward) layer in decoder.
- 
							activation_function (strorfunction, optional, defaults to"gelu") — The non-linear activation function (function or string) in the encoder and pooler. If string,"gelu","relu","silu"and"gelu_new"are supported.
- 
							dropout (float, optional, defaults to 0.1) — The dropout probability for all fully connected layers in the embeddings, encoder, and pooler.
- 
							attention_dropout (float, optional, defaults to 0.0) — The dropout ratio for the attention probabilities.
- 
							activation_dropout (float, optional, defaults to 0.0) — The dropout ratio for activations inside the fully connected layer.
- 
							classifier_dropout (float, optional, defaults to 0.0) — The dropout ratio for classifier.
- 
							max_position_embeddings (int, optional, defaults to 1024) — The maximum sequence length that this model might ever be used with. Typically set this to something large just in case (e.g., 512 or 1024 or 2048).
- 
							init_std (float, optional, defaults to 0.02) — The standard deviation of the truncated_normal_initializer for initializing all weight matrices. encoder_layerdrop — (float, optional, defaults to 0.0): The LayerDrop probability for the encoder. See the [LayerDrop paper](see https://arxiv.org/abs/1909.11556) for more details. decoder_layerdrop — (float, optional, defaults to 0.0): The LayerDrop probability for the decoder. See the [LayerDrop paper](see https://arxiv.org/abs/1909.11556) for more details.
- 
							scale_embedding (bool, optional, defaults toFalse) — Scale embeddings by diving by sqrt(d_model).
- 
							use_cache (bool, optional, defaults toTrue) — Whether or not the model should return the last key/values attentions (not used by all models). num_labels — (int, optional, defaults to 3): The number of labels to use in BartForSequenceClassification.
- 
							forced_eos_token_id (int, optional, defaults to 2) — The id of the token to force as the last generated token whenmax_lengthis reached. Usually set toeos_token_id.
This is the configuration class to store the configuration of a BartModel. It is used to instantiate a BART model according to the specified arguments, defining the model architecture. Instantiating a configuration with the defaults will yield a similar configuration to that of the BART facebook/bart-large architecture.
Configuration objects inherit from PretrainedConfig and can be used to control the model outputs. Read the documentation from PretrainedConfig for more information.
Example:
>>> from transformers import BartModel, BartConfig
>>> # Initializing a BART facebook/bart-large style configuration
>>> configuration = BartConfig()
>>> # Initializing a model from the facebook/bart-large style configuration
>>> model = BartModel(configuration)
>>> # Accessing the model configuration
>>> configuration = model.configBartTokenizer
( vocab_file merges_file errors = 'replace' bos_token = '<s>' eos_token = '</s>' sep_token = '</s>' cls_token = '<s>' unk_token = '<unk>' pad_token = '<pad>' mask_token = '<mask>' add_prefix_space = False **kwargs )
Construct a BART tokenizer.
BartTokenizer is identical to RobertaTokenizer. Refer to superclass RobertaTokenizer for usage examples and documentation concerning the initialization parameters and other methods.
BartTokenizerFast
( vocab_file = None merges_file = None tokenizer_file = None errors = 'replace' bos_token = '<s>' eos_token = '</s>' sep_token = '</s>' cls_token = '<s>' unk_token = '<unk>' pad_token = '<pad>' mask_token = '<mask>' add_prefix_space = False trim_offsets = True **kwargs )
Construct a βfastβ BART tokenizer (backed by HuggingFaceβs tokenizers library).
BartTokenizerFast is identical to RobertaTokenizerFast. Refer to superclass RobertaTokenizerFast for usage examples and documentation concerning the initialization parameters and other methods.
BartModel
( config: BartConfig )
Parameters
- config (BartConfig) — Model configuration class with all the parameters of the model. Initializing with a config file does not load the weights associated with the model, only the configuration. Check out the from_pretrained() method to load the model weights.
The bare BART Model outputting raw hidden-states without any specific head on top. This model inherits from PreTrainedModel. Check the superclass documentation for the generic methods the library implements for all its model (such as downloading or saving, resizing the input embeddings, pruning heads etc.)
This model is also a PyTorch torch.nn.Module subclass. Use it as a regular PyTorch Module and refer to the PyTorch documentation for all matter related to general usage and behavior.
(
		input_ids = None
			attention_mask = None
			decoder_input_ids = None
			decoder_attention_mask = None
			head_mask = None
			decoder_head_mask = None
			cross_attn_head_mask = None
			encoder_outputs = None
			past_key_values = None
			inputs_embeds = None
			decoder_inputs_embeds = None
			use_cache = None
			output_attentions = None
			output_hidden_states = None
			return_dict = None
			
		)
		β
			transformers.modeling_outputs.Seq2SeqModelOutput or tuple(torch.FloatTensor)
Parameters
- 
							input_ids (torch.LongTensorof shape(batch_size, sequence_length)) — Indices of input sequence tokens in the vocabulary. Padding will be ignored by default should you provide it.Indices can be obtained using BartTokenizer. See PreTrainedTokenizer.encode() and PreTrainedTokenizer.call() for details. 
- 
							attention_mask (torch.Tensorof shape(batch_size, sequence_length), optional) — Mask to avoid performing attention on padding token indices. Mask values selected in[0, 1]:- 1 for tokens that are not masked,
- 0 for tokens that are masked.
 
- 
							decoder_input_ids (torch.LongTensorof shape(batch_size, target_sequence_length), optional) — Indices of decoder input sequence tokens in the vocabulary.Indices can be obtained using BartTokenizer. See PreTrainedTokenizer.encode() and PreTrainedTokenizer.call() for details. Bart uses the eos_token_idas the starting token fordecoder_input_idsgeneration. Ifpast_key_valuesis used, optionally only the lastdecoder_input_idshave to be input (seepast_key_values).For translation and summarization training, decoder_input_idsshould be provided. If nodecoder_input_idsis provided, the model will create this tensor by shifting theinput_idsto the right for denoising pre-training following the paper.
- 
							decoder_attention_mask (torch.LongTensorof shape(batch_size, target_sequence_length), optional) — Default behavior: generate a tensor that ignores pad tokens indecoder_input_ids. Causal mask will also be used by default.If you want to change padding behavior, you should read modeling_bart._prepare_decoder_inputsand modify to your needs. See diagram 1 in the paper for more information on the default strategy.
- 
							head_mask (torch.Tensorof shape(encoder_layers, encoder_attention_heads), optional) — Mask to nullify selected heads of the attention modules in the encoder. Mask values selected in[0, 1]:- 1 indicates the head is not masked,
- 0 indicates the head is masked.
 
- 
							decoder_head_mask (torch.Tensorof shape(decoder_layers, decoder_attention_heads), optional) — Mask to nullify selected heads of the attention modules in the decoder. Mask values selected in[0, 1]:- 1 indicates the head is not masked,
- 0 indicates the head is masked.
 
- 
							cross_attn_head_mask (torch.Tensorof shape(decoder_layers, decoder_attention_heads), optional) — Mask to nullify selected heads of the cross-attention modules in the decoder. Mask values selected in[0, 1]:- 1 indicates the head is not masked,
- 0 indicates the head is masked.
 
- 
							encoder_outputs (tuple(tuple(torch.FloatTensor), optional) — Tuple consists of (last_hidden_state, optional:hidden_states, optional:attentions)last_hidden_stateof shape(batch_size, sequence_length, hidden_size), optional) is a sequence of hidden-states at the output of the last layer of the encoder. Used in the cross-attention of the decoder.
- 
							past_key_values (tuple(tuple(torch.FloatTensor)), optional, returned whenuse_cache=Trueis passed or whenconfig.use_cache=True) — Tuple oftuple(torch.FloatTensor)of lengthconfig.n_layers, with each tuple having 2 tensors of shape(batch_size, num_heads, sequence_length, embed_size_per_head)) and 2 additional tensors of shape(batch_size, num_heads, encoder_sequence_length, embed_size_per_head).Contains pre-computed hidden-states (key and values in the self-attention blocks and in the cross-attention blocks) that can be used (see past_key_valuesinput) to speed up sequential decoding.If past_key_valuesare used, the user can optionally input only the lastdecoder_input_ids(those that don’t have their past key value states given to this model) of shape(batch_size, 1)instead of all `decoder_input_ids``` of shape(batch_size, sequence_length). inputs_embeds (torch.FloatTensorof shape(batch_size, sequence_length, hidden_size), *optional*): Optionally, instead of passinginput_idsyou can choose to directly pass an embedded representation. This is useful if you want more control over how to convertinput_ids` indices into associated vectors than the model’s internal embedding lookup matrix.
- 
							decoder_inputs_embeds (torch.FloatTensorof shape(batch_size, target_sequence_length, hidden_size), optional) — Optionally, instead of passingdecoder_input_idsyou can choose to directly pass an embedded representation. Ifpast_key_valuesis used, optionally only the lastdecoder_inputs_embedshave to be input (seepast_key_values). This is useful if you want more control over how to convertdecoder_input_idsindices into associated vectors than the model’s internal embedding lookup matrix.If decoder_input_idsanddecoder_inputs_embedsare both unset,decoder_inputs_embedstakes the value ofinputs_embeds.
- 
							use_cache (bool, optional) — If set toTrue,past_key_valueskey value states are returned and can be used to speed up decoding (seepast_key_values).
- 
							output_attentions (bool, optional) — Whether or not to return the attentions tensors of all attention layers. Seeattentionsunder returned tensors for more detail.
- 
							output_hidden_states (bool, optional) — Whether or not to return the hidden states of all layers. Seehidden_statesunder returned tensors for more detail.
- 
							return_dict (bool, optional) — Whether or not to return a ModelOutput instead of a plain tuple.
Returns
transformers.modeling_outputs.Seq2SeqModelOutput or tuple(torch.FloatTensor)
A transformers.modeling_outputs.Seq2SeqModelOutput or a tuple of
torch.FloatTensor (if return_dict=False is passed or when config.return_dict=False) comprising various
elements depending on the configuration (BartConfig) and inputs.
- 
last_hidden_state ( torch.FloatTensorof shape(batch_size, sequence_length, hidden_size)) β Sequence of hidden-states at the output of the last layer of the decoder of the model.If past_key_valuesis used only the last hidden-state of the sequences of shape(batch_size, 1, hidden_size)is output.
- 
past_key_values ( tuple(tuple(torch.FloatTensor)), optional, returned whenuse_cache=Trueis passed or whenconfig.use_cache=True) β Tuple oftuple(torch.FloatTensor)of lengthconfig.n_layers, with each tuple having 2 tensors of shape(batch_size, num_heads, sequence_length, embed_size_per_head)) and 2 additional tensors of shape(batch_size, num_heads, encoder_sequence_length, embed_size_per_head).Contains pre-computed hidden-states (key and values in the self-attention blocks and in the cross-attention blocks) that can be used (see past_key_valuesinput) to speed up sequential decoding.
- 
decoder_hidden_states ( tuple(torch.FloatTensor), optional, returned whenoutput_hidden_states=Trueis passed or whenconfig.output_hidden_states=True) β Tuple oftorch.FloatTensor(one for the output of the embeddings + one for the output of each layer) of shape(batch_size, sequence_length, hidden_size).Hidden-states of the decoder at the output of each layer plus the initial embedding outputs. 
- 
decoder_attentions ( tuple(torch.FloatTensor), optional, returned whenoutput_attentions=Trueis passed or whenconfig.output_attentions=True) β Tuple oftorch.FloatTensor(one for each layer) of shape(batch_size, num_heads, sequence_length, sequence_length).Attentions weights of the decoder, after the attention softmax, used to compute the weighted average in the self-attention heads. 
- 
cross_attentions ( tuple(torch.FloatTensor), optional, returned whenoutput_attentions=Trueis passed or whenconfig.output_attentions=True) β Tuple oftorch.FloatTensor(one for each layer) of shape(batch_size, num_heads, sequence_length, sequence_length).Attentions weights of the decoderβs cross-attention layer, after the attention softmax, used to compute the weighted average in the cross-attention heads. 
- 
encoder_last_hidden_state ( torch.FloatTensorof shape(batch_size, sequence_length, hidden_size), optional) β Sequence of hidden-states at the output of the last layer of the encoder of the model.
- 
encoder_hidden_states ( tuple(torch.FloatTensor), optional, returned whenoutput_hidden_states=Trueis passed or whenconfig.output_hidden_states=True) β Tuple oftorch.FloatTensor(one for the output of the embeddings + one for the output of each layer) of shape(batch_size, sequence_length, hidden_size).Hidden-states of the encoder at the output of each layer plus the initial embedding outputs. 
- 
encoder_attentions ( tuple(torch.FloatTensor), optional, returned whenoutput_attentions=Trueis passed or whenconfig.output_attentions=True) β Tuple oftorch.FloatTensor(one for each layer) of shape(batch_size, num_heads, sequence_length, sequence_length).Attentions weights of the encoder, after the attention softmax, used to compute the weighted average in the self-attention heads. 
The BartModel forward method, overrides the __call__ special method.
Although the recipe for forward pass needs to be defined within this function, one should call the Module
instance afterwards instead of this since the former takes care of running the pre and post processing steps while
the latter silently ignores them.
Example:
>>> from transformers import BartTokenizer, BartModel
>>> import torch
>>> tokenizer = BartTokenizer.from_pretrained('facebook/bart-large')
>>> model = BartModel.from_pretrained('facebook/bart-large')
>>> inputs = tokenizer("Hello, my dog is cute", return_tensors="pt")
>>> outputs = model(**inputs)
>>> last_hidden_states = outputs.last_hidden_stateBartForConditionalGeneration
( config: BartConfig )
Parameters
- config (BartConfig) — Model configuration class with all the parameters of the model. Initializing with a config file does not load the weights associated with the model, only the configuration. Check out the from_pretrained() method to load the model weights.
The BART Model with a language modeling head. Can be used for summarization. This model inherits from PreTrainedModel. Check the superclass documentation for the generic methods the library implements for all its model (such as downloading or saving, resizing the input embeddings, pruning heads etc.)
This model is also a PyTorch torch.nn.Module subclass. Use it as a regular PyTorch Module and refer to the PyTorch documentation for all matter related to general usage and behavior.
(
		input_ids = None
			attention_mask = None
			decoder_input_ids = None
			decoder_attention_mask = None
			head_mask = None
			decoder_head_mask = None
			cross_attn_head_mask = None
			encoder_outputs = None
			past_key_values = None
			inputs_embeds = None
			decoder_inputs_embeds = None
			labels = None
			use_cache = None
			output_attentions = None
			output_hidden_states = None
			return_dict = None
			
		)
		β
			transformers.modeling_outputs.Seq2SeqLMOutput or tuple(torch.FloatTensor)
Parameters
- 
							input_ids (torch.LongTensorof shape(batch_size, sequence_length)) — Indices of input sequence tokens in the vocabulary. Padding will be ignored by default should you provide it.Indices can be obtained using BartTokenizer. See PreTrainedTokenizer.encode() and PreTrainedTokenizer.call() for details. 
- 
							attention_mask (torch.Tensorof shape(batch_size, sequence_length), optional) — Mask to avoid performing attention on padding token indices. Mask values selected in[0, 1]:- 1 for tokens that are not masked,
- 0 for tokens that are masked.
 
- 
							decoder_input_ids (torch.LongTensorof shape(batch_size, target_sequence_length), optional) — Indices of decoder input sequence tokens in the vocabulary.Indices can be obtained using BartTokenizer. See PreTrainedTokenizer.encode() and PreTrainedTokenizer.call() for details. Bart uses the eos_token_idas the starting token fordecoder_input_idsgeneration. Ifpast_key_valuesis used, optionally only the lastdecoder_input_idshave to be input (seepast_key_values).For translation and summarization training, decoder_input_idsshould be provided. If nodecoder_input_idsis provided, the model will create this tensor by shifting theinput_idsto the right for denoising pre-training following the paper.
- 
							decoder_attention_mask (torch.LongTensorof shape(batch_size, target_sequence_length), optional) — Default behavior: generate a tensor that ignores pad tokens indecoder_input_ids. Causal mask will also be used by default.If you want to change padding behavior, you should read modeling_bart._prepare_decoder_inputsand modify to your needs. See diagram 1 in the paper for more information on the default strategy.
- 
							head_mask (torch.Tensorof shape(encoder_layers, encoder_attention_heads), optional) — Mask to nullify selected heads of the attention modules in the encoder. Mask values selected in[0, 1]:- 1 indicates the head is not masked,
- 0 indicates the head is masked.
 
- 
							decoder_head_mask (torch.Tensorof shape(decoder_layers, decoder_attention_heads), optional) — Mask to nullify selected heads of the attention modules in the decoder. Mask values selected in[0, 1]:- 1 indicates the head is not masked,
- 0 indicates the head is masked.
 
- 
							cross_attn_head_mask (torch.Tensorof shape(decoder_layers, decoder_attention_heads), optional) — Mask to nullify selected heads of the cross-attention modules in the decoder. Mask values selected in[0, 1]:- 1 indicates the head is not masked,
- 0 indicates the head is masked.
 
- 
							encoder_outputs (tuple(tuple(torch.FloatTensor), optional) — Tuple consists of (last_hidden_state, optional:hidden_states, optional:attentions)last_hidden_stateof shape(batch_size, sequence_length, hidden_size), optional) is a sequence of hidden-states at the output of the last layer of the encoder. Used in the cross-attention of the decoder.
- 
							past_key_values (tuple(tuple(torch.FloatTensor)), optional, returned whenuse_cache=Trueis passed or whenconfig.use_cache=True) — Tuple oftuple(torch.FloatTensor)of lengthconfig.n_layers, with each tuple having 2 tensors of shape(batch_size, num_heads, sequence_length, embed_size_per_head)) and 2 additional tensors of shape(batch_size, num_heads, encoder_sequence_length, embed_size_per_head).Contains pre-computed hidden-states (key and values in the self-attention blocks and in the cross-attention blocks) that can be used (see past_key_valuesinput) to speed up sequential decoding.If past_key_valuesare used, the user can optionally input only the lastdecoder_input_ids(those that don’t have their past key value states given to this model) of shape(batch_size, 1)instead of all `decoder_input_ids``` of shape(batch_size, sequence_length). inputs_embeds (torch.FloatTensorof shape(batch_size, sequence_length, hidden_size), *optional*): Optionally, instead of passinginput_idsyou can choose to directly pass an embedded representation. This is useful if you want more control over how to convertinput_ids` indices into associated vectors than the model’s internal embedding lookup matrix.
- 
							decoder_inputs_embeds (torch.FloatTensorof shape(batch_size, target_sequence_length, hidden_size), optional) — Optionally, instead of passingdecoder_input_idsyou can choose to directly pass an embedded representation. Ifpast_key_valuesis used, optionally only the lastdecoder_inputs_embedshave to be input (seepast_key_values). This is useful if you want more control over how to convertdecoder_input_idsindices into associated vectors than the model’s internal embedding lookup matrix.If decoder_input_idsanddecoder_inputs_embedsare both unset,decoder_inputs_embedstakes the value ofinputs_embeds.
- 
							use_cache (bool, optional) — If set toTrue,past_key_valueskey value states are returned and can be used to speed up decoding (seepast_key_values).
- 
							output_attentions (bool, optional) — Whether or not to return the attentions tensors of all attention layers. Seeattentionsunder returned tensors for more detail.
- 
							output_hidden_states (bool, optional) — Whether or not to return the hidden states of all layers. Seehidden_statesunder returned tensors for more detail.
- 
							return_dict (bool, optional) — Whether or not to return a ModelOutput instead of a plain tuple.
- 
							labels (torch.LongTensorof shape(batch_size, sequence_length), optional) — Labels for computing the masked language modeling loss. Indices should either be in[0, ..., config.vocab_size]or -100 (seeinput_idsdocstring). Tokens with indices set to-100are ignored (masked), the loss is only computed for the tokens with labels in[0, ..., config.vocab_size].
Returns
transformers.modeling_outputs.Seq2SeqLMOutput or tuple(torch.FloatTensor)
A transformers.modeling_outputs.Seq2SeqLMOutput or a tuple of
torch.FloatTensor (if return_dict=False is passed or when config.return_dict=False) comprising various
elements depending on the configuration (BartConfig) and inputs.
- 
loss ( torch.FloatTensorof shape(1,), optional, returned whenlabelsis provided) β Language modeling loss.
- 
logits ( torch.FloatTensorof shape(batch_size, sequence_length, config.vocab_size)) β Prediction scores of the language modeling head (scores for each vocabulary token before SoftMax).
- 
past_key_values ( tuple(tuple(torch.FloatTensor)), optional, returned whenuse_cache=Trueis passed or whenconfig.use_cache=True) β Tuple oftuple(torch.FloatTensor)of lengthconfig.n_layers, with each tuple having 2 tensors of shape(batch_size, num_heads, sequence_length, embed_size_per_head)) and 2 additional tensors of shape(batch_size, num_heads, encoder_sequence_length, embed_size_per_head).Contains pre-computed hidden-states (key and values in the self-attention blocks and in the cross-attention blocks) that can be used (see past_key_valuesinput) to speed up sequential decoding.
- 
decoder_hidden_states ( tuple(torch.FloatTensor), optional, returned whenoutput_hidden_states=Trueis passed or whenconfig.output_hidden_states=True) β Tuple oftorch.FloatTensor(one for the output of the embeddings + one for the output of each layer) of shape(batch_size, sequence_length, hidden_size).Hidden-states of the decoder at the output of each layer plus the initial embedding outputs. 
- 
decoder_attentions ( tuple(torch.FloatTensor), optional, returned whenoutput_attentions=Trueis passed or whenconfig.output_attentions=True) β Tuple oftorch.FloatTensor(one for each layer) of shape(batch_size, num_heads, sequence_length, sequence_length).Attentions weights of the decoder, after the attention softmax, used to compute the weighted average in the self-attention heads. 
- 
cross_attentions ( tuple(torch.FloatTensor), optional, returned whenoutput_attentions=Trueis passed or whenconfig.output_attentions=True) β Tuple oftorch.FloatTensor(one for each layer) of shape(batch_size, num_heads, sequence_length, sequence_length).Attentions weights of the decoderβs cross-attention layer, after the attention softmax, used to compute the weighted average in the cross-attention heads. 
- 
encoder_last_hidden_state ( torch.FloatTensorof shape(batch_size, sequence_length, hidden_size), optional) β Sequence of hidden-states at the output of the last layer of the encoder of the model.
- 
encoder_hidden_states ( tuple(torch.FloatTensor), optional, returned whenoutput_hidden_states=Trueis passed or whenconfig.output_hidden_states=True) β Tuple oftorch.FloatTensor(one for the output of the embeddings + one for the output of each layer) of shape(batch_size, sequence_length, hidden_size).Hidden-states of the encoder at the output of each layer plus the initial embedding outputs. 
- 
encoder_attentions ( tuple(torch.FloatTensor), optional, returned whenoutput_attentions=Trueis passed or whenconfig.output_attentions=True) β Tuple oftorch.FloatTensor(one for each layer) of shape(batch_size, num_heads, sequence_length, sequence_length).Attentions weights of the encoder, after the attention softmax, used to compute the weighted average in the self-attention heads. 
The BartForConditionalGeneration forward method, overrides the __call__ special method.
Although the recipe for forward pass needs to be defined within this function, one should call the Module
instance afterwards instead of this since the former takes care of running the pre and post processing steps while
the latter silently ignores them.
Summarization example::
from transformers import BartTokenizer, BartForConditionalGeneration, BartConfig
model = BartForConditionalGeneration.from_pretrained(βfacebook/bart-large-cnnβ) tokenizer = BartTokenizer.from_pretrained(βfacebook/bart-large-cnnβ)
ARTICLE_TO_SUMMARIZE = βMy friends are cool but they eat too many carbs.β inputs = tokenizer([ARTICLE_TO_SUMMARIZE], max_length=1024, return_tensors=βptβ)
Generate Summary
summary_ids = model.generate(inputs[βinput_idsβ], num_beams=4, max_length=5, early_stopping=True) print([tokenizer.decode(g, skip_special_tokens=True, clean_up_tokenization_spaces=False) for g in summary_ids])
Mask filling example::
from transformers import BartTokenizer, BartForConditionalGeneration tokenizer = BartTokenizer.from_pretrained(βfacebook/bart-largeβ) TXT = βMy friends are <mask> but they eat too many carbs.β
model = BartForConditionalGeneration.from_pretrained(βfacebook/bart-largeβ) input_ids = tokenizer([TXT], return_tensors=βptβ)[βinput_idsβ] logits = model(input_ids).logits
masked_index = (input_ids[0] == tokenizer.mask_token_id).nonzero().item() probs = logits[0, masked_index].softmax(dim=0) values, predictions = probs.topk(5)
tokenizer.decode(predictions).split()
BartForSequenceClassification
( config: BartConfig **kwargs )
Parameters
- config (BartConfig) — Model configuration class with all the parameters of the model. Initializing with a config file does not load the weights associated with the model, only the configuration. Check out the from_pretrained() method to load the model weights.
Bart model with a sequence classification/head on top (a linear layer on top of the pooled output) e.g. for GLUE tasks.
This model inherits from PreTrainedModel. Check the superclass documentation for the generic methods the library implements for all its model (such as downloading or saving, resizing the input embeddings, pruning heads etc.)
This model is also a PyTorch torch.nn.Module subclass. Use it as a regular PyTorch Module and refer to the PyTorch documentation for all matter related to general usage and behavior.
(
		input_ids = None
			attention_mask = None
			decoder_input_ids = None
			decoder_attention_mask = None
			head_mask = None
			decoder_head_mask = None
			cross_attn_head_mask = None
			encoder_outputs = None
			inputs_embeds = None
			decoder_inputs_embeds = None
			labels = None
			use_cache = None
			output_attentions = None
			output_hidden_states = None
			return_dict = None
			
		)
		β
			transformers.modeling_outputs.Seq2SeqSequenceClassifierOutput or tuple(torch.FloatTensor)
Parameters
- 
							input_ids (torch.LongTensorof shape(batch_size, sequence_length)) — Indices of input sequence tokens in the vocabulary. Padding will be ignored by default should you provide it.Indices can be obtained using BartTokenizer. See PreTrainedTokenizer.encode() and PreTrainedTokenizer.call() for details. 
- 
							attention_mask (torch.Tensorof shape(batch_size, sequence_length), optional) — Mask to avoid performing attention on padding token indices. Mask values selected in[0, 1]:- 1 for tokens that are not masked,
- 0 for tokens that are masked.
 
- 
							decoder_input_ids (torch.LongTensorof shape(batch_size, target_sequence_length), optional) — Indices of decoder input sequence tokens in the vocabulary.Indices can be obtained using BartTokenizer. See PreTrainedTokenizer.encode() and PreTrainedTokenizer.call() for details. Bart uses the eos_token_idas the starting token fordecoder_input_idsgeneration. Ifpast_key_valuesis used, optionally only the lastdecoder_input_idshave to be input (seepast_key_values).For translation and summarization training, decoder_input_idsshould be provided. If nodecoder_input_idsis provided, the model will create this tensor by shifting theinput_idsto the right for denoising pre-training following the paper.
- 
							decoder_attention_mask (torch.LongTensorof shape(batch_size, target_sequence_length), optional) — Default behavior: generate a tensor that ignores pad tokens indecoder_input_ids. Causal mask will also be used by default.If you want to change padding behavior, you should read modeling_bart._prepare_decoder_inputsand modify to your needs. See diagram 1 in the paper for more information on the default strategy.
- 
							head_mask (torch.Tensorof shape(encoder_layers, encoder_attention_heads), optional) — Mask to nullify selected heads of the attention modules in the encoder. Mask values selected in[0, 1]:- 1 indicates the head is not masked,
- 0 indicates the head is masked.
 
- 
							decoder_head_mask (torch.Tensorof shape(decoder_layers, decoder_attention_heads), optional) — Mask to nullify selected heads of the attention modules in the decoder. Mask values selected in[0, 1]:- 1 indicates the head is not masked,
- 0 indicates the head is masked.
 
- 
							cross_attn_head_mask (torch.Tensorof shape(decoder_layers, decoder_attention_heads), optional) — Mask to nullify selected heads of the cross-attention modules in the decoder. Mask values selected in[0, 1]:- 1 indicates the head is not masked,
- 0 indicates the head is masked.
 
- 
							encoder_outputs (tuple(tuple(torch.FloatTensor), optional) — Tuple consists of (last_hidden_state, optional:hidden_states, optional:attentions)last_hidden_stateof shape(batch_size, sequence_length, hidden_size), optional) is a sequence of hidden-states at the output of the last layer of the encoder. Used in the cross-attention of the decoder.
- 
							past_key_values (tuple(tuple(torch.FloatTensor)), optional, returned whenuse_cache=Trueis passed or whenconfig.use_cache=True) — Tuple oftuple(torch.FloatTensor)of lengthconfig.n_layers, with each tuple having 2 tensors of shape(batch_size, num_heads, sequence_length, embed_size_per_head)) and 2 additional tensors of shape(batch_size, num_heads, encoder_sequence_length, embed_size_per_head).Contains pre-computed hidden-states (key and values in the self-attention blocks and in the cross-attention blocks) that can be used (see past_key_valuesinput) to speed up sequential decoding.If past_key_valuesare used, the user can optionally input only the lastdecoder_input_ids(those that don’t have their past key value states given to this model) of shape(batch_size, 1)instead of all `decoder_input_ids``` of shape(batch_size, sequence_length). inputs_embeds (torch.FloatTensorof shape(batch_size, sequence_length, hidden_size), *optional*): Optionally, instead of passinginput_idsyou can choose to directly pass an embedded representation. This is useful if you want more control over how to convertinput_ids` indices into associated vectors than the model’s internal embedding lookup matrix.
- 
							decoder_inputs_embeds (torch.FloatTensorof shape(batch_size, target_sequence_length, hidden_size), optional) — Optionally, instead of passingdecoder_input_idsyou can choose to directly pass an embedded representation. Ifpast_key_valuesis used, optionally only the lastdecoder_inputs_embedshave to be input (seepast_key_values). This is useful if you want more control over how to convertdecoder_input_idsindices into associated vectors than the model’s internal embedding lookup matrix.If decoder_input_idsanddecoder_inputs_embedsare both unset,decoder_inputs_embedstakes the value ofinputs_embeds.
- 
							use_cache (bool, optional) — If set toTrue,past_key_valueskey value states are returned and can be used to speed up decoding (seepast_key_values).
- 
							output_attentions (bool, optional) — Whether or not to return the attentions tensors of all attention layers. Seeattentionsunder returned tensors for more detail.
- 
							output_hidden_states (bool, optional) — Whether or not to return the hidden states of all layers. Seehidden_statesunder returned tensors for more detail.
- 
							return_dict (bool, optional) — Whether or not to return a ModelOutput instead of a plain tuple.
- 
							labels (torch.LongTensorof shape(batch_size,), optional) — Labels for computing the sequence classification/regression loss. Indices should be in[0, ..., config.num_labels - 1]. Ifconfig.num_labels > 1a classification loss is computed (Cross-Entropy).
Returns
transformers.modeling_outputs.Seq2SeqSequenceClassifierOutput or tuple(torch.FloatTensor)
A transformers.modeling_outputs.Seq2SeqSequenceClassifierOutput or a tuple of
torch.FloatTensor (if return_dict=False is passed or when config.return_dict=False) comprising various
elements depending on the configuration (BartConfig) and inputs.
- 
loss ( torch.FloatTensorof shape(1,), optional, returned whenlabelis provided) β Classification (or regression if config.num_labels==1) loss.
- 
logits ( torch.FloatTensorof shape(batch_size, config.num_labels)) β Classification (or regression if config.num_labels==1) scores (before SoftMax).
- 
past_key_values ( tuple(tuple(torch.FloatTensor)), optional, returned whenuse_cache=Trueis passed or whenconfig.use_cache=True) β Tuple oftuple(torch.FloatTensor)of lengthconfig.n_layers, with each tuple having 2 tensors of shape(batch_size, num_heads, sequence_length, embed_size_per_head)) and 2 additional tensors of shape(batch_size, num_heads, encoder_sequence_length, embed_size_per_head).Contains pre-computed hidden-states (key and values in the self-attention blocks and in the cross-attention blocks) that can be used (see past_key_valuesinput) to speed up sequential decoding.
- 
decoder_hidden_states ( tuple(torch.FloatTensor), optional, returned whenoutput_hidden_states=Trueis passed or whenconfig.output_hidden_states=True) β Tuple oftorch.FloatTensor(one for the output of the embeddings + one for the output of each layer) of shape(batch_size, sequence_length, hidden_size).Hidden-states of the decoder at the output of each layer plus the initial embedding outputs. 
- 
decoder_attentions ( tuple(torch.FloatTensor), optional, returned whenoutput_attentions=Trueis passed or whenconfig.output_attentions=True) β Tuple oftorch.FloatTensor(one for each layer) of shape(batch_size, num_heads, sequence_length, sequence_length).Attentions weights of the decoder, after the attention softmax, used to compute the weighted average in the self-attention heads. 
- 
cross_attentions ( tuple(torch.FloatTensor), optional, returned whenoutput_attentions=Trueis passed or whenconfig.output_attentions=True) β Tuple oftorch.FloatTensor(one for each layer) of shape(batch_size, num_heads, sequence_length, sequence_length).Attentions weights of the decoderβs cross-attention layer, after the attention softmax, used to compute the weighted average in the cross-attention heads. 
- 
encoder_last_hidden_state ( torch.FloatTensorof shape(batch_size, sequence_length, hidden_size), optional) β Sequence of hidden-states at the output of the last layer of the encoder of the model.
- 
encoder_hidden_states ( tuple(torch.FloatTensor), optional, returned whenoutput_hidden_states=Trueis passed or whenconfig.output_hidden_states=True) β Tuple oftorch.FloatTensor(one for the output of the embeddings + one for the output of each layer) of shape(batch_size, sequence_length, hidden_size).Hidden-states of the encoder at the output of each layer plus the initial embedding outputs. 
- 
encoder_attentions ( tuple(torch.FloatTensor), optional, returned whenoutput_attentions=Trueis passed or whenconfig.output_attentions=True) β Tuple oftorch.FloatTensor(one for each layer) of shape(batch_size, num_heads, sequence_length, sequence_length).Attentions weights of the encoder, after the attention softmax, used to compute the weighted average in the self-attention heads. 
The BartForSequenceClassification forward method, overrides the __call__ special method.
Although the recipe for forward pass needs to be defined within this function, one should call the Module
instance afterwards instead of this since the former takes care of running the pre and post processing steps while
the latter silently ignores them.
Example of single-label classification:
>>> from transformers import BartTokenizer, BartForSequenceClassification
>>> import torch
>>> tokenizer = BartTokenizer.from_pretrained('facebook/bart-large')
>>> model = BartForSequenceClassification.from_pretrained('facebook/bart-large')
>>> inputs = tokenizer("Hello, my dog is cute", return_tensors="pt")
>>> labels = torch.tensor([1]).unsqueeze(0) # Batch size 1
>>> outputs = model(**inputs, labels=labels)
>>> loss = outputs.loss
>>> logits = outputs.logitsExample of multi-label classification:
>>> from transformers import BartTokenizer, BartForSequenceClassification
>>> import torch
>>> tokenizer = BartTokenizer.from_pretrained('facebook/bart-large')
>>> model = BartForSequenceClassification.from_pretrained('facebook/bart-large', problem_type="multi_label_classification")
>>> inputs = tokenizer("Hello, my dog is cute", return_tensors="pt")
>>> labels = torch.tensor([[1, 1]], dtype=torch.float) # need dtype=float for BCEWithLogitsLoss
>>> outputs = model(**inputs, labels=labels)
>>> loss = outputs.loss
>>> logits = outputs.logitsBartForQuestionAnswering
( config )
Parameters
- config (BartConfig) — Model configuration class with all the parameters of the model. Initializing with a config file does not load the weights associated with the model, only the configuration. Check out the from_pretrained() method to load the model weights.
BART Model with a span classification head on top for extractive question-answering tasks like SQuAD (a linear
layer on top of the hidden-states output to compute span start logits and span end logits).
This model inherits from PreTrainedModel. Check the superclass documentation for the generic methods the library implements for all its model (such as downloading or saving, resizing the input embeddings, pruning heads etc.)
This model is also a PyTorch torch.nn.Module subclass. Use it as a regular PyTorch Module and refer to the PyTorch documentation for all matter related to general usage and behavior.
(
		input_ids = None
			attention_mask = None
			decoder_input_ids = None
			decoder_attention_mask = None
			head_mask = None
			decoder_head_mask = None
			cross_attn_head_mask = None
			encoder_outputs = None
			start_positions = None
			end_positions = None
			inputs_embeds = None
			decoder_inputs_embeds = None
			use_cache = None
			output_attentions = None
			output_hidden_states = None
			return_dict = None
			
		)
		β
			transformers.modeling_outputs.Seq2SeqQuestionAnsweringModelOutput or tuple(torch.FloatTensor)
Parameters
- 
							input_ids (torch.LongTensorof shape(batch_size, sequence_length)) — Indices of input sequence tokens in the vocabulary. Padding will be ignored by default should you provide it.Indices can be obtained using BartTokenizer. See PreTrainedTokenizer.encode() and PreTrainedTokenizer.call() for details. 
- 
							attention_mask (torch.Tensorof shape(batch_size, sequence_length), optional) — Mask to avoid performing attention on padding token indices. Mask values selected in[0, 1]:- 1 for tokens that are not masked,
- 0 for tokens that are masked.
 
- 
							decoder_input_ids (torch.LongTensorof shape(batch_size, target_sequence_length), optional) — Indices of decoder input sequence tokens in the vocabulary.Indices can be obtained using BartTokenizer. See PreTrainedTokenizer.encode() and PreTrainedTokenizer.call() for details. Bart uses the eos_token_idas the starting token fordecoder_input_idsgeneration. Ifpast_key_valuesis used, optionally only the lastdecoder_input_idshave to be input (seepast_key_values).For translation and summarization training, decoder_input_idsshould be provided. If nodecoder_input_idsis provided, the model will create this tensor by shifting theinput_idsto the right for denoising pre-training following the paper.
- 
							decoder_attention_mask (torch.LongTensorof shape(batch_size, target_sequence_length), optional) — Default behavior: generate a tensor that ignores pad tokens indecoder_input_ids. Causal mask will also be used by default.If you want to change padding behavior, you should read modeling_bart._prepare_decoder_inputsand modify to your needs. See diagram 1 in the paper for more information on the default strategy.
- 
							head_mask (torch.Tensorof shape(encoder_layers, encoder_attention_heads), optional) — Mask to nullify selected heads of the attention modules in the encoder. Mask values selected in[0, 1]:- 1 indicates the head is not masked,
- 0 indicates the head is masked.
 
- 
							decoder_head_mask (torch.Tensorof shape(decoder_layers, decoder_attention_heads), optional) — Mask to nullify selected heads of the attention modules in the decoder. Mask values selected in[0, 1]:- 1 indicates the head is not masked,
- 0 indicates the head is masked.
 
- 
							cross_attn_head_mask (torch.Tensorof shape(decoder_layers, decoder_attention_heads), optional) — Mask to nullify selected heads of the cross-attention modules in the decoder. Mask values selected in[0, 1]:- 1 indicates the head is not masked,
- 0 indicates the head is masked.
 
- 
							encoder_outputs (tuple(tuple(torch.FloatTensor), optional) — Tuple consists of (last_hidden_state, optional:hidden_states, optional:attentions)last_hidden_stateof shape(batch_size, sequence_length, hidden_size), optional) is a sequence of hidden-states at the output of the last layer of the encoder. Used in the cross-attention of the decoder.
- 
							past_key_values (tuple(tuple(torch.FloatTensor)), optional, returned whenuse_cache=Trueis passed or whenconfig.use_cache=True) — Tuple oftuple(torch.FloatTensor)of lengthconfig.n_layers, with each tuple having 2 tensors of shape(batch_size, num_heads, sequence_length, embed_size_per_head)) and 2 additional tensors of shape(batch_size, num_heads, encoder_sequence_length, embed_size_per_head).Contains pre-computed hidden-states (key and values in the self-attention blocks and in the cross-attention blocks) that can be used (see past_key_valuesinput) to speed up sequential decoding.If past_key_valuesare used, the user can optionally input only the lastdecoder_input_ids(those that don’t have their past key value states given to this model) of shape(batch_size, 1)instead of all `decoder_input_ids``` of shape(batch_size, sequence_length). inputs_embeds (torch.FloatTensorof shape(batch_size, sequence_length, hidden_size), *optional*): Optionally, instead of passinginput_idsyou can choose to directly pass an embedded representation. This is useful if you want more control over how to convertinput_ids` indices into associated vectors than the model’s internal embedding lookup matrix.
- 
							decoder_inputs_embeds (torch.FloatTensorof shape(batch_size, target_sequence_length, hidden_size), optional) — Optionally, instead of passingdecoder_input_idsyou can choose to directly pass an embedded representation. Ifpast_key_valuesis used, optionally only the lastdecoder_inputs_embedshave to be input (seepast_key_values). This is useful if you want more control over how to convertdecoder_input_idsindices into associated vectors than the model’s internal embedding lookup matrix.If decoder_input_idsanddecoder_inputs_embedsare both unset,decoder_inputs_embedstakes the value ofinputs_embeds.
- 
							use_cache (bool, optional) — If set toTrue,past_key_valueskey value states are returned and can be used to speed up decoding (seepast_key_values).
- 
							output_attentions (bool, optional) — Whether or not to return the attentions tensors of all attention layers. Seeattentionsunder returned tensors for more detail.
- 
							output_hidden_states (bool, optional) — Whether or not to return the hidden states of all layers. Seehidden_statesunder returned tensors for more detail.
- 
							return_dict (bool, optional) — Whether or not to return a ModelOutput instead of a plain tuple.
- 
							start_positions (torch.LongTensorof shape(batch_size,), optional) — Labels for position (index) of the start of the labelled span for computing the token classification loss. Positions are clamped to the length of the sequence (sequence_length). Position outside of the sequence are not taken into account for computing the loss.
- 
							end_positions (torch.LongTensorof shape(batch_size,), optional) — Labels for position (index) of the end of the labelled span for computing the token classification loss. Positions are clamped to the length of the sequence (sequence_length). Position outside of the sequence are not taken into account for computing the loss.
Returns
transformers.modeling_outputs.Seq2SeqQuestionAnsweringModelOutput or tuple(torch.FloatTensor)
A transformers.modeling_outputs.Seq2SeqQuestionAnsweringModelOutput or a tuple of
torch.FloatTensor (if return_dict=False is passed or when config.return_dict=False) comprising various
elements depending on the configuration (BartConfig) and inputs.
- 
loss ( torch.FloatTensorof shape(1,), optional, returned whenlabelsis provided) β Total span extraction loss is the sum of a Cross-Entropy for the start and end positions.
- 
start_logits ( torch.FloatTensorof shape(batch_size, sequence_length)) β Span-start scores (before SoftMax).
- 
end_logits ( torch.FloatTensorof shape(batch_size, sequence_length)) β Span-end scores (before SoftMax).
- 
past_key_values ( tuple(tuple(torch.FloatTensor)), optional, returned whenuse_cache=Trueis passed or whenconfig.use_cache=True) β Tuple oftuple(torch.FloatTensor)of lengthconfig.n_layers, with each tuple having 2 tensors of shape(batch_size, num_heads, sequence_length, embed_size_per_head)) and 2 additional tensors of shape(batch_size, num_heads, encoder_sequence_length, embed_size_per_head).Contains pre-computed hidden-states (key and values in the self-attention blocks and in the cross-attention blocks) that can be used (see past_key_valuesinput) to speed up sequential decoding.
- 
decoder_hidden_states ( tuple(torch.FloatTensor), optional, returned whenoutput_hidden_states=Trueis passed or whenconfig.output_hidden_states=True) β Tuple oftorch.FloatTensor(one for the output of the embeddings + one for the output of each layer) of shape(batch_size, sequence_length, hidden_size).Hidden-states of the decoder at the output of each layer plus the initial embedding outputs. 
- 
decoder_attentions ( tuple(torch.FloatTensor), optional, returned whenoutput_attentions=Trueis passed or whenconfig.output_attentions=True) β Tuple oftorch.FloatTensor(one for each layer) of shape(batch_size, num_heads, sequence_length, sequence_length).Attentions weights of the decoder, after the attention softmax, used to compute the weighted average in the self-attention heads. 
- 
cross_attentions ( tuple(torch.FloatTensor), optional, returned whenoutput_attentions=Trueis passed or whenconfig.output_attentions=True) β Tuple oftorch.FloatTensor(one for each layer) of shape(batch_size, num_heads, sequence_length, sequence_length).Attentions weights of the decoderβs cross-attention layer, after the attention softmax, used to compute the weighted average in the cross-attention heads. 
- 
encoder_last_hidden_state ( torch.FloatTensorof shape(batch_size, sequence_length, hidden_size), optional) β Sequence of hidden-states at the output of the last layer of the encoder of the model.
- 
encoder_hidden_states ( tuple(torch.FloatTensor), optional, returned whenoutput_hidden_states=Trueis passed or whenconfig.output_hidden_states=True) β Tuple oftorch.FloatTensor(one for the output of the embeddings + one for the output of each layer) of shape(batch_size, sequence_length, hidden_size).Hidden-states of the encoder at the output of each layer plus the initial embedding outputs. 
- 
encoder_attentions ( tuple(torch.FloatTensor), optional, returned whenoutput_attentions=Trueis passed or whenconfig.output_attentions=True) β Tuple oftorch.FloatTensor(one for each layer) of shape(batch_size, num_heads, sequence_length, sequence_length).Attentions weights of the encoder, after the attention softmax, used to compute the weighted average in the self-attention heads. 
The BartForQuestionAnswering forward method, overrides the __call__ special method.
Although the recipe for forward pass needs to be defined within this function, one should call the Module
instance afterwards instead of this since the former takes care of running the pre and post processing steps while
the latter silently ignores them.
Example:
>>> from transformers import BartTokenizer, BartForQuestionAnswering
>>> import torch
>>> tokenizer = BartTokenizer.from_pretrained('facebook/bart-large')
>>> model = BartForQuestionAnswering.from_pretrained('facebook/bart-large')
>>> question, text = "Who was Jim Henson?", "Jim Henson was a nice puppet"
>>> inputs = tokenizer(question, text, return_tensors='pt')
>>> start_positions = torch.tensor([1])
>>> end_positions = torch.tensor([3])
>>> outputs = model(**inputs, start_positions=start_positions, end_positions=end_positions)
>>> loss = outputs.loss
>>> start_scores = outputs.start_logits
>>> end_scores = outputs.end_logitsBartForCausalLM
(
		input_ids = None
			attention_mask = None
			encoder_hidden_states = None
			encoder_attention_mask = None
			head_mask = None
			cross_attn_head_mask = None
			past_key_values = None
			inputs_embeds = None
			labels = None
			use_cache = None
			output_attentions = None
			output_hidden_states = None
			return_dict = None
			
		)
		β
			transformers.modeling_outputs.CausalLMOutputWithCrossAttentions or tuple(torch.FloatTensor)
Parameters
- 
							input_ids (torch.LongTensorof shape(batch_size, sequence_length)) — Indices of input sequence tokens in the vocabulary. Padding will be ignored by default should you provide it.Indices can be obtained using BartTokenizer. See PreTrainedTokenizer.encode() and PreTrainedTokenizer.call() for details. 
- 
							attention_mask (torch.Tensorof shape(batch_size, sequence_length), optional) — Mask to avoid performing attention on padding token indices. Mask values selected in[0, 1]:- 1 for tokens that are not masked,
- 0 for tokens that are masked.
 
- 
							encoder_hidden_states  (torch.FloatTensorof shape(batch_size, sequence_length, hidden_size), optional) — Sequence of hidden-states at the output of the last layer of the encoder. Used in the cross-attention if the model is configured as a decoder.
- 
							encoder_attention_mask (torch.FloatTensorof shape(batch_size, sequence_length), optional) — Mask to avoid performing attention on the padding token indices of the encoder input. This mask is used in the cross-attention if the model is configured as a decoder. Mask values selected in[0, 1]:
- 
							head_mask (torch.Tensorof shape(decoder_layers, decoder_attention_heads), optional) — Mask to nullify selected heads of the attention modules. Mask values selected in[0, 1]:- 1 indicates the head is not masked,
- 0 indicates the head is masked.
 
- 
							cross_attn_head_mask (torch.Tensorof shape(decoder_layers, decoder_attention_heads), optional) — Mask to nullify selected heads of the cross-attention modules. Mask values selected in[0, 1]:- 1 indicates the head is not masked,
- 0 indicates the head is masked.
 
- 
							past_key_values (tuple(tuple(torch.FloatTensor)), optional, returned whenuse_cache=Trueis passed or whenconfig.use_cache=True) — Tuple oftuple(torch.FloatTensor)of lengthconfig.n_layers, with each tuple having 2 tensors of shape(batch_size, num_heads, sequence_length, embed_size_per_head)) and 2 additional tensors of shape(batch_size, num_heads, encoder_sequence_length, embed_size_per_head). The two additional tensors are only required when the model is used as a decoder in a Sequence to Sequence model.Contains pre-computed hidden-states (key and values in the self-attention blocks and in the cross-attention blocks) that can be used (see past_key_valuesinput) to speed up sequential decoding.If past_key_valuesare used, the user can optionally input only the lastdecoder_input_ids(those that don’t have their past key value states given to this model) of shape(batch_size, 1)instead of alldecoder_input_idsof shape(batch_size, sequence_length).
- 
							labels (torch.LongTensorof shape(batch_size, sequence_length), optional) — Labels for computing the masked language modeling loss. Indices should either be in[0, ..., config.vocab_size]or -100 (seeinput_idsdocstring). Tokens with indices set to-100are ignored (masked), the loss is only computed for the tokens with labels in[0, ..., config.vocab_size].
- 
							use_cache (bool, optional) — If set toTrue,past_key_valueskey value states are returned and can be used to speed up decoding (seepast_key_values).- 1 for tokens that are not masked,
- 0 for tokens that are masked.
 
- 
							output_attentions (bool, optional) — Whether or not to return the attentions tensors of all attention layers. Seeattentionsunder returned tensors for more detail.
- 
							output_hidden_states (bool, optional) — Whether or not to return the hidden states of all layers. Seehidden_statesunder returned tensors for more detail.
- 
							return_dict (bool, optional) — Whether or not to return a ModelOutput instead of a plain tuple.
Returns
transformers.modeling_outputs.CausalLMOutputWithCrossAttentions or tuple(torch.FloatTensor)
A transformers.modeling_outputs.CausalLMOutputWithCrossAttentions or a tuple of
torch.FloatTensor (if return_dict=False is passed or when config.return_dict=False) comprising various
elements depending on the configuration (BartConfig) and inputs.
- 
loss ( torch.FloatTensorof shape(1,), optional, returned whenlabelsis provided) β Language modeling loss (for next-token prediction).
- 
logits ( torch.FloatTensorof shape(batch_size, sequence_length, config.vocab_size)) β Prediction scores of the language modeling head (scores for each vocabulary token before SoftMax).
- 
hidden_states ( tuple(torch.FloatTensor), optional, returned whenoutput_hidden_states=Trueis passed or whenconfig.output_hidden_states=True) β Tuple oftorch.FloatTensor(one for the output of the embeddings + one for the output of each layer) of shape(batch_size, sequence_length, hidden_size).Hidden-states of the model at the output of each layer plus the initial embedding outputs. 
- 
attentions ( tuple(torch.FloatTensor), optional, returned whenoutput_attentions=Trueis passed or whenconfig.output_attentions=True) β Tuple oftorch.FloatTensor(one for each layer) of shape(batch_size, num_heads, sequence_length, sequence_length).Attentions weights after the attention softmax, used to compute the weighted average in the self-attention heads. 
- 
cross_attentions ( tuple(torch.FloatTensor), optional, returned whenoutput_attentions=Trueis passed or whenconfig.output_attentions=True) β Tuple oftorch.FloatTensor(one for each layer) of shape(batch_size, num_heads, sequence_length, sequence_length).Cross attentions weights after the attention softmax, used to compute the weighted average in the cross-attention heads. 
- 
past_key_values ( tuple(tuple(torch.FloatTensor)), optional, returned whenuse_cache=Trueis passed or whenconfig.use_cache=True) β Tuple oftorch.FloatTensortuples of lengthconfig.n_layers, with each tuple containing the cached key, value states of the self-attention and the cross-attention layers if model is used in encoder-decoder setting. Only relevant ifconfig.is_decoder = True.Contains pre-computed hidden-states (key and values in the attention blocks) that can be used (see past_key_valuesinput) to speed up sequential decoding.
Example:
>>> from transformers import BartTokenizer, BartForCausalLM
>>> tokenizer = BartTokenizer.from_pretrained('facebook/bart-large')
>>> model = BartForCausalLM.from_pretrained('facebook/bart-large', add_cross_attention=False)
>>> assert model.config.is_decoder, f"{model.__class__} has to be configured as a decoder."
>>> inputs = tokenizer("Hello, my dog is cute", return_tensors="pt")
>>> outputs = model(**inputs)
>>> logits = outputs.logitsTFBartModel
( *args **kwargs )
Parameters
- config (BartConfig) — Model configuration class with all the parameters of the model. Initializing with a config file does not load the weights associated with the model, only the configuration. Check out the from_pretrained() method to load the model weights.
The bare BART Model outputting raw hidden-states without any specific head on top. This model inherits from TFPreTrainedModel. Check the superclass documentation for the generic methods the library implements for all its model (such as downloading or saving, resizing the input embeddings, pruning heads etc.)
This model is also a tf.keras.Model subclass. Use it as a regular TF 2.0 Keras Model and refer to the TF 2.0 documentation for all matter related to general usage and behavior.
TF 2.0 models accepts two formats as inputs:
- having all inputs as keyword arguments (like PyTorch models), or
- having all inputs as a list, tuple or dict in the first positional arguments.
This second option is useful when using tf.keras.Model.fit method which currently requires having all
the tensors in the first argument of the model call function: model(inputs).
If you choose this second option, there are three possibilities you can use to gather all the input Tensors in the first positional argument :
- a single Tensor with input_idsonly and nothing else:model(input_ids)
- a list of varying length with one or several input Tensors IN THE ORDER given in the docstring:
model([input_ids, attention_mask])ormodel([input_ids, attention_mask, token_type_ids])
- a dictionary with one or several input Tensors associated to the input names given in the docstring:
model({"input_ids": input_ids, "token_type_ids": token_type_ids})
(
		input_ids = None
			attention_mask = None
			decoder_input_ids = None
			decoder_attention_mask = None
			head_mask = None
			decoder_head_mask = None
			cross_attn_head_mask = None
			encoder_outputs: typing.Union[typing.Tuple, transformers.modeling_tf_outputs.TFBaseModelOutput, NoneType] = None
			past_key_values = None
			inputs_embeds = None
			decoder_inputs_embeds = None
			use_cache = None
			output_attentions = None
			output_hidden_states = None
			return_dict = None
			training = False
			**kwargs
			
		)
		β
			transformers.modeling_tf_outputs.TFSeq2SeqModelOutput or tuple(tf.Tensor)
Parameters
- 
							input_ids (tf.Tensorof shape(batch_size, sequence_length)) — Indices of input sequence tokens in the vocabulary.Indices can be obtained using BertTokenizer. See PreTrainedTokenizer.encode() and PreTrainedTokenizer.call() for details. 
- 
							attention_mask (tf.Tensorof shape(batch_size, sequence_length), optional) — Mask to avoid performing attention on padding token indices. Mask values selected in[0, 1]:- 1 for tokens that are not masked,
- 0 for tokens that are masked.
 
- 
							decoder_input_ids (tf.Tensorof shape(batch_size, target_sequence_length), optional) — Indices of decoder input sequence tokens in the vocabulary.Indices can be obtained using BartTokenizer. See PreTrainedTokenizer.encode() and PreTrainedTokenizer.call() for details. Bart uses the eos_token_idas the starting token fordecoder_input_idsgeneration. Ifpast_key_valuesis used, optionally only the lastdecoder_input_idshave to be input (seepast_key_values).For translation and summarization training, decoder_input_idsshould be provided. If nodecoder_input_idsis provided, the model will create this tensor by shifting theinput_idsto the right for denoising pre-training following the paper.
- 
							decoder_attention_mask (tf.Tensorof shape(batch_size, target_sequence_length), optional) — will be made by default and ignore pad tokens. It is not recommended to set this for most use cases.
- 
							head_mask (tf.Tensorof shape(encoder_layers, encoder_attention_heads), optional) — Mask to nullify selected heads of the attention modules in the encoder. Mask values selected in[0, 1]:- 1 indicates the head is not masked,
- 0 indicates the head is masked.
 
- 
							decoder_head_mask (tf.Tensorof shape(decoder_layers, decoder_attention_heads), optional) — Mask to nullify selected heads of the attention modules in the decoder. Mask values selected in[0, 1]:- 1 indicates the head is not masked,
- 0 indicates the head is masked.
 
- 
							cross_attn_head_mask (tf.Tensorof shape(decoder_layers, decoder_attention_heads), optional) — Mask to nullify selected heads of the cross-attention modules. Mask values selected in[0, 1]:- 1 indicates the head is not masked,
- 0 indicates the head is masked.
 
- 
							encoder_outputs (tf.FloatTensor, optional) — hidden states at the output of the last layer of the encoder. Used in the cross-attention of the decoder. of shape(batch_size, sequence_length, hidden_size)is a sequence of
- 
							past_key_values (Tuple[Tuple[tf.Tensor]]of lengthconfig.n_layers) — contains precomputed key and value hidden states of the attention blocks. Can be used to speed up decoding. Ifpast_key_valuesare used, the user can optionally input only the lastdecoder_input_ids(those that don’t have their past key value states given to this model) of shape(batch_size, 1)instead of alldecoder_input_idsof shape(batch_size, sequence_length).
- 
							use_cache (bool, optional, defaults toTrue) — If set toTrue,past_key_valueskey value states are returned and can be used to speed up decoding (seepast_key_values). Set toFalseduring training,Trueduring generation
- 
							output_attentions (bool, optional) — Whether or not to return the attentions tensors of all attention layers. Seeattentionsunder returned tensors for more detail. This argument can be used only in eager mode, in graph mode the value in the config will be used instead.
- 
							output_hidden_states (bool, optional) — Whether or not to return the hidden states of all layers. Seehidden_statesunder returned tensors for more detail. This argument can be used only in eager mode, in graph mode the value in the config will be used instead.
- 
							return_dict (bool, optional) — Whether or not to return a ModelOutput instead of a plain tuple. This argument can be used in eager mode, in graph mode the value will always be set to True.
- 
							training (bool, optional, defaults toFalse) — Whether or not to use the model in training mode (some modules like dropout modules have different behaviors between training and evaluation).
Returns
transformers.modeling_tf_outputs.TFSeq2SeqModelOutput or tuple(tf.Tensor)
A transformers.modeling_tf_outputs.TFSeq2SeqModelOutput or a tuple of tf.Tensor (if
return_dict=False is passed or when config.return_dict=False) comprising various elements depending on the
configuration (BartConfig) and inputs.
- 
last_hidden_state ( tf.Tensorof shape(batch_size, sequence_length, hidden_size)) β Sequence of hidden-states at the output of the last layer of the decoder of the model.If past_key_valuesis used only the last hidden-state of the sequences of shape(batch_size, 1, hidden_size)is output.
- 
past_key_values ( List[tf.Tensor], optional, returned whenuse_cache=Trueis passed or whenconfig.use_cache=True) β List oftf.Tensorof lengthconfig.n_layers, with each tensor of shape(2, batch_size, num_heads, sequence_length, embed_size_per_head)).Contains pre-computed hidden-states (key and values in the attention blocks) of the decoder that can be used (see past_key_valuesinput) to speed up sequential decoding.
- 
decoder_hidden_states ( tuple(tf.Tensor), optional, returned whenoutput_hidden_states=Trueis passed or whenconfig.output_hidden_states=True) β Tuple oftf.Tensor(one for the output of the embeddings + one for the output of each layer) of shape(batch_size, sequence_length, hidden_size).Hidden-states of the decoder at the output of each layer plus the initial embedding outputs. 
- 
decoder_attentions ( tuple(tf.Tensor), optional, returned whenoutput_attentions=Trueis passed or whenconfig.output_attentions=True) β Tuple oftf.Tensor(one for each layer) of shape(batch_size, num_heads, sequence_length, sequence_length).Attentions weights of the decoder, after the attention softmax, used to compute the weighted average in the self-attention heads. 
- 
cross_attentions ( tuple(tf.Tensor), optional, returned whenoutput_attentions=Trueis passed or whenconfig.output_attentions=True) β Tuple oftf.Tensor(one for each layer) of shape(batch_size, num_heads, sequence_length, sequence_length).Attentions weights of the decoderβs cross-attention layer, after the attention softmax, used to compute the weighted average in the cross-attention heads. 
- 
encoder_last_hidden_state ( tf.Tensorof shape(batch_size, sequence_length, hidden_size), optional) β Sequence of hidden-states at the output of the last layer of the encoder of the model.
- 
encoder_hidden_states ( tuple(tf.Tensor), optional, returned whenoutput_hidden_states=Trueis passed or whenconfig.output_hidden_states=True) β Tuple oftf.Tensor(one for the output of the embeddings + one for the output of each layer) of shape(batch_size, sequence_length, hidden_size).Hidden-states of the encoder at the output of each layer plus the initial embedding outputs. 
- 
encoder_attentions ( tuple(tf.Tensor), optional, returned whenoutput_attentions=Trueis passed or whenconfig.output_attentions=True) β Tuple oftf.Tensor(one for each layer) of shape(batch_size, num_heads, sequence_length, sequence_length).Attentions weights of the encoder, after the attention softmax, used to compute the weighted average in the self-attention heads. 
The TFBartModel forward method, overrides the __call__ special method.
Although the recipe for forward pass needs to be defined within this function, one should call the Module
instance afterwards instead of this since the former takes care of running the pre and post processing steps while
the latter silently ignores them.
Example:
>>> from transformers import BartTokenizer, TFBartModel
>>> import tensorflow as tf
>>> tokenizer = BartTokenizer.from_pretrained('facebook/bart-large')
>>> model = TFBartModel.from_pretrained('facebook/bart-large')
>>> inputs = tokenizer("Hello, my dog is cute", return_tensors="tf")
>>> outputs = model(inputs)
>>> last_hidden_states = outputs.last_hidden_stateTFBartForConditionalGeneration
( *args **kwargs )
Parameters
- config (BartConfig) — Model configuration class with all the parameters of the model. Initializing with a config file does not load the weights associated with the model, only the configuration. Check out the from_pretrained() method to load the model weights.
The BART Model with a language modeling head. Can be used for summarization. This model inherits from TFPreTrainedModel. Check the superclass documentation for the generic methods the library implements for all its model (such as downloading or saving, resizing the input embeddings, pruning heads etc.)
This model is also a tf.keras.Model subclass. Use it as a regular TF 2.0 Keras Model and refer to the TF 2.0 documentation for all matter related to general usage and behavior.
TF 2.0 models accepts two formats as inputs:
- having all inputs as keyword arguments (like PyTorch models), or
- having all inputs as a list, tuple or dict in the first positional arguments.
This second option is useful when using tf.keras.Model.fit method which currently requires having all
the tensors in the first argument of the model call function: model(inputs).
If you choose this second option, there are three possibilities you can use to gather all the input Tensors in the first positional argument :
- a single Tensor with input_idsonly and nothing else:model(input_ids)
- a list of varying length with one or several input Tensors IN THE ORDER given in the docstring:
model([input_ids, attention_mask])ormodel([input_ids, attention_mask, token_type_ids])
- a dictionary with one or several input Tensors associated to the input names given in the docstring:
model({"input_ids": input_ids, "token_type_ids": token_type_ids})
(
		input_ids = None
			attention_mask = None
			decoder_input_ids = None
			decoder_attention_mask = None
			head_mask = None
			decoder_head_mask = None
			cross_attn_head_mask = None
			encoder_outputs: typing.Optional[transformers.modeling_tf_outputs.TFBaseModelOutput] = None
			past_key_values = None
			inputs_embeds = None
			decoder_inputs_embeds = None
			use_cache = None
			output_attentions = None
			output_hidden_states = None
			return_dict = None
			labels = None
			training = False
			**kwargs
			
		)
		β
			transformers.modeling_tf_outputs.TFSeq2SeqLMOutput or tuple(tf.Tensor)
Parameters
- 
							input_ids (tf.Tensorof shape({0})) — Indices of input sequence tokens in the vocabulary.Indices can be obtained using BertTokenizer. See PreTrainedTokenizer.encode() and PreTrainedTokenizer.call() for details. 
- 
							attention_mask (tf.Tensorof shape({0}), optional) — Mask to avoid performing attention on padding token indices. Mask values selected in[0, 1]:- 1 for tokens that are not masked,
- 0 for tokens that are masked.
 
- 
							decoder_input_ids (tf.Tensorof shape(batch_size, target_sequence_length), optional) — Indices of decoder input sequence tokens in the vocabulary.Indices can be obtained using BartTokenizer. See PreTrainedTokenizer.encode() and PreTrainedTokenizer.call() for details. Bart uses the eos_token_idas the starting token fordecoder_input_idsgeneration. Ifpast_key_valuesis used, optionally only the lastdecoder_input_idshave to be input (seepast_key_values).For translation and summarization training, decoder_input_idsshould be provided. If nodecoder_input_idsis provided, the model will create this tensor by shifting theinput_idsto the right for denoising pre-training following the paper.
- 
							decoder_attention_mask (tf.Tensorof shape(batch_size, target_sequence_length), optional) — will be made by default and ignore pad tokens. It is not recommended to set this for most use cases.
- 
							head_mask (tf.Tensorof shape(encoder_layers, encoder_attention_heads), optional) — Mask to nullify selected heads of the attention modules in the encoder. Mask values selected in[0, 1]:- 1 indicates the head is not masked,
- 0 indicates the head is masked.
 
- 
							decoder_head_mask (tf.Tensorof shape(decoder_layers, decoder_attention_heads), optional) — Mask to nullify selected heads of the attention modules in the decoder. Mask values selected in[0, 1]:- 1 indicates the head is not masked,
- 0 indicates the head is masked.
 
- 
							cross_attn_head_mask (tf.Tensorof shape(decoder_layers, decoder_attention_heads), optional) — Mask to nullify selected heads of the cross-attention modules. Mask values selected in[0, 1]:- 1 indicates the head is not masked,
- 0 indicates the head is masked.
 
- 
							encoder_outputs (tf.FloatTensor, optional) — hidden states at the output of the last layer of the encoder. Used in the cross-attention of the decoder. of shape(batch_size, sequence_length, hidden_size)is a sequence of
- 
							past_key_values (Tuple[Tuple[tf.Tensor]]of lengthconfig.n_layers) — contains precomputed key and value hidden states of the attention blocks. Can be used to speed up decoding. Ifpast_key_valuesare used, the user can optionally input only the lastdecoder_input_ids(those that don’t have their past key value states given to this model) of shape(batch_size, 1)instead of alldecoder_input_idsof shape(batch_size, sequence_length).
- 
							use_cache (bool, optional, defaults toTrue) — If set toTrue,past_key_valueskey value states are returned and can be used to speed up decoding (seepast_key_values). Set toFalseduring training,Trueduring generation
- 
							output_attentions (bool, optional) — Whether or not to return the attentions tensors of all attention layers. Seeattentionsunder returned tensors for more detail. This argument can be used only in eager mode, in graph mode the value in the config will be used instead.
- 
							output_hidden_states (bool, optional) — Whether or not to return the hidden states of all layers. Seehidden_statesunder returned tensors for more detail. This argument can be used only in eager mode, in graph mode the value in the config will be used instead.
- 
							return_dict (bool, optional) — Whether or not to return a ModelOutput instead of a plain tuple. This argument can be used in eager mode, in graph mode the value will always be set to True.
- 
							training (bool, optional, defaults toFalse) — Whether or not to use the model in training mode (some modules like dropout modules have different behaviors between training and evaluation).
- 
							labels (tf.Tensorof shape(batch_size, sequence_length), optional) — Labels for computing the masked language modeling loss. Indices should either be in[0, ..., config.vocab_size]or -100 (seeinput_idsdocstring). Tokens with indices set to-100are ignored (masked), the loss is only computed for the tokens with labels in[0, ..., config.vocab_size].
Returns
transformers.modeling_tf_outputs.TFSeq2SeqLMOutput or tuple(tf.Tensor)
A transformers.modeling_tf_outputs.TFSeq2SeqLMOutput or a tuple of tf.Tensor (if
return_dict=False is passed or when config.return_dict=False) comprising various elements depending on the
configuration (BartConfig) and inputs.
- 
loss ( tf.Tensorof shape(n,), optional, where n is the number of non-masked labels, returned whenlabelsis provided) β Language modeling loss.
- 
logits ( tf.Tensorof shape(batch_size, sequence_length, config.vocab_size)) β Prediction scores of the language modeling head (scores for each vocabulary token before SoftMax).
- 
past_key_values ( List[tf.Tensor], optional, returned whenuse_cache=Trueis passed or whenconfig.use_cache=True) β List oftf.Tensorof lengthconfig.n_layers, with each tensor of shape(2, batch_size, num_heads, sequence_length, embed_size_per_head)).Contains pre-computed hidden-states (key and values in the attention blocks) of the decoder that can be used (see past_key_valuesinput) to speed up sequential decoding.
- 
decoder_hidden_states ( tuple(tf.Tensor), optional, returned whenoutput_hidden_states=Trueis passed or whenconfig.output_hidden_states=True) β Tuple oftf.Tensor(one for the output of the embeddings + one for the output of each layer) of shape(batch_size, sequence_length, hidden_size).Hidden-states of the decoder at the output of each layer plus the initial embedding outputs. 
- 
decoder_attentions ( tuple(tf.Tensor), optional, returned whenoutput_attentions=Trueis passed or whenconfig.output_attentions=True) β Tuple oftf.Tensor(one for each layer) of shape(batch_size, num_heads, sequence_length, sequence_length).Attentions weights of the decoder, after the attention softmax, used to compute the weighted average in the self-attention heads. 
- 
cross_attentions ( tuple(tf.Tensor), optional, returned whenoutput_attentions=Trueis passed or whenconfig.output_attentions=True) β Tuple oftf.Tensor(one for each layer) of shape(batch_size, num_heads, sequence_length, sequence_length).Attentions weights of the decoderβs cross-attention layer, after the attention softmax, used to compute the weighted average in the cross-attention heads. 
- 
encoder_last_hidden_state ( tf.Tensorof shape(batch_size, sequence_length, hidden_size), optional) β Sequence of hidden-states at the output of the last layer of the encoder of the model.
- 
encoder_hidden_states ( tuple(tf.Tensor), optional, returned whenoutput_hidden_states=Trueis passed or whenconfig.output_hidden_states=True) β Tuple oftf.Tensor(one for the output of the embeddings + one for the output of each layer) of shape(batch_size, sequence_length, hidden_size).Hidden-states of the encoder at the output of each layer plus the initial embedding outputs. 
- 
encoder_attentions ( tuple(tf.Tensor), optional, returned whenoutput_attentions=Trueis passed or whenconfig.output_attentions=True) β Tuple oftf.Tensor(one for each layer) of shape(batch_size, num_heads, sequence_length, sequence_length).Attentions weights of the encoder, after the attention softmax, used to compute the weighted average in the self-attention heads. 
The TFBartForConditionalGeneration forward method, overrides the __call__ special method.
Although the recipe for forward pass needs to be defined within this function, one should call the Module
instance afterwards instead of this since the former takes care of running the pre and post processing steps while
the latter silently ignores them.
Summarization example::
from transformers import BartTokenizer, TFBartForConditionalGeneration, BartConfig
model = TFBartForConditionalGeneration.from_pretrained(βfacebook/bart-largeβ) tokenizer = BartTokenizer.from_pretrained(βfacebook/bart-largeβ)
ARTICLE_TO_SUMMARIZE = βMy friends are cool but they eat too many carbs.β inputs = tokenizer([ARTICLE_TO_SUMMARIZE], max_length=1024, return_tensors=βtfβ)
Generate Summary
summary_ids = model.generate(inputs[βinput_idsβ], num_beams=4, max_length=5, early_stopping=True) print([tokenizer.decode(g, skip_special_tokens=True, clean_up_tokenization_spaces=False) for g in summary_ids])
Mask filling example::
from transformers import BartTokenizer, TFBartForConditionalGeneration tokenizer = BartTokenizer.from_pretrained(βfacebook/bart-largeβ) TXT = βMy friends are <mask> but they eat too many carbs.β
model = TFBartForConditionalGeneration.from_pretrained(βfacebook/bart-largeβ) input_ids = tokenizer([TXT], return_tensors=βtfβ)[βinput_idsβ] logits = model(input_ids).logits probs = tf.nn.softmax(logits[0])
probs[5] is associated with the mask token
FlaxBartModel
( config: BartConfig input_shape: typing.Tuple[int] = (1, 1) seed: int = 0 dtype: dtype = <class 'jax._src.numpy.lax_numpy.float32'> **kwargs )
Parameters
- config (BartConfig) — Model configuration class with all the parameters of the model. Initializing with a config file does not load the weights associated with the model, only the configuration. Check out the from_pretrained() method to load the model weights.
- 
							dtype (jax.numpy.dtype, optional, defaults tojax.numpy.float32) — The data type of the computation. Can be one ofjax.numpy.float32,jax.numpy.float16(on GPUs) andjax.numpy.bfloat16(on TPUs).This can be used to enable mixed-precision training or half-precision inference on GPUs or TPUs. If specified all the computation will be performed with the given dtype.Note that this only specifies the dtype of the computation and does not influence the dtype of model parameters. If you wish to change the dtype of the model parameters, see to_fp16() and to_bf16(). 
The bare Bart Model transformer outputting raw hidden-states without any specific head on top. This model inherits from FlaxPreTrainedModel. Check the superclass documentation for the generic methods the library implements for all its model (such as downloading or saving, resizing the input embeddings, pruning heads etc.)
This model is also a Flax Linen flax.nn.Module subclass. Use it as a regular Flax Module and refer to the Flax documentation for all matter related to general usage and behavior.
Finally, this model supports inherent JAX features such as:
(
		input_ids: ndarray
			attention_mask: typing.Optional[jax._src.numpy.lax_numpy.ndarray] = None
			decoder_input_ids: typing.Optional[jax._src.numpy.lax_numpy.ndarray] = None
			decoder_attention_mask: typing.Optional[jax._src.numpy.lax_numpy.ndarray] = None
			position_ids: typing.Optional[jax._src.numpy.lax_numpy.ndarray] = None
			decoder_position_ids: typing.Optional[jax._src.numpy.lax_numpy.ndarray] = None
			output_attentions: typing.Optional[bool] = None
			output_hidden_states: typing.Optional[bool] = None
			return_dict: typing.Optional[bool] = None
			train: bool = False
			params: dict = None
			dropout_rng: PRNGKey = None
			
		)
		β
			transformers.modeling_flax_outputs.FlaxSeq2SeqModelOutput or tuple(torch.FloatTensor)
Parameters
- 
							input_ids (jnp.ndarrayof shape(batch_size, sequence_length)) — Indices of input sequence tokens in the vocabulary. Padding will be ignored by default should you provide it.Indices can be obtained using BartTokenizer. See PreTrainedTokenizer.encode() and PreTrainedTokenizer.call() for details. 
- 
							attention_mask (jnp.ndarrayof shape(batch_size, sequence_length), optional) — Mask to avoid performing attention on padding token indices. Mask values selected in[0, 1]:- 1 for tokens that are not masked,
- 0 for tokens that are masked.
 
- 
							decoder_input_ids (jnp.ndarrayof shape(batch_size, target_sequence_length), optional) — Indices of decoder input sequence tokens in the vocabulary.Indices can be obtained using BartTokenizer. See PreTrainedTokenizer.encode() and PreTrainedTokenizer.call() for details. For translation and summarization training, decoder_input_idsshould be provided. If nodecoder_input_idsis provided, the model will create this tensor by shifting theinput_idsto the right for denoising pre-training following the paper.
- 
							decoder_attention_mask (jnp.ndarrayof shape(batch_size, target_sequence_length), optional) — Default behavior: generate a tensor that ignores pad tokens indecoder_input_ids. Causal mask will also be used by default.If you want to change padding behavior, you should modify to your needs. See diagram 1 in the paper for more information on the default strategy. 
- 
							position_ids (numpy.ndarrayof shape(batch_size, sequence_length), optional) — Indices of positions of each input sequence tokens in the position embeddings. Selected in the range[0, config.max_position_embeddings - 1].
- 
							decoder_position_ids (numpy.ndarrayof shape(batch_size, sequence_length), optional) — Indices of positions of each decoder input sequence tokens in the position embeddings. Selected in the range[0, config.max_position_embeddings - 1].
- 
							output_attentions (bool, optional) — Whether or not to return the attentions tensors of all attention layers. Seeattentionsunder returned tensors for more detail.
- 
							output_hidden_states (bool, optional) — Whether or not to return the hidden states of all layers. Seehidden_statesunder returned tensors for more detail.
- 
							return_dict (bool, optional) — Whether or not to return a ModelOutput instead of a plain tuple.
Returns
transformers.modeling_flax_outputs.FlaxSeq2SeqModelOutput or tuple(torch.FloatTensor)
A transformers.modeling_flax_outputs.FlaxSeq2SeqModelOutput or a tuple of
torch.FloatTensor (if return_dict=False is passed or when config.return_dict=False) comprising various
elements depending on the configuration (BartConfig) and inputs.
- 
last_hidden_state ( jnp.ndarrayof shape(batch_size, sequence_length, hidden_size)) β Sequence of hidden-states at the output of the last layer of the decoder of the model.If past_key_valuesis used only the last hidden-state of the sequences of shape(batch_size, 1, hidden_size)is output.
- 
past_key_values ( tuple(tuple(jnp.ndarray)), optional, returned whenuse_cache=Trueis passed or whenconfig.use_cache=True) β Tuple oftuple(jnp.ndarray)of lengthconfig.n_layers, with each tuple having 2 tensors of shape(batch_size, num_heads, sequence_length, embed_size_per_head)) and 2 additional tensors of shape(batch_size, num_heads, encoder_sequence_length, embed_size_per_head).Contains pre-computed hidden-states (key and values in the self-attention blocks and in the cross-attention blocks) that can be used (see past_key_valuesinput) to speed up sequential decoding.
- 
decoder_hidden_states ( tuple(jnp.ndarray), optional, returned whenoutput_hidden_states=Trueis passed or whenconfig.output_hidden_states=True) β Tuple ofjnp.ndarray(one for the output of the embeddings + one for the output of each layer) of shape(batch_size, sequence_length, hidden_size).Hidden-states of the decoder at the output of each layer plus the initial embedding outputs. 
- 
decoder_attentions ( tuple(jnp.ndarray), optional, returned whenoutput_attentions=Trueis passed or whenconfig.output_attentions=True) β Tuple ofjnp.ndarray(one for each layer) of shape(batch_size, num_heads, sequence_length, sequence_length).Attentions weights of the decoder, after the attention softmax, used to compute the weighted average in the self-attention heads. 
- 
cross_attentions ( tuple(jnp.ndarray), optional, returned whenoutput_attentions=Trueis passed or whenconfig.output_attentions=True) β Tuple ofjnp.ndarray(one for each layer) of shape(batch_size, num_heads, sequence_length, sequence_length).Attentions weights of the decoderβs cross-attention layer, after the attention softmax, used to compute the weighted average in the cross-attention heads. 
- 
encoder_last_hidden_state ( jnp.ndarrayof shape(batch_size, sequence_length, hidden_size), optional) β Sequence of hidden-states at the output of the last layer of the encoder of the model.
- 
encoder_hidden_states ( tuple(jnp.ndarray), optional, returned whenoutput_hidden_states=Trueis passed or whenconfig.output_hidden_states=True) β Tuple ofjnp.ndarray(one for the output of the embeddings + one for the output of each layer) of shape(batch_size, sequence_length, hidden_size).Hidden-states of the encoder at the output of each layer plus the initial embedding outputs. 
- 
encoder_attentions ( tuple(jnp.ndarray), optional, returned whenoutput_attentions=Trueis passed or whenconfig.output_attentions=True) β Tuple ofjnp.ndarray(one for each layer) of shape(batch_size, num_heads, sequence_length, sequence_length).Attentions weights of the encoder, after the attention softmax, used to compute the weighted average in the self-attention heads. 
The FlaxBartPreTrainedModel forward method, overrides the __call__ special method.
Although the recipe for forward pass needs to be defined within this function, one should call the Module
instance afterwards instead of this since the former takes care of running the pre and post processing steps while
the latter silently ignores them.
Example:
>>> from transformers import BartTokenizer, FlaxBartModel
>>> tokenizer = BartTokenizer.from_pretrained('facebook/bart-base')
>>> model = FlaxBartModel.from_pretrained('facebook/bart-base')
>>> inputs = tokenizer("Hello, my dog is cute", return_tensors='jax')
>>> outputs = model(**inputs)
>>> last_hidden_states = outputs.last_hidden_state(
		input_ids: ndarray
			attention_mask: typing.Optional[jax._src.numpy.lax_numpy.ndarray] = None
			position_ids: typing.Optional[jax._src.numpy.lax_numpy.ndarray] = None
			output_attentions: typing.Optional[bool] = None
			output_hidden_states: typing.Optional[bool] = None
			return_dict: typing.Optional[bool] = None
			train: bool = False
			params: dict = None
			dropout_rng: PRNGKey = None
			
		)
		β
			transformers.modeling_flax_outputs.FlaxBaseModelOutput or tuple(torch.FloatTensor)
Parameters
- 
							input_ids (jnp.ndarrayof shape(batch_size, sequence_length)) — Indices of input sequence tokens in the vocabulary. Padding will be ignored by default should you provide it.Indices can be obtained using BartTokenizer. See PreTrainedTokenizer.encode() and PreTrainedTokenizer.call() for details. 
- 
							attention_mask (jnp.ndarrayof shape(batch_size, sequence_length), optional) — Mask to avoid performing attention on padding token indices. Mask values selected in[0, 1]:- 1 for tokens that are not masked,
- 0 for tokens that are masked.
 
- 
							position_ids (numpy.ndarrayof shape(batch_size, sequence_length), optional) — Indices of positions of each input sequence tokens in the position embeddings. Selected in the range[0, config.max_position_embeddings - 1].
- 
							output_attentions (bool, optional) — Whether or not to return the attentions tensors of all attention layers. Seeattentionsunder returned tensors for more detail.
- 
							output_hidden_states (bool, optional) — Whether or not to return the hidden states of all layers. Seehidden_statesunder returned tensors for more detail.
- 
							return_dict (bool, optional) — Whether or not to return a ModelOutput instead of a plain tuple.
Returns
transformers.modeling_flax_outputs.FlaxBaseModelOutput or tuple(torch.FloatTensor)
A transformers.modeling_flax_outputs.FlaxBaseModelOutput or a tuple of
torch.FloatTensor (if return_dict=False is passed or when config.return_dict=False) comprising various
elements depending on the configuration (<class 'transformers.models.bart.configuration_bart.BartConfig'>) and inputs.
- 
last_hidden_state ( jnp.ndarrayof shape(batch_size, sequence_length, hidden_size)) β Sequence of hidden-states at the output of the last layer of the model.
- 
hidden_states ( tuple(jnp.ndarray), optional, returned whenoutput_hidden_states=Trueis passed or whenconfig.output_hidden_states=True) β Tuple ofjnp.ndarray(one for the output of the embeddings + one for the output of each layer) of shape(batch_size, sequence_length, hidden_size).Hidden-states of the model at the output of each layer plus the initial embedding outputs. 
- 
attentions ( tuple(jnp.ndarray), optional, returned whenoutput_attentions=Trueis passed or whenconfig.output_attentions=True) β Tuple ofjnp.ndarray(one for each layer) of shape(batch_size, num_heads, sequence_length, sequence_length).Attentions weights after the attention softmax, used to compute the weighted average in the self-attention heads. 
Example:
>>> from transformers import BartTokenizer, FlaxBartForConditionalGeneration
>>> model = FlaxBartForConditionalGeneration.from_pretrained('facebook/bart-large-cnn')
>>> tokenizer = BartTokenizer.from_pretrained('facebook/bart-large-cnn')
>>> text = "My friends are cool but they eat too many carbs."
>>> inputs = tokenizer(text, max_length=1024, return_tensors='jax')
>>> encoder_outputs = model.encode(**inputs)(
		decoder_input_ids
			encoder_outputs
			encoder_attention_mask: typing.Optional[jax._src.numpy.lax_numpy.ndarray] = None
			decoder_attention_mask: typing.Optional[jax._src.numpy.lax_numpy.ndarray] = None
			decoder_position_ids: typing.Optional[jax._src.numpy.lax_numpy.ndarray] = None
			past_key_values: dict = None
			output_attentions: typing.Optional[bool] = None
			output_hidden_states: typing.Optional[bool] = None
			return_dict: typing.Optional[bool] = None
			train: bool = False
			params: dict = None
			dropout_rng: PRNGKey = None
			
		)
		β
			transformers.modeling_flax_outputs.FlaxBaseModelOutputWithPastAndCrossAttentions or tuple(torch.FloatTensor)
Parameters
- 
							decoder_input_ids (jnp.ndarrayof shape(batch_size, target_sequence_length)) — Indices of decoder input sequence tokens in the vocabulary.Indices can be obtained using BartTokenizer. See PreTrainedTokenizer.encode() and PreTrainedTokenizer.call() for details. For translation and summarization training, decoder_input_idsshould be provided. If nodecoder_input_idsis provided, the model will create this tensor by shifting theinput_idsto the right for denoising pre-training following the paper.
- 
							encoder_outputs (tuple(tuple(jnp.ndarray)) — Tuple consists of (last_hidden_state, optional:hidden_states, optional:attentions)last_hidden_stateof shape(batch_size, sequence_length, hidden_size), optional) is a sequence of hidden-states at the output of the last layer of the encoder. Used in the cross-attention of the decoder.
- 
							encoder_attention_mask (jnp.ndarrayof shape(batch_size, sequence_length), optional) — Mask to avoid performing attention on padding token indices. Mask values selected in[0, 1]:- 1 for tokens that are not masked,
- 0 for tokens that are masked.
 
- 
							decoder_attention_mask (jnp.ndarrayof shape(batch_size, target_sequence_length), optional) — Default behavior: generate a tensor that ignores pad tokens indecoder_input_ids. Causal mask will also be used by default.If you want to change padding behavior, you should modify to your needs. See diagram 1 in the paper for more information on the default strategy. 
- 
							decoder_position_ids (numpy.ndarrayof shape(batch_size, sequence_length), optional) — Indices of positions of each decoder input sequence tokens in the position embeddings. Selected in the range[0, config.max_position_embeddings - 1].
- 
							past_key_values (Dict[str, np.ndarray], optional, returned byinit_cacheor when passing previouspast_key_values) — Dictionary of pre-computed hidden-states (key and values in the attention blocks) that can be used for fast auto-regressive decoding. Pre-computed key and value hidden-states are of shape [batch_size, max_length].
- 
							output_attentions (bool, optional) — Whether or not to return the attentions tensors of all attention layers. Seeattentionsunder returned tensors for more detail.
- 
							output_hidden_states (bool, optional) — Whether or not to return the hidden states of all layers. Seehidden_statesunder returned tensors for more detail.
- 
							return_dict (bool, optional) — Whether or not to return a ModelOutput instead of a plain tuple.
Returns
transformers.modeling_flax_outputs.FlaxBaseModelOutputWithPastAndCrossAttentions or tuple(torch.FloatTensor)
A transformers.modeling_flax_outputs.FlaxBaseModelOutputWithPastAndCrossAttentions or a tuple of
torch.FloatTensor (if return_dict=False is passed or when config.return_dict=False) comprising various
elements depending on the configuration (<class 'transformers.models.bart.configuration_bart.BartConfig'>) and inputs.
- 
last_hidden_state ( jnp.ndarrayof shape(batch_size, sequence_length, hidden_size)) β Sequence of hidden-states at the output of the last layer of the model.If past_key_valuesis used only the last hidden-state of the sequences of shape(batch_size, 1, hidden_size)is output.
- 
past_key_values ( tuple(tuple(jnp.ndarray)), optional, returned whenuse_cache=Trueis passed or whenconfig.use_cache=True) β Tuple oftuple(jnp.ndarray)of lengthconfig.n_layers, with each tuple having 2 tensors of shape(batch_size, num_heads, sequence_length, embed_size_per_head)) and optionally ifconfig.is_encoder_decoder=True2 additional tensors of shape(batch_size, num_heads, encoder_sequence_length, embed_size_per_head).Contains pre-computed hidden-states (key and values in the self-attention blocks and optionally if config.is_encoder_decoder=Truein the cross-attention blocks) that can be used (seepast_key_valuesinput) to speed up sequential decoding.
- 
hidden_states ( tuple(jnp.ndarray), optional, returned whenoutput_hidden_states=Trueis passed or whenconfig.output_hidden_states=True) β Tuple ofjnp.ndarray(one for the output of the embeddings + one for the output of each layer) of shape(batch_size, sequence_length, hidden_size).Hidden-states of the model at the output of each layer plus the initial embedding outputs. 
- 
attentions ( tuple(jnp.ndarray), optional, returned whenoutput_attentions=Trueis passed or whenconfig.output_attentions=True) β Tuple ofjnp.ndarray(one for each layer) of shape(batch_size, num_heads, sequence_length, sequence_length).Attentions weights after the attention softmax, used to compute the weighted average in the self-attention heads. 
- 
cross_attentions ( tuple(jnp.ndarray), optional, returned whenoutput_attentions=Trueandconfig.add_cross_attention=Trueis passed or whenconfig.output_attentions=True) β Tuple ofjnp.ndarray(one for each layer) of shape(batch_size, num_heads, sequence_length, sequence_length).Attentions weights of the decoderβs cross-attention layer, after the attention softmax, used to compute the weighted average in the cross-attention heads. 
Example:
>>> from transformers import BartTokenizer, FlaxBartForConditionalGeneration
>>> model = FlaxBartForConditionalGeneration.from_pretrained('facebook/bart-large-cnn')
>>> tokenizer = BartTokenizer.from_pretrained('facebook/bart-large-cnn')
>>> text = "My friends are cool but they eat too many carbs."
>>> inputs = tokenizer(text, max_length=1024, return_tensors='jax')
>>> encoder_outputs = model.encode(**inputs)
>>> decoder_start_token_id = model.config.decoder_start_token_id
>>> decoder_input_ids = jnp.ones((inputs.input_ids.shape[0], 1), dtype="i4") * decoder_start_token_id
>>> outputs = model.decode(decoder_input_ids, encoder_outputs)
>>> last_decoder_hidden_states = outputs.last_hidden_stateFlaxBartForConditionalGeneration
( config: BartConfig input_shape: typing.Tuple[int] = (1, 1) seed: int = 0 dtype: dtype = <class 'jax._src.numpy.lax_numpy.float32'> **kwargs )
Parameters
- config (BartConfig) — Model configuration class with all the parameters of the model. Initializing with a config file does not load the weights associated with the model, only the configuration. Check out the from_pretrained() method to load the model weights.
- 
							dtype (jax.numpy.dtype, optional, defaults tojax.numpy.float32) — The data type of the computation. Can be one ofjax.numpy.float32,jax.numpy.float16(on GPUs) andjax.numpy.bfloat16(on TPUs).This can be used to enable mixed-precision training or half-precision inference on GPUs or TPUs. If specified all the computation will be performed with the given dtype.Note that this only specifies the dtype of the computation and does not influence the dtype of model parameters. If you wish to change the dtype of the model parameters, see to_fp16() and to_bf16(). 
The BART Model with a language modeling head. Can be used for summarization. This model inherits from FlaxPreTrainedModel. Check the superclass documentation for the generic methods the library implements for all its model (such as downloading or saving, resizing the input embeddings, pruning heads etc.)
This model is also a Flax Linen flax.nn.Module subclass. Use it as a regular Flax Module and refer to the Flax documentation for all matter related to general usage and behavior.
Finally, this model supports inherent JAX features such as:
(
		input_ids: ndarray
			attention_mask: typing.Optional[jax._src.numpy.lax_numpy.ndarray] = None
			decoder_input_ids: typing.Optional[jax._src.numpy.lax_numpy.ndarray] = None
			decoder_attention_mask: typing.Optional[jax._src.numpy.lax_numpy.ndarray] = None
			position_ids: typing.Optional[jax._src.numpy.lax_numpy.ndarray] = None
			decoder_position_ids: typing.Optional[jax._src.numpy.lax_numpy.ndarray] = None
			output_attentions: typing.Optional[bool] = None
			output_hidden_states: typing.Optional[bool] = None
			return_dict: typing.Optional[bool] = None
			train: bool = False
			params: dict = None
			dropout_rng: PRNGKey = None
			
		)
		β
			transformers.modeling_flax_outputs.FlaxSeq2SeqLMOutput or tuple(torch.FloatTensor)
Parameters
- 
							input_ids (jnp.ndarrayof shape(batch_size, sequence_length)) — Indices of input sequence tokens in the vocabulary. Padding will be ignored by default should you provide it.Indices can be obtained using BartTokenizer. See PreTrainedTokenizer.encode() and PreTrainedTokenizer.call() for details. 
- 
							attention_mask (jnp.ndarrayof shape(batch_size, sequence_length), optional) — Mask to avoid performing attention on padding token indices. Mask values selected in[0, 1]:- 1 for tokens that are not masked,
- 0 for tokens that are masked.
 
- 
							decoder_input_ids (jnp.ndarrayof shape(batch_size, target_sequence_length), optional) — Indices of decoder input sequence tokens in the vocabulary.Indices can be obtained using BartTokenizer. See PreTrainedTokenizer.encode() and PreTrainedTokenizer.call() for details. For translation and summarization training, decoder_input_idsshould be provided. If nodecoder_input_idsis provided, the model will create this tensor by shifting theinput_idsto the right for denoising pre-training following the paper.
- 
							decoder_attention_mask (jnp.ndarrayof shape(batch_size, target_sequence_length), optional) — Default behavior: generate a tensor that ignores pad tokens indecoder_input_ids. Causal mask will also be used by default.If you want to change padding behavior, you should modify to your needs. See diagram 1 in the paper for more information on the default strategy. 
- 
							position_ids (numpy.ndarrayof shape(batch_size, sequence_length), optional) — Indices of positions of each input sequence tokens in the position embeddings. Selected in the range[0, config.max_position_embeddings - 1].
- 
							decoder_position_ids (numpy.ndarrayof shape(batch_size, sequence_length), optional) — Indices of positions of each decoder input sequence tokens in the position embeddings. Selected in the range[0, config.max_position_embeddings - 1].
- 
							output_attentions (bool, optional) — Whether or not to return the attentions tensors of all attention layers. Seeattentionsunder returned tensors for more detail.
- 
							output_hidden_states (bool, optional) — Whether or not to return the hidden states of all layers. Seehidden_statesunder returned tensors for more detail.
- 
							return_dict (bool, optional) — Whether or not to return a ModelOutput instead of a plain tuple.
Returns
transformers.modeling_flax_outputs.FlaxSeq2SeqLMOutput or tuple(torch.FloatTensor)
A transformers.modeling_flax_outputs.FlaxSeq2SeqLMOutput or a tuple of
torch.FloatTensor (if return_dict=False is passed or when config.return_dict=False) comprising various
elements depending on the configuration (BartConfig) and inputs.
- 
logits ( jnp.ndarrayof shape(batch_size, sequence_length, config.vocab_size)) β Prediction scores of the language modeling head (scores for each vocabulary token before SoftMax).
- 
past_key_values ( tuple(tuple(jnp.ndarray)), optional, returned whenuse_cache=Trueis passed or whenconfig.use_cache=True) β Tuple oftuple(jnp.ndarray)of lengthconfig.n_layers, with each tuple having 2 tensors of shape(batch_size, num_heads, sequence_length, embed_size_per_head)) and 2 additional tensors of shape(batch_size, num_heads, encoder_sequence_length, embed_size_per_head).Contains pre-computed hidden-states (key and values in the self-attention blocks and in the cross-attention blocks) that can be used (see past_key_valuesinput) to speed up sequential decoding.
- 
decoder_hidden_states ( tuple(jnp.ndarray), optional, returned whenoutput_hidden_states=Trueis passed or whenconfig.output_hidden_states=True) β Tuple ofjnp.ndarray(one for the output of the embeddings + one for the output of each layer) of shape(batch_size, sequence_length, hidden_size).Hidden-states of the decoder at the output of each layer plus the initial embedding outputs. 
- 
decoder_attentions ( tuple(jnp.ndarray), optional, returned whenoutput_attentions=Trueis passed or whenconfig.output_attentions=True) β Tuple ofjnp.ndarray(one for each layer) of shape(batch_size, num_heads, sequence_length, sequence_length).Attentions weights of the decoder, after the attention softmax, used to compute the weighted average in the self-attention heads. 
- 
cross_attentions ( tuple(jnp.ndarray), optional, returned whenoutput_attentions=Trueis passed or whenconfig.output_attentions=True) β Tuple ofjnp.ndarray(one for each layer) of shape(batch_size, num_heads, sequence_length, sequence_length).Attentions weights of the decoderβs cross-attention layer, after the attention softmax, used to compute the weighted average in the cross-attention heads. 
- 
encoder_last_hidden_state ( jnp.ndarrayof shape(batch_size, sequence_length, hidden_size), optional) β Sequence of hidden-states at the output of the last layer of the encoder of the model.
- 
encoder_hidden_states ( tuple(jnp.ndarray), optional, returned whenoutput_hidden_states=Trueis passed or whenconfig.output_hidden_states=True) β Tuple ofjnp.ndarray(one for the output of the embeddings + one for the output of each layer) of shape(batch_size, sequence_length, hidden_size).Hidden-states of the encoder at the output of each layer plus the initial embedding outputs. 
- 
encoder_attentions ( tuple(jnp.ndarray), optional, returned whenoutput_attentions=Trueis passed or whenconfig.output_attentions=True) β Tuple ofjnp.ndarray(one for each layer) of shape(batch_size, num_heads, sequence_length, sequence_length).Attentions weights of the encoder, after the attention softmax, used to compute the weighted average in the self-attention heads. 
The FlaxBartPreTrainedModel forward method, overrides the __call__ special method.
Although the recipe for forward pass needs to be defined within this function, one should call the Module
instance afterwards instead of this since the former takes care of running the pre and post processing steps while
the latter silently ignores them.
Summarization example::
from transformers import BartTokenizer, FlaxBartForConditionalGeneration
model = FlaxBartForConditionalGeneration.from_pretrained(βfacebook/bart-large-cnnβ) tokenizer = BartTokenizer.from_pretrained(βfacebook/bart-large-cnnβ)
ARTICLE_TO_SUMMARIZE = βMy friends are cool but they eat too many carbs.β inputs = tokenizer([ARTICLE_TO_SUMMARIZE], max_length=1024, return_tensors=βjaxβ)
Generate Summary
summary_ids = model.generate(inputs[βinput_idsβ]).sequences print(tokenizer.batch_decode(summary_ids, skip_special_tokens=True, clean_up_tokenization_spaces=False))
Mask filling example::
from transformers import BartTokenizer, FlaxBartForConditionalGeneration tokenizer = BartTokenizer.from_pretrained(βfacebook/bart-largeβ) TXT = βMy friends are <mask> but they eat too many carbs.β
model = FlaxBartForConditionalGeneration.from_pretrained(βfacebook/bart-largeβ) input_ids = tokenizer([TXT], return_tensors=βjaxβ)[βinput_idsβ] logits = model(input_ids).logits
masked_index = (input_ids[0] == tokenizer.mask_token_id).nonzero()[0].item() probs = jax.nn.softmax(logits[0, masked_index], axis=0) values, predictions = jax.lax.top_k(probs)
tokenizer.decode(predictions).split()
(
		input_ids: ndarray
			attention_mask: typing.Optional[jax._src.numpy.lax_numpy.ndarray] = None
			position_ids: typing.Optional[jax._src.numpy.lax_numpy.ndarray] = None
			output_attentions: typing.Optional[bool] = None
			output_hidden_states: typing.Optional[bool] = None
			return_dict: typing.Optional[bool] = None
			train: bool = False
			params: dict = None
			dropout_rng: PRNGKey = None
			
		)
		β
			transformers.modeling_flax_outputs.FlaxBaseModelOutput or tuple(torch.FloatTensor)
Parameters
- 
							input_ids (jnp.ndarrayof shape(batch_size, sequence_length)) — Indices of input sequence tokens in the vocabulary. Padding will be ignored by default should you provide it.Indices can be obtained using BartTokenizer. See PreTrainedTokenizer.encode() and PreTrainedTokenizer.call() for details. 
- 
							attention_mask (jnp.ndarrayof shape(batch_size, sequence_length), optional) — Mask to avoid performing attention on padding token indices. Mask values selected in[0, 1]:- 1 for tokens that are not masked,
- 0 for tokens that are masked.
 
- 
							position_ids (numpy.ndarrayof shape(batch_size, sequence_length), optional) — Indices of positions of each input sequence tokens in the position embeddings. Selected in the range[0, config.max_position_embeddings - 1].
- 
							output_attentions (bool, optional) — Whether or not to return the attentions tensors of all attention layers. Seeattentionsunder returned tensors for more detail.
- 
							output_hidden_states (bool, optional) — Whether or not to return the hidden states of all layers. Seehidden_statesunder returned tensors for more detail.
- 
							return_dict (bool, optional) — Whether or not to return a ModelOutput instead of a plain tuple.
Returns
transformers.modeling_flax_outputs.FlaxBaseModelOutput or tuple(torch.FloatTensor)
A transformers.modeling_flax_outputs.FlaxBaseModelOutput or a tuple of
torch.FloatTensor (if return_dict=False is passed or when config.return_dict=False) comprising various
elements depending on the configuration (<class 'transformers.models.bart.configuration_bart.BartConfig'>) and inputs.
- 
last_hidden_state ( jnp.ndarrayof shape(batch_size, sequence_length, hidden_size)) β Sequence of hidden-states at the output of the last layer of the model.
- 
hidden_states ( tuple(jnp.ndarray), optional, returned whenoutput_hidden_states=Trueis passed or whenconfig.output_hidden_states=True) β Tuple ofjnp.ndarray(one for the output of the embeddings + one for the output of each layer) of shape(batch_size, sequence_length, hidden_size).Hidden-states of the model at the output of each layer plus the initial embedding outputs. 
- 
attentions ( tuple(jnp.ndarray), optional, returned whenoutput_attentions=Trueis passed or whenconfig.output_attentions=True) β Tuple ofjnp.ndarray(one for each layer) of shape(batch_size, num_heads, sequence_length, sequence_length).Attentions weights after the attention softmax, used to compute the weighted average in the self-attention heads. 
Example:
>>> from transformers import BartTokenizer, FlaxBartForConditionalGeneration
>>> model = FlaxBartForConditionalGeneration.from_pretrained('facebook/bart-large-cnn')
>>> tokenizer = BartTokenizer.from_pretrained('facebook/bart-large-cnn')
>>> text = "My friends are cool but they eat too many carbs."
>>> inputs = tokenizer(text, max_length=1024, return_tensors='jax')
>>> encoder_outputs = model.encode(**inputs)(
		decoder_input_ids
			encoder_outputs
			encoder_attention_mask: typing.Optional[jax._src.numpy.lax_numpy.ndarray] = None
			decoder_attention_mask: typing.Optional[jax._src.numpy.lax_numpy.ndarray] = None
			decoder_position_ids: typing.Optional[jax._src.numpy.lax_numpy.ndarray] = None
			past_key_values: dict = None
			output_attentions: typing.Optional[bool] = None
			output_hidden_states: typing.Optional[bool] = None
			return_dict: typing.Optional[bool] = None
			train: bool = False
			params: dict = None
			dropout_rng: PRNGKey = None
			
		)
		β
			transformers.modeling_flax_outputs.FlaxCausalLMOutputWithCrossAttentions or tuple(torch.FloatTensor)
Parameters
- 
							decoder_input_ids (jnp.ndarrayof shape(batch_size, target_sequence_length)) — Indices of decoder input sequence tokens in the vocabulary.Indices can be obtained using BartTokenizer. See PreTrainedTokenizer.encode() and PreTrainedTokenizer.call() for details. For translation and summarization training, decoder_input_idsshould be provided. If nodecoder_input_idsis provided, the model will create this tensor by shifting theinput_idsto the right for denoising pre-training following the paper.
- 
							encoder_outputs (tuple(tuple(jnp.ndarray)) — Tuple consists of (last_hidden_state, optional:hidden_states, optional:attentions)last_hidden_stateof shape(batch_size, sequence_length, hidden_size), optional) is a sequence of hidden-states at the output of the last layer of the encoder. Used in the cross-attention of the decoder.
- 
							encoder_attention_mask (jnp.ndarrayof shape(batch_size, sequence_length), optional) — Mask to avoid performing attention on padding token indices. Mask values selected in[0, 1]:- 1 for tokens that are not masked,
- 0 for tokens that are masked.
 
- 
							decoder_attention_mask (jnp.ndarrayof shape(batch_size, target_sequence_length), optional) — Default behavior: generate a tensor that ignores pad tokens indecoder_input_ids. Causal mask will also be used by default.If you want to change padding behavior, you should modify to your needs. See diagram 1 in the paper for more information on the default strategy. 
- 
							decoder_position_ids (numpy.ndarrayof shape(batch_size, sequence_length), optional) — Indices of positions of each decoder input sequence tokens in the position embeddings. Selected in the range[0, config.max_position_embeddings - 1].
- 
							past_key_values (Dict[str, np.ndarray], optional, returned byinit_cacheor when passing previouspast_key_values) — Dictionary of pre-computed hidden-states (key and values in the attention blocks) that can be used for fast auto-regressive decoding. Pre-computed key and value hidden-states are of shape [batch_size, max_length].
- 
							output_attentions (bool, optional) — Whether or not to return the attentions tensors of all attention layers. Seeattentionsunder returned tensors for more detail.
- 
							output_hidden_states (bool, optional) — Whether or not to return the hidden states of all layers. Seehidden_statesunder returned tensors for more detail.
- 
							return_dict (bool, optional) — Whether or not to return a ModelOutput instead of a plain tuple.
Returns
transformers.modeling_flax_outputs.FlaxCausalLMOutputWithCrossAttentions or tuple(torch.FloatTensor)
A transformers.modeling_flax_outputs.FlaxCausalLMOutputWithCrossAttentions or a tuple of
torch.FloatTensor (if return_dict=False is passed or when config.return_dict=False) comprising various
elements depending on the configuration (<class 'transformers.models.bart.configuration_bart.BartConfig'>) and inputs.
- 
logits ( jnp.ndarrayof shape(batch_size, sequence_length, config.vocab_size)) β Prediction scores of the language modeling head (scores for each vocabulary token before SoftMax).
- 
hidden_states ( tuple(jnp.ndarray), optional, returned whenoutput_hidden_states=Trueis passed or whenconfig.output_hidden_states=True) β Tuple ofjnp.ndarray(one for the output of the embeddings + one for the output of each layer) of shape(batch_size, sequence_length, hidden_size).Hidden-states of the model at the output of each layer plus the initial embedding outputs. 
- 
attentions ( tuple(jnp.ndarray), optional, returned whenoutput_attentions=Trueis passed or whenconfig.output_attentions=True) β Tuple ofjnp.ndarray(one for each layer) of shape(batch_size, num_heads, sequence_length, sequence_length).Attentions weights after the attention softmax, used to compute the weighted average in the self-attention heads. 
- 
cross_attentions ( tuple(jnp.ndarray), optional, returned whenoutput_attentions=Trueis passed or whenconfig.output_attentions=True) β Tuple ofjnp.ndarray(one for each layer) of shape(batch_size, num_heads, sequence_length, sequence_length).Cross attentions weights after the attention softmax, used to compute the weighted average in the cross-attention heads. 
- 
past_key_values ( tuple(tuple(jnp.ndarray)), optional, returned whenuse_cache=Trueis passed or whenconfig.use_cache=True) β Tuple ofjnp.ndarraytuples of lengthconfig.n_layers, with each tuple containing the cached key, value states of the self-attention and the cross-attention layers if model is used in encoder-decoder setting. Only relevant ifconfig.is_decoder = True.Contains pre-computed hidden-states (key and values in the attention blocks) that can be used (see past_key_valuesinput) to speed up sequential decoding.
Example:
>>> from transformers import BartTokenizer, FlaxBartForConditionalGeneration
>>> model = FlaxBartForConditionalGeneration.from_pretrained('facebook/bart-large-cnn')
>>> tokenizer = BartTokenizer.from_pretrained('facebook/bart-large-cnn')
>>> text = "My friends are cool but they eat too many carbs."
>>> inputs = tokenizer(text, max_length=1024, return_tensors='jax')
>>> encoder_outputs = model.encode(**inputs)
>>> decoder_start_token_id = model.config.decoder_start_token_id
>>> decoder_input_ids = jnp.ones((inputs.input_ids.shape[0], 1), dtype="i4") * decoder_start_token_id
>>> outputs = model.decode(decoder_input_ids, encoder_outputs)
>>> logits = outputs.logitsFlaxBartForSequenceClassification
( config: BartConfig input_shape: typing.Tuple[int] = (1, 1) seed: int = 0 dtype: dtype = <class 'jax._src.numpy.lax_numpy.float32'> **kwargs )
Parameters
- config (BartConfig) — Model configuration class with all the parameters of the model. Initializing with a config file does not load the weights associated with the model, only the configuration. Check out the from_pretrained() method to load the model weights.
- 
							dtype (jax.numpy.dtype, optional, defaults tojax.numpy.float32) — The data type of the computation. Can be one ofjax.numpy.float32,jax.numpy.float16(on GPUs) andjax.numpy.bfloat16(on TPUs).This can be used to enable mixed-precision training or half-precision inference on GPUs or TPUs. If specified all the computation will be performed with the given dtype.Note that this only specifies the dtype of the computation and does not influence the dtype of model parameters. If you wish to change the dtype of the model parameters, see to_fp16() and to_bf16(). 
Bart model with a sequence classification/head on top (a linear layer on top of the pooled output) e.g. for GLUE tasks.
This model inherits from FlaxPreTrainedModel. Check the superclass documentation for the generic methods the library implements for all its model (such as downloading or saving, resizing the input embeddings, pruning heads etc.)
This model is also a Flax Linen flax.nn.Module subclass. Use it as a regular Flax Module and refer to the Flax documentation for all matter related to general usage and behavior.
Finally, this model supports inherent JAX features such as:
(
		input_ids: ndarray
			attention_mask: typing.Optional[jax._src.numpy.lax_numpy.ndarray] = None
			decoder_input_ids: typing.Optional[jax._src.numpy.lax_numpy.ndarray] = None
			decoder_attention_mask: typing.Optional[jax._src.numpy.lax_numpy.ndarray] = None
			position_ids: typing.Optional[jax._src.numpy.lax_numpy.ndarray] = None
			decoder_position_ids: typing.Optional[jax._src.numpy.lax_numpy.ndarray] = None
			output_attentions: typing.Optional[bool] = None
			output_hidden_states: typing.Optional[bool] = None
			return_dict: typing.Optional[bool] = None
			train: bool = False
			params: dict = None
			dropout_rng: PRNGKey = None
			
		)
		β
			transformers.modeling_flax_outputs.FlaxSeq2SeqSequenceClassifierOutput or tuple(torch.FloatTensor)
Parameters
- 
							input_ids (jnp.ndarrayof shape(batch_size, sequence_length)) — Indices of input sequence tokens in the vocabulary. Padding will be ignored by default should you provide it.Indices can be obtained using BartTokenizer. See PreTrainedTokenizer.encode() and PreTrainedTokenizer.call() for details. 
- 
							attention_mask (jnp.ndarrayof shape(batch_size, sequence_length), optional) — Mask to avoid performing attention on padding token indices. Mask values selected in[0, 1]:- 1 for tokens that are not masked,
- 0 for tokens that are masked.
 
- 
							decoder_input_ids (jnp.ndarrayof shape(batch_size, target_sequence_length), optional) — Indices of decoder input sequence tokens in the vocabulary.Indices can be obtained using BartTokenizer. See PreTrainedTokenizer.encode() and PreTrainedTokenizer.call() for details. For translation and summarization training, decoder_input_idsshould be provided. If nodecoder_input_idsis provided, the model will create this tensor by shifting theinput_idsto the right for denoising pre-training following the paper.
- 
							decoder_attention_mask (jnp.ndarrayof shape(batch_size, target_sequence_length), optional) — Default behavior: generate a tensor that ignores pad tokens indecoder_input_ids. Causal mask will also be used by default.If you want to change padding behavior, you should modify to your needs. See diagram 1 in the paper for more information on the default strategy. 
- 
							position_ids (numpy.ndarrayof shape(batch_size, sequence_length), optional) — Indices of positions of each input sequence tokens in the position embeddings. Selected in the range[0, config.max_position_embeddings - 1].
- 
							decoder_position_ids (numpy.ndarrayof shape(batch_size, sequence_length), optional) — Indices of positions of each decoder input sequence tokens in the position embeddings. Selected in the range[0, config.max_position_embeddings - 1].
- 
							output_attentions (bool, optional) — Whether or not to return the attentions tensors of all attention layers. Seeattentionsunder returned tensors for more detail.
- 
							output_hidden_states (bool, optional) — Whether or not to return the hidden states of all layers. Seehidden_statesunder returned tensors for more detail.
- 
							return_dict (bool, optional) — Whether or not to return a ModelOutput instead of a plain tuple.
Returns
transformers.modeling_flax_outputs.FlaxSeq2SeqSequenceClassifierOutput or tuple(torch.FloatTensor)
A transformers.modeling_flax_outputs.FlaxSeq2SeqSequenceClassifierOutput or a tuple of
torch.FloatTensor (if return_dict=False is passed or when config.return_dict=False) comprising various
elements depending on the configuration (BartConfig) and inputs.
- 
logits ( jnp.ndarrayof shape(batch_size, config.num_labels)) β Classification (or regression if config.num_labels==1) scores (before SoftMax).
- 
past_key_values ( tuple(tuple(jnp.ndarray)), optional, returned whenuse_cache=Trueis passed or whenconfig.use_cache=True) β Tuple oftuple(jnp.ndarray)of lengthconfig.n_layers, with each tuple having 2 tensors of shape(batch_size, num_heads, sequence_length, embed_size_per_head)) and 2 additional tensors of shape(batch_size, num_heads, encoder_sequence_length, embed_size_per_head).Contains pre-computed hidden-states (key and values in the self-attention blocks and in the cross-attention blocks) that can be used (see past_key_valuesinput) to speed up sequential decoding.
- 
decoder_hidden_states ( tuple(jnp.ndarray), optional, returned whenoutput_hidden_states=Trueis passed or whenconfig.output_hidden_states=True) β Tuple ofjnp.ndarray(one for the output of the embeddings + one for the output of each layer) of shape(batch_size, sequence_length, hidden_size).Hidden-states of the decoder at the output of each layer plus the initial embedding outputs. 
- 
decoder_attentions ( tuple(jnp.ndarray), optional, returned whenoutput_attentions=Trueis passed or whenconfig.output_attentions=True) β Tuple ofjnp.ndarray(one for each layer) of shape(batch_size, num_heads, sequence_length, sequence_length).Attentions weights of the decoder, after the attention softmax, used to compute the weighted average in the self-attention heads. 
- 
cross_attentions ( tuple(jnp.ndarray), optional, returned whenoutput_attentions=Trueis passed or whenconfig.output_attentions=True) β Tuple ofjnp.ndarray(one for each layer) of shape(batch_size, num_heads, sequence_length, sequence_length).Attentions weights of the decoderβs cross-attention layer, after the attention softmax, used to compute the weighted average in the cross-attention heads. 
- 
encoder_last_hidden_state ( jnp.ndarrayof shape(batch_size, sequence_length, hidden_size), optional) β Sequence of hidden-states at the output of the last layer of the encoder of the model.
- 
encoder_hidden_states ( tuple(jnp.ndarray), optional, returned whenoutput_hidden_states=Trueis passed or whenconfig.output_hidden_states=True) β Tuple ofjnp.ndarray(one for the output of the embeddings + one for the output of each layer) of shape(batch_size, sequence_length, hidden_size).Hidden-states of the encoder at the output of each layer plus the initial embedding outputs. 
- 
encoder_attentions ( tuple(jnp.ndarray), optional, returned whenoutput_attentions=Trueis passed or whenconfig.output_attentions=True) β Tuple ofjnp.ndarray(one for each layer) of shape(batch_size, num_heads, sequence_length, sequence_length).Attentions weights of the encoder, after the attention softmax, used to compute the weighted average in the self-attention heads. 
The FlaxBartPreTrainedModel forward method, overrides the __call__ special method.
Although the recipe for forward pass needs to be defined within this function, one should call the Module
instance afterwards instead of this since the former takes care of running the pre and post processing steps while
the latter silently ignores them.
Example:
>>> from transformers import BartTokenizer, FlaxBartForSequenceClassification
>>> tokenizer = BartTokenizer.from_pretrained('facebook/bart-base')
>>> model = FlaxBartForSequenceClassification.from_pretrained('facebook/bart-base')
>>> inputs = tokenizer("Hello, my dog is cute", return_tensors='jax')
>>> outputs = model(**inputs)
>>> logits = outputs.logits(
		input_ids: ndarray
			attention_mask: typing.Optional[jax._src.numpy.lax_numpy.ndarray] = None
			position_ids: typing.Optional[jax._src.numpy.lax_numpy.ndarray] = None
			output_attentions: typing.Optional[bool] = None
			output_hidden_states: typing.Optional[bool] = None
			return_dict: typing.Optional[bool] = None
			train: bool = False
			params: dict = None
			dropout_rng: PRNGKey = None
			
		)
		β
			transformers.modeling_flax_outputs.FlaxBaseModelOutput or tuple(torch.FloatTensor)
Parameters
- 
							input_ids (jnp.ndarrayof shape(batch_size, sequence_length)) — Indices of input sequence tokens in the vocabulary. Padding will be ignored by default should you provide it.Indices can be obtained using BartTokenizer. See PreTrainedTokenizer.encode() and PreTrainedTokenizer.call() for details. 
- 
							attention_mask (jnp.ndarrayof shape(batch_size, sequence_length), optional) — Mask to avoid performing attention on padding token indices. Mask values selected in[0, 1]:- 1 for tokens that are not masked,
- 0 for tokens that are masked.
 
- 
							position_ids (numpy.ndarrayof shape(batch_size, sequence_length), optional) — Indices of positions of each input sequence tokens in the position embeddings. Selected in the range[0, config.max_position_embeddings - 1].
- 
							output_attentions (bool, optional) — Whether or not to return the attentions tensors of all attention layers. Seeattentionsunder returned tensors for more detail.
- 
							output_hidden_states (bool, optional) — Whether or not to return the hidden states of all layers. Seehidden_statesunder returned tensors for more detail.
- 
							return_dict (bool, optional) — Whether or not to return a ModelOutput instead of a plain tuple.
Returns
transformers.modeling_flax_outputs.FlaxBaseModelOutput or tuple(torch.FloatTensor)
A transformers.modeling_flax_outputs.FlaxBaseModelOutput or a tuple of
torch.FloatTensor (if return_dict=False is passed or when config.return_dict=False) comprising various
elements depending on the configuration (<class 'transformers.models.bart.configuration_bart.BartConfig'>) and inputs.
- 
last_hidden_state ( jnp.ndarrayof shape(batch_size, sequence_length, hidden_size)) β Sequence of hidden-states at the output of the last layer of the model.
- 
hidden_states ( tuple(jnp.ndarray), optional, returned whenoutput_hidden_states=Trueis passed or whenconfig.output_hidden_states=True) β Tuple ofjnp.ndarray(one for the output of the embeddings + one for the output of each layer) of shape(batch_size, sequence_length, hidden_size).Hidden-states of the model at the output of each layer plus the initial embedding outputs. 
- 
attentions ( tuple(jnp.ndarray), optional, returned whenoutput_attentions=Trueis passed or whenconfig.output_attentions=True) β Tuple ofjnp.ndarray(one for each layer) of shape(batch_size, num_heads, sequence_length, sequence_length).Attentions weights after the attention softmax, used to compute the weighted average in the self-attention heads. 
Example:
>>> from transformers import BartTokenizer, FlaxBartForConditionalGeneration
>>> model = FlaxBartForConditionalGeneration.from_pretrained('facebook/bart-large-cnn')
>>> tokenizer = BartTokenizer.from_pretrained('facebook/bart-large-cnn')
>>> text = "My friends are cool but they eat too many carbs."
>>> inputs = tokenizer(text, max_length=1024, return_tensors='jax')
>>> encoder_outputs = model.encode(**inputs)(
		decoder_input_ids
			encoder_outputs
			encoder_attention_mask: typing.Optional[jax._src.numpy.lax_numpy.ndarray] = None
			decoder_attention_mask: typing.Optional[jax._src.numpy.lax_numpy.ndarray] = None
			decoder_position_ids: typing.Optional[jax._src.numpy.lax_numpy.ndarray] = None
			past_key_values: dict = None
			output_attentions: typing.Optional[bool] = None
			output_hidden_states: typing.Optional[bool] = None
			return_dict: typing.Optional[bool] = None
			train: bool = False
			params: dict = None
			dropout_rng: PRNGKey = None
			
		)
		β
			transformers.modeling_flax_outputs.FlaxBaseModelOutputWithPastAndCrossAttentions or tuple(torch.FloatTensor)
Parameters
- 
							decoder_input_ids (jnp.ndarrayof shape(batch_size, target_sequence_length)) — Indices of decoder input sequence tokens in the vocabulary.Indices can be obtained using BartTokenizer. See PreTrainedTokenizer.encode() and PreTrainedTokenizer.call() for details. For translation and summarization training, decoder_input_idsshould be provided. If nodecoder_input_idsis provided, the model will create this tensor by shifting theinput_idsto the right for denoising pre-training following the paper.
- 
							encoder_outputs (tuple(tuple(jnp.ndarray)) — Tuple consists of (last_hidden_state, optional:hidden_states, optional:attentions)last_hidden_stateof shape(batch_size, sequence_length, hidden_size), optional) is a sequence of hidden-states at the output of the last layer of the encoder. Used in the cross-attention of the decoder.
- 
							encoder_attention_mask (jnp.ndarrayof shape(batch_size, sequence_length), optional) — Mask to avoid performing attention on padding token indices. Mask values selected in[0, 1]:- 1 for tokens that are not masked,
- 0 for tokens that are masked.
 
- 
							decoder_attention_mask (jnp.ndarrayof shape(batch_size, target_sequence_length), optional) — Default behavior: generate a tensor that ignores pad tokens indecoder_input_ids. Causal mask will also be used by default.If you want to change padding behavior, you should modify to your needs. See diagram 1 in the paper for more information on the default strategy. 
- 
							decoder_position_ids (numpy.ndarrayof shape(batch_size, sequence_length), optional) — Indices of positions of each decoder input sequence tokens in the position embeddings. Selected in the range[0, config.max_position_embeddings - 1].
- 
							past_key_values (Dict[str, np.ndarray], optional, returned byinit_cacheor when passing previouspast_key_values) — Dictionary of pre-computed hidden-states (key and values in the attention blocks) that can be used for fast auto-regressive decoding. Pre-computed key and value hidden-states are of shape [batch_size, max_length].
- 
							output_attentions (bool, optional) — Whether or not to return the attentions tensors of all attention layers. Seeattentionsunder returned tensors for more detail.
- 
							output_hidden_states (bool, optional) — Whether or not to return the hidden states of all layers. Seehidden_statesunder returned tensors for more detail.
- 
							return_dict (bool, optional) — Whether or not to return a ModelOutput instead of a plain tuple.
Returns
transformers.modeling_flax_outputs.FlaxBaseModelOutputWithPastAndCrossAttentions or tuple(torch.FloatTensor)
A transformers.modeling_flax_outputs.FlaxBaseModelOutputWithPastAndCrossAttentions or a tuple of
torch.FloatTensor (if return_dict=False is passed or when config.return_dict=False) comprising various
elements depending on the configuration (<class 'transformers.models.bart.configuration_bart.BartConfig'>) and inputs.
- 
last_hidden_state ( jnp.ndarrayof shape(batch_size, sequence_length, hidden_size)) β Sequence of hidden-states at the output of the last layer of the model.If past_key_valuesis used only the last hidden-state of the sequences of shape(batch_size, 1, hidden_size)is output.
- 
past_key_values ( tuple(tuple(jnp.ndarray)), optional, returned whenuse_cache=Trueis passed or whenconfig.use_cache=True) β Tuple oftuple(jnp.ndarray)of lengthconfig.n_layers, with each tuple having 2 tensors of shape(batch_size, num_heads, sequence_length, embed_size_per_head)) and optionally ifconfig.is_encoder_decoder=True2 additional tensors of shape(batch_size, num_heads, encoder_sequence_length, embed_size_per_head).Contains pre-computed hidden-states (key and values in the self-attention blocks and optionally if config.is_encoder_decoder=Truein the cross-attention blocks) that can be used (seepast_key_valuesinput) to speed up sequential decoding.
- 
hidden_states ( tuple(jnp.ndarray), optional, returned whenoutput_hidden_states=Trueis passed or whenconfig.output_hidden_states=True) β Tuple ofjnp.ndarray(one for the output of the embeddings + one for the output of each layer) of shape(batch_size, sequence_length, hidden_size).Hidden-states of the model at the output of each layer plus the initial embedding outputs. 
- 
attentions ( tuple(jnp.ndarray), optional, returned whenoutput_attentions=Trueis passed or whenconfig.output_attentions=True) β Tuple ofjnp.ndarray(one for each layer) of shape(batch_size, num_heads, sequence_length, sequence_length).Attentions weights after the attention softmax, used to compute the weighted average in the self-attention heads. 
- 
cross_attentions ( tuple(jnp.ndarray), optional, returned whenoutput_attentions=Trueandconfig.add_cross_attention=Trueis passed or whenconfig.output_attentions=True) β Tuple ofjnp.ndarray(one for each layer) of shape(batch_size, num_heads, sequence_length, sequence_length).Attentions weights of the decoderβs cross-attention layer, after the attention softmax, used to compute the weighted average in the cross-attention heads. 
Example:
>>> from transformers import BartTokenizer, FlaxBartForConditionalGeneration
>>> model = FlaxBartForConditionalGeneration.from_pretrained('facebook/bart-large-cnn')
>>> tokenizer = BartTokenizer.from_pretrained('facebook/bart-large-cnn')
>>> text = "My friends are cool but they eat too many carbs."
>>> inputs = tokenizer(text, max_length=1024, return_tensors='jax')
>>> encoder_outputs = model.encode(**inputs)
>>> decoder_start_token_id = model.config.decoder_start_token_id
>>> decoder_input_ids = jnp.ones((inputs.input_ids.shape[0], 1), dtype="i4") * decoder_start_token_id
>>> outputs = model.decode(decoder_input_ids, encoder_outputs)
>>> last_decoder_hidden_states = outputs.last_hidden_stateFlaxBartForQuestionAnswering
( config: BartConfig input_shape: typing.Tuple[int] = (1, 1) seed: int = 0 dtype: dtype = <class 'jax._src.numpy.lax_numpy.float32'> **kwargs )
Parameters
- config (BartConfig) — Model configuration class with all the parameters of the model. Initializing with a config file does not load the weights associated with the model, only the configuration. Check out the from_pretrained() method to load the model weights.
- 
							dtype (jax.numpy.dtype, optional, defaults tojax.numpy.float32) — The data type of the computation. Can be one ofjax.numpy.float32,jax.numpy.float16(on GPUs) andjax.numpy.bfloat16(on TPUs).This can be used to enable mixed-precision training or half-precision inference on GPUs or TPUs. If specified all the computation will be performed with the given dtype.Note that this only specifies the dtype of the computation and does not influence the dtype of model parameters. If you wish to change the dtype of the model parameters, see to_fp16() and to_bf16(). 
BART Model with a span classification head on top for extractive question-answering tasks like SQuAD (a linear
layer on top of the hidden-states output to compute span start logits and span end logits).
This model inherits from FlaxPreTrainedModel. Check the superclass documentation for the generic methods the library implements for all its model (such as downloading or saving, resizing the input embeddings, pruning heads etc.)
This model is also a Flax Linen flax.nn.Module subclass. Use it as a regular Flax Module and refer to the Flax documentation for all matter related to general usage and behavior.
Finally, this model supports inherent JAX features such as:
(
		input_ids: ndarray
			attention_mask: typing.Optional[jax._src.numpy.lax_numpy.ndarray] = None
			decoder_input_ids: typing.Optional[jax._src.numpy.lax_numpy.ndarray] = None
			decoder_attention_mask: typing.Optional[jax._src.numpy.lax_numpy.ndarray] = None
			position_ids: typing.Optional[jax._src.numpy.lax_numpy.ndarray] = None
			decoder_position_ids: typing.Optional[jax._src.numpy.lax_numpy.ndarray] = None
			output_attentions: typing.Optional[bool] = None
			output_hidden_states: typing.Optional[bool] = None
			return_dict: typing.Optional[bool] = None
			train: bool = False
			params: dict = None
			dropout_rng: PRNGKey = None
			
		)
		β
			transformers.modeling_flax_outputs.FlaxSeq2SeqQuestionAnsweringModelOutput or tuple(torch.FloatTensor)
Parameters
- 
							input_ids (jnp.ndarrayof shape(batch_size, sequence_length)) — Indices of input sequence tokens in the vocabulary. Padding will be ignored by default should you provide it.Indices can be obtained using BartTokenizer. See PreTrainedTokenizer.encode() and PreTrainedTokenizer.call() for details. 
- 
							attention_mask (jnp.ndarrayof shape(batch_size, sequence_length), optional) — Mask to avoid performing attention on padding token indices. Mask values selected in[0, 1]:- 1 for tokens that are not masked,
- 0 for tokens that are masked.
 
- 
							decoder_input_ids (jnp.ndarrayof shape(batch_size, target_sequence_length), optional) — Indices of decoder input sequence tokens in the vocabulary.Indices can be obtained using BartTokenizer. See PreTrainedTokenizer.encode() and PreTrainedTokenizer.call() for details. For translation and summarization training, decoder_input_idsshould be provided. If nodecoder_input_idsis provided, the model will create this tensor by shifting theinput_idsto the right for denoising pre-training following the paper.
- 
							decoder_attention_mask (jnp.ndarrayof shape(batch_size, target_sequence_length), optional) — Default behavior: generate a tensor that ignores pad tokens indecoder_input_ids. Causal mask will also be used by default.If you want to change padding behavior, you should modify to your needs. See diagram 1 in the paper for more information on the default strategy. 
- 
							position_ids (numpy.ndarrayof shape(batch_size, sequence_length), optional) — Indices of positions of each input sequence tokens in the position embeddings. Selected in the range[0, config.max_position_embeddings - 1].
- 
							decoder_position_ids (numpy.ndarrayof shape(batch_size, sequence_length), optional) — Indices of positions of each decoder input sequence tokens in the position embeddings. Selected in the range[0, config.max_position_embeddings - 1].
- 
							output_attentions (bool, optional) — Whether or not to return the attentions tensors of all attention layers. Seeattentionsunder returned tensors for more detail.
- 
							output_hidden_states (bool, optional) — Whether or not to return the hidden states of all layers. Seehidden_statesunder returned tensors for more detail.
- 
							return_dict (bool, optional) — Whether or not to return a ModelOutput instead of a plain tuple.
Returns
transformers.modeling_flax_outputs.FlaxSeq2SeqQuestionAnsweringModelOutput or tuple(torch.FloatTensor)
A transformers.modeling_flax_outputs.FlaxSeq2SeqQuestionAnsweringModelOutput or a tuple of
torch.FloatTensor (if return_dict=False is passed or when config.return_dict=False) comprising various
elements depending on the configuration (BartConfig) and inputs.
- 
start_logits ( jnp.ndarrayof shape(batch_size, sequence_length)) β Span-start scores (before SoftMax).
- 
end_logits ( jnp.ndarrayof shape(batch_size, sequence_length)) β Span-end scores (before SoftMax).
- 
past_key_values ( tuple(tuple(jnp.ndarray)), optional, returned whenuse_cache=Trueis passed or whenconfig.use_cache=True) β Tuple oftuple(jnp.ndarray)of lengthconfig.n_layers, with each tuple having 2 tensors of shape(batch_size, num_heads, sequence_length, embed_size_per_head)) and 2 additional tensors of shape(batch_size, num_heads, encoder_sequence_length, embed_size_per_head).Contains pre-computed hidden-states (key and values in the self-attention blocks and in the cross-attention blocks) that can be used (see past_key_valuesinput) to speed up sequential decoding.
- 
decoder_hidden_states ( tuple(jnp.ndarray), optional, returned whenoutput_hidden_states=Trueis passed or whenconfig.output_hidden_states=True) β Tuple ofjnp.ndarray(one for the output of the embeddings + one for the output of each layer) of shape(batch_size, sequence_length, hidden_size).Hidden-states of the decoder at the output of each layer plus the initial embedding outputs. 
- 
decoder_attentions ( tuple(jnp.ndarray), optional, returned whenoutput_attentions=Trueis passed or whenconfig.output_attentions=True) β Tuple ofjnp.ndarray(one for each layer) of shape(batch_size, num_heads, sequence_length, sequence_length).Attentions weights of the decoder, after the attention softmax, used to compute the weighted average in the self-attention heads. 
- 
cross_attentions ( tuple(jnp.ndarray), optional, returned whenoutput_attentions=Trueis passed or whenconfig.output_attentions=True) β Tuple ofjnp.ndarray(one for each layer) of shape(batch_size, num_heads, sequence_length, sequence_length).Attentions weights of the decoderβs cross-attention layer, after the attention softmax, used to compute the weighted average in the cross-attention heads. 
- 
encoder_last_hidden_state ( jnp.ndarrayof shape(batch_size, sequence_length, hidden_size), optional) β Sequence of hidden-states at the output of the last layer of the encoder of the model.
- 
encoder_hidden_states ( tuple(jnp.ndarray), optional, returned whenoutput_hidden_states=Trueis passed or whenconfig.output_hidden_states=True) β Tuple ofjnp.ndarray(one for the output of the embeddings + one for the output of each layer) of shape(batch_size, sequence_length, hidden_size).Hidden-states of the encoder at the output of each layer plus the initial embedding outputs. 
- 
encoder_attentions ( tuple(jnp.ndarray), optional, returned whenoutput_attentions=Trueis passed or whenconfig.output_attentions=True) β Tuple ofjnp.ndarray(one for each layer) of shape(batch_size, num_heads, sequence_length, sequence_length).Attentions weights of the encoder, after the attention softmax, used to compute the weighted average in the self-attention heads. 
The FlaxBartPreTrainedModel forward method, overrides the __call__ special method.
Although the recipe for forward pass needs to be defined within this function, one should call the Module
instance afterwards instead of this since the former takes care of running the pre and post processing steps while
the latter silently ignores them.
Example:
>>> from transformers import BartTokenizer, FlaxBartForQuestionAnswering
>>> tokenizer = BartTokenizer.from_pretrained('facebook/bart-base')
>>> model = FlaxBartForQuestionAnswering.from_pretrained('facebook/bart-base')
>>> question, text = "Who was Jim Henson?", "Jim Henson was a nice puppet"
>>> inputs = tokenizer(question, text, return_tensors='jax')
>>> outputs = model(**inputs)
>>> start_scores = outputs.start_logits
>>> end_scores = outputs.end_logits(
		input_ids: ndarray
			attention_mask: typing.Optional[jax._src.numpy.lax_numpy.ndarray] = None
			position_ids: typing.Optional[jax._src.numpy.lax_numpy.ndarray] = None
			output_attentions: typing.Optional[bool] = None
			output_hidden_states: typing.Optional[bool] = None
			return_dict: typing.Optional[bool] = None
			train: bool = False
			params: dict = None
			dropout_rng: PRNGKey = None
			
		)
		β
			transformers.modeling_flax_outputs.FlaxBaseModelOutput or tuple(torch.FloatTensor)
Parameters
- 
							input_ids (jnp.ndarrayof shape(batch_size, sequence_length)) — Indices of input sequence tokens in the vocabulary. Padding will be ignored by default should you provide it.Indices can be obtained using BartTokenizer. See PreTrainedTokenizer.encode() and PreTrainedTokenizer.call() for details. 
- 
							attention_mask (jnp.ndarrayof shape(batch_size, sequence_length), optional) — Mask to avoid performing attention on padding token indices. Mask values selected in[0, 1]:- 1 for tokens that are not masked,
- 0 for tokens that are masked.
 
- 
							position_ids (numpy.ndarrayof shape(batch_size, sequence_length), optional) — Indices of positions of each input sequence tokens in the position embeddings. Selected in the range[0, config.max_position_embeddings - 1].
- 
							output_attentions (bool, optional) — Whether or not to return the attentions tensors of all attention layers. Seeattentionsunder returned tensors for more detail.
- 
							output_hidden_states (bool, optional) — Whether or not to return the hidden states of all layers. Seehidden_statesunder returned tensors for more detail.
- 
							return_dict (bool, optional) — Whether or not to return a ModelOutput instead of a plain tuple.
Returns
transformers.modeling_flax_outputs.FlaxBaseModelOutput or tuple(torch.FloatTensor)
A transformers.modeling_flax_outputs.FlaxBaseModelOutput or a tuple of
torch.FloatTensor (if return_dict=False is passed or when config.return_dict=False) comprising various
elements depending on the configuration (<class 'transformers.models.bart.configuration_bart.BartConfig'>) and inputs.
- 
last_hidden_state ( jnp.ndarrayof shape(batch_size, sequence_length, hidden_size)) β Sequence of hidden-states at the output of the last layer of the model.
- 
hidden_states ( tuple(jnp.ndarray), optional, returned whenoutput_hidden_states=Trueis passed or whenconfig.output_hidden_states=True) β Tuple ofjnp.ndarray(one for the output of the embeddings + one for the output of each layer) of shape(batch_size, sequence_length, hidden_size).Hidden-states of the model at the output of each layer plus the initial embedding outputs. 
- 
attentions ( tuple(jnp.ndarray), optional, returned whenoutput_attentions=Trueis passed or whenconfig.output_attentions=True) β Tuple ofjnp.ndarray(one for each layer) of shape(batch_size, num_heads, sequence_length, sequence_length).Attentions weights after the attention softmax, used to compute the weighted average in the self-attention heads. 
Example:
>>> from transformers import BartTokenizer, FlaxBartForConditionalGeneration
>>> model = FlaxBartForConditionalGeneration.from_pretrained('facebook/bart-large-cnn')
>>> tokenizer = BartTokenizer.from_pretrained('facebook/bart-large-cnn')
>>> text = "My friends are cool but they eat too many carbs."
>>> inputs = tokenizer(text, max_length=1024, return_tensors='jax')
>>> encoder_outputs = model.encode(**inputs)(
		decoder_input_ids
			encoder_outputs
			encoder_attention_mask: typing.Optional[jax._src.numpy.lax_numpy.ndarray] = None
			decoder_attention_mask: typing.Optional[jax._src.numpy.lax_numpy.ndarray] = None
			decoder_position_ids: typing.Optional[jax._src.numpy.lax_numpy.ndarray] = None
			past_key_values: dict = None
			output_attentions: typing.Optional[bool] = None
			output_hidden_states: typing.Optional[bool] = None
			return_dict: typing.Optional[bool] = None
			train: bool = False
			params: dict = None
			dropout_rng: PRNGKey = None
			
		)
		β
			transformers.modeling_flax_outputs.FlaxBaseModelOutputWithPastAndCrossAttentions or tuple(torch.FloatTensor)
Parameters
- 
							decoder_input_ids (jnp.ndarrayof shape(batch_size, target_sequence_length)) — Indices of decoder input sequence tokens in the vocabulary.Indices can be obtained using BartTokenizer. See PreTrainedTokenizer.encode() and PreTrainedTokenizer.call() for details. For translation and summarization training, decoder_input_idsshould be provided. If nodecoder_input_idsis provided, the model will create this tensor by shifting theinput_idsto the right for denoising pre-training following the paper.
- 
							encoder_outputs (tuple(tuple(jnp.ndarray)) — Tuple consists of (last_hidden_state, optional:hidden_states, optional:attentions)last_hidden_stateof shape(batch_size, sequence_length, hidden_size), optional) is a sequence of hidden-states at the output of the last layer of the encoder. Used in the cross-attention of the decoder.
- 
							encoder_attention_mask (jnp.ndarrayof shape(batch_size, sequence_length), optional) — Mask to avoid performing attention on padding token indices. Mask values selected in[0, 1]:- 1 for tokens that are not masked,
- 0 for tokens that are masked.
 
- 
							decoder_attention_mask (jnp.ndarrayof shape(batch_size, target_sequence_length), optional) — Default behavior: generate a tensor that ignores pad tokens indecoder_input_ids. Causal mask will also be used by default.If you want to change padding behavior, you should modify to your needs. See diagram 1 in the paper for more information on the default strategy. 
- 
							decoder_position_ids (numpy.ndarrayof shape(batch_size, sequence_length), optional) — Indices of positions of each decoder input sequence tokens in the position embeddings. Selected in the range[0, config.max_position_embeddings - 1].
- 
							past_key_values (Dict[str, np.ndarray], optional, returned byinit_cacheor when passing previouspast_key_values) — Dictionary of pre-computed hidden-states (key and values in the attention blocks) that can be used for fast auto-regressive decoding. Pre-computed key and value hidden-states are of shape [batch_size, max_length].
- 
							output_attentions (bool, optional) — Whether or not to return the attentions tensors of all attention layers. Seeattentionsunder returned tensors for more detail.
- 
							output_hidden_states (bool, optional) — Whether or not to return the hidden states of all layers. Seehidden_statesunder returned tensors for more detail.
- 
							return_dict (bool, optional) — Whether or not to return a ModelOutput instead of a plain tuple.
Returns
transformers.modeling_flax_outputs.FlaxBaseModelOutputWithPastAndCrossAttentions or tuple(torch.FloatTensor)
A transformers.modeling_flax_outputs.FlaxBaseModelOutputWithPastAndCrossAttentions or a tuple of
torch.FloatTensor (if return_dict=False is passed or when config.return_dict=False) comprising various
elements depending on the configuration (<class 'transformers.models.bart.configuration_bart.BartConfig'>) and inputs.
- 
last_hidden_state ( jnp.ndarrayof shape(batch_size, sequence_length, hidden_size)) β Sequence of hidden-states at the output of the last layer of the model.If past_key_valuesis used only the last hidden-state of the sequences of shape(batch_size, 1, hidden_size)is output.
- 
past_key_values ( tuple(tuple(jnp.ndarray)), optional, returned whenuse_cache=Trueis passed or whenconfig.use_cache=True) β Tuple oftuple(jnp.ndarray)of lengthconfig.n_layers, with each tuple having 2 tensors of shape(batch_size, num_heads, sequence_length, embed_size_per_head)) and optionally ifconfig.is_encoder_decoder=True2 additional tensors of shape(batch_size, num_heads, encoder_sequence_length, embed_size_per_head).Contains pre-computed hidden-states (key and values in the self-attention blocks and optionally if config.is_encoder_decoder=Truein the cross-attention blocks) that can be used (seepast_key_valuesinput) to speed up sequential decoding.
- 
hidden_states ( tuple(jnp.ndarray), optional, returned whenoutput_hidden_states=Trueis passed or whenconfig.output_hidden_states=True) β Tuple ofjnp.ndarray(one for the output of the embeddings + one for the output of each layer) of shape(batch_size, sequence_length, hidden_size).Hidden-states of the model at the output of each layer plus the initial embedding outputs. 
- 
attentions ( tuple(jnp.ndarray), optional, returned whenoutput_attentions=Trueis passed or whenconfig.output_attentions=True) β Tuple ofjnp.ndarray(one for each layer) of shape(batch_size, num_heads, sequence_length, sequence_length).Attentions weights after the attention softmax, used to compute the weighted average in the self-attention heads. 
- 
cross_attentions ( tuple(jnp.ndarray), optional, returned whenoutput_attentions=Trueandconfig.add_cross_attention=Trueis passed or whenconfig.output_attentions=True) β Tuple ofjnp.ndarray(one for each layer) of shape(batch_size, num_heads, sequence_length, sequence_length).Attentions weights of the decoderβs cross-attention layer, after the attention softmax, used to compute the weighted average in the cross-attention heads. 
Example:
>>> from transformers import BartTokenizer, FlaxBartForConditionalGeneration
>>> model = FlaxBartForConditionalGeneration.from_pretrained('facebook/bart-large-cnn')
>>> tokenizer = BartTokenizer.from_pretrained('facebook/bart-large-cnn')
>>> text = "My friends are cool but they eat too many carbs."
>>> inputs = tokenizer(text, max_length=1024, return_tensors='jax')
>>> encoder_outputs = model.encode(**inputs)
>>> decoder_start_token_id = model.config.decoder_start_token_id
>>> decoder_input_ids = jnp.ones((inputs.input_ids.shape[0], 1), dtype="i4") * decoder_start_token_id
>>> outputs = model.decode(decoder_input_ids, encoder_outputs)
>>> last_decoder_hidden_states = outputs.last_hidden_state