Upload BAGEL-7B-MoT model files

Browse files

Files changed (15) hide show

README.md +118 -0
ae.safetensors +3 -0
cache/models--ByteDance-Seed--BAGEL-7B-MoT/refs/main +1 -0
config.json +5 -0
convert_ema_to_standard.py +127 -0
ema.safetensors +3 -0
generation_config.json +14 -0
llm_config.json +27 -0
merges.txt +0 -0
model.safetensors +3 -0
model.safetensors.index.json +0 -0
tokenizer.json +0 -0
tokenizer_config.json +207 -0
vit_config.json +9 -0
vocab.json +0 -0

README.md ADDED Viewed

	@@ -0,0 +1,118 @@

+---
+license: apache-2.0
+base_model:
+- Qwen/Qwen2.5-7B-Instruct
+pipeline_tag: any-to-any
+library_name: bagel-mot
+---
+<p align="left">
+  <img src="https://lf3-static.bytednsdoc.com/obj/eden-cn/nuhojubrps/banner.png" alt="BAGEL" width="480"/>
+</p>
+# 🥯 BAGEL • Unified Model for Multimodal Understanding and Generation
+<p align="left">
+  <a href="https://bagel-ai.org/">
+    <img
+      src="https://img.shields.io/badge/BAGEL-Website-0A66C2?logo=safari&logoColor=white" style="display: inline-block; vertical-align: middle;"
+      alt="BAGEL Website"
+    />
+  </a>
+  <a href="https://arxiv.org/abs/2505.14683">
+    <img
+      src="https://img.shields.io/badge/BAGEL-Paper-red?logo=arxiv&logoColor=red" style="display: inline-block; vertical-align: middle;"
+      alt="BAGEL Paper on arXiv"
+    />
+  </a>
+  <a href="https://github.com/bytedance-seed/BAGEL" target="_blank" style="margin: 2px;">
+      <img
+        alt="Github" src="https://img.shields.io/badge/BAGEL-Codebase-536af5?color=536af5&logo=github" style="display: inline-block; vertical-align: middle;"
+        alt="BAGEL Codebase"
+      />
+  </a>
+  <a href="https://demo.bagel-ai.org/">
+    <img
+      src="https://img.shields.io/badge/BAGEL-Demo-blue?logo=googleplay&logoColor=white" style="display: inline-block; vertical-align: middle;"
+      alt="BAGEL Demo"
+    />
+  </a>
+  <a href="https://discord.com/invite/Z836xxzy">
+    <img
+      src="https://img.shields.io/badge/BAGEL-Discord-green?logo=discord&logoColor=white" style="display: inline-block; vertical-align: middle;"
+      alt="BAGEL Discord"
+    />
+  </a>
+</p>
+> We present **BAGEL**, an open‑source multimodal foundation model with 7B active parameters (14B total) trained on large‑scale interleaved multimodal data. BAGEL outperforms the current top‑tier open‑source VLMs like Qwen2.5-VL and InternVL-2.5 on standard multimodal understanding leaderboards, and delivers text‑to‑image quality that is competitive with strong specialist generators such as SD3.
+Moreover, BAGEL demonstrates superior qualitative results in classical image‑editing scenarios than the leading open-source models. More importantly, it extends to free-form visual manipulation, multiview synthesis, and world navigation, capabilities that constitute "world-modeling" tasks beyond the scope of previous image-editing models.
+This repository hosts the model weights for **BAGEL**. For installation, usage instructions, and further documentation, please visit our [GitHub repository](https://github.com/bytedance-seed/BAGEL).
+<p align="left"><img src="https://github.com/ByteDance-Seed/Bagel/raw/main/assets/teaser.webp" width="80%"></p>
+## 🧠 Method
+BAGEL adopts a Mixture-of-Transformer-Experts (MoT) architecture to maximize the model’s capacity to learn from richly diverse multimodal information. Following the same principle of capacity maximization, it utilizes two separate encoders to capture pixel-level and semantic-level features of an image. The overall framework follows a Next Group of Token Prediction paradigm, where the model is trained to predict the next group of language or visual tokens as a compression target.
+BAGEL scales MoT’s capacity through Pre-training, Continued Training, and Supervised Finetuning on trillions of interleaved multimodal tokens spanning language, image, video, and web data. It surpasses open models on standard understanding and generation benchmarks and demonstrates advanced in-context multimodal abilities like free-form image editing, future frame prediction, 3D manipulation, world navigation, and sequential reasoning.
+<p align="left"><img src="https://github.com/ByteDance-Seed/Bagel/raw/main/assets/arch.png" width="50%"></p>
+## 🌱 Emerging Properties
+<p align="left"><img src="https://github.com/ByteDance-Seed/Bagel/raw/main/assets/emerging_curves.png" width="50%"></p>
+As we scale up BAGEL’s pretraining with more multimodal tokens, we observe consistent performance gains across understanding, generation, and editing tasks. Different capabilities emerge at distinct training stages—multimodal understanding and generation appear early, followed by basic editing, while complex, intelligent editing emerges later. This staged progression suggests an emergent pattern, where advanced multimodal reasoning builds on well-formed foundational skills. Ablation studies further show that combining VAE and ViT features significantly improves intelligent editing, underscoring the importance of visual-semantic context in enabling complex multimodal reasoning and further supporting its role in the emergence of advanced capabilities.
+## 📊 Benchmarks
+### 1. Visual Understanding
+| Model | MME ↑ | MMBench ↑ |   MMMU ↑ | MM-Vet ↑ | MathVista ↑ |
+| ------------------- | ----------: | ----------: | -------: | -------: | ----------: |
+| Janus-Pro-7B        | -  |     79.2 |     41.0 |     50.0 |           – |
+| Qwen2.5-VL-7B      | 2347    |   83.5 | **58.6** |     67.1 |           68.2 |
+| **BAGEL**    | **2388**  |  **85.0** |     55.3 | **67.2** |    **73.1** |
+### 2. Text-to-Image Generation · GenEval
+| Model        | Overall ↑ |
+| ------------ | --------- |
+| FLUX-1-dev   | 0.82      |
+| SD3-Medium   | 0.74      |
+| Janus-Pro-7B | 0.80      |
+| **BAGEL**    | **0.88**  |
+### 3. Image Editing
+| Model         | GEdit-Bench-EN (SC) ↑ | GEdit-Bench-EN (PQ) ↑ | GEdit-Bench-EN (O) ↑ | IntelligentBench ↑ |
+| ------------- | --------------------- | --------------------- | ------------------- | ------------------ |
+| Step1X-Edit   | 7.09                  | 6.76                  | **6.70**            | 14.9               |
+| Gemini-2-exp. | 6.73                  | 6.61                  | 6.32                | **57.6**           |
+| **BAGEL**     | **7.36**              | **6.83**              | 6.52                | 44.0               |
+| **BAGEL+CoT** | –                   | –                     | –                   | 55.3               |
+## License
+BAGEL is licensed under the Apache 2.0 license. It is finetuned from [Qwen2.5-7B-Instruct](https://huggingface.co/Qwen/Qwen2.5-7B-Instruct) and [siglip-so400m-14-384-flash-attn2](https://huggingface.co/HuggingFaceM4/siglip-so400m-14-384-flash-attn2) model, and uses the [FLUX.1-schnell VAE model](https://huggingface.co/black-forest-labs/FLUX.1-schnell), all under Apache 2.0.
+## ✍️ Citation
+```bibtex
+@article{deng2025bagel,
+  title   = {Emerging Properties in Unified Multimodal Pretraining},
+  author  = {Deng, Chaorui and Zhu, Deyao and Li, Kunchang and Gou, Chenhui and Li, Feng and Wang, Zeyu and Zhong, Shu and Yu, Weihao and Nie, Xiaonan and Song, Ziang and Shi, Guang and Fan, Haoqi},
+  journal = {arXiv preprint arXiv:2505.14683},
+  year    = {2025}
+}
+```

ae.safetensors ADDED Viewed

	@@ -0,0 +1,3 @@

+version https://git-lfs.github.com/spec/v1
+oid sha256:afc8e28272cd15db3919bacdb6918ce9c1ed22e96cb12c4d5ed0fba823529e38
+size 335304388

cache/models--ByteDance-Seed--BAGEL-7B-MoT/refs/main ADDED Viewed

	@@ -0,0 +1 @@


1	+ 570026eca23479ee7df5a6ce9fb50a835530da30

config.json ADDED Viewed

	@@ -0,0 +1,5 @@

+{
+  "name": [
+    "BAGEL-7B-MoT"
+  ],
+}

convert_ema_to_standard.py ADDED Viewed

	@@ -0,0 +1,127 @@

+#!/usr/bin/env python3
+"""
+Merge BAGEL EMA checkpoint into a standard inference checkpoint.
+The repository ships two shards:
+* ``ema.safetensors`` – EMA weights for the Mixture-of-Transformer stack,
+  connector and ViT encoder described by ``llm_config.json`` / ``vit_config.json``.
+* ``ae.safetensors`` – VAE weights referenced by ``model.safetensors.index.json``.
+This script combines the two into a single ``model`` checkpoint that can be used in
+place of the EMA file. By default the script keeps the source files untouched and
+writes a new ``model_from_ema.safetensors`` plus, optionally, an accompanying index.
+"""
+from __future__ import annotations
+import argparse
+import json
+from collections import OrderedDict
+from pathlib import Path
+from typing import Dict
+import torch
+def parse_args() -> argparse.Namespace:
+    parser = argparse.ArgumentParser(
+        description="Convert BAGEL EMA weights into a regular inference checkpoint."
+    )
+    parser.add_argument(
+        "--ema",
+        type=Path,
+        default=Path("ema.safetensors"),
+        help="Path to the EMA weights file (default: ema.safetensors).",
+    )
+    parser.add_argument(
+        "--ae",
+        type=Path,
+        default=Path("ae.safetensors"),
+        help="Path to the VAE weights file (default: ae.safetensors).",
+    )
+    parser.add_argument(
+        "--output",
+        type=Path,
+        default=Path("model_from_ema.safetensors"),
+        help="Destination for the merged checkpoint.",
+    )
+    parser.add_argument(
+        "--index",
+        type=Path,
+        default=None,
+        help="Optional path for a Hugging Face style index JSON file.",
+    )
+    return parser.parse_args()
+def load_safetensors(path: Path) -> Dict[str, torch.Tensor]:
+    try:
+        from safetensors.torch import load_file
+    except ImportError as exc:  # pragma: no cover - raises early when dependency missing
+        raise RuntimeError(
+            "safetensors is required. Install it with `pip install safetensors`."
+        ) from exc
+    tensors = load_file(str(path))
+    if not tensors:
+        raise ValueError(f"{path} does not contain any tensors.")
+    return tensors
+def save_safetensors(
+    tensors: Dict[str, torch.Tensor], path: Path, *, metadata: Dict[str, str]
+) -> None:
+    try:
+        from safetensors.torch import save_file
+    except ImportError as exc:  # pragma: no cover - raises early when dependency missing
+        raise RuntimeError(
+            "safetensors is required. Install it with `pip install safetensors`."
+        ) from exc
+    save_file(tensors, str(path), metadata=metadata)
+def compute_total_size_bytes(tensors: Dict[str, torch.Tensor]) -> int:
+    total = 0
+    for tensor in tensors.values():
+        total += tensor.element_size() * tensor.nelement()
+    return total
+def main() -> None:
+    args = parse_args()
+    if not args.ema.is_file():
+        raise FileNotFoundError(f"EMA weights not found: {args.ema}")
+    if not args.ae.is_file():
+        raise FileNotFoundError(f"VAE weights not found: {args.ae}")
+    ema_state = load_safetensors(args.ema)
+    ae_state = load_safetensors(args.ae)
+    overlap = set(ae_state.keys()) & set(ema_state.keys())
+    if overlap:
+        raise ValueError(
+            f"Found {len(overlap)} overlapping parameter names between ae and ema files; "
+            "please inspect your checkpoints before merging."
+        )
+    merged = OrderedDict()
+    merged.update(sorted(ae_state.items()))
+    merged.update(sorted(ema_state.items()))
+    total_size = compute_total_size_bytes(merged)
+    metadata = {"total_size": str(total_size)}
+    save_safetensors(merged, args.output, metadata=metadata)
+    if args.index:
+        weight_map = {key: args.output.name for key in merged.keys()}
+        index_payload = {
+            "metadata": {"total_size": total_size},
+            "weight_map": weight_map,
+        }
+        args.index.write_text(json.dumps(index_payload, indent=4, ensure_ascii=False) + "\n")
+if __name__ == "__main__":
+    main()

ema.safetensors ADDED Viewed

	@@ -0,0 +1,3 @@

+version https://git-lfs.github.com/spec/v1
+oid sha256:0b41c43835fd737b8c948e604870da522c091dcf151f3e8d55f84781765ee1a3
+size 29214685336

generation_config.json ADDED Viewed

	@@ -0,0 +1,14 @@

+{
+  "bos_token_id": 151643,
+  "pad_token_id": 151643,
+  "do_sample": true,
+  "eos_token_id": [
+    151645,
+    151643
+  ],
+  "repetition_penalty": 1.05,
+  "temperature": 0.7,
+  "top_p": 0.8,
+  "top_k": 20,
+  "transformers_version": "4.37.0"
+}

llm_config.json ADDED Viewed

	@@ -0,0 +1,27 @@

+{
+  "architectures": [
+    "Qwen2ForCausalLM"
+  ],
+  "attention_dropout": 0.0,
+  "bos_token_id": 151643,
+  "eos_token_id": 151645,
+  "hidden_act": "silu",
+  "hidden_size": 3584,
+  "initializer_range": 0.02,
+  "intermediate_size": 18944,
+  "max_position_embeddings": 32768,
+  "max_window_layers": 28,
+  "model_type": "qwen2",
+  "num_attention_heads": 28,
+  "num_hidden_layers": 28,
+  "num_key_value_heads": 4,
+  "rms_norm_eps": 1e-06,
+  "rope_theta": 1000000.0,
+  "sliding_window": 131072,
+  "tie_word_embeddings": false,
+  "torch_dtype": "bfloat16",
+  "transformers_version": "4.43.1",
+  "use_cache": true,
+  "use_sliding_window": false,
+  "vocab_size": 152064
+}

merges.txt ADDED Viewed

The diff for this file is too large to render. See raw diff

model.safetensors ADDED Viewed

	@@ -0,0 +1,3 @@

+version https://git-lfs.github.com/spec/v1
+oid sha256:8dd47a686ee248579b984a558f2640aad3d54f6f6d83e5d25197738a7c34e015
+size 29549989332

model.safetensors.index.json ADDED Viewed

The diff for this file is too large to render. See raw diff

tokenizer.json ADDED Viewed

The diff for this file is too large to render. See raw diff

tokenizer_config.json ADDED Viewed

	@@ -0,0 +1,207 @@

+{
+  "add_bos_token": false,
+  "add_prefix_space": false,
+  "added_tokens_decoder": {
+    "151643": {
+      "content": "<|endoftext|>",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": true
+    },
+    "151644": {
+      "content": "<|im_start|>",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": true
+    },
+    "151645": {
+      "content": "<|im_end|>",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": true
+    },
+    "151646": {
+      "content": "<|object_ref_start|>",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": true
+    },
+    "151647": {
+      "content": "<|object_ref_end|>",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": true
+    },
+    "151648": {
+      "content": "<|box_start|>",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": true
+    },
+    "151649": {
+      "content": "<|box_end|>",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": true
+    },
+    "151650": {
+      "content": "<|quad_start|>",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": true
+    },
+    "151651": {
+      "content": "<|quad_end|>",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": true
+    },
+    "151652": {
+      "content": "<|vision_start|>",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": true
+    },
+    "151653": {
+      "content": "<|vision_end|>",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": true
+    },
+    "151654": {
+      "content": "<|vision_pad|>",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": true
+    },
+    "151655": {
+      "content": "<|image_pad|>",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": true
+    },
+    "151656": {
+      "content": "<|video_pad|>",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": true
+    },
+    "151657": {
+      "content": "<tool_call>",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": false
+    },
+    "151658": {
+      "content": "</tool_call>",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": false
+    },
+    "151659": {
+      "content": "<|fim_prefix|>",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": false
+    },
+    "151660": {
+      "content": "<|fim_middle|>",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": false
+    },
+    "151661": {
+      "content": "<|fim_suffix|>",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": false
+    },
+    "151662": {
+      "content": "<|fim_pad|>",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": false
+    },
+    "151663": {
+      "content": "<|repo_name|>",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": false
+    },
+    "151664": {
+      "content": "<|file_sep|>",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": false
+    }
+  },
+  "additional_special_tokens": [
+    "<|im_start|>",
+    "<|im_end|>",
+    "<|object_ref_start|>",
+    "<|object_ref_end|>",
+    "<|box_start|>",
+    "<|box_end|>",
+    "<|quad_start|>",
+    "<|quad_end|>",
+    "<|vision_start|>",
+    "<|vision_end|>",
+    "<|vision_pad|>",
+    "<|image_pad|>",
+    "<|video_pad|>"
+  ],
+  "bos_token": null,
+  "chat_template": "{%- if tools %}\n    {{- '<|im_start|>system\\n' }}\n    {%- if messages[0]['role'] == 'system' %}\n        {{- messages[0]['content'] }}\n    {%- else %}\n        {{- 'You are Qwen, created by Alibaba Cloud. You are a helpful assistant.' }}\n    {%- endif %}\n    {{- \"\\n\\n# Tools\\n\\nYou may call one or more functions to assist with the user query.\\n\\nYou are provided with function signatures within <tools></tools> XML tags:\\n<tools>\" }}\n    {%- for tool in tools %}\n        {{- \"\\n\" }}\n        {{- tool | tojson }}\n    {%- endfor %}\n    {{- \"\\n</tools>\\n\\nFor each function call, return a json object with function name and arguments within <tool_call></tool_call> XML tags:\\n<tool_call>\\n{\\\"name\\\": <function-name>, \\\"arguments\\\": <args-json-object>}\\n</tool_call><|im_end|>\\n\" }}\n{%- else %}\n    {%- if messages[0]['role'] == 'system' %}\n        {{- '<|im_start|>system\\n' + messages[0]['content'] + '<|im_end|>\\n' }}\n    {%- else %}\n        {{- '<|im_start|>system\\nYou are Qwen, created by Alibaba Cloud. You are a helpful assistant.<|im_end|>\\n' }}\n    {%- endif %}\n{%- endif %}\n{%- for message in messages %}\n    {%- if (message.role == \"user\") or (message.role == \"system\" and not loop.first) or (message.role == \"assistant\" and not message.tool_calls) %}\n        {{- '<|im_start|>' + message.role + '\\n' + message.content + '<|im_end|>' + '\\n' }}\n    {%- elif message.role == \"assistant\" %}\n        {{- '<|im_start|>' + message.role }}\n        {%- if message.content %}\n            {{- '\\n' + message.content }}\n        {%- endif %}\n        {%- for tool_call in message.tool_calls %}\n            {%- if tool_call.function is defined %}\n                {%- set tool_call = tool_call.function %}\n            {%- endif %}\n            {{- '\\n<tool_call>\\n{\"name\": \"' }}\n            {{- tool_call.name }}\n            {{- '\", \"arguments\": ' }}\n            {{- tool_call.arguments | tojson }}\n            {{- '}\\n</tool_call>' }}\n        {%- endfor %}\n        {{- '<|im_end|>\\n' }}\n    {%- elif message.role == \"tool\" %}\n        {%- if (loop.index0 == 0) or (messages[loop.index0 - 1].role != \"tool\") %}\n            {{- '<|im_start|>user' }}\n        {%- endif %}\n        {{- '\\n<tool_response>\\n' }}\n        {{- message.content }}\n        {{- '\\n</tool_response>' }}\n        {%- if loop.last or (messages[loop.index0 + 1].role != \"tool\") %}\n            {{- '<|im_end|>\\n' }}\n        {%- endif %}\n    {%- endif %}\n{%- endfor %}\n{%- if add_generation_prompt %}\n    {{- '<|im_start|>assistant\\n' }}\n{%- endif %}\n",
+  "clean_up_tokenization_spaces": false,
+  "eos_token": "<|im_end|>",
+  "errors": "replace",
+  "model_max_length": 131072,
+  "pad_token": "<|endoftext|>",
+  "split_special_tokens": false,
+  "tokenizer_class": "Qwen2Tokenizer",
+  "unk_token": null
+}

vit_config.json ADDED Viewed

	@@ -0,0 +1,9 @@

+{
+    "hidden_size": 1152,
+    "image_size": 980,
+    "intermediate_size": 4304,
+    "model_type": "siglip_vision_model",
+    "num_attention_heads": 16,
+    "num_hidden_layers": 27,
+    "patch_size": 14
+}

vocab.json ADDED Viewed

The diff for this file is too large to render. See raw diff