--- license: llama3.1 base_model: meta-llama/Llama-3.1-8B-Instruct tags: - generated_from_trainer model-index: - name: outputs/ablation-88 results: [] datasets: - princeton-nlp/gemma2-ultrafeedback-armorm --- **UPDATE:** We have since released the full [Shisa V2](https://huggingface.co/collections/shisa-ai/shisa-v2-67fc98ecaf940ad6c49f5689) family of models. See our announcement at [https://shisa.ai/posts/shisa-v2/](https://shisa.ai/posts/shisa-v2/) --- *Per the Llama 3.1 Community License Agreement, the official name of this model is "Llama 3.1 shisa-v2-llama3.1-8b-preview"* # shisa-v2-llama-3.1-8b-preview This is a preview release of the Shisa V2 bilingual Japanese and English (JA/EN). It is a fine tune of [meta-llama/Llama-3.1-8B-Instruct](https://huggingface.co/meta-llama/Llama-3.1-8B-Instruct) and inherits the tokenizer ([JA efficiency](https://github.com/shisa-ai/shisa-v2/blob/main/eval/tokenizer-efficiency/tokenizer-eval-ja.md)) and context length (128K). While we're still working hard on this model (including integrating additional datasets and applying several more post-training stages) this model already shows a significant performace leap over our prior published models, including beating our Llama 3 70B tune from last year almost across the board on our evals. We're releasing this WIP preview to celebrate since we also just noticed that [shisa-gamma-7b-v1 hit 1 milion downloads](https://shisa.ai/posts/shisa-gamma-7b-v1-1-million-downloads/)! 🥳 (OK, it's not [1 billion](https://about.fb.com/news/2025/03/celebrating-1-billion-downloads-llama/) but it's still nothing to sneeze at!) ## Evals | Release Date | Model Name | Overall Avg | JA Avg | EN Avg | Shaberi Avg | ELYZA 100 | JA MT Bench | Rakuda | Tengu | shisa-jp-ifeval | llm-jp-eval | shisa-jp-rp-bench | MixEval | LiveBench | IFEval | EvalPlus | |--------------|---------------------------------------|-------------|--------|--------|-------------|-----------|-------|--------|-------|-----------|-------------|-------------|---------|-----------|--------|---------| | 2025-03 | shisa-ai/shisa-v2-llama3.1-8b-preview | **62.46** | **70.35** | **54.58** | **7.18** | **7.55** | **8.57** | **9.03** | **7.33** | 0.19 | **0.56** | **4.67** | 0.46 | **32.00** | **0.79** | 0.62 | | 2024-05 | shisa-ai/shisa-v1-llama3-70b | 60.70 | 68.45 | 52.95 | 6.82 | 7.33 | 8.06 | 8.88 | 6.65 | **0.24** | **0.56** | 4.51 | **0.56** | 27.40 | 0.65 | **0.63** | | 2024-05 | shisa-ai/shisa-v1-llama3-8b | 50.09 | 57.37 | 42.80 | 6.30 | 6.46 | 7.54 | 8.03 | 6.41 | 0.09 | 0.23 | 4.26 | 0.36 | 20.20 | 0.63 | 0.52 | | 2023-12 | augmxnt/shisa-gamma-7b-v1 | 37.37 | 53.94 | 20.80 | 5.49 | 5.74 | 5.93 | 7.28 | 5.87 | 0.13 | 0.52 | 3.22 | 0.26 | 2.20 | 0.37 | 0.18 | | 2023-12 | augmxnt/shisa-7b-v1 | 29.54 | 36.80 | 22.28 | 3.51 | 4.03 | 3.66 | 3.25 | 5.11 | 0.15 | 0.46 | 1.82 | 0.20 | 16.50 | 0.27 | 0.26 | Not only is this our best model yet, from our testing, this model is also currently SOTA in overall JA/EN performance among all 7-8B class models: | Release Date | Model Name | Overall AVG | JA AVG | EN AVG | Shaberi AVG | ELYZA 100 | JA MT Bench | Rakuda | Tengu | shisa-jp-ifeval | llm-jp.eval | shisa-jp-rp-bench | MixEval | LiveBench | IFEval | EvalPlus | |-------------|-----------------------------------------------------------|-------------|--------|--------|-------------|-----------|------|--------|-------|-----------|-------------|-------------|---------|-----------|--------|---------| | 2025-03 | shisa-ai/shisa-v2-llama3.1-8b-preview | **62.46** | 70.35 | **54.58** | 7.18 | 7.55 | **8.57** | 9.03 | 7.33 | 0.19 | **0.56** | **4.67** | **0.46** | **32.00** | 0.79 | **0.62** | | 2024-07 | AXCXEPT/Llama-3.1-8B-EZO-1.1-it | 60.46 | 66.97 | 53.95 | 6.98 | 7.57 | 8.26 | 8.61 | 7.28 | 0.22 | 0.39 | 4.53 | **0.46** | 30.40 | 0.77 | **0.62** | | 2025-03 | allenai/Llama-3.1-Tulu-3.1-8B | 59.81 | 65.41 | 54.20 | 6.56 | 6.84 | 7.69 | 8.61 | 6.52 | 0.22 | 0.52 | 4.39 | 0.40 | 31.30 | **0.82** | **0.63** | | 2024-07 | meta-llama/Llama-3.1-8B-Instruct | 56.78 | 59.66 | 53.90 | 6.48 | 6.95 | 7.67 | 8.36 | 6.40 | 0.16 | 0.25 | 4.15 | 0.44 | 27.70 | 0.80 | **0.63** | | 2024-12 | tokyotech-llm/Llama-3.1-Swallow-8B-Instruct-v0.3 | 56.69 | **71.18** | 42.20 | 7.18 | **7.90** | 8.25 | 9.16 | 7.36 | **0.25** | **0.56** | 4.52 | 0.30 | 26.40 | 0.64 | 0.48 | | 2025-02 | sbintuitions/sarashina2.2-3b-instruct-v0.1 | 53.90 | 68.70 | 39.10 | **7.28** |**7.91** | 8.46 | **9.28** | **7.43** | 0.22 | 0.42 | 4.30 | 0.28 | 21.20 | 0.50 | 0.57 | | 2024-06 | elyza/Llama-3-ELYZA-JP-8B | 53.32 | 67.55 | 39.10 | 7.01 | 7.80 | 8.05 | 9.00 | 7.11 | 0.24 | 0.41 | 4.39 | 0.34 | 17.50 | 0.62 | 0.43 | | 2024-08 | weblab-GENIAC/Tanuki-8B-dpo-v1.0 | 42.14 | 57.26 | 27.03 | 6.30 | 7.00 | 7.10 | 8.11 | 6.51 | 0.17 | 0.24 | 3.67 | 0.24 | 14.40 | 0.38 | 0.32 | | 2025-02 | llm-jp/llm-jp-3-7.2b-instruct3 | 42.69 | 61.93 | 23.45 | 6.79 | 6.99 | 7.70 | 9.16 | 6.79 | 0.20 | 0.47 | 3.03 | 0.22 | 5.20 | 0.49 | 0.18 | Testing notes: - JA functional tests are done with the [shisa-ai/shaberi](https://github.com/shisa-ai/shaberi/) fork using a [PoLL](https://arxiv.org/abs/2404.18796) of [Tulu 3 405B FP8](https://huggingface.co/shisa-ai/Llama-3.1-Tulu-3-405B-FP8-Dynamic), [Llama 3.3 70B](https://huggingface.co/meta-llama/Llama-3.3-70B-Instruct), and [Athene-V2](https://huggingface.co/Nexusflow/Athene-V2-Chat) that was tested to be roughly comparable to `gpt-4-1106-preview` for scoring - Gemini 2 9B models aren't tested atm due to lack of system prompt breaking multiple evals atm... - Dynamic RoPE extension is used when necessary for testing models w/ a 4K context window - Sarashina2.2-Instruct 3B included since they [claim to achieve 7-8B class performance](https://www.sbintuitions.co.jp/blog/entry/2025/03/07/093143) and I was curious if that panned out (it seems so, kicks butt on the Shaberi functional tests) ## Data Our final release will have full details, but this currently model is largely based off of work done on [Shisa V1](https://huggingface.co/augmxnt/shisa-7b-v1), but refined, filtered, regenerated, annotated, rated, and selected. This is augmented by additional datasets focused on translation, multi-turn chat, role-play and other real-world tasks. All synthetic data was regenerated from open weight models. This model currently has only a single DPO stage using a placeholder (but surprisingly good!) [EN preference set](https://huggingface.co/datasets/princeton-nlp/gemma2-ultrafeedback-armorm) and a custom RP DPO mix. ## Credits Trained by [Shisa.AI](https://shisa.ai/): [Leonard Lin](https://huggingface.co/leonardlin) and [Adam Lensenmayer](https://huggingface.co/NekoMikoReimu) Compute sponsored by Ubitus K.K. and METI GENIAC. Trained with [Axolotl](https://github.com/axolotl-ai-cloud/axolotl). [Visualize in Weights & Biases](https://wandb.ai/augmxnt/shisa-v2/runs/o6fjtk4k) ### Framework versions - TRL: 0.15.1 - Transformers: 4.49.0 - Pytorch: 2.6.0 - Datasets: 3.2.0 - Tokenizers: 0.21.0