- 2025-05-19: Released Gemma-2-Llama Swallow 2B PT, 2B IT, 9B PT, 9B IT, 27B PT, 27B IT.
The Gemma-2-Llama Swallow series is a set of large language models based on the pretrained Gemma 2 models (2B, 9B, and 27B), further pretrained and instruction-tuned with a focus on Japanese language proficiency and knowledge.
- Gemma-2-Llama Swallow 2B PT v0.1: https://huggingface.co/tokyotech-llm/Gemma-2-Llama-Swallow-2b-pt-v0.1
- Gemma-2-Llama Swallow 2B IT v0.1: https://huggingface.co/tokyotech-llm/Gemma-2-Llama-Swallow-2b-it-v0.1
- Gemma-2-Llama Swallow 9B PT v0.1: https://huggingface.co/tokyotech-llm/Gemma-2-Llama-Swallow-9b-pt-v0.1
- Gemma-2-Llama Swallow 9B IT v0.1: https://huggingface.co/tokyotech-llm/Gemma-2-Llama-Swallow-9b-it-v0.1
- Gemma-2-Llama Swallow 27B PT v0.1: https://huggingface.co/tokyotech-llm/Gemma-2-Llama-Swallow-27b-pt-v0.1
- Gemma-2-Llama Swallow 27B IT v0.1: https://huggingface.co/tokyotech-llm/Gemma-2-Llama-Swallow-27b-it-v0.1
Although these models are continual-pretraining versions of Gemma 2, they also inherit the license of Meta’s Llama 3.3 because Llama 3.3 70B Instruct was used to synthesize the training data for coding. These models can be used for research or commercial purposes as long as they do not violate the Gemma Terms of Use and comply with the Llama 3.3 License. As announced at the Gemma Developer Day in 2024, the Institute of Science Tokyo and Google have been collaborating closely on the development of open models in Japan. One key part of that partnership is the TPU Research Cloud program. Through this, we make use of Google’s support for its computing resources, and we are leveraging these resources in this research.
The performance of the Gemma-2-Llama Swallow series was evaluated using the tasks on the Swallow Leaderboard (10 Japanese understanding/generation tasks, 10 English understanding/generation tasks, and Japanese MT-Bench). Key findings are as follows:
- Gemma-2-Llama Swallow 2B/9B/27B PT v0.1 achieved the highest performance among evaluated LLMs of the same scale on Japanese understanding and generation tasks.
- Notably, Gemma-2-Llama Swallow 9B/27B PT v0.1 performed on par with LLMs one tier larger in scale for Japanese tasks (Gemma-2-Llama Swallow 9B PT v0.1 matched the performance of Gemma 2 27B; Gemma-2-Llama Swallow 27B PT v0.1 matched that of Llama 3.1 Swallow 70B v0.1).
- Gemma-2-Llama Swallow 2B/9B IT v0.1 also achieved the highest performance among evaluated LLMs of the same scale on both Japanese understanding/generation tasks and Japanese MT-Bench.
Note that the graphs on this page are dynamically generated based on the Swallow Leaderboard.
Pre-trained 2B model
Gemma-2-Llama Swallow 2B PT v0.1 is compared against tiny pretrained models (without post-training) ranging from 1B to 5B parameters.
The models in the graph are ordered by their average score on Japanese understanding and generation tasks.
Through continual pretraining from Gemma 2 2B, the average score of Gemma-2-Llama Swallow 2B PT v0.1 on Japanese understanding and generation tasks improved from 0.348 to 0.421 (+7.3 points).
Conversely, the average score on English understanding and generation tasks decreased slightly from 0.439 to 0.426 (−1.3 points).
However, considering the substantial improvement in Japanese task performance, maintaining the original performance across all capabilities within such limited parameters is understandably difficult.
Compared to Gemma 2 Baku 2B, another continual-pretraining model based on Gemma 2 2B, Gemma-2-Llama Swallow 2B PT v0.1 achieved higher average scores in both Japanese and English.
Next, we compare it with Gemma 3, the latest version of Gemma released by Google.
In terms of average score for Japanese understanding and generation, Gemma-2-Llama Swallow 2B PT v0.1 outperforms both Gemma 3 1B (0.223) and Gemma 3 4B (0.417).
It may seem surprising that it outperforms the newer and larger Gemma 3 4B, but it is important to note that Gemma 3 4B is a multimodal model that supports image input.
According to Gemma Team (2025), Gemma 3 4B has approximately 4.3B parameters, of which 0.4B are used for the image encoder (SigLIP), and its parameters are trained across both image and language data.
On the other hand, Gemma-2-Llama Swallow 2B PT v0.1 has approximately 2.6B parameters, all dedicated to language tasks.
Looking at the average score on English understanding and generation tasks, Gemma-2-Llama Swallow 2B PT v0.1 scores 0.426 compared to Gemma 3 4B’s 0.501, showing an advantage for Gemma 3 in English.
Finally, we compare with other models.
The top model in terms of average score on Japanese understanding and generation tasks is Qwen2.5 3B (actual size: 3.1B) with a score of 0.442, followed by Gemma-2-Llama Swallow 2B PT v0.1 (actual size: 2.6B) with 0.421.
For Gemma-2-Llama Swallow 2B PT v0.1, the 0.5B parameter disadvantage amounts to approximately 19% of the total parameters, so this performance gap is understandable.
Even when compared to Llama 3.2 1B and Llama 3.2 3B, Gemma-2-Llama Swallow 2B PT v0.1 achieves higher scores on Japanese tasks.
For Japanese processing with tiny 2B-class pretrained models, Gemma-2-Llama Swallow 2B PT v0.1 or Qwen2.5 3B would be a good choice.
Pre-trained 9B model
Gemma-2-Llama Swallow 9B PT v0.1 is compared against small-scale pretrained models (without post-training) ranging from 5B to 13B parameters. Through continual pretraining from Gemma 2 9B, the average score of Gemma-2-Llama Swallow 9B PT v0.1 on Japanese understanding and generation tasks improved from 0.500 to 0.558 (+5.8 points). This average Japanese score (0.558) significantly surpasses that of Llama 3.1 Swallow 8B v0.2 (0.499), Qwen2.5 7B (0.512), and Gemma 3 12B (0.518), making it the top performer in this class. Moreover, Gemma-2-Llama Swallow 9B PT v0.1 performs on par with medium-sized LLMs such as Gemma 2 27B (0.546) and Llama 3.1 70B (0.566), approaching the performance of significantly larger models despite being in the small-size category. While the average score on English understanding and generation tasks decreased slightly from 0.597 to 0.595 after continual pretraining, the drop was limited to just 0.2 points. These results indicate that Gemma-2-Llama Swallow 9B PT v0.1 is a strong candidate among compact Japanese language models.
Pre-trained 27B model
Gemma-2-Llama Swallow 27B PT v0.1 is compared against medium-scale pretrained models (without post-training) ranging from 13B to 100B parameters.
Through continual pretraining from Gemma 2 27B, the average score of Gemma-2-Llama Swallow 27B PT v0.1 on Japanese understanding and generation tasks improved from 0.546 to 0.594 (+4.8 points).
Unlike the 2B and 9B models, the average score on English understanding and generation tasks also increased, from 0.645 to 0.655 (+1.0 point).
Since there are relatively few LLMs around the 30B parameter scale, this graph includes 70B-class models with more than twice the parameters.
Despite this, the performance of Gemma-2-Llama Swallow 27B PT v0.1 is comparable to that of Llama 3.1 Swallow 70B v0.1 and Llama 3 Swallow 70B.
Post-trained 2B model
Gemma-2-Llama Swallow 2B IT v0.1 is compared against ultra-compact post-training models ranging from 1B to 5B parameters.
The models in the graph are ordered by their average score on the Japanese MT-Bench.
Through continual pretraining from Gemma 2 2B and custom post-training (via imitation learning from Gemma 2 27B IT), the average scores for Gemma-2-Llama Swallow 2B IT v0.1 are: 0.424 (Japanese understanding/generation), 0.431 (English understanding/generation), and 0.597 (Japanese MT-Bench).
For comparison, the corresponding scores for Gemma 2 2B IT are: 0.392 (Japanese understanding/generation), 0.489 (English understanding/generation), and 0.569 (Japanese MT-Bench).
Thus, Japanese understanding/generation improved by 3.2 points, English understanding/generation dropped by 5.8 points, and Japanese MT-Bench improved by 2.8 points.
In the post-training of Gemma-2-Llama Swallow, emphasis was placed on Japanese dialogue performance, and no English data was included in the instruction tuning.
In contrast, in Gemma 2’s post-training, the English understanding/generation score improved from 0.439 to 0.489 (+5.0 points), suggesting that incorporating English in post-training may be necessary to avoid a drop in English task performance.
Compared to other models in the Gemma 2 2B family, such as Gemma 2 JPN and Gemma 2 Baku 2B IT, Gemma-2-Llama Swallow 2B IT v0.1 achieved higher performance on both Japanese understanding/generation and Japanese MT-Bench tasks. Furthermore, among models of 3B parameters or fewer, Gemma-2-Llama Swallow 2B IT v0.1 achieved the highest scores on both Japanese understanding/generation and Japanese MT-Bench tasks. That said, Gemma 3 4B IT—while having more parameters—greatly outperformed others on the Japanese MT-Bench with a score of 0.724.
Post-trained 9B model
Gemma-2-Llama Swallow 9B IT v0.1 is compared against small-scale post-training models ranging from 7B to 13B parameters. Through continual pretraining from Gemma 2 9B and custom post-training (via imitation learning from Gemma 2 27B IT), the average scores for Gemma-2-Llama Swallow 9B IT v0.1 are: 0.546 (Japanese understanding/generation), 0.611 (English understanding/generation), and 0.749 (Japanese MT-Bench). For comparison, the scores for Gemma 2 9B IT are: 0.536 (Japanese understanding/generation), 0.649 (English understanding/generation), and 0.736 (Japanese MT-Bench). This translates to a +1.0 point gain in Japanese understanding/generation, a −3.8 point drop in English understanding/generation, and a +1.3 point gain in Japanese MT-Bench. Among models with 9B parameters or fewer, Gemma-2-Llama Swallow 9B IT v0.1 achieved the highest performance on both Japanese understanding/generation tasks and Japanese MT-Bench. When compared to models up to 13B in size, Gemma-2-Llama Swallow 9B IT v0.1 ranked second, following only Gemma 3 12B IT. Once again, the Japanese MT-Bench score of Gemma 3 12B IT (0.821) stands out as exceptionally high.
Post-trained 27B model
Finally, Gemma-2-Llama Swallow 27B IT v0.1 is compared against medium-scale post-training models ranging from 13B to 100B parameters, as well as GPT-3.5, GPT-4o (gpt-4o-2024-08-06), and GPT-4o-mini (gpt-4o-mini-2024-07-18).
However, for the OpenAI GPT series, some English understanding/generation tasks could not be evaluated fairly, so the corresponding scores are marked as missing.
(For details, see the section “Evaluation Settings for OpenAI-based Models” in Issues Encountered During Evaluation.)
Through continued pretraining from Gemma 2 27B and custom post-training (via imitation learning from Gemma 2 27B IT), the average scores for Gemma-2-Llama Swallow 27B IT v0.1 are: 0.602 (Japanese understanding/generation), 0.687 (English understanding/generation), and 0.759 (Japanese MT-Bench).
Compared to Gemma 2 27B IT, this model shows improvement on Japanese understanding/generation tasks, but no gain was observed on English understanding/generation or Japanese MT-Bench tasks.
Gemma 3 27B IT achieved the highest score on Japanese MT-Bench within this category (which is remarkable, considering that a 27B multimodal base model outperformed GPT-4o).
The average score for Japanese understanding/generation by Gemma-2-Llama Swallow 27B IT v0.1 (0.602) ranks third, following GPT-4o (0.646) and Llama 3.3 Swallow 70B Instruct v0.4 (0.613).
Gemma-2-Llama Swallow 27B IT v0.1 stands out as a strong open LLM candidate with excellent performance in Japanese.
Gemma-2-Llama Swallow is constructed through the following steps:
- Gemma-2-Llama Swallow Pretrained Model (PT): Continual pretraining (Fujii et al., 2024) on Gemma 2 without vocabulary extension
- Gemma-2-Llama Swallow Instruction-Tuned Model (IT): Supervised fine-tuning (SFT) on the pretrained Gemma-2-Llama Swallow model
The training data composition largely follows that of LLama 3.3 Swallow 70B.
The corpus used for continued pretraining includes:
- Cosmopedia
- Dclm-baseline-1.0 (Li et al., 2024)
- FineMath-4+ (Allal et al., 2025)
- English Wikipedia
- Japanese Wikipedia
- Laboro ParaCorpus
- High-quality educational texts selected from Swallow Corpus Version 2:
- Top 10% classified by a Wikipedia-based classifier from Swallow Education Classifier
- Top 10% classified by an LLM-based classifier from Swallow Education Classifier
- Japanese QA-style synthetic text generated from high-value educational content
- Swallow Code: Filtered and LLM-refined subset of The Stack v2 (Lozhkov et al., 2024)
Note: Swallow Code v0.3 includes synthetic data generated by Llama 3.3 Swallow 70B.
As such, Gemma-2-Llama Swallow is considered a derivative of Llama 3.3, and its model name includes the prefix “Llama” accordingly.
The corpus used for instruction tuning includes:
The following sections describe content specific to Gemma-2-Llama Swallow.
Continual Pretraining on Tensor Processing Units (TPU)
Continual pretraining and instruction tuning of Gemma-2-Llama Swallow were conducted on a TPU v6e cluster using MaxText.
Training was carried out using a sharding scheme equivalent to Fully Sharded Data Parallel (FSDP) stage 3.
By optimizing Vector Memory (VMEM) settings and combining asynchronous collective fusion via XLA/LIBTPU with communication-computation overlap, we achieved approximately 30% throughput improvement over conventional settings.
Instead of pre-tokenization, we adopted an on-the-fly tokenization approach that streams Arecords directly.
Checkpoints are transferred to Google Cloud Storage asynchronously using a dedicated background thread, preventing TPU idling during checkpoint saving.
Upon receiving a preemption notice, the most recent checkpoint is immediately saved, and training resumes promptly after instance restart.
References
- Loubna Ben Allal, Anton Lozhkov, Elie Bakouch, Gabriel Martín Blázquez, Guilherme Penedo, Lewis Tunstall, Andrés Marafioti, Hynek Kydlíček, Agustín Piqueres Lajarín, Vaibhav Srivastav, Joshua Lochner, Caleb Fahlgren, Xuan-Son Nguyen, Clémentine Fourrier, Ben Burtenshaw, Hugo Larcher, Haojun Zhao, Cyril Zakka, Mathieu Morlon, Colin Raffel, Leandro von Werra and Thomas Wolf. 2025. SmolLM2: When Smol Goes Big – Data-Centric Training of a Small Language Model. arXiv:2502.02737.
- Kazuki Fujii, Taishi Nakamura, Mengsay Loem, Hiroki Iida, Masanari Ohi, Kakeru Hattori, Hirai Shota, Sakae Mizuki, Rio Yokota, and Naoaki Okazaki. Continual Pre-Training for Cross-Lingual LLM Adaptation: Enhancing Japanese Language Capabilities. In Proceedings of the First Conference on Language Modeling (COLM), October 2024.
- Gemma Team: Aishwarya Kamath, Johan Ferret, Shreya Pathak, Nino Vieillard, Ramona Merhej, Sarah Perrin, Tatiana Matejovicova, Alexandre Ramé, Morgane Rivière, Louis Rouillard, Thomas Mesnard, Geoffrey Cideron, Jean-bastien Grill, Sabela Ramos, Edouard Yvinec, Michelle Casbon, Etienne Pot, Ivo Penchev, Gaël Liu, Francesco Visin, Kathleen Kenealy, Lucas Beyer, Xiaohai Zhai, Anton Tsitsulin, Robert Busa-Fekete, Alex Feng, Noveen Sachdeva, Benjamin Coleman, Yi Gao, Basil Mustafa, Iain Barr, Emilio Parisotto, David Tian, Matan Eyal, Colin Cherry, Jan-Thorsten Peter, Danila Sinopalnikov, Surya Bhupatiraju, Rishabh Agarwal, Mehran Kazemi, Dan Malkin, Ravin Kumar, David Vilar, Idan Brusilovsky, Jiaming Luo, Andreas Steiner, Abe Friesen, Abhanshu Sharma, Abheesht Sharma, Adi Mayrav Gilady, Adrian Goedeckemeyer, Alaa Saade, Alex Feng, Alexander Kolesnikov, Alexei Bendebury, Alvin Abdagic, Amit Vadi, András György, André Susano Pinto, Anil Das, Ankur Bapna, Antoine Miech, Antoine Yang, Antonia Paterson, Ashish Shenoy, Ayan Chakrabarti, Bilal Piot, Bo Wu, Bobak Shahriari, Bryce Petrini, Charlie Chen, Charline Le Lan, Christopher A. Choquette-Choo, CJ Carey, Cormac Brick, Daniel Deutsch, Danielle Eisenbud, Dee Cattle, Derek Cheng, Dimitris Paparas, Divyashree Shivakumar Sreepathihalli, Doug Reid, Dustin Tran, Dustin Zelle, Eric Noland, Erwin Huizenga, Eugene Kharitonov, Frederick Liu, Gagik Amirkhanyan, Glenn Cameron, Hadi Hashemi, Hanna Klimczak-Plucińska, Harman Singh, Harsh Mehta, Harshal Tushar Lehri, Hussein Hazimeh, Ian Ballantyne, Idan Szpektor, Ivan Nardini et al. 2025. Gemma 3 Technical Report. arXiv:2503.19786.
- Jeffrey Li, Alex Fang, Georgios Smyrnis, Maor Ivgi, Matt Jordan, Samir Yitzhak Gadre, Hritik Bansal, Etash Kumar Guha, Sedrick Keh, Kushal Arora, Saurabh Garg, Rui Xin, Niklas Muennighoff, Reinhard Heckel, Jean Mercat, Mayee Chen, Suchin Gururangan, Mitchell Wortsman, Alon Albalak, Yonatan Bitton, Marianna Nezhurina, Amro Abbas, Cheng-Yu Hsieh, Dhruba Ghosh, Josh Gardner, Maciej Kilian, Hanlin Zhang, Rulin Shao, Sarah M. Pratt, Sunny Sanyal, Gabriel Ilharco, Giannis Daras, Kalyani Marathe, Aaron Gokaslan, Jieyu Zhang, Khyathi Raghavi Chandu, Thao Nguyen, Igor Vasiljevic, Sham M. Kakade, Shuran Song, Sujay Sanghavi, Fartash Faghri, Sewoong Oh, Luke Zettlemoyer, Kyle Lo, Alaaeldin El-Nouby, Hadi Pouransari, Alexander Toshev, Stephanie Wang, Dirk Groeneveld, Luca Soldaini, Pang Wei Koh, Jenia Jitsev, Thomas Kollar, Alexandros G. Dimakis, Yair Carmon, Achal Dave, Ludwig Schmidt and Vaishaal Shankar. 2024. DataComp-LM: In search of the next generation of training sets for language models. arXiv:2406.11794.
The research and development of the large language model Swallow has been supported by the AIST Project “Research and Development on Generative AI Foundation Models in the Physical Domain,” the “Core Integrated Technology Development for Next-Generation Artificial Intelligence and Robotics” project by the New Energy and Industrial Technology Development Organization (NEDO) (JPNP18002), specifically focusing on “Development of AI Application Technology for Decision Support in Design Risk Assessment Based on Expert Perspectives.” It is also supported by a project from the Ministry of Education, Culture, Sports, Science, and Technology (MEXT) aimed at “establishment of research and development centers to ensure the transparency and reliability of generative AI models”, along with other contributions. We received computational support for model training from Google through the TPU Research Cloud (TRC).