モデル名 事後(日)平均 事後(英)平均 MTB (日) 平均 MTB (英) 平均
ABEJA-QwQ32b-Reasoning-Japanese-v1.0 0.582 0.739 0.843 0.866
Apertus-8B-Instruct 0.221 0.269 0.576 0.628
Apertus-70B-Instruct 0.325 0.315 0.675 0.740
CyberAgentLM3-22B-chat 0.331 0.280 0.697 0.621
DeepSeek-R1-Distill-Llama-8B 0.346 0.549 0.526 0.704
DeepSeek-R1-Distill-Llama-70B 0.535 0.730 0.707 0.842
DeepSeek-R1-Distill-Qwen-7B 0.382 0.546 0.411 0.649
DeepSeek-R1-Distill-Qwen-14B 0.495 0.672 0.700 0.775
DeepSeek-R1-Distill-Qwen-32B 0.535 0.701 0.753 0.822
DeepSeek-R1-Distill-Qwen-14B-Japanese 0.442 0.629 0.771 0.835
DeepSeek-R1-Distill-Qwen-32B-Japanese 0.491 0.697 0.808 0.857
ELYZA-Shortcut-1.0-Qwen-32B 0.514 0.547 0.827 0.868
ELYZA-Thinking-1.0-Qwen-32B 0.426 0.571 0.694 0.748
Flux-Japanese-Qwen2.5-32B-Instruct-V1.0 0.431 0.551 0.801 0.853
Gemini 3 Pro Preview (gemini-3-pro-preview) 0.733 0.927 0.906 0.907
Gemma 2 2B IT 0.231 0.256 0.555 0.718
Gemma 2 9B IT 0.366 0.392 0.743 0.761
Gemma 2 27B IT 0.410 0.441 0.762 0.800
Gemma-2-Llama Swallow 2B IT 0.240 0.184 0.583 0.584
Gemma-2-Llama Swallow 9B IT 0.354 0.323 0.729 0.734
Gemma-2-Llama Swallow 27B IT 0.421 0.393 0.759 0.771
Gemma 3 1B IT 0.145 0.201 0.434 0.578
Gemma 3 4B IT 0.350 0.405 0.735 0.793
Gemma 3 12B IT 0.474 0.524 0.811 0.860
Gemma 3 27B IT 0.522 0.571 0.830 0.880
GPT-4.1 (gpt-4.1-2025-04-14) 0.645 0.685 0.892 0.908
GPT-4o (gpt-4o-2024-08-06) 0.576 0.560 0.865 0.922
GPT-5 (gpt-5-2025-08-07) 0.703 0.875 0.882 0.888
GPT-5 mini (gpt-5-mini-2025-08-07) 0.667 0.831 0.898 0.902
GPT-5.1 Thinking (gpt-5.1-2025-11-13) 0.696 0.874 0.897 0.907
gpt-oss-20b 0.568 0.737 0.869 0.889
gpt-oss-120b 0.614 0.794 0.907 0.918
GPT-OSS-Swallow-20B-RL-v0.1 0.606 0.788 0.872 0.846
GPT-OSS-Swallow-120B-RL-v0.1 0.642 0.804 0.916 0.905
GPT-OSS-Swallow-20B-SFT-v0.1 0.590 0.738 0.893 0.879
GPT-OSS-Swallow-120B-SFT-v0.1 0.616 0.785 0.902 0.894
Llama 3.1 8B Instruct 0.317 0.387 0.592 0.737
Llama-3.1-Nemotron-Nano-8B-v1 0.423 0.588 0.363 0.701
Llama 3.1 Swallow 8B Instruct v0.3 0.326 0.291 0.709 0.691
Llama 3.1 Swallow 8B Instruct v0.5 0.374 0.315 0.726 0.753
Llama 3.3 70B Instruct 0.491 0.545 0.735 0.863
Llama-3.3-Nemotron-Super-49B-v1 0.555 0.711 0.806 0.881
Llama 3.3 Swallow 70B Instruct v0.4 0.486 0.470 0.791 0.816
Llama 4 Scout Instruct 0.550 0.594 0.789 0.857
llm-jp-3.1-1.8b-instruct4 0.217 0.178 0.657 0.548
llm-jp-3.1-13b-instruct4 0.271 0.244 0.733 0.682
MedGemma 27B IT 0.389 0.495 0.778 0.830
o3 (o3-2025-04-16) 0.692 0.846 0.903 0.917
o3-mini (o3-mini-2025-01-31) 0.613 0.767 0.880 0.901
Olmo 3 7B Think 0.470 0.658 0.498 0.621
Olmo 3 32B Think 0.537 0.751 0.616 0.689
Phi-4 0.507 0.547 0.822 0.881
Phi-4-reasoning-plus 0.310 0.469 0.374 0.426
Qwen2.5-7B-Instruct 0.411 0.454 0.688 0.797
Qwen2.5-14B-Instruct 0.471 0.514 0.799 0.865
Qwen2.5-32B-Instruct 0.504 0.543 0.819 0.869
Qwen3-0.6B 0.257 0.335 0.431 0.595
Qwen3-1.7B 0.428 0.531 0.662 0.779
Qwen3-4B 0.502 0.672 0.797 0.839
Qwen3-8B 0.542 0.715 0.845 0.851
Qwen3-14B 0.578 0.763 0.874 0.882
Qwen3-32B 0.588 0.768 0.875 0.892
Qwen3-30B-A3B 0.572 0.764 0.858 0.893
Qwen3-235B-A22B-Instruct-2507 0.642 0.771 0.915 0.911
Qwen3-235B-A22B-Thinking-2507 0.660 0.856 0.904 0.922
Qwen3-Next-80B-A3B-Instruct 0.614 0.802 0.916 0.920
Qwen3-Next-80B-A3B-Thinking 0.631 0.840 0.879 0.883
Qwen3-Swallow-30B-A3B-CPT-v0.2 0.581 0.696 0.747 0.748
Qwen3-Swallow-30B-A3B-RL-v0.2 0.591 0.732 0.889 0.866
Qwen3-Swallow-30B-A3B-SFT-v0.2 0.574 0.704 0.887 0.882
Qwen3-Swallow-8B-CPT-v0.2 0.515 0.601 0.719 0.683
Qwen3-Swallow-32B-CPT-v0.2 0.600 0.731 0.788 0.766
Qwen3-Swallow-8B-RL-v0.2 0.557 0.694 0.844 0.855
Qwen3-Swallow-32B-RL-v0.2 0.609 0.792 0.894 0.877
Qwen3-Swallow-8B-SFT-v0.2 0.534 0.638 0.868 0.855
Qwen3-Swallow-32B-SFT-v0.2 0.592 0.729 0.879 0.890
QwQ Bakeneko 32B 0.546 0.718 0.879 0.871
Sarashina2.2 3B Instruct v0.1 0.318 0.318 0.721 0.708
モデル名 事後(日)平均 JamC-QA 英日翻訳 日英翻訳 MMLU-ProX (日) GPQA (日) MATH-100 (日) JHumanEval M-IFEval-Ja
ABEJA-QwQ32b-Reasoning-Japanese-v1.0 0.582 0.611 0.243 0.215 0.712 0.527 0.899 0.866 0.619
Apertus-8B-Instruct 0.221 0.366 0.010 0.214 0.221 0.214 0.172 0.353 0.487
Apertus-70B-Instruct 0.325 0.469 0.255 0.228 0.355 0.266 0.263 0.438 0.549
CyberAgentLM3-22B-chat 0.331 0.495 0.237 0.210 0.310 0.266 0.354 0.443 0.429
DeepSeek-R1-Distill-Llama-8B 0.346 0.279 0.096 0.144 0.319 0.310 0.556 0.721 0.319
DeepSeek-R1-Distill-Llama-70B 0.535 0.461 0.210 0.224 0.642 0.538 0.859 0.812 0.558
DeepSeek-R1-Distill-Qwen-7B 0.382 0.257 0.044 0.082 0.438 0.400 0.778 0.674 0.341
DeepSeek-R1-Distill-Qwen-14B 0.495 0.414 0.171 0.199 0.591 0.496 0.737 0.859 0.496
DeepSeek-R1-Distill-Qwen-32B 0.535 0.439 0.204 0.211 0.660 0.536 0.838 0.855 0.509
DeepSeek-R1-Distill-Qwen-14B-Japanese 0.442 0.393 0.191 0.178 0.525 0.400 0.788 0.620 0.513
DeepSeek-R1-Distill-Qwen-32B-Japanese 0.491 0.434 0.215 0.196 0.606 0.464 0.838 0.680 0.544
ELYZA-Shortcut-1.0-Qwen-32B 0.514 0.524 0.245 0.228 0.631 0.415 0.788 0.770 0.633
ELYZA-Thinking-1.0-Qwen-32B 0.426 0.495 0.238 0.223 0.623 0.455 0.788 0.162 0.566
Flux-Japanese-Qwen2.5-32B-Instruct-V1.0 0.431 0.487 0.000 0.219 0.630 0.458 0.788 0.438 0.540
Gemini 3 Pro Preview (gemini-3-pro-preview) 0.733 0.937 0.290 0.268 0.873 0.844 0.949 0.970 0.792
Gemma 2 2B IT 0.231 0.280 0.147 0.165 0.214 0.248 0.202 0.359 0.416
Gemma 2 9B IT 0.366 0.385 0.223 0.227 0.423 0.277 0.444 0.583 0.558
Gemma 2 27B IT 0.410 0.418 0.247 0.236 0.462 0.304 0.505 0.700 0.588
Gemma-2-Llama Swallow 2B IT 0.240 0.372 0.210 0.142 0.190 0.259 0.263 0.241 0.363
Gemma-2-Llama Swallow 9B IT 0.354 0.464 0.251 0.218 0.372 0.283 0.374 0.518 0.540
Gemma-2-Llama Swallow 27B IT 0.421 0.531 0.267 0.247 0.452 0.333 0.465 0.656 0.540
Gemma 3 1B IT 0.145 0.249 0.004 0.083 0.148 0.248 0.172 0.112 0.323
Gemma 3 4B IT 0.350 0.285 0.186 0.189 0.335 0.246 0.606 0.604 0.473
Gemma 3 12B IT 0.474 0.401 0.225 0.229 0.527 0.373 0.798 0.763 0.619
Gemma 3 27B IT 0.522 0.488 0.250 0.238 0.609 0.417 0.859 0.796 0.597
GPT-4.1 (gpt-4.1-2025-04-14) 0.645 0.790 0.278 0.260 0.772 0.603 0.899 0.911 0.810
GPT-4o (gpt-4o-2024-08-06) 0.576 0.747 0.282 0.265 0.685 0.453 0.758 0.844 0.704
GPT-5 (gpt-5-2025-08-07) 0.703 0.858 0.272 0.236 0.849 0.786 0.980 0.943 0.907
GPT-5 mini (gpt-5-mini-2025-08-07) 0.667 0.701 0.267 0.239 0.805 0.750 0.960 0.945 0.827
GPT-5.1 Thinking (gpt-5.1-2025-11-13) 0.696 0.838 0.262 0.225 0.843 0.795 0.970 0.941 0.894
gpt-oss-20b 0.568 0.403 0.239 0.208 0.702 0.571 0.929 0.927 0.549
gpt-oss-120b 0.614 0.516 0.263 0.207 0.756 0.663 0.970 0.925 0.735
GPT-OSS-Swallow-20B-RL-v0.1 0.606 0.533 0.251 0.213 0.722 0.629 0.970 0.924 0.553
GPT-OSS-Swallow-120B-RL-v0.1 0.642 0.630 0.266 0.221 0.775 0.705 0.960 0.935 0.695
GPT-OSS-Swallow-20B-SFT-v0.1 0.590 0.513 0.254 0.220 0.695 0.578 0.960 0.909 0.575
GPT-OSS-Swallow-120B-SFT-v0.1 0.616 0.589 0.270 0.216 0.747 0.636 0.929 0.922 0.708
Llama 3.1 8B Instruct 0.317 0.310 0.187 0.194 0.306 0.261 0.384 0.580 0.381
Llama-3.1-Nemotron-Nano-8B-v1 0.423 0.264 0.065 0.080 0.489 0.339 0.919 0.802 0.186
Llama 3.1 Swallow 8B Instruct v0.3 0.326 0.414 0.249 0.223 0.306 0.239 0.364 0.488 0.491
Llama 3.1 Swallow 8B Instruct v0.5 0.374 0.496 0.249 0.222 0.369 0.295 0.404 0.584 0.496
Llama 3.3 70B Instruct 0.491 0.484 0.245 0.246 0.607 0.453 0.646 0.752 0.650
Llama-3.3-Nemotron-Super-49B-v1 0.555 0.446 0.215 0.185 0.687 0.531 0.919 0.900 0.558
Llama 3.3 Swallow 70B Instruct v0.4 0.486 0.562 0.275 0.251 0.533 0.355 0.697 0.727 0.593
Llama 4 Scout Instruct 0.550 0.579 0.237 0.230 0.687 0.540 0.758 0.820 0.611
llm-jp-3.1-1.8b-instruct4 0.217 0.348 0.000 0.159 0.195 0.239 0.212 0.365 0.288
llm-jp-3.1-13b-instruct4 0.271 0.509 0.007 0.161 0.296 0.230 0.232 0.463 0.372
MedGemma 27B IT 0.389 0.456 0.253 0.240 0.606 0.350 0.818 0.001 0.624
o3 (o3-2025-04-16) 0.692 0.851 0.276 0.216 0.835 0.766 0.970 0.929 0.850
o3-mini (o3-mini-2025-01-31) 0.613 0.507 0.243 0.221 0.760 0.685 0.939 0.934 0.841
Olmo 3 7B Think 0.470 0.276 0.131 0.143 0.512 0.393 0.949 0.885 0.310
Olmo 3 32B Think 0.537 0.353 0.174 0.186 0.669 0.547 0.919 0.915 0.332
Phi-4 0.507 0.444 0.240 0.221 0.638 0.435 0.798 0.770 0.438
Phi-4-reasoning-plus 0.310 0.000 0.001 0.001 0.118 0.563 0.737 0.751 0.221
Qwen2.5-7B-Instruct 0.411 0.365 0.189 0.184 0.452 0.315 0.636 0.737 0.504
Qwen2.5-14B-Instruct 0.471 0.444 0.207 0.218 0.556 0.348 0.768 0.754 0.606
Qwen2.5-32B-Instruct 0.504 0.476 0.221 0.224 0.623 0.411 0.768 0.803 0.673
Qwen3-0.6B 0.257 0.250 0.001 0.000 0.295 0.237 0.606 0.408 0.438
Qwen3-1.7B 0.428 0.278 0.130 0.156 0.514 0.315 0.859 0.747 0.460
Qwen3-4B 0.502 0.328 0.154 0.189 0.643 0.440 0.919 0.838 0.562
Qwen3-8B 0.542 0.398 0.209 0.202 0.696 0.491 0.929 0.869 0.575
Qwen3-14B 0.578 0.455 0.228 0.220 0.737 0.556 0.939 0.910 0.624
Qwen3-32B 0.588 0.479 0.226 0.220 0.746 0.571 0.949 0.923 0.681
Qwen3-30B-A3B 0.572 0.460 0.170 0.219 0.738 0.558 0.960 0.899 0.655
Qwen3-235B-A22B-Instruct-2507 0.642 0.636 0.258 0.230 0.799 0.701 0.970 0.900 0.730
Qwen3-235B-A22B-Thinking-2507 0.660 0.659 0.260 0.234 0.819 0.739 0.970 0.938 0.783
Qwen3-Next-80B-A3B-Instruct 0.614 0.599 0.240 0.228 0.770 0.614 0.939 0.905 0.681
Qwen3-Next-80B-A3B-Thinking 0.631 0.595 0.232 0.197 0.797 0.710 0.960 0.927 0.743
Qwen3-Swallow-30B-A3B-CPT-v0.2 0.581 0.528 0.248 0.229 0.713 0.547 0.909 0.894 0.518
Qwen3-Swallow-30B-A3B-RL-v0.2 0.591 0.496 0.243 0.213 0.729 0.596 0.949 0.909 0.531
Qwen3-Swallow-30B-A3B-SFT-v0.2 0.574 0.504 0.242 0.219 0.711 0.558 0.899 0.888 0.553
Qwen3-Swallow-8B-CPT-v0.2 0.515 0.449 0.238 0.215 0.628 0.404 0.838 0.833 0.456
Qwen3-Swallow-32B-CPT-v0.2 0.600 0.551 0.255 0.235 0.735 0.576 0.939 0.910 0.549
Qwen3-Swallow-8B-RL-v0.2 0.557 0.444 0.235 0.204 0.675 0.529 0.929 0.885 0.473
Qwen3-Swallow-32B-RL-v0.2 0.609 0.534 0.252 0.219 0.754 0.607 0.970 0.930 0.584
Qwen3-Swallow-8B-SFT-v0.2 0.534 0.439 0.231 0.205 0.653 0.446 0.929 0.830 0.451
Qwen3-Swallow-32B-SFT-v0.2 0.592 0.493 0.251 0.216 0.723 0.594 0.960 0.904 0.540
QwQ Bakeneko 32B 0.546 0.495 0.232 0.207 0.677 0.487 0.859 0.867 0.584
Sarashina2.2 3B Instruct v0.1 0.318 0.498 0.002 0.160 0.335 0.301 0.465 0.464 0.288
モデル名 事後(英)平均 HellaSwag MMLU-Pro (英) GPQA (英) MATH-500 (英) AIME 24-25 LCB
ABEJA-QwQ32b-Reasoning-Japanese-v1.0 0.739 0.906 0.780 0.606 0.964 0.617 0.563
Apertus-8B-Instruct 0.269 0.636 0.321 0.293 0.288 0.000 0.075
Apertus-70B-Instruct 0.315 0.745 0.422 0.253 0.386 0.000 0.084
CyberAgentLM3-22B-chat 0.280 0.770 0.260 0.288 0.298 0.017 0.045
DeepSeek-R1-Distill-Llama-8B 0.549 0.688 0.549 0.460 0.866 0.367 0.364
DeepSeek-R1-Distill-Llama-70B 0.730 0.891 0.776 0.626 0.936 0.617 0.534
DeepSeek-R1-Distill-Qwen-7B 0.546 0.564 0.547 0.495 0.902 0.417 0.351
DeepSeek-R1-Distill-Qwen-14B 0.672 0.841 0.707 0.525 0.908 0.567 0.486
DeepSeek-R1-Distill-Qwen-32B 0.701 0.885 0.737 0.571 0.926 0.567 0.523
DeepSeek-R1-Distill-Qwen-14B-Japanese 0.629 0.823 0.679 0.470 0.916 0.433 0.451
DeepSeek-R1-Distill-Qwen-32B-Japanese 0.697 0.872 0.737 0.576 0.940 0.550 0.508
ELYZA-Shortcut-1.0-Qwen-32B 0.547 0.897 0.684 0.460 0.830 0.150 0.263
ELYZA-Thinking-1.0-Qwen-32B 0.571 0.888 0.708 0.576 0.860 0.300 0.093
Flux-Japanese-Qwen2.5-32B-Instruct-V1.0 0.551 0.929 0.677 0.510 0.838 0.150 0.203
Gemini 3 Pro Preview (gemini-3-pro-preview) 0.927 0.948 0.885 0.924 0.986 0.950 0.866
Gemma 2 2B IT 0.256 0.596 0.287 0.359 0.262 0.000 0.034
Gemma 2 9B IT 0.392 0.829 0.503 0.369 0.488 0.017 0.146
Gemma 2 27B IT 0.441 0.846 0.572 0.404 0.560 0.033 0.230
Gemma-2-Llama Swallow 2B IT 0.184 0.495 0.169 0.268 0.138 0.000 0.036
Gemma-2-Llama Swallow 9B IT 0.323 0.801 0.296 0.283 0.438 0.017 0.106
Gemma-2-Llama Swallow 27B IT 0.393 0.786 0.436 0.343 0.544 0.033 0.218
Gemma 3 1B IT 0.201 0.357 0.171 0.237 0.438 0.000 0.002
Gemma 3 4B IT 0.405 0.620 0.440 0.354 0.748 0.117 0.151
Gemma 3 12B IT 0.524 0.816 0.617 0.389 0.862 0.217 0.247
Gemma 3 27B IT 0.571 0.861 0.681 0.475 0.880 0.233 0.298
GPT-4.1 (gpt-4.1-2025-04-14) 0.685 0.940 0.813 0.667 0.906 0.400 0.387
GPT-4o (gpt-4o-2024-08-06) 0.560 0.930 0.749 0.556 0.792 0.083 0.250
GPT-5 (gpt-5-2025-08-07) 0.875 0.959 0.865 0.828 0.990 0.933 0.677
GPT-5 mini (gpt-5-mini-2025-08-07) 0.831 0.934 0.822 0.692 0.970 0.883 0.686
GPT-5.1 Thinking (gpt-5.1-2025-11-13) 0.874 0.954 0.741 0.838 0.984 0.933 0.796
gpt-oss-20b 0.737 0.847 0.741 0.636 0.944 0.617 0.635
gpt-oss-120b 0.794 0.878 0.790 0.727 0.966 0.733 0.670
GPT-OSS-Swallow-20B-RL-v0.1 0.788 0.831 0.737 0.646 0.964 0.850 0.697
GPT-OSS-Swallow-120B-RL-v0.1 0.804 0.855 0.732 0.641 0.988 0.883 0.723
GPT-OSS-Swallow-20B-SFT-v0.1 0.738 0.818 0.733 0.636 0.948 0.683 0.610
GPT-OSS-Swallow-120B-SFT-v0.1 0.785 0.878 0.779 0.712 0.966 0.717 0.655
Llama 3.1 8B Instruct 0.387 0.769 0.489 0.374 0.526 0.033 0.131
Llama-3.1-Nemotron-Nano-8B-v1 0.588 0.518 0.566 0.470 0.948 0.550 0.478
Llama 3.1 Swallow 8B Instruct v0.3 0.291 0.725 0.287 0.293 0.338 0.000 0.102
Llama 3.1 Swallow 8B Instruct v0.5 0.315 0.648 0.399 0.318 0.452 0.000 0.072
Llama 3.3 70B Instruct 0.545 0.911 0.717 0.480 0.746 0.117 0.303
Llama-3.3-Nemotron-Super-49B-v1 0.711 0.885 0.783 0.667 0.960 0.567 0.408
Llama 3.3 Swallow 70B Instruct v0.4 0.470 0.884 0.570 0.409 0.642 0.083 0.232
Llama 4 Scout Instruct 0.594 0.891 0.744 0.606 0.834 0.183 0.309
llm-jp-3.1-1.8b-instruct4 0.178 0.450 0.163 0.278 0.146 0.000 0.030
llm-jp-3.1-13b-instruct4 0.244 0.717 0.252 0.227 0.188 0.000 0.082
MedGemma 27B IT 0.495 0.859 0.654 0.434 0.824 0.200 0.001
o3 (o3-2025-04-16) 0.846 0.956 0.857 0.818 0.978 0.817 0.649
o3-mini (o3-mini-2025-01-31) 0.767 0.869 0.792 0.747 0.958 0.733 0.503
Olmo 3 7B Think 0.658 0.703 0.627 0.490 0.936 0.683 0.510
Olmo 3 32B Think 0.751 0.851 0.753 0.591 0.962 0.750 0.597
Phi-4 0.547 0.859 0.630 0.551 0.800 0.217 0.227
Phi-4-reasoning-plus 0.469 0.260 0.113 0.611 0.770 0.583 0.478
Qwen2.5-7B-Instruct 0.454 0.820 0.554 0.348 0.742 0.100 0.158
Qwen2.5-14B-Instruct 0.514 0.886 0.652 0.404 0.794 0.133 0.215
Qwen2.5-32B-Instruct 0.543 0.908 0.640 0.480 0.812 0.150 0.270
Qwen3-0.6B 0.335 0.425 0.338 0.283 0.694 0.133 0.135
Qwen3-1.7B 0.531 0.626 0.560 0.394 0.904 0.383 0.315
Qwen3-4B 0.672 0.790 0.690 0.515 0.938 0.600 0.499
Qwen3-8B 0.715 0.851 0.713 0.561 0.942 0.700 0.525
Qwen3-14B 0.763 0.890 0.770 0.611 0.972 0.750 0.587
Qwen3-32B 0.768 0.901 0.779 0.646 0.964 0.717 0.602
Qwen3-30B-A3B 0.764 0.883 0.782 0.636 0.964 0.733 0.589
Qwen3-235B-A22B-Instruct-2507 0.771 0.940 0.824 0.586 0.982 0.767 0.529
Qwen3-235B-A22B-Thinking-2507 0.856 0.931 0.845 0.803 0.980 0.883 0.692
Qwen3-Next-80B-A3B-Instruct 0.802 0.929 0.824 0.753 0.980 0.733 0.592
Qwen3-Next-80B-A3B-Thinking 0.840 0.923 0.828 0.798 0.974 0.850 0.670
Qwen3-Swallow-30B-A3B-CPT-v0.2 0.696 0.849 0.745 0.611 0.922 0.533 0.513
Qwen3-Swallow-30B-A3B-RL-v0.2 0.732 0.808 0.674 0.576 0.956 0.733 0.646
Qwen3-Swallow-30B-A3B-SFT-v0.2 0.704 0.832 0.711 0.571 0.932 0.667 0.514
Qwen3-Swallow-8B-CPT-v0.2 0.601 0.791 0.621 0.515 0.872 0.467 0.341
Qwen3-Swallow-32B-CPT-v0.2 0.731 0.899 0.761 0.636 0.950 0.583 0.556
Qwen3-Swallow-8B-RL-v0.2 0.694 0.790 0.675 0.540 0.938 0.667 0.555
Qwen3-Swallow-32B-RL-v0.2 0.792 0.882 0.778 0.657 0.976 0.800 0.662
Qwen3-Swallow-8B-SFT-v0.2 0.638 0.797 0.669 0.545 0.926 0.467 0.422
Qwen3-Swallow-32B-SFT-v0.2 0.729 0.879 0.757 0.631 0.960 0.583 0.567
QwQ Bakeneko 32B 0.718 0.903 0.773 0.611 0.942 0.567 0.513
Sarashina2.2 3B Instruct v0.1 0.318 0.613 0.329 0.293 0.570 0.017 0.086
モデル名 MTB (日) 平均 コード 抽出 人文 数学 推論 ロールプレイ 科・技・工・数 執筆
ABEJA-QwQ32b-Reasoning-Japanese-v1.0 0.843 0.868 0.893 0.885 0.889 0.694 0.848 0.850 0.821
Apertus-8B-Instruct 0.576 0.478 0.525 0.728 0.352 0.446 0.697 0.672 0.707
Apertus-70B-Instruct 0.675 0.558 0.709 0.848 0.486 0.443 0.811 0.767 0.774
CyberAgentLM3-22B-chat 0.697 0.500 0.733 0.859 0.591 0.611 0.791 0.721 0.769
DeepSeek-R1-Distill-Llama-8B 0.526 0.376 0.625 0.681 0.595 0.496 0.483 0.510 0.442
DeepSeek-R1-Distill-Llama-70B 0.707 0.551 0.778 0.838 0.780 0.525 0.768 0.733 0.681
DeepSeek-R1-Distill-Qwen-7B 0.411 0.371 0.572 0.347 0.804 0.346 0.275 0.341 0.228
DeepSeek-R1-Distill-Qwen-14B 0.700 0.632 0.803 0.739 0.857 0.563 0.720 0.631 0.658
DeepSeek-R1-Distill-Qwen-32B 0.753 0.669 0.874 0.764 0.867 0.606 0.790 0.738 0.716
DeepSeek-R1-Distill-Qwen-14B-Japanese 0.771 0.557 0.777 0.880 0.871 0.664 0.801 0.859 0.758
DeepSeek-R1-Distill-Qwen-32B-Japanese 0.808 0.639 0.813 0.917 0.924 0.652 0.842 0.872 0.802
ELYZA-Shortcut-1.0-Qwen-32B 0.827 0.794 0.876 0.877 0.875 0.641 0.873 0.834 0.845
ELYZA-Thinking-1.0-Qwen-32B 0.694 0.687 0.824 0.688 0.927 0.641 0.583 0.656 0.542
Flux-Japanese-Qwen2.5-32B-Instruct-V1.0 0.801 0.807 0.819 0.856 0.892 0.654 0.791 0.818 0.768
Gemini 3 Pro Preview (gemini-3-pro-preview) 0.906 0.902 0.897 0.934 0.994 0.865 0.880 0.901 0.873
Gemma 2 2B IT 0.555 0.460 0.585 0.673 0.448 0.422 0.641 0.571 0.639
Gemma 2 9B IT 0.743 0.635 0.816 0.865 0.686 0.649 0.784 0.734 0.773
Gemma 2 27B IT 0.762 0.760 0.825 0.874 0.697 0.578 0.818 0.745 0.796
Gemma-2-Llama Swallow 2B IT 0.583 0.408 0.551 0.774 0.420 0.418 0.725 0.655 0.709
Gemma-2-Llama Swallow 9B IT 0.729 0.579 0.787 0.880 0.661 0.616 0.788 0.735 0.783
Gemma-2-Llama Swallow 27B IT 0.759 0.627 0.846 0.868 0.767 0.548 0.796 0.785 0.833
Gemma 3 1B IT 0.434 0.396 0.484 0.519 0.343 0.337 0.519 0.434 0.436
Gemma 3 4B IT 0.735 0.727 0.650 0.814 0.826 0.482 0.787 0.796 0.802
Gemma 3 12B IT 0.811 0.784 0.807 0.880 0.858 0.582 0.856 0.878 0.844
Gemma 3 27B IT 0.830 0.747 0.942 0.878 0.808 0.733 0.849 0.853 0.831
GPT-4.1 (gpt-4.1-2025-04-14) 0.892 0.917 0.911 0.885 0.980 0.819 0.879 0.887 0.858
GPT-4o (gpt-4o-2024-08-06) 0.865 0.896 0.929 0.874 0.895 0.755 0.869 0.847 0.855
GPT-5 (gpt-5-2025-08-07) 0.882 0.893 0.883 0.928 0.882 0.758 0.896 0.933 0.885
GPT-5 mini (gpt-5-mini-2025-08-07) 0.898 0.912 0.907 0.902 0.949 0.834 0.871 0.937 0.875
GPT-5.1 Thinking (gpt-5.1-2025-11-13) 0.897 0.899 0.885 0.932 0.996 0.766 0.905 0.921 0.872
gpt-oss-20b 0.869 0.914 0.917 0.853 0.994 0.772 0.772 0.909 0.824
gpt-oss-120b 0.907 0.898 0.924 0.915 0.999 0.862 0.855 0.948 0.852
GPT-OSS-Swallow-20B-RL-v0.1 0.872 0.915 0.888 0.809 0.938 0.788 0.886 0.903 0.849
GPT-OSS-Swallow-120B-RL-v0.1 0.916 0.901 0.959 0.928 0.977 0.848 0.915 0.937 0.864
GPT-OSS-Swallow-20B-SFT-v0.1 0.893 0.936 0.904 0.858 1.000 0.799 0.881 0.917 0.851
GPT-OSS-Swallow-120B-SFT-v0.1 0.902 0.934 0.928 0.905 1.000 0.793 0.883 0.928 0.847
Llama 3.1 8B Instruct 0.592 0.528 0.848 0.585 0.600 0.465 0.569 0.562 0.577
Llama-3.1-Nemotron-Nano-8B-v1 0.363 0.374 0.503 0.311 0.564 0.270 0.289 0.301 0.293
Llama 3.1 Swallow 8B Instruct v0.3 0.709 0.570 0.783 0.869 0.631 0.506 0.782 0.716 0.813
Llama 3.1 Swallow 8B Instruct v0.5 0.726 0.590 0.843 0.884 0.470 0.618 0.780 0.799 0.822
Llama 3.3 70B Instruct 0.735 0.672 0.878 0.751 0.742 0.638 0.762 0.735 0.700
Llama-3.3-Nemotron-Super-49B-v1 0.806 0.731 0.898 0.821 0.801 0.755 0.804 0.809 0.828
Llama 3.3 Swallow 70B Instruct v0.4 0.791 0.696 0.856 0.881 0.807 0.664 0.827 0.772 0.822
Llama 4 Scout Instruct 0.789 0.763 0.923 0.816 0.879 0.615 0.787 0.752 0.778
llm-jp-3.1-1.8b-instruct4 0.657 0.574 0.601 0.809 0.672 0.446 0.767 0.697 0.693
llm-jp-3.1-13b-instruct4 0.733 0.587 0.700 0.870 0.731 0.559 0.831 0.775 0.807
MedGemma 27B IT 0.778 0.799 0.926 0.805 0.883 0.646 0.718 0.758 0.686
o3 (o3-2025-04-16) 0.903 0.935 0.898 0.888 0.995 0.809 0.889 0.941 0.867
o3-mini (o3-mini-2025-01-31) 0.880 0.868 0.937 0.860 0.952 0.802 0.863 0.893 0.868
Olmo 3 7B Think 0.498 0.439 0.604 0.522 0.705 0.393 0.488 0.472 0.359
Olmo 3 32B Think 0.616 0.494 0.646 0.670 0.749 0.535 0.673 0.584 0.579
Phi-4 0.822 0.752 0.933 0.862 0.890 0.629 0.830 0.845 0.835
Phi-4-reasoning-plus 0.374 0.205 0.376 0.206 0.379 0.283 0.643 0.162 0.741
Qwen2.5-7B-Instruct 0.688 0.638 0.711 0.782 0.685 0.494 0.736 0.730 0.729
Qwen2.5-14B-Instruct 0.799 0.773 0.882 0.850 0.796 0.646 0.829 0.795 0.822
Qwen2.5-32B-Instruct 0.819 0.776 0.913 0.845 0.863 0.706 0.839 0.802 0.811
Qwen3-0.6B 0.431 0.332 0.423 0.460 0.626 0.346 0.418 0.445 0.402
Qwen3-1.7B 0.662 0.574 0.591 0.715 0.841 0.567 0.631 0.765 0.613
Qwen3-4B 0.797 0.696 0.818 0.855 0.947 0.729 0.747 0.826 0.760
Qwen3-8B 0.845 0.757 0.834 0.890 0.996 0.823 0.829 0.822 0.806
Qwen3-14B 0.874 0.850 0.839 0.903 0.994 0.824 0.839 0.919 0.827
Qwen3-32B 0.875 0.794 0.871 0.871 0.997 0.836 0.881 0.917 0.830
Qwen3-30B-A3B 0.858 0.883 0.801 0.910 0.966 0.763 0.860 0.853 0.832
Qwen3-235B-A22B-Instruct-2507 0.915 0.943 0.938 0.907 0.987 0.826 0.893 0.933 0.891
Qwen3-235B-A22B-Thinking-2507 0.904 0.896 0.878 0.933 0.985 0.851 0.876 0.955 0.861
Qwen3-Next-80B-A3B-Instruct 0.916 0.919 0.920 0.931 0.987 0.830 0.890 0.945 0.906
Qwen3-Next-80B-A3B-Thinking 0.879 0.871 0.855 0.912 0.993 0.811 0.850 0.920 0.821
Qwen3-Swallow-30B-A3B-CPT-v0.2 0.747 0.699 0.708 0.784 0.805 0.588 0.870 0.857 0.669
Qwen3-Swallow-30B-A3B-RL-v0.2 0.889 0.908 0.882 0.896 0.993 0.785 0.905 0.904 0.835
Qwen3-Swallow-30B-A3B-SFT-v0.2 0.887 0.906 0.896 0.919 0.965 0.817 0.889 0.882 0.825
Qwen3-Swallow-8B-CPT-v0.2 0.719 0.699 0.618 0.730 0.820 0.574 0.859 0.812 0.637
Qwen3-Swallow-32B-CPT-v0.2 0.788 0.696 0.785 0.783 0.887 0.732 0.878 0.840 0.706
Qwen3-Swallow-8B-RL-v0.2 0.844 0.899 0.775 0.799 0.927 0.748 0.882 0.890 0.828
Qwen3-Swallow-32B-RL-v0.2 0.894 0.915 0.941 0.881 0.981 0.803 0.896 0.893 0.842
Qwen3-Swallow-8B-SFT-v0.2 0.868 0.859 0.821 0.878 0.965 0.831 0.893 0.910 0.785
Qwen3-Swallow-32B-SFT-v0.2 0.879 0.911 0.841 0.886 0.933 0.814 0.885 0.919 0.846
QwQ Bakeneko 32B 0.879 0.839 0.903 0.886 0.969 0.838 0.857 0.892 0.850
Sarashina2.2 3B Instruct v0.1 0.721 0.579 0.680 0.862 0.828 0.467 0.832 0.766 0.752
モデル名 MTB (英) 平均 コード 抽出 人文 数学 推論 ロールプレイ 科・技・工・数 執筆
ABEJA-QwQ32b-Reasoning-Japanese-v1.0 0.866 0.808 0.878 0.899 0.951 0.757 0.872 0.882 0.881
Apertus-8B-Instruct 0.628 0.494 0.601 0.817 0.423 0.445 0.787 0.682 0.775
Apertus-70B-Instruct 0.740 0.587 0.732 0.896 0.608 0.640 0.800 0.815 0.844
CyberAgentLM3-22B-chat 0.621 0.467 0.695 0.828 0.479 0.429 0.678 0.647 0.747
DeepSeek-R1-Distill-Llama-8B 0.704 0.398 0.745 0.825 0.827 0.562 0.768 0.731 0.775
DeepSeek-R1-Distill-Llama-70B 0.842 0.787 0.931 0.862 0.919 0.723 0.850 0.806 0.854
DeepSeek-R1-Distill-Qwen-7B 0.649 0.481 0.656 0.708 0.762 0.520 0.686 0.700 0.677
DeepSeek-R1-Distill-Qwen-14B 0.775 0.512 0.851 0.815 0.886 0.745 0.841 0.750 0.803
DeepSeek-R1-Distill-Qwen-32B 0.822 0.619 0.901 0.869 0.918 0.793 0.861 0.768 0.850
DeepSeek-R1-Distill-Qwen-14B-Japanese 0.835 0.724 0.884 0.870 0.907 0.771 0.867 0.817 0.838
DeepSeek-R1-Distill-Qwen-32B-Japanese 0.857 0.730 0.893 0.894 0.964 0.770 0.871 0.872 0.861
ELYZA-Shortcut-1.0-Qwen-32B 0.868 0.834 0.888 0.892 0.890 0.841 0.864 0.888 0.845
ELYZA-Thinking-1.0-Qwen-32B 0.748 0.770 0.913 0.754 0.912 0.775 0.617 0.639 0.606
Flux-Japanese-Qwen2.5-32B-Instruct-V1.0 0.853 0.789 0.852 0.885 0.957 0.782 0.852 0.868 0.838
Gemini 3 Pro Preview (gemini-3-pro-preview) 0.907 0.882 0.895 0.932 0.960 0.835 0.915 0.942 0.897
Gemma 2 2B IT 0.718 0.543 0.687 0.868 0.659 0.609 0.780 0.780 0.816
Gemma 2 9B IT 0.761 0.624 0.799 0.893 0.682 0.610 0.832 0.808 0.841
Gemma 2 27B IT 0.800 0.701 0.855 0.891 0.724 0.702 0.843 0.827 0.858
Gemma-2-Llama Swallow 2B IT 0.584 0.461 0.534 0.758 0.376 0.452 0.728 0.646 0.715
Gemma-2-Llama Swallow 9B IT 0.734 0.523 0.831 0.886 0.720 0.518 0.789 0.786 0.819
Gemma-2-Llama Swallow 27B IT 0.771 0.626 0.789 0.883 0.751 0.622 0.824 0.821 0.853
Gemma 3 1B IT 0.578 0.503 0.453 0.719 0.709 0.366 0.634 0.686 0.553
Gemma 3 4B IT 0.793 0.704 0.762 0.901 0.855 0.553 0.869 0.845 0.857
Gemma 3 12B IT 0.860 0.741 0.862 0.917 0.932 0.758 0.891 0.899 0.879
Gemma 3 27B IT 0.880 0.767 0.919 0.920 0.924 0.797 0.908 0.917 0.888
GPT-4.1 (gpt-4.1-2025-04-14) 0.908 0.898 0.936 0.903 0.942 0.863 0.901 0.925 0.898
GPT-4o (gpt-4o-2024-08-06) 0.922 0.943 0.927 0.896 0.993 0.976 0.874 0.905 0.865
GPT-5 (gpt-5-2025-08-07) 0.888 0.876 0.912 0.923 0.843 0.811 0.906 0.927 0.904
GPT-5 mini (gpt-5-mini-2025-08-07) 0.902 0.914 0.882 0.925 0.925 0.835 0.905 0.939 0.890
GPT-5.1 Thinking (gpt-5.1-2025-11-13) 0.907 0.878 0.911 0.921 0.954 0.837 0.912 0.949 0.895
gpt-oss-20b 0.889 0.913 0.881 0.935 0.913 0.779 0.880 0.939 0.869
gpt-oss-120b 0.918 0.947 0.892 0.915 0.989 0.871 0.886 0.960 0.886
GPT-OSS-Swallow-20B-RL-v0.1 0.846 0.873 0.855 0.848 0.869 0.700 0.874 0.913 0.838
GPT-OSS-Swallow-120B-RL-v0.1 0.905 0.885 0.900 0.896 0.934 0.873 0.906 0.966 0.878
GPT-OSS-Swallow-20B-SFT-v0.1 0.879 0.942 0.894 0.893 0.910 0.745 0.881 0.923 0.846
GPT-OSS-Swallow-120B-SFT-v0.1 0.894 0.895 0.923 0.903 0.994 0.801 0.873 0.912 0.852
Llama 3.1 8B Instruct 0.737 0.556 0.816 0.871 0.697 0.522 0.821 0.765 0.850
Llama-3.1-Nemotron-Nano-8B-v1 0.701 0.658 0.654 0.696 0.906 0.526 0.712 0.738 0.720
Llama 3.1 Swallow 8B Instruct v0.3 0.691 0.528 0.714 0.886 0.562 0.458 0.773 0.768 0.838
Llama 3.1 Swallow 8B Instruct v0.5 0.753 0.576 0.801 0.900 0.769 0.499 0.848 0.796 0.833
Llama 3.3 70B Instruct 0.863 0.795 0.935 0.891 0.895 0.861 0.858 0.822 0.847
Llama-3.3-Nemotron-Super-49B-v1 0.881 0.782 0.915 0.910 0.963 0.800 0.878 0.908 0.893
Llama 3.3 Swallow 70B Instruct v0.4 0.816 0.672 0.902 0.888 0.839 0.706 0.828 0.838 0.855
Llama 4 Scout Instruct 0.857 0.722 0.911 0.860 0.920 0.904 0.836 0.840 0.862
llm-jp-3.1-1.8b-instruct4 0.548 0.454 0.482 0.662 0.521 0.364 0.665 0.563 0.673
llm-jp-3.1-13b-instruct4 0.682 0.562 0.681 0.844 0.625 0.512 0.736 0.715 0.779
MedGemma 27B IT 0.830 0.722 0.914 0.884 0.970 0.735 0.819 0.858 0.737
o3 (o3-2025-04-16) 0.917 0.929 0.931 0.945 0.964 0.836 0.900 0.938 0.892
o3-mini (o3-mini-2025-01-31) 0.901 0.876 0.913 0.891 0.969 0.865 0.895 0.914 0.882
Olmo 3 7B Think 0.621 0.472 0.615 0.667 0.703 0.543 0.707 0.600 0.663
Olmo 3 32B Think 0.689 0.546 0.647 0.776 0.748 0.665 0.799 0.671 0.663
Phi-4 0.881 0.771 0.904 0.876 0.928 0.933 0.889 0.879 0.865
Phi-4-reasoning-plus 0.426 0.281 0.384 0.116 0.437 0.322 0.769 0.299 0.800
Qwen2.5-7B-Instruct 0.797 0.656 0.769 0.893 0.843 0.662 0.832 0.886 0.833
Qwen2.5-14B-Instruct 0.865 0.752 0.873 0.899 0.932 0.861 0.870 0.894 0.839
Qwen2.5-32B-Instruct 0.869 0.806 0.862 0.895 0.954 0.817 0.876 0.890 0.851
Qwen3-0.6B 0.595 0.376 0.678 0.673 0.803 0.408 0.551 0.633 0.637
Qwen3-1.7B 0.779 0.642 0.754 0.830 0.968 0.686 0.764 0.785 0.805
Qwen3-4B 0.839 0.737 0.831 0.884 0.947 0.735 0.870 0.861 0.845
Qwen3-8B 0.851 0.804 0.831 0.892 0.980 0.713 0.858 0.876 0.854
Qwen3-14B 0.882 0.843 0.849 0.904 0.971 0.805 0.878 0.919 0.890
Qwen3-32B 0.892 0.860 0.910 0.905 0.979 0.796 0.899 0.919 0.869
Qwen3-30B-A3B 0.893 0.915 0.909 0.909 0.972 0.798 0.882 0.895 0.866
Qwen3-235B-A22B-Instruct-2507 0.911 0.888 0.859 0.925 0.990 0.873 0.911 0.940 0.905
Qwen3-235B-A22B-Thinking-2507 0.922 0.877 0.899 0.945 0.998 0.886 0.908 0.952 0.908
Qwen3-Next-80B-A3B-Instruct 0.920 0.908 0.925 0.932 0.966 0.842 0.925 0.949 0.912
Qwen3-Next-80B-A3B-Thinking 0.883 0.855 0.853 0.934 0.941 0.774 0.889 0.964 0.851
Qwen3-Swallow-30B-A3B-CPT-v0.2 0.748 0.731 0.737 0.755 0.773 0.732 0.831 0.747 0.680
Qwen3-Swallow-30B-A3B-RL-v0.2 0.866 0.831 0.857 0.883 0.940 0.793 0.889 0.906 0.827
Qwen3-Swallow-30B-A3B-SFT-v0.2 0.882 0.934 0.844 0.923 0.964 0.775 0.878 0.916 0.821
Qwen3-Swallow-8B-CPT-v0.2 0.683 0.612 0.650 0.694 0.799 0.507 0.831 0.733 0.642
Qwen3-Swallow-32B-CPT-v0.2 0.766 0.684 0.762 0.745 0.822 0.743 0.845 0.770 0.754
Qwen3-Swallow-8B-RL-v0.2 0.855 0.819 0.864 0.885 0.919 0.720 0.904 0.897 0.832
Qwen3-Swallow-32B-RL-v0.2 0.877 0.909 0.799 0.902 0.989 0.762 0.886 0.911 0.855
Qwen3-Swallow-8B-SFT-v0.2 0.855 0.877 0.785 0.910 0.928 0.752 0.878 0.903 0.809
Qwen3-Swallow-32B-SFT-v0.2 0.890 0.887 0.862 0.901 1.000 0.752 0.909 0.951 0.860
QwQ Bakeneko 32B 0.871 0.848 0.821 0.883 0.967 0.816 0.851 0.892 0.890
Sarashina2.2 3B Instruct v0.1 0.708 0.499 0.642 0.863 0.747 0.552 0.783 0.827 0.750