推論速度を最大 3 倍にした Gemma 4 の MTP drafter とは何か

English summary

Explains Gemma 4's MTP (Multi-Token Prediction) drafter — an improvement over speculative decoding that predicts multiple tokens in parallel inside the same autoregressive model, eliminating the need for a separate draft model. Reports up to 3× inference speedup. A notable technique for accelerating local LLM inference.

Google の Gemma 4 に搭載された MTP (Multi-Token Prediction) drafter の仕組みを解説。従来の speculative decoding に対する改良として、自己回帰モデル内で複数トークンを並列予測し、ドラフトモデルを別途用意する必要がない設計が特徴。報告では推論速度が最大 3 倍に向上。ローカル LLM 推論の高速化技術として注目される。

ポイント

Gemma 4 の MTP (Multi-Token Prediction) drafter を解説
従来の speculative decoding を改良
ドラフトモデル不要で複数トークンを並列予測
推論速度が最大 3 倍に向上

ソース

Zenn (llm)