Domain-level metacognitive monitoring in frontier LLMs: A 33-model atlas

33 モデル横断調査 — フロンティア LLM のドメイン別メタ認知能力アトラス

エージェント期その他評価手法・指標設計・ベンチマーク結果（LMSys / SWE-bench / MMLU / ARC 等。性能向上の発表は capability-update を優先）AI による科学的発見（医療・生物学・物理・気候等）（応用一般は application を優先）テキスト（自然言語）

2026-05-11 · arXiv cs.CL

English summary

A 33-model atlas of domain-level metacognitive monitoring in frontier LLMs. The study measures whether models accurately estimate when they might be wrong, breaking results down by domain, model scale, and reasoning mode. Useful baseline data for hallucination mitigation and agent self-verification design.

33 種類のフロンティア LLM を対象に、ドメイン別のメタ認知能力（自分の知識・無知の自覚）を包括的に評価した大規模スタディ。モデルが『自分が間違える可能性を正しく見積もれるか』を計測し、ドメイン・スケール・推論モード別の傾向を提示する。ハルシネーション抑制やエージェント実行時の自己検証戦略を設計する基礎データとして有用。

ポイント

33 モデル横断のメタ認知能力アトラス
ドメイン・スケール・推論モード別の傾向を提示
ハルシネーション抑制・エージェント自己検証の基礎データ

ソース

arXiv cs.CL