Debates regarding the reporting of AI benchmarks by AI labs have become increasingly public. Recently, an employee of OpenAI accused Elon Musk’s AI company, xAI, of disseminating misleading benchmark results for its latest AI model, Grok 3. Igor Babushkin, a co-founder of xAI, defended the company’s position.
The core of the dispute revolves around xAI’s publication of a graph showcasing Grok 3’s performance on AIME 2025, a set of difficult math questions from a recent invitational mathematics exam. Although some experts have questioned the validity of AIME as an AI benchmark, both AIME 2025 and its predecessor versions are frequently utilized to assess a model’s mathematical capabilities.
According to xAI’s graph, two variants of Grok 3, Grok 3 Reasoning Beta and Grok 3 mini Reasoning, outperformed OpenAI’s top model, o3-mini-high, on AIME 2025. However, OpenAI employees pointed out that xAI’s analysis excluded o3-mini-high’s AIME 2025 score under the “cons@64” condition.
The term “cons@64” refers to “consensus@64,” where a model has 64 attempts to solve a benchmark problem, with the most frequently generated answers being the final ones. The cons@64 condition typically enhances benchmark scores, and its omission might suggest one model’s superiority over another, which might not be the case in reality.
Grok 3 Reasoning Beta and Grok 3 mini Reasoning’s scores under “@1”—the initial score on the benchmark—are lower than those of o3-mini-high. Additionally, Grok 3 Reasoning Beta lags behind OpenAI’s o1 model with “medium” computing settings, yet xAI promotes Grok 3 as the “world’s smartest AI.”
Babushkin asserted that OpenAI has previously released similarly misleading benchmark charts, though these compared its own models. A more impartial party presented a more “accurate” graph reflecting nearly every model’s performance under cons@64.
AI researcher Nathan Lambert observed that the critical metric yet to be disclosed is the computational and financial cost each model incurred to achieve its best score. This highlights the limited insight most AI benchmarks provide regarding a model’s limitations and strengths.