On Monday, a Meta executive refuted a rumor suggesting that the company had trained its latest AI models to perform well on specific benchmarks while hiding their weaknesses. Ahmad Al-Dahle, Meta’s Vice President of Generative AI, stated on X that it is “simply not true” that Meta trained its Llama 4 Maverick and Llama 4 Scout models using “test sets.” In the context of AI benchmarks, test sets are datasets used to assess a model’s performance post-training. Training on these sets could artificially boost a model’s benchmark scores, misrepresenting its actual capabilities.
Over the weekend, an unverified rumor alleging that Meta manipulated its new models’ benchmark results emerged on platforms like X and Reddit. This rumor seemed to originate from a Chinese social media post by a user claiming they resigned from Meta in protest over its benchmarking methods.
Reports indicated that Maverick and Scout models were said to underperform on certain tasks, which added fuel to the rumor. Additionally, Meta’s choice to use an experimental, unreleased variant of Maverick to achieve better scores on the LM Arena benchmark drew further scrutiny. Researchers on X noted significant differences in the behavior of the publicly available Maverick compared to the version hosted on LM Arena.
Al-Dahle admitted that some users are experiencing “mixed quality” from the Maverick and Scout models across various cloud providers. He stated that, following the immediate release of the models, it is expected that several days would be required to fine-tune all public implementations. Meta will continue to address bug fixes and work with its partners on onboarding.