As traditional AI benchmarking methods are proving insufficient, developers are exploring innovative approaches to evaluate the capabilities of generative AI models. One group of developers has chosen Minecraft, the sandbox-building game owned by Microsoft, as a testing ground for these models.
The website Minecraft Benchmark, also known as MC-Bench, was developed collaboratively to host challenges where AI models compete against each other by responding to prompts with Minecraft creations. Users have the opportunity to vote on which model performed better, with the identities of the AI models revealed only after voting.
Adi Singh, a 12th-grader who initiated MC-Bench, emphasizes the value of using Minecraft not for the game itself, but due to its widespread recognition as the best-selling video game of all time. This allows even those unfamiliar with the game to assess the effectiveness of AI-generated blocky creations, such as representations of pineapples.
Singh explained to TechCrunch that Minecraft offers a familiar interface, making it easier to observe AI development progress. Currently, the MC-Bench team consists of eight volunteer contributors. While companies like Anthropic, Google, OpenAI, and Alibaba have subsidized the use of their products for running benchmark prompts, they are not directly affiliated with the project.
Singh stated that the team is currently focusing on simple builds to measure progress since the GPT-3 era, but they have plans to expand towards longer projects and goal-oriented tasks. Singh believes that games provide a safer and more controlled environment for testing agentic reasoning than real life.
Other games such as Pokémon Red, Street Fighter, and Pictionary have also been utilized as experimental AI benchmarks, in part due to the difficulty associated with effectively benchmarking AI models.
Common practice involves testing AI models using standardized evaluations, but many of these assessments give AI models an advantage. Due to their training, these models excel in specific problem-solving tasks, particularly those involving rote memorization or basic extrapolation.
To illustrate this, it is noted that while OpenAI’s GPT-4 can score in the 88th percentile on the LSAT, it struggles with simple tasks like discerning the number of ‘Rs’ in the word “strawberry.” Anthropic’s Claude 3.7 Sonnet achieved a 62.3% accuracy on a software engineering benchmark but performs poorly in playing Pokémon compared to most young children.
Although MC-Bench is technically a programming benchmark since models write code to create builds like “Frosty the Snowman” or “a charming tropical beach hut on a pristine sandy shore,” it enables users to evaluate visual results easily, thereby appealing to a broader audience and facilitating more comprehensive data collection on AI model performance.
Singh argues that the scores from MC-Bench provide valuable insights, potentially signaling AI models’ usefulness to companies. Singh noted that the current leaderboard aligns closely with personal experience of using these models, contrasting with many pure text benchmarks.