Thursday, October 17, 2024
HomeTechnologyApple Engineers Demonstrate AI 'Reasoning' Fragility

Apple Engineers Demonstrate AI ‘Reasoning’ Fragility

For some time, companies like OpenAI and Google have promoted advanced “reasoning” capabilities as the next significant advancement in their latest artificial intelligence models. However, a recent study conducted by six Apple engineers reveals that the mathematical “reasoning” demonstrated by advanced large language models can be highly fragile and unreliable when confronted with minor changes to common benchmark problems.

The study’s results support earlier research indicating that large language models’ (LLMs) reliance on probabilistic pattern matching lacks the formal understanding necessary for consistent mathematical reasoning. The researchers suggest that “current LLMs are not capable of genuine logical reasoning.” Instead, these models attempt to replicate reasoning steps observed in their training data.

In their preprint paper, “GSM-Symbolic: Understanding the Limitations of Mathematical Reasoning in Large Language Models,” the Apple researchers begin with GSM8K’s standardized set of over 8,000 grade-school level mathematical word problems, often used as a benchmark for modern LLMs’ complex reasoning capabilities. They then modify part of this testing set by dynamically replacing certain names and numbers with new values—for instance, transforming a question about Sophie getting 31 building blocks for her nephew into a question about Bill getting 19 building blocks for his brother in the GSM-Symbolic evaluation.

This approach helps avoid potential “data contamination” from the static GSM8K questions being incorporated into an AI model’s training data. Although these modifications do not change the actual difficulty of the mathematical reasoning, the models were expected to perform equally well when tested on GSM-Symbolic as they would on GSM8K.

However, when the researchers evaluated more than 20 state-of-the-art LLMs using the GSM-Symbolic, they observed a decrease in average accuracy across the board compared to GSM8K, with performance drops ranging from 0.3 percent to 9.2 percent, depending on the model. The results also displayed high variance across 50 separate runs of GSM-Symbolic with different names and values, with gaps of up to 15 percent accuracy between the best and worst runs often observed within a single model. Interestingly, changing the numbers tended to result in worse accuracy than altering the names.

This level of variance—both within different GSM-Symbolic runs and in comparison to GSM8K results—was unexpected, as the researchers point out, “the overall reasoning steps needed to solve a question remain the same.” These findings suggest that the models are not engaged in any “formal” reasoning but are instead attempting to match patterns and align given questions and solutions with similar ones from the training data.

Despite the variance in GSM-Symbolic tests, the overall impact was often small. For example, OpenAI’s ChatGPT-4o accuracy decreased slightly from 95.2 percent on GSM8K to 94.9 percent on GSM-Symbolic, maintaining a high success rate with either benchmark, regardless of whether the model employs “formal” reasoning (though accuracy for many models dropped significantly when additional logical steps were added to the problems).

The performance of tested LLMs declined sharply when the Apple researchers altered the GSM-Symbolic benchmark by incorporating “seemingly relevant but ultimately inconsequential statements” into the questions. In this “GSM-NoOp” benchmark set, extraneous details, such as the size of picked kiwis, were included in the questions. Adding these red herrings led to “catastrophic performance drops” in accuracy, ranging from 17.5 percent to 65.7 percent, depending on the model. These significant accuracy declines highlight the inherent limitations of using simple “pattern matching” methods to convert statements into operations without genuinely understanding their meaning.

Source link

RELATED ARTICLES

LEAVE A REPLY

Please enter your comment!
Please enter your name here

Most Popular

Recent Comments