Apple Researchers Challenge AI Reasoning Claims With Controlled Puzzle Tests

Apple researchers have discovered that advanced AI models, such as OpenAI's o3-mini and Claude 3.7, experience a complete performance failure when tested on complex puzzle environments raises questions about their true reasoning abilities. The study used puzzles like Tower of Hanoi and river crossing to examine the models' performance, rather than standard mathematical benchmarks. At low complexity levels, standard language models outperformed their reasoning-enhanced counterparts. At medium complexity, reasoning models demonstrated advantages, but both types of models experienced complete accuracy collapse at high complexity levels. The researchers found that reasoning models reduced their computational effort as problems became more difficult, despite operating well below their token generation limits. Even when provided with explicit solution algorithms, the models' performance failed to improve significantly. The researchers noted inconsistencies in how models applied learned strategies across different problem scales. Some models successfully handled 100-move sequences in one puzzle type while failing after just five moves in simpler scenarios. The study's findings contradict conventional assumptions about AI reasoning progress. Overall, the results raise questions about the true reasoning capabilities of large language models.

apple.slashdot.org

RSS Hunter

2025-06-09

Create attached notes ...