AI Achieves Human-Level Performance on General Intelligence Test
A new artificial intelligence (AI) model has made headlines by achieving results comparable to human intelligence on a recently developed test known as the ARC-AGI benchmark.
On December 20, OpenAI's o3 system scored an impressive 85%, surpassing the previous best score from an AI model, which was just 55%. This brings it on par with the average score of human participants in similar assessments. The model also performed well in a challenging mathematics evaluation.
The quest to create artificial general intelligence (AGI) is a primary objective for major AI research labs worldwide. At first glance, this breakthrough points to significant progress towards achieving that goal.
Despite lingering skepticism, many in the AI research community feel that our understanding of AGI is shifting. For many, the realization of AGI now appears more immediate and pressing. But are these experts correct in their excitement?
Understanding General Intelligence
To comprehend the significance of the o3 results, we must delve into what the ARC-AGI test evaluates. Essentially, it measures an AI system's capacity for “sample efficiency,” which is defined as how effectively the system adapts to new challenges after being exposed to a limited number of examples.
For example, systems like ChatGPT (GPT-4) rely on massive datasets of human language to predict which words are likely to come next. While this method works reasonably well for familiar tasks, it struggles with less common scenarios due to less data availability.
For AI systems to expand their utility beyond repetitive tasks and those with tolerable error margins, they need to learn and adapt from a few examples. This ability to tackle new problems successfully from limited data is often referred to as generalization—a fundamental aspect of intelligence.
The ARC-AGI Test Explained
The ARC-AGI benchmark showcases sample-efficient adaptation through grid-based problems. Participants are asked to analyze patterns derived from a series of grids. For each question, three instances offer learning opportunities, following which the system predicts the outcome of a fourth instance based on identified rules.
These tasks resemble IQ test questions familiar to many.
Adapting with Weak Rules
While we don't know the detailed mechanics behind OpenAI's achievement, the results suggest that the o3 model demonstrates remarkable adaptability. It appears capable of discovering generalizable rules from just a handful of examples.
The concept of using the simplest, most effective rules—referred to as “weak rules”—is crucial for gaining adaptability in these situations. In layman's terms, the model could be identifying patterns without making unnecessary assumptions.
For instance, a rule derived from the problems might be: “Any shape with an extended line moves to the line's end and obscures overlapping shapes.”
Thinking in Chains
While it remains unclear how OpenAI achieved this performance, it is likely that the design of the o3 system enables it to identify weak rules effectively. The o3 model began as a general-purpose system with enhanced problem-solving capabilities, then underwent specific training for the ARC-AGI test.
According to French AI researcher Francois Chollet, the o3 model likely navigates through various “chains of thought” to reach a solution, opting for the optimal pathway based on some heuristic measures. This process is not dissimilar to how Google’s AlphaGo AI analyzed potential moves when competing against a world champion in Go.
The chains of thought can be thought of as methods the AI uses to resolve problems. Each method evaluated may yield equally valid approaches. The heuristic it employs could dictate choices, such as prioritizing simplicity or choosing the weakest approach.
If o3 functions similarly to AlphaGo, it implies considerable sophistication, where AI is trained to evaluate move sequences and determine their effectiveness.
The Path Ahead
The pressing question remains: does this advancement bring us closer to true AGI? If the mechanics behind o3 are accurately understood, it’s possible that this model isn't fundamentally superior to its predecessors. Instead, it could simply showcase improved generalization through targeted training approaches.
Most details about o3's operations are still undisclosed, given that OpenAI has only provided limited insights via selected media presentations and student testing. A thorough examination of the model’s specific capabilities, success rates, and failure instances is necessary to appreciate its full potential.
When o3 becomes widely available, we will gain a clearer picture of its adaptability compared to an average human. Should it prove to be highly adaptable, it could herald a transformative era of self-adjusting intelligence that redefines economic landscapes. In such a case, reevaluating the benchmarks for AGI and contemplating governance strategies will be essential.
Conversely, if the performance falls short, it still represents significant progress. However, progress in our daily lives may continue as usual.
AI, Intelligence, Adaptation