GDPval: OpenAI's Transparent Study on Performance of AI in Real-World Tasks
In a landmark study using its new real-world benchmark, GDPval, OpenAI transparently revealed that a competitor's model, Anthropic's Claude Opus 4.1, outperformed its own advanced AI on professional tasks, signaling a shift toward evidence-based evaluation and a future of specialized human-AI collaboration.
In a significant move toward greater industry transparency, OpenAI has released a new benchmark called GDPval, designed to evaluate artificial intelligence on tangible, economically important professional tasks. The study's initial findings delivered a surprising result: a competing model, Anthropic's Claude Opus 4.1, surpassed OpenAI's own advanced GPT-5 in performance. This transparent self-assessment marks a shift away from promotional claims and toward a more grounded, evidence-based analysis of AI's practical capabilities in the workplace.
Unlike traditional academic tests, GDPval measures performance on 1,320 tasks sourced directly from the daily work of seasoned professionals across 44 key occupations. These scenarios, crafted by experts, are designed to mirror the complexity of actual job assignments, often including reference files and requiring sophisticated outputs like diagrams and reports. When judged by human experts, Claude Opus 4.1's output was deemed equal to or better than a human professional's in 47.6% of the tasks. This score brings it remarkably close to the 50% threshold that OpenAI considers the benchmark for achieving parity with human experts. In comparison, OpenAI's advanced GPT-5 model achieved a combined rate of 38.8%.
The results also highlight the enduring value of human expertise, as professionals produced superior work in the remaining 52.4% of cases. AI models showed limitations in areas demanding nuanced judgment, contextual understanding, and the interpretation of implicit requirements that are skills developed through experience. This points toward a future centered on human-AI collaboration, where AI handles routine work, freeing up human experts for strategic oversight and complex problem-solving.
A key insight from the study is the emergence of specialized AI strengths. Claude Opus 4.1 showed a particular aptitude for "aesthetics," producing well-formatted documents, while GPT-5 excelled in "accuracy," precisely following instructions. This distinction, also observed in real-world coding tests, suggests that businesses may adopt a "multi-model" approach, selecting the best AI for specific tasks, much like assembling a team of human specialists.
By linking AI performance to economic value, the GDPval study provides a concrete framework for understanding the technology's impact. While it confirms AI's potential to boost productivity, it also issues a critical warning. As AI capabilities advance, educational and professional training programs must evolve to focus on skills where humans maintain a clear advantage, such as high-level strategy and creative problem-solving to prevent a future "crisis of competence".





