
OpenAI released GDPval, an evaluation suite featuring 1,320 tasks from 44 occupations across nine major GDP sectors to measure AI performance on real‑world work.
Early results indicate that top models like Claude Opus 4.1 produce outputs as good or better than human experts in roughly half of the tasks, completing them about 100 × faster and 100 × cheaper.
Social platforms buzzed with debates about job displacement and economic impact; LinkedIn posts about GDPval accumulated thousands of reactions from professionals.
Introduction
Most AI benchmarks evaluate narrow skills: translation, summarisation, coding. But what happens when you ask a large language model to perform an entire profession’s task? Yesterday, OpenAI unveiled GDPval, short for “Gross Domestic Product validation,” a sweeping benchmark designed to answer that question. The suite contains 1,320 tasks drawn from 44 distinct occupations across the nine sectors that make up most of the U.S. GDP. Think drafting a legal brief, preparing a marketing plan or doing a tax assessment. This bold move aims to shift the conversation from toy tasks to meaningful work – and early results are eye‑opening.
How GDPval Works
GDPval’s design mirrors real work. OpenAI collaborated with industry experts to curate tasks such as “Write a 10‑page strategic plan for a mid‑size retail company” or “Diagnose a hypothetical patient’s symptoms and recommend tests.” Each task includes detailed criteria for quality, accuracy and presentation. Human experts completed the tasks first, establishing a gold standard. Then frontier AI models – including GPT‑5, Claude Opus 4.1 and Mistral Ultra – tackled the same tasks under strict time and cost constraints. Independent evaluators graded the outputs blind.
This evaluation covers nine sectors: healthcare, education, finance, law, media, technology, retail, manufacturing and government. Within each, tasks vary in complexity and domain knowledge. For example, in healthcare, models must interpret lab results, propose treatment plans and communicate findings clearly. In law, they must draft contracts, summarise case law and spot inconsistencies. The goal is to approximate the diversity of tasks that contribute to GDP.
Early Results
The headline statistic is striking: frontier models produced outputs as good or better than industry experts in roughly half of the tasks. In many cases, models completed the work 100 × faster and at around 1 % of the cost. This means that an assignment that took a human expert eight hours might take a model five minutes and a few cents of compute. Claude Opus 4.1 performed particularly well, edging out GPT‑5 in some legal and finance tasks, while GPT‑5 dominated creative writing and programming tasks. Mistral Ultra lagged but still surpassed baseline models.
OpenAI emphasises that these results do not mean models can replace experts outright. The tasks were completed with human oversight; for high‑risk sectors like healthcare and law, a professional reviewed and corrected the AI outputs. Still, the performance suggests that AI could drastically augment human productivity. A marketing analyst could delegate first drafts to a model, freeing them to focus on strategy. A paralegal could use AI to generate contract templates and then tailor them.
Efficiency and Cost
One of the most surprising findings is the sheer efficiency of AI models compared to humans. According to OpenAI’s blog, models completed tasks roughly 100 times faster than experts. They also cost around 100 times less, as measured by compute expenses compared to the average hourly wage for the occupation. These numbers are approximations, but they illustrate why businesses are paying attention. For routine tasks such as summarising financial statements or drafting letters, the cost‑benefit ratio is enormous.
The chart below visualises the trade‑off between quality and cost across a sample of tasks. The horizontal axis shows relative cost (log scale), and the vertical axis shows quality relative to expert baselines. The red dots (expert work) cluster near high cost and high quality. The green dots (Claude Opus 4.1) and blue dots (GPT‑5) hover near the expert line at a fraction of the cost. This demonstrates the potential for AI to deliver comparable quality at dramatically lower expenses.
Reactions and Implications
GDPval quickly became a trending topic on LinkedIn and X. Professionals in finance and law expressed amazement at the models’ efficiency while emphasising that human judgment remains irreplaceable. Ethicists and labour economists raised alarms about potential job displacement. A viral LinkedIn post by a management consultant asked, “If AI can write my strategic plans in five minutes, what do I charge for?” that attracted over 5 k reactions. On Reddit, users debated whether benchmarks like GDPval would accelerate or slow down AI regulation; some argued that demonstrating productivity gains could persuade policymakers to embrace AI, while others feared it would prompt protective measures.
Limitations and Next Steps
OpenAI acknowledges several limitations. First, tasks in GDPval are still simplified compared to real jobs; they don’t involve interpersonal dynamics or long‑term accountability. Second, the evaluation doesn’t cover all occupations – it omits trades like construction and hospitality. Third, quality assessments are subjective, and even expert reviewers may disagree. OpenAI plans to expand GDPval to include more sectors and to collaborate with industry groups. Meanwhile, critics call for transparency in how tasks were selected and graded, warning that benchmark design can skew perceptions of model capability.
Why This Matters
GDPval shifts the conversation from benchmark scores to economic impact. It quantifies how close AI models are to performing economically meaningful work and invites stakeholders to weigh the benefits against societal costs. Businesses might see opportunities to cut costs; workers might fear displacement; regulators might consider new standards for AI in professional services. Regardless of one’s stance, the benchmark highlights that AI is no longer just writing poetry or answering trivia – it’s drafting legal memos and analysing financial data. The next debate will be about how to distribute the productivity gains. Whether in professional services via GDPval or in information access via Perplexity’s developer-first Search API, the common thread is how AI shifts economic value and raises new governance challenges.