GPQA Diamond Benchmark

Full 198-question evaluation on PhD-level science questions

March 2026 · Full 198-question dataset · Independent evaluation
85.9%
Pipeline v2 Accuracy
170/198 correct on GPQA Diamond
+19.2pp
vs Single Model
Pipeline beats best single mid-tier model
7.3:1
SAVE : HURT Ratio
44 saves, 6 hurts — net +38 questions
$0.16
Per Question
20x cheaper than frontier models

The Core Insight

The Svarix Intelligence OS orchestrates multiple AI models through a proprietary pipeline of domain-adaptive analysis, structured debate, and evidence-based synthesis. Rather than relying on a single model's answer, the system identifies and selects the strongest reasoning chain across diverse analytical perspectives.

The result: 85.9% accuracy exceeding GPT-5.3's published 81.0% at a fraction of the cost. The 7.3:1 SAVE:HURT ratio means for every error the pipeline introduces, it rescues over 7 correct answers that a single model would miss.

GPQA Diamond Leaderboard

198 Questions

Graduate-level science questions in Physics, Chemistry, and Biology — designed to be answerable only by domain experts. Even PhD holders outside their specialty score below 35%.

Model
Score
Cost Tier
Gemini 3.1 ProPublished benchmark
94.3%
Premium
Claude Opus 4.6Published benchmark
91.3%
Premium
Svarix Intelligence OS v2170/198 — proprietary multi-model orchestration
85.9%
Budget
GPT-5.3Published benchmark
81.0%
Premium
Svarix Intelligence OS v1153/198 — original pipeline
77.3%
Budget
Best Single Model (baseline)132/198 — single mid-tier model
66.7%
Mid-tier
Majority Vote (4 models)Naive aggregation baseline
65.2%
Mid-tier

Per-Domain Accuracy

Domain
Questions
Single Model
Pipeline
Gain
Physics
86
84.9%
95.3%
+10.4pp
Chemistry
93
54.8%
78.5%
+23.7pp
Biology
19
42.1%
78.9%
+36.8pp

Biology: +36.8 percentage points — the pipeline nearly doubles accuracy from 42% to 79%. Chemistry saw the largest v2 improvement at +23.7pp. Physics reaches 95.3% — within striking distance of frontier models.

Head-to-Head: Pipeline vs Single Model

126
Both Correct
63.6%
44
Pipeline SAVE
Pipeline rescues wrong single-model answers
6
Pipeline HURT
Single right, pipeline wrong
22
Both Wrong
11.1%
Why It Works

The pipeline's 7.3:1 SAVE:HURT ratio means it almost never makes things worse. For every error introduced, it rescues over 7 answers that a single model would get wrong.

This isn't luck — it's the result of proprietary orchestration that identifies the strongest reasoning chain even when it comes from a minority perspective.

Pipeline vs Majority Vote

Simple majority vote: 65.2%. Svarix Pipeline: 85.9%. Delta: +20.7pp.

Naive aggregation fails because it treats all models equally. The Svarix pipeline evaluates reasoning quality, not just answer frequency — a fundamentally different approach.

How It Works

1

Multi-Model Analysis

Multiple AI models with diverse training data independently analyze the problem, each bringing different strengths and perspectives.

2

Structured Deliberation

A proprietary debate-and-reflection process challenges weak reasoning and surfaces the strongest arguments through adversarial verification.

3

Intelligent Synthesis

Domain-adaptive synthesis reconstructs the answer from the combined evidence — identifying correct reasoning even when it comes from a minority analyst.

The pipeline uses 4 models from different providers — ensuring diversity of training data, reasoning patterns, and failure modes. No single-vendor dependency. Full audit trail at every step for regulated industries.

Cost Efficiency

$0.16
per question — full pipeline
20x
cheaper than frontier single models
+4.9pp
above GPT-5.3 at a fraction of the cost

The Svarix pipeline achieves frontier-competitive accuracy using only mid-tier and budget models. No access to premium APIs (Opus, Gemini Pro) is required. As individual models improve, the pipeline benefits automatically without architectural changes.

Enterprise Use Cases

Legal & Compliance

Multi-perspective regulatory analysis with adversarial challenge and full audit trail.

Healthcare & Life Sciences

Differential diagnosis support with cross-model verification from diverse architectures.

Financial Analysis

Complex research synthesis with fact-checking, source verification, and adversarial challenge.

Education & Research

PhD-level tutoring and hypothesis evaluation at scale — nearly 2x accuracy on Biology questions.

Enterprise Decision Support

Strategic analysis with transparent reasoning chains visible at every pipeline step.

Methodology

Dataset: All 198 questions from the official GPQA Diamond dataset (Idavidrein/gpqa on HuggingFace). Answer positions randomized with seed=42 for reproducibility. Distribution: 86 Physics, 93 Chemistry, 19 Biology.

Baseline: Each question answered by a single mid-tier model call with standardized prompting. This establishes what one model can do alone.

Pipeline: The same question run through our full multi-step pipeline with multiple analysts from different providers, structured debate, convergence analysis, and adversarial verification.

Scoring: Exact match against official correct answer. A “pipeline save” is when the single model got it wrong but the pipeline got it right. A “pipeline hurt” is the reverse.

Full Technical Paper

Detailed analysis including per-model breakdowns, emergence classification, pipeline architecture, and cost-performance comparison against published frontier scores. Available to qualified enterprises and research partners.

Request Access

Try the Intelligence OS

The same pipeline that achieves these results is available for your questions.

No vendor lock-in. Full audit trail. Multi-provider redundancy. Improves automatically as models improve.