GPQA Diamond Benchmark

Full 198-question evaluation on PhD-level science questions

March 2026 · Full 198-question dataset · Independent evaluation

85.9%

Pipeline v2 Accuracy

170/198 correct on GPQA Diamond

+19.2pp

vs Single Model

Pipeline beats best single mid-tier model

7.3:1

SAVE : HURT Ratio

44 saves, 6 hurts — net +38 questions

$0.16

Per Question

20x cheaper than frontier models

The Core Insight

The Svarix Intelligence OS orchestrates multiple AI models through a proprietary pipeline of domain-adaptive analysis, structured debate, and evidence-based synthesis. Rather than relying on a single model's answer, the system identifies and selects the strongest reasoning chain across diverse analytical perspectives.

The result: 85.9% accuracy exceeding GPT-5.3's published 81.0% at a fraction of the cost. The 7.3:1 SAVE:HURT ratio means for every error the pipeline introduces, it rescues over 7 correct answers that a single model would miss.

GPQA Diamond Leaderboard

198 Questions

Graduate-level science questions in Physics, Chemistry, and Biology — designed to be answerable only by domain experts. Even PhD holders outside their specialty score below 35%.

Model

Score

Cost Tier

Gemini 3.1 ProPublished benchmark

94.3%

Premium

Claude Opus 4.6Published benchmark

91.3%

Premium

Svarix Intelligence OS v2170/198 — proprietary multi-model orchestration

85.9%

Budget

GPT-5.3Published benchmark

81.0%

Premium

Svarix Intelligence OS v1153/198 — original pipeline

77.3%

Budget

Best Single Model (baseline)132/198 — single mid-tier model

66.7%

Mid-tier

Majority Vote (4 models)Naive aggregation baseline

65.2%

Mid-tier

Per-Domain Accuracy

Domain

Questions

Single Model

Pipeline

Gain

Physics

84.9%

95.3%

+10.4pp

Chemistry

54.8%

78.5%

+23.7pp

Biology

42.1%

78.9%

+36.8pp

Biology: +36.8 percentage points — the pipeline nearly doubles accuracy from 42% to 79%. Chemistry saw the largest v2 improvement at +23.7pp. Physics reaches 95.3% — within striking distance of frontier models.

Head-to-Head: Pipeline vs Single Model

126

Both Correct

63.6%

Pipeline SAVE

Pipeline rescues wrong single-model answers

Pipeline HURT

Single right, pipeline wrong

Both Wrong

11.1%

Why It Works

The pipeline's 7.3:1 SAVE:HURT ratio means it almost never makes things worse. For every error introduced, it rescues over 7 answers that a single model would get wrong.

This isn't luck — it's the result of proprietary orchestration that identifies the strongest reasoning chain even when it comes from a minority perspective.

Pipeline vs Majority Vote

Simple majority vote: 65.2%. Svarix Pipeline: 85.9%. Delta: +20.7pp.

Naive aggregation fails because it treats all models equally. The Svarix pipeline evaluates reasoning quality, not just answer frequency — a fundamentally different approach.

How It Works

Multi-Model Analysis

Multiple AI models with diverse training data independently analyze the problem, each bringing different strengths and perspectives.

Structured Deliberation

A proprietary debate-and-reflection process challenges weak reasoning and surfaces the strongest arguments through adversarial verification.

Intelligent Synthesis

Domain-adaptive synthesis reconstructs the answer from the combined evidence — identifying correct reasoning even when it comes from a minority analyst.

The pipeline uses 4 models from different providers — ensuring diversity of training data, reasoning patterns, and failure modes. No single-vendor dependency. Full audit trail at every step for regulated industries.

Cost Efficiency

$0.16

per question — full pipeline

20x

cheaper than frontier single models

+4.9pp

above GPT-5.3 at a fraction of the cost

The Svarix pipeline achieves frontier-competitive accuracy using only mid-tier and budget models. No access to premium APIs (Opus, Gemini Pro) is required. As individual models improve, the pipeline benefits automatically without architectural changes.

Enterprise Use Cases

Legal & Compliance

Multi-perspective regulatory analysis with adversarial challenge and full audit trail.

Healthcare & Life Sciences

Differential diagnosis support with cross-model verification from diverse architectures.

Financial Analysis

Complex research synthesis with fact-checking, source verification, and adversarial challenge.

Education & Research

PhD-level tutoring and hypothesis evaluation at scale — nearly 2x accuracy on Biology questions.

Enterprise Decision Support

Strategic analysis with transparent reasoning chains visible at every pipeline step.

Methodology

Dataset: All 198 questions from the official GPQA Diamond dataset (Idavidrein/gpqa on HuggingFace). Answer positions randomized with seed=42 for reproducibility. Distribution: 86 Physics, 93 Chemistry, 19 Biology.

Baseline: Each question answered by a single mid-tier model call with standardized prompting. This establishes what one model can do alone.

Pipeline: The same question run through our full multi-step pipeline with multiple analysts from different providers, structured debate, convergence analysis, and adversarial verification.

Scoring: Exact match against official correct answer. A “pipeline save” is when the single model got it wrong but the pipeline got it right. A “pipeline hurt” is the reverse.

Full Technical Paper

Detailed analysis including per-model breakdowns, emergence classification, pipeline architecture, and cost-performance comparison against published frontier scores. Available to qualified enterprises and research partners.

Request Access

Try the Intelligence OS

The same pipeline that achieves these results is available for your questions.

No vendor lock-in. Full audit trail. Multi-provider redundancy. Improves automatically as models improve.

Try It Free Request Enterprise Trial