Quick answers based on verified benchmarks from vellum.ai. Find the perfect model for your task.
Data sourced from vellum.ai and verified providers

Don't overthink it. Here's what to use based on real benchmarks.
82% on SWE Bench. Best for agentic coding, debugging, and code generation.
100% on AIME 2025. Only model with perfect score on high school math competition.
95.4% on GPQA Diamond. Highest score on the hardest reasoning benchmark.
45.8% on HLE. Top performer across multiple benchmarks.
68.8% on ARC-AGI 2. Leader in visual reasoning and understanding.
2600 tokens/sec. Fastest model with good quality.
We aggregate benchmark data from verified providers to help you make informed decisions.
We track benchmarks from vellum.ai, model providers, and independent evaluators.
SWE Bench for coding, AIME for math, GPQA Diamond for reasoning, ARC-AGI for vision.
Rankings updated monthly as new models and benchmarks are released.
Token costs, latency, and throughput data to optimize your budget.
Full rankings across all benchmarks from verified providers.
Score: 95.4% on GPQA Diamond. The leader in complex reasoning tasks.
Score: 100% on AIME 2025. Perfect performance on high school math competition.
Score: 82% on SWE Bench. Top choice for agentic coding tasks.
Score: 45.8% on Humanity's Last Exam. The highest comprehensive performance.
Score: 68.8% on ARC-AGI 2. Leader in visual understanding tasks.
Speed: 2600 tokens/sec. Incredible throughput for high-volume applications.
Everything you need to know about choosing the right LLM.