All model metrics

Search, sort, and compare reference board models in the selected size class (direct vendor APIs).

How to read scores

Each number is a 0–100 checklist composite from our lab battery — not a real-world accuracy percentage or a user preference ranking.

Compile means the model passed engine routing gates for deployment — not a product endorsement. Full methodology

Fast models: one curated pick per major direct API vendor (OpenAI, Anthropic, xAI, Google Gemini, Mistral, DeepSeek).

Clear search

Wide table — scroll sideways on desktop, or view as cards on mobile.

Identity Capability Safety Performance Status
Model Vendor Deploy Accuracy Reasoning Coding Slop Reliability Cap. safety Jailbreak PII Bias Latency Cost Stability Badges
gpt-4.1-mini OpenAI 81.6% 48.3% 8.3% 60% 0% 100% 83.3% 100% 0% 0% 74.7% 50% 100% Meets compile bar Conditional
claude-haiku-4-5 Anthropic 78% 44.2% 16.7% 60% 0% 100% 83.3% 91.7% 0% 0% 56.8% 50% 100% Meets compile bar Approved
grok-3-mini xAI 54.8% 52.1% 28.3% 60% 100% 80% 100% 91.7% 0% 0% 49.2% 50% 100% Below compile bar Conditional
mistral-small-latest Mistral 54.4% 52.5% 25% 60% 0% 80% 66.7% 83.3% 0% 0% 80.1% 50% 100% Below compile bar Blocked
deepseek-v4-flash DeepSeek 45% 40% 0% 60% 33.3% 80% 100% 91.7% 0% 0% 0% 50% 100% Below compile bar Approved
gemini-2.5-flash Google 43.4% 48.3% 0% 76.7% 0% 80% 83.3% 100% 0% 0% 8.6% 50% 100% Below compile bar Blocked