Model benchmark leaderboard

Independent lab results for models we evaluate for AI orchestration — curated reference cohorts, not an exhaustive vendor catalog.

How to read scores

Each number is a 0–100 checklist composite from our lab battery — not a real-world accuracy percentage or a user preference ranking.

Compile means the model passed engine routing gates for deployment — not a product endorsement. Full methodology

Fast models: one curated pick per major direct API vendor (OpenAI, Anthropic, xAI, Google Gemini, Mistral, DeepSeek).

Top models in this size class

Wide table — scroll sideways on desktop, or view as cards on mobile.

Model Vendor Size Accuracy Reasoning Coding Latency Cost Compile Tested
mistral-small-latest Mistral Fast 52.5% 25% 60% 80.1% 50% Below compile bar 4d ago
grok-3-mini xAI Fast 52.1% 28.3% 60% 49.2% 50% Below compile bar 4d ago
gpt-4.1-mini OpenAI Fast 48.3% 8.3% 60% 74.7% 50% Meets compile bar 4d ago
gemini-2.5-flash Google Fast 48.3% 0% 76.7% 8.6% 50% Below compile bar 4d ago
claude-haiku-4-5 Anthropic Fast 44.2% 16.7% 60% 56.8% 50% Meets compile bar 4d ago
deepseek-v4-flash DeepSeek Fast 40% 0% 60% 0% 50% Below compile bar 4d ago