Model benchmark leaderboard
Independent lab results for models we evaluate for AI orchestration — curated reference cohorts, not an exhaustive vendor catalog.
- Reference board — curated models with full capability and safety testing. Fast and Flagship are one pick per major direct API vendor; Frontier covers large models above 70B.
- Size bands — compare within Fast, Flagship, or Frontier instead of mixing every parameter count. Frontier rows are evaluated via Together.ai; Fast and Flagship use native vendor APIs.
- Together Catalog — separate exploratory list on Together.ai (capability-tested only; not mixed into the reference board).
How to read scores
Each number is a 0–100 checklist composite from our lab battery — not a real-world accuracy percentage or a user preference ranking.
- 85+ Strong on the tested checklist
- 65–84 Solid, with room to improve
- <65 Early or mixed results — common on strict v1 gates
Compile means the model passed engine routing gates for deployment — not a product endorsement. Full methodology
Fast models: one curated pick per major direct API vendor (OpenAI, Anthropic, xAI, Google Gemini, Mistral, DeepSeek).
Top models in this size class
Wide table — scroll sideways on desktop, or view as cards on mobile.
| Model | Vendor | Size | Accuracy | Reasoning | Coding | Latency | Cost | Compile | Tested |
|---|---|---|---|---|---|---|---|---|---|
| mistral-small-latest | Mistral | Fast | 52.5% | 25% | 60% | 80.1% | 50% | Below compile bar | 4d ago |
| grok-3-mini | xAI | Fast | 52.1% | 28.3% | 60% | 49.2% | 50% | Below compile bar | 4d ago |
| gpt-4.1-mini | OpenAI | Fast | 48.3% | 8.3% | 60% | 74.7% | 50% | Meets compile bar | 4d ago |
| gemini-2.5-flash | Fast | 48.3% | 0% | 76.7% | 8.6% | 50% | Below compile bar | 4d ago | |
| claude-haiku-4-5 | Anthropic | Fast | 44.2% | 16.7% | 60% | 56.8% | 50% | Meets compile bar | 4d ago |
| deepseek-v4-flash | DeepSeek | Fast | 40% | 0% | 60% | 0% | 50% | Below compile bar | 4d ago |