Model benchmark leaderboard
Independent lab results for models we evaluate for AI orchestration — curated reference cohorts, not an exhaustive vendor catalog.
- Reference board — curated models with full capability and safety testing. Fast and Flagship are one pick per major direct API vendor; Frontier covers large models above 70B.
- Size bands — compare within Fast, Flagship, or Frontier instead of mixing every parameter count. Frontier rows are evaluated via Together.ai; Fast and Flagship use native vendor APIs.
- Together Catalog — separate exploratory list on Together.ai (capability-tested only; not mixed into the reference board).
How to read scores
Each number is a 0–100 checklist composite from our lab battery — not a real-world accuracy percentage or a user preference ranking.
- 85+ Strong on the tested checklist
- 65–84 Solid, with room to improve
- <65 Early or mixed results — common on strict v1 gates
Compile means the model passed engine routing gates for deployment — not a product endorsement. Full methodology
Exploratory models on Together.ai — capability-tested for comparison; safety testing is not required for this cohort.
Together Catalog — top models
Wide table — scroll sideways on desktop, or view as cards on mobile.
| Model | Vendor | Size | Accuracy | Reasoning | Coding | Latency | Cost | Compile | Tested |
|---|---|---|---|---|---|---|---|---|---|
| Llama 3 70B Chat Exploratory · safety not required | Meta | Flagship | 0% | 0% | 0% | 99.1% | — | Below compile bar | 2d ago |
| Llama 3 8B Chat Exploratory · safety not required | Meta | Fast | 0% | 0% | 0% | 95.9% | — | Below compile bar | 2d ago |
| Llama 3.3 70B Instruct Turbo Exploratory · safety not required | Meta | Flagship | 61.3% | 25% | 60% | 63.8% | 50% | Below compile bar | 2d ago |
| Llama 3.3 70B Instruct Turbo (Together) Exploratory · safety not required | Meta | Flagship | 61.3% | 25% | 60% | 68.8% | 50% | Below compile bar | 2d ago |
| Mistral Small 24B Instruct Exploratory · safety not required | Mistral | Flagship | 0% | 0% | 0% | 99.6% | — | Below compile bar | 30d ago |
| Qwen 2.5 7B Instruct Turbo Exploratory · safety not required | Qwen | Fast | 58.8% | 25% | 85% | 49.5% | 50% | Below compile bar | 2d ago |