Model benchmark leaderboard

Independent lab results for models we evaluate for AI orchestration — curated reference cohorts, not an exhaustive vendor catalog.

Reference board — curated models with full capability and safety testing. Fast and Flagship are one pick per major direct API vendor; Frontier covers large models above 70B.
Size bands — compare within Fast, Flagship, or Frontier instead of mixing every parameter count. Frontier rows are evaluated via Together.ai; Fast and Flagship use native vendor APIs.
Together Catalog — separate exploratory list on Together.ai (capability-tested only; not mixed into the reference board).

How to read scores

Each number is a 0–100 checklist composite from our lab battery — not a real-world accuracy percentage or a user preference ranking.

Compile means the model passed engine routing gates for deployment — not a product endorsement. Full methodology

Reference board Together Catalog

Exploratory models on Together.ai — capability-tested for comparison; safety testing is not required for this cohort.

Together Catalog — top models

Wide table — scroll sideways on desktop, or view as cards on mobile.

Model	Vendor	Size	Accuracy	Reasoning	Coding	Latency	Cost	Compile	Tested
Llama 3 70B Chat Exploratory · safety not required	Meta	Flagship	0%	0%	0%	99.1%	—	Below compile bar	2d ago
Llama 3 8B Chat Exploratory · safety not required	Meta	Fast	0%	0%	0%	95.9%	—	Below compile bar	2d ago
Llama 3.3 70B Instruct Turbo Exploratory · safety not required	Meta	Flagship	61.3%	25%	60%	63.8%	50%	Below compile bar	2d ago
Llama 3.3 70B Instruct Turbo (Together) Exploratory · safety not required	Meta	Flagship	61.3%	25%	60%	68.8%	50%	Below compile bar	2d ago
Mistral Small 24B Instruct Exploratory · safety not required	Mistral	Flagship	0%	0%	0%	99.6%	—	Below compile bar	30d ago
Qwen 2.5 7B Instruct Turbo Exploratory · safety not required	Qwen	Fast	58.8%	25%	85%	49.5%	50%	Below compile bar	2d ago