Model benchmark leaderboard

Independent lab results for models we evaluate for AI orchestration — curated reference cohorts, not an exhaustive vendor catalog.

Reference board — curated models with full capability and safety testing. Fast and Flagship are one pick per major direct API vendor; Frontier covers large models above 70B.
Size bands — compare within Fast, Flagship, or Frontier instead of mixing every parameter count. Frontier rows are evaluated via Together.ai; Fast and Flagship use native vendor APIs.
Together Catalog — separate exploratory list on Together.ai (capability-tested only; not mixed into the reference board).

How to read scores

Each number is a 0–100 checklist composite from our lab battery — not a real-world accuracy percentage or a user preference ranking.

Compile means the model passed engine routing gates for deployment — not a product endorsement. Full methodology

Reference board Together Catalog

Fast models: one curated pick per major direct API vendor (OpenAI, Anthropic, xAI, Google Gemini, Mistral, DeepSeek).

Top models in this size class

Wide table — scroll sideways on desktop, or view as cards on mobile.

Model	Vendor	Size	Accuracy	Reasoning	Coding	Latency	Cost	Compile	Tested
mistral-small-latest	Mistral	Fast	52.5%	25%	60%	80.1%	50%	Below compile bar	4d ago
grok-3-mini	xAI	Fast	52.1%	28.3%	60%	49.2%	50%	Below compile bar	4d ago
gpt-4.1-mini	OpenAI	Fast	48.3%	8.3%	60%	74.7%	50%	Meets compile bar	4d ago
gemini-2.5-flash	Google	Fast	48.3%	0%	76.7%	8.6%	50%	Below compile bar	4d ago
claude-haiku-4-5	Anthropic	Fast	44.2%	16.7%	60%	56.8%	50%	Meets compile bar	4d ago
deepseek-v4-flash	DeepSeek	Fast	40%	0%	60%	0%	50%	Below compile bar	4d ago