How we benchmark models
What appears on the public leaderboard, how cohorts differ, and what we do not claim.
Reference board (default)
The default public board is a reference cohort — curated models we stand behind after full lab testing, not an exhaustive vendor catalog. For each major direct API vendor (OpenAI, Anthropic, xAI, Google Gemini, Mistral, DeepSeek), we profile one Fast and one Flagship model on that vendor's native API.
We also publish a small Frontier band (above 70B parameters) for large open models that passed the same capability and safety bar. Today these Frontier models are evaluated via Together.ai because they are not yet represented as direct P1 picks — latency and cost on the site reflect Together endpoints, not native vendor routing.
- Capability testing uses our standard benchmark battery (v2 pack).
- Safety testing is required for reference board publication (including Frontier).
- Size bands on the site: Fast (direct API), Flagship (direct API), and Frontier (Together.ai today).
- Compare scores within a band; do not treat Frontier latency/cost as directly comparable to Fast or Flagship.
Together Catalog (exploratory)
The Together Catalog tab shows models from a separate exploratory allowlist hosted on Together.ai. These rows are capability-tested for comparison, but safety testing is not required for this cohort. They are not mixed into the reference board — including Frontier models, which use a higher publication bar.
- Flat list sorted by vendor and model name (no size-band selector).
- Marked as exploratory on the leaderboard.
Reading the numbers
- Scores are 0–100 checklist composites from fixed lab scenarios — not real-world task accuracy.
- Most models score below 65 on early v1 gates; that reflects strict tests, not necessarily poor products.
- Compile badges mean the model passed engine routing gates — not an endorsement.
What we do not claim
- We do not mirror every SKU from every AI vendor.
- Scores are checklist-based lab results, not human preference rankings.
- We do not guarantee performance in your product or workload.
- Hosting channel (direct API vs Together.ai) affects latency and cost; we label cohorts accordingly.