How we benchmark models

What appears on the public leaderboard, how cohorts differ, and what we do not claim.

Reference board (default)

The default public board is a reference cohort — curated models we stand behind after full lab testing, not an exhaustive vendor catalog. For each major direct API vendor (OpenAI, Anthropic, xAI, Google Gemini, Mistral, DeepSeek), we profile one Fast and one Flagship model on that vendor's native API.

We also publish a small Frontier band (above 70B parameters) for large open models that passed the same capability and safety bar. Today these Frontier models are evaluated via Together.ai because they are not yet represented as direct P1 picks — latency and cost on the site reflect Together endpoints, not native vendor routing.

Together Catalog (exploratory)

The Together Catalog tab shows models from a separate exploratory allowlist hosted on Together.ai. These rows are capability-tested for comparison, but safety testing is not required for this cohort. They are not mixed into the reference board — including Frontier models, which use a higher publication bar.

Reading the numbers

What we do not claim

View Together Catalog