Model benchmark leaderboard

Independent lab results for models we evaluate for AI orchestration — curated reference cohorts, not an exhaustive vendor catalog.

How to read scores

Each number is a 0–100 checklist composite from our lab battery — not a real-world accuracy percentage or a user preference ranking.

Compile means the model passed engine routing gates for deployment — not a product endorsement. Full methodology

Exploratory models on Together.ai — capability-tested for comparison; safety testing is not required for this cohort.

Together Catalog — top models

Wide table — scroll sideways on desktop, or view as cards on mobile.

Model Vendor Size Accuracy Reasoning Coding Latency Cost Compile Tested
Llama 3 70B Chat Exploratory · safety not required Meta Flagship 0% 0% 0% 99.1% Below compile bar 2d ago
Llama 3 8B Chat Exploratory · safety not required Meta Fast 0% 0% 0% 95.9% Below compile bar 2d ago
Llama 3.3 70B Instruct Turbo Exploratory · safety not required Meta Flagship 61.3% 25% 60% 63.8% 50% Below compile bar 2d ago
Llama 3.3 70B Instruct Turbo (Together) Exploratory · safety not required Meta Flagship 61.3% 25% 60% 68.8% 50% Below compile bar 2d ago
Mistral Small 24B Instruct Exploratory · safety not required Mistral Flagship 0% 0% 0% 99.6% Below compile bar 30d ago
Qwen 2.5 7B Instruct Turbo Exploratory · safety not required Qwen Fast 58.8% 25% 85% 49.5% 50% Below compile bar 2d ago