All model metrics

Search and compare exploratory models on Together.ai.

How to read scores

Each number is a 0–100 checklist composite from our lab battery — not a real-world accuracy percentage or a user preference ranking.

Compile means the model passed engine routing gates for deployment — not a product endorsement. Full methodology

Reference board Together Catalog

Exploratory models on Together.ai — capability-tested for comparison; safety testing is not required for this cohort.

Wide table — scroll sideways on desktop, or view as cards on mobile.

Identity

Capability

Safety

Performance

Status

Model

Vendor

Deploy

Accuracy

Reasoning

Coding

Slop

Reliability

Cap. safety

Jailbreak

PII

Bias

Latency

Cost

Stability

Badges

Llama 3.3 70B Instruct Turbo (Together) Exploratory · safety not required

Standard	Composite	Tier	Easy	Medium	Hard	Easy-only cap
std.coding	60%	bad	0	1	0	No
std.instruction_follow	0%	worst	—	—	—	No
std.json_structured	100%	best	1	1	1	No
std.long_context	0%	worst	—	—	—	No
std.math	60%	bad	0	1	0	No
std.multiturn	0%	worst	—	—	—	No
std.reasoning	25%	worst	0	0	1	No
std.safety_policy	83.3%	good	—	—	—	No
std.slop.contradiction	100%	best	—	—	—	No
std.slop.format	0%	worst	—	—	—	No
std.slop.hallucination	100%	best	—	—	—	No
std.slop.relevance	100%	best	—	—	—	No
std.slop.topic_drift	100%	best	—	—	—	No
std.slop.uncertainty	100%	best	—	—	—	No
std.stability.refusal	100%	best	—	—	—	No
std.stability.repeat	100%	best	—	—	—	No
std.stability.schema	100%	best	—	—	—	No
std.summarization	85%	best	0	1	1	No
std.tool_use	100%	best	—	—	—	No
std.translation	0%	worst	—	—	—	No

Not tested