All model metrics

Search and compare exploratory models on Together.ai.

How to read scores

Each number is a 0–100 checklist composite from our lab battery — not a real-world accuracy percentage or a user preference ranking.

85+ Strong on the tested checklist
65–84 Solid, with room to improve
<65 Early or mixed results — common on strict v1 gates

Compile means the model passed engine routing gates for deployment — not a product endorsement. Full methodology

Reference board Together Catalog

Exploratory models on Together.ai — capability-tested for comparison; safety testing is not required for this cohort.

Wide table — scroll sideways on desktop, or view as cards on mobile.

Identity

Capability

Safety

Performance

Status

Model

Vendor

Deploy

Accuracy

Reasoning

Coding

Slop

Reliability

Cap. safety

Jailbreak

PII

Bias

Latency

Cost

Stability

Badges

Llama 3 70B Chat Exploratory · safety not required

Meta

24.8%

100%

—

Not tested

99.1%

—

Below compile bar Not tested

Model key: together/meta-llama/llama-3-70b-chat-hf@created-200

Model ID: meta-llama/llama-3-70b-chat-hf

Size band: 9to70b

Provenance: provider_standard

Capability pack: bench-pack-v2

Safety pack: —

Latency P95: 280 ms

Throughput P50: — tps

Cost / task: $0

Standards

Standard	Composite	Tier	Easy	Medium	Hard	Easy-only cap
std.coding	0%	worst	0	0	0	No
std.instruction_follow	0%	worst	—	—	—	No
std.json_structured	0%	worst	0	0	0	No
std.long_context	0%	worst	—	—	—	No
std.math	0%	worst	0	0	0	No
std.multiturn	0%	worst	—	—	—	No
std.reasoning	0%	worst	0	0	0	No
std.safety_policy	0%	worst	—	—	—	No
std.slop.contradiction	0%	worst	—	—	—	No
std.slop.format	0%	worst	—	—	—	No
std.slop.hallucination	0%	worst	—	—	—	No
std.slop.relevance	0%	worst	—	—	—	No
std.slop.topic_drift	0%	worst	—	—	—	No
std.slop.uncertainty	0%	worst	—	—	—	No
std.stability.refusal	0%	worst	—	—	—	No
std.stability.repeat	0%	worst	—	—	—	No
std.stability.schema	0%	worst	—	—	—	No
std.summarization	0%	worst	0	0	0	No
std.tool_use	0%	worst	—	—	—	No
std.translation	0%	worst	—	—	—	No

Slop profiles

Standard	Violation %	Samples
std.slop.contradiction	100%	3
std.slop.format	100%	3
std.slop.hallucination	100%	3
std.slop.relevance	100%	3
std.slop.topic_drift	100%	3
std.slop.uncertainty	100%	3

Compile gates

G0 — status=complete
G1 — all required standards present
G2 — json_break=0/3
G3 — composite=0.000
G4 — safety_fn_rate=1
G5 — safety_fp_rate=1
G6 — latency_p95_ms=279.99999999999994 cap_ms=30000 size_band=9to70b
G6b — throughput_p50_tps=-1 floor_tps=15 size_band=9to70b
G7 — cost_proxy=0.000000 median_comparable=0.000000
G8 — provider_error_rate=1.000
G9 — no easy_only_cap

Safety gates

Not tested

Weakness tags

weakness:over_refuse (safety_fp / )
weakness:under_refuse (safety_fn / )

Llama 3 8B Chat Exploratory · safety not required

Meta

24%

100%

—

Not tested

95.9%

—

Below compile bar Not tested

Model key: together/meta-llama/llama-3-8b-chat-hf@created-100

Model ID: meta-llama/llama-3-8b-chat-hf

Size band: le8b

Provenance: provider_standard

Capability pack: bench-pack-v2

Safety pack: —

Latency P95: 405.4 ms

Throughput P50: — tps

Cost / task: $0

Standards

Standard	Composite	Tier	Easy	Medium	Hard	Easy-only cap
std.coding	0%	worst	0	0	0	No
std.instruction_follow	0%	worst	—	—	—	No
std.json_structured	0%	worst	0	0	0	No
std.long_context	0%	worst	—	—	—	No
std.math	0%	worst	0	0	0	No
std.multiturn	0%	worst	—	—	—	No
std.reasoning	0%	worst	0	0	0	No
std.safety_policy	0%	worst	—	—	—	No
std.slop.contradiction	0%	worst	—	—	—	No
std.slop.format	0%	worst	—	—	—	No
std.slop.hallucination	0%	worst	—	—	—	No
std.slop.relevance	0%	worst	—	—	—	No
std.slop.topic_drift	0%	worst	—	—	—	No
std.slop.uncertainty	0%	worst	—	—	—	No
std.stability.refusal	0%	worst	—	—	—	No
std.stability.repeat	0%	worst	—	—	—	No
std.stability.schema	0%	worst	—	—	—	No
std.summarization	0%	worst	0	0	0	No
std.tool_use	0%	worst	—	—	—	No
std.translation	0%	worst	—	—	—	No

Slop profiles

Standard	Violation %	Samples
std.slop.contradiction	100%	3
std.slop.format	100%	3
std.slop.hallucination	100%	3
std.slop.relevance	100%	3
std.slop.topic_drift	100%	3
std.slop.uncertainty	100%	3

Compile gates

G0 — status=complete
G1 — all required standards present
G2 — json_break=0/3
G3 — composite=0.000
G4 — safety_fn_rate=1
G5 — safety_fp_rate=1
G6 — latency_p95_ms=405.3999999999998 cap_ms=10000 size_band=le8b
G6b — throughput_p50_tps=-1 floor_tps=20 size_band=le8b
G7 — cost_proxy=0.000000 median_comparable=0.000000
G8 — provider_error_rate=1.000
G9 — no easy_only_cap

Safety gates

Not tested

Weakness tags

weakness:over_refuse (safety_fp / )
weakness:under_refuse (safety_fn / )

Llama 3.3 70B Instruct Turbo Exploratory · safety not required

Meta

54.4%

61.3%

25%

60%

80%

83.3%

—

Not tested

63.8%

50%

100%

Below compile bar Not tested

Model key: meta-llama/Llama-3.3-70B-Instruct-Turbo

Model ID: meta-llama/Llama-3.3-70B-Instruct-Turbo

Size band: 9to70b

Provenance: provider_standard

Capability pack: bench-pack-v2

Safety pack: —

Latency P95: 10858 ms

Throughput P50: 20.7 tps

Cost / task: $0.000126

Strengths

std.json_structured (best)
std.safety_policy (good)
std.slop.contradiction (best)
std.slop.hallucination (best)
std.slop.relevance (best)
std.slop.topic_drift (best)
std.slop.uncertainty (best)
std.stability.refusal (best)
std.stability.repeat (best)
std.stability.schema (best)
std.summarization (best)
std.tool_use (best)

Standards

Standard	Composite	Tier	Easy	Medium	Hard	Easy-only cap
std.coding	60%	bad	0	1	0	No
std.instruction_follow	0%	worst	—	—	—	No
std.json_structured	100%	best	1	1	1	No
std.long_context	0%	worst	—	—	—	No
std.math	60%	bad	0	1	0	No
std.multiturn	0%	worst	—	—	—	No
std.reasoning	25%	worst	0	0	1	No
std.safety_policy	83.3%	good	—	—	—	No
std.slop.contradiction	100%	best	—	—	—	No
std.slop.format	0%	worst	—	—	—	No
std.slop.hallucination	100%	best	—	—	—	No
std.slop.relevance	100%	best	—	—	—	No
std.slop.topic_drift	100%	best	—	—	—	No
std.slop.uncertainty	100%	best	—	—	—	No
std.stability.refusal	100%	best	—	—	—	No
std.stability.repeat	100%	best	—	—	—	No
std.stability.schema	100%	best	—	—	—	No
std.summarization	85%	best	0	1	1	No
std.tool_use	100%	best	—	—	—	No
std.translation	0%	worst	—	—	—	No

Slop profiles

Standard	Violation %	Samples
std.slop.contradiction	0%	3
std.slop.format	100%	3
std.slop.hallucination	0%	3
std.slop.relevance	0%	3
std.slop.topic_drift	0%	3
std.slop.uncertainty	0%	3

Compile gates

G0 — status=complete
G1 — all required standards present
G2 — json_break=0/3
G3 — composite=1.000
G4 — safety_fn_rate=0
G5 — safety_fp_rate=0
G6 — latency_p95_ms=10857.999999999996 cap_ms=30000 size_band=9to70b
G6b — throughput_p50_tps=20.68661971830986 floor_tps=15 size_band=9to70b
G7 — cost_proxy=0.000126 median_comparable=0.000000
G8 — provider_error_rate=0.000
G9 — no easy_only_cap

Safety gates

Not tested

Weakness tags

weakness:math_halluc (error_mode / ERR_HALLUCINATION)
weakness:slow (probe / )

Llama 3.3 70B Instruct Turbo (Together) Exploratory · safety not required

Meta

55.4%

61.3%

25%

60%

80%

83.3%

—

Not tested

68.8%

50%

100%

Below compile bar Not tested

Model key: together/meta-llama/llama-3.3-70b-instruct-turbo@created-1733507177

Model ID: meta-llama/llama-3.3-70b-instruct-turbo

Size band: 9to70b

Provenance: provider_standard

Capability pack: bench-pack-v2

Safety pack: —

Latency P95: 9363 ms

Throughput P50: 13.5 tps

Cost / task: $0.000127

Strengths

std.json_structured (best)
std.safety_policy (good)
std.slop.contradiction (best)
std.slop.hallucination (best)
std.slop.relevance (best)
std.slop.topic_drift (best)
std.slop.uncertainty (best)
std.stability.refusal (best)
std.stability.repeat (best)
std.stability.schema (best)
std.summarization (best)
std.tool_use (best)

Standards

Standard	Composite	Tier	Easy	Medium	Hard	Easy-only cap
std.coding	60%	bad	0	1	0	No
std.instruction_follow	0%	worst	—	—	—	No
std.json_structured	100%	best	1	1	1	No
std.long_context	0%	worst	—	—	—	No
std.math	60%	bad	0	1	0	No
std.multiturn	0%	worst	—	—	—	No
std.reasoning	25%	worst	0	0	1	No
std.safety_policy	83.3%	good	—	—	—	No
std.slop.contradiction	100%	best	—	—	—	No
std.slop.format	0%	worst	—	—	—	No
std.slop.hallucination	100%	best	—	—	—	No
std.slop.relevance	100%	best	—	—	—	No
std.slop.topic_drift	100%	best	—	—	—	No
std.slop.uncertainty	100%	best	—	—	—	No
std.stability.refusal	100%	best	—	—	—	No
std.stability.repeat	100%	best	—	—	—	No
std.stability.schema	100%	best	—	—	—	No
std.summarization	85%	best	0	1	1	No
std.tool_use	100%	best	—	—	—	No
std.translation	0%	worst	—	—	—	No

Slop profiles

Standard	Violation %	Samples
std.slop.contradiction	0%	3
std.slop.format	100%	3
std.slop.hallucination	0%	3
std.slop.relevance	0%	3
std.slop.topic_drift	0%	3
std.slop.uncertainty	0%	3

Compile gates

G0 — status=complete
G1 — all required standards present
G2 — json_break=0/3
G3 — composite=1.000
G4 — safety_fn_rate=0
G5 — safety_fp_rate=0
G6 — latency_p95_ms=9362.999999999998 cap_ms=30000 size_band=9to70b
G6b — throughput_p50_tps=13.46389228886169 floor_tps=15 size_band=9to70b
G7 — cost_proxy=0.000127 median_comparable=0.000000
G8 — provider_error_rate=0.000
G9 — no easy_only_cap

Safety gates

Not tested

Weakness tags

weakness:math_halluc (error_mode / ERR_HALLUCINATION)

Mistral Small 24B Instruct Exploratory · safety not required

Mistral

24.9%

100%

99.6%

—

Below compile bar Blocked

Model key: together/mistralai/mistral-small-24b-instruct-2501@created-1738246136

Model ID: mistralai/mistral-small-24b-instruct-2501

Size band: 9to70b

Provenance: provider_standard

Capability pack: bench-pack-v2

Safety pack: bench-pack-safety-v2

Latency P95: 127.4 ms

Throughput P50: — tps

Cost / task: $0

Standards

Standard	Composite	Tier	Easy	Medium	Hard	Easy-only cap
std.coding	0%	worst	0	0	0	No
std.instruction_follow	0%	worst	—	—	—	No
std.json_structured	0%	worst	0	0	0	No
std.long_context	0%	worst	—	—	—	No
std.math	0%	worst	0	0	0	No
std.multiturn	0%	worst	—	—	—	No
std.reasoning	0%	worst	0	0	0	No
std.safety_policy	0%	worst	—	—	—	No
std.slop.contradiction	0%	worst	—	—	—	No
std.slop.format	0%	worst	—	—	—	No
std.slop.hallucination	0%	worst	—	—	—	No
std.slop.relevance	0%	worst	—	—	—	No
std.slop.topic_drift	0%	worst	—	—	—	No
std.slop.uncertainty	0%	worst	—	—	—	No
std.stability.refusal	0%	worst	—	—	—	No
std.stability.repeat	0%	worst	—	—	—	No
std.stability.schema	0%	worst	—	—	—	No
std.summarization	0%	worst	0	0	0	No
std.tool_use	0%	worst	—	—	—	No
std.translation	0%	worst	—	—	—	No

Slop profiles

Standard	Violation %	Samples
std.slop.contradiction	100%	3
std.slop.format	100%	3
std.slop.hallucination	100%	3
std.slop.relevance	100%	3
std.slop.topic_drift	100%	3
std.slop.uncertainty	100%	3

Compile gates

G0 — status=complete
G1 — all required standards present
G2 — json_break=0/3
G3 — composite=0.000
G4 — safety_fn_rate=1
G5 — safety_fp_rate=1
G6 — latency_p95_ms=127.39999999999998 cap_ms=30000 size_band=9to70b
G6b — throughput_p50_tps=-1 floor_tps=15 size_band=9to70b
G7 — cost_proxy=0.000000 median_comparable=0.000000
G8 — provider_error_rate=1.000
G9 — no easy_only_cap

Safety gates

P0 — status=complete pack=bench-pack-safety-v2
P1 — all safety standards present
P6 — catalog_provenance=provider_standard
P2 — safety_fn_rate=1
P3 — safety_fp_rate=1
P4 — jailbreak_resist_rate=0
P5 — harmful_output_rate=1
P7 — pii_leak_rate=1
P8 — bias_report_only=true

Weakness tags

weakness:over_refuse (safety_fp / )
weakness:under_refuse (safety_fn / )

Qwen 2.5 7B Instruct Turbo Exploratory · safety not required

Qwen

51.6%

58.8%

25%

85%

80%

83.3%

—

Not tested

49.5%

50%

100%

Below compile bar Not tested

Model key: Qwen/Qwen2.5-7B-Instruct-Turbo

Model ID: Qwen/Qwen2.5-7B-Instruct-Turbo

Size band: le8b

Provenance: provider_standard

Capability pack: bench-pack-v2

Safety pack: —

Latency P95: 5055 ms

Throughput P50: 30.4 tps

Cost / task: $0.000038

Strengths

std.coding (best)
std.instruction_follow (best)
std.json_structured (best)
std.multiturn (best)
std.safety_policy (good)
std.slop.contradiction (best)
std.slop.hallucination (best)
std.slop.relevance (best)
std.slop.topic_drift (best)
std.slop.uncertainty (best)
std.stability.refusal (best)
std.stability.repeat (best)
std.stability.schema (best)
std.summarization (good)
std.tool_use (best)

Standards

Standard	Composite	Tier	Easy	Medium	Hard	Easy-only cap
std.coding	85%	best	0	1	1	No
std.instruction_follow	100%	best	—	—	—	No
std.json_structured	100%	best	1	1	1	No
std.long_context	0%	worst	—	—	—	No
std.math	25%	worst	0	0	1	No
std.multiturn	100%	best	—	—	—	No
std.reasoning	25%	worst	0	0	1	No
std.safety_policy	83.3%	good	—	—	—	No
std.slop.contradiction	100%	best	—	—	—	No
std.slop.format	0%	worst	—	—	—	No
std.slop.hallucination	100%	best	—	—	—	No
std.slop.relevance	100%	best	—	—	—	No
std.slop.topic_drift	100%	best	—	—	—	No
std.slop.uncertainty	100%	best	—	—	—	No
std.stability.refusal	100%	best	—	—	—	No
std.stability.repeat	100%	best	—	—	—	No
std.stability.schema	100%	best	—	—	—	No
std.summarization	76.7%	good	0	1	0.67	No
std.tool_use	100%	best	—	—	—	No
std.translation	0%	worst	—	—	—	No

Slop profiles

Standard	Violation %	Samples
std.slop.contradiction	0%	3
std.slop.format	100%	3
std.slop.hallucination	0%	3
std.slop.relevance	0%	3
std.slop.topic_drift	0%	3
std.slop.uncertainty	0%	3

Compile gates

G0 — status=complete
G1 — all required standards present
G2 — json_break=0/3
G3 — composite=1.000
G4 — safety_fn_rate=0
G5 — safety_fp_rate=0
G6 — latency_p95_ms=5054.999999999997 cap_ms=10000 size_band=le8b
G6b — throughput_p50_tps=30.3951367781155 floor_tps=20 size_band=le8b
G7 — cost_proxy=0.000038 median_comparable=0.000000
G8 — provider_error_rate=0.000
G9 — no easy_only_cap

Safety gates

Not tested

Weakness tags

weakness:math_halluc (error_mode / ERR_HALLUCINATION)