|
Llama 3 70B Chat
Exploratory · safety not required
|
Meta |
24.8%
|
0%
|
0%
|
0%
|
100%
|
0%
|
0%
|
—
|
Not tested
|
Not tested
|
99.1%
|
—
|
0%
|
Below compile bar
Not tested
|
Model key: together/meta-llama/llama-3-70b-chat-hf@created-200
Model ID: meta-llama/llama-3-70b-chat-hf
Size band: 9to70b
Provenance: provider_standard
Capability pack: bench-pack-v2
Safety pack: —
Latency P95: 280 ms
Throughput P50: — tps
Cost / task: $0
Standards
| Standard |
Composite |
Tier |
Easy |
Medium |
Hard |
Easy-only cap |
| std.coding |
0% |
worst |
0 |
0 |
0 |
No |
| std.instruction_follow |
0% |
worst |
— |
— |
— |
No |
| std.json_structured |
0% |
worst |
0 |
0 |
0 |
No |
| std.long_context |
0% |
worst |
— |
— |
— |
No |
| std.math |
0% |
worst |
0 |
0 |
0 |
No |
| std.multiturn |
0% |
worst |
— |
— |
— |
No |
| std.reasoning |
0% |
worst |
0 |
0 |
0 |
No |
| std.safety_policy |
0% |
worst |
— |
— |
— |
No |
| std.slop.contradiction |
0% |
worst |
— |
— |
— |
No |
| std.slop.format |
0% |
worst |
— |
— |
— |
No |
| std.slop.hallucination |
0% |
worst |
— |
— |
— |
No |
| std.slop.relevance |
0% |
worst |
— |
— |
— |
No |
| std.slop.topic_drift |
0% |
worst |
— |
— |
— |
No |
| std.slop.uncertainty |
0% |
worst |
— |
— |
— |
No |
| std.stability.refusal |
0% |
worst |
— |
— |
— |
No |
| std.stability.repeat |
0% |
worst |
— |
— |
— |
No |
| std.stability.schema |
0% |
worst |
— |
— |
— |
No |
| std.summarization |
0% |
worst |
0 |
0 |
0 |
No |
| std.tool_use |
0% |
worst |
— |
— |
— |
No |
| std.translation |
0% |
worst |
— |
— |
— |
No |
Slop profiles
| Standard |
Violation % |
Samples |
| std.slop.contradiction |
100% |
3 |
| std.slop.format |
100% |
3 |
| std.slop.hallucination |
100% |
3 |
| std.slop.relevance |
100% |
3 |
| std.slop.topic_drift |
100% |
3 |
| std.slop.uncertainty |
100% |
3 |
Compile gates
-
G0 — status=complete
-
G1 — all required standards present
-
G2 — json_break=0/3
-
G3 — composite=0.000
-
G4 — safety_fn_rate=1
-
G5 — safety_fp_rate=1
-
G6 — latency_p95_ms=279.99999999999994 cap_ms=30000 size_band=9to70b
-
G6b — throughput_p50_tps=-1 floor_tps=15 size_band=9to70b
-
G7 — cost_proxy=0.000000 median_comparable=0.000000
-
G8 — provider_error_rate=1.000
-
G9 — no easy_only_cap
Weakness tags
- weakness:over_refuse (safety_fp / )
- weakness:under_refuse (safety_fn / )
|
|
Llama 3 8B Chat
Exploratory · safety not required
|
Meta |
24%
|
0%
|
0%
|
0%
|
100%
|
0%
|
0%
|
—
|
Not tested
|
Not tested
|
95.9%
|
—
|
0%
|
Below compile bar
Not tested
|
Model key: together/meta-llama/llama-3-8b-chat-hf@created-100
Model ID: meta-llama/llama-3-8b-chat-hf
Size band: le8b
Provenance: provider_standard
Capability pack: bench-pack-v2
Safety pack: —
Latency P95: 405.4 ms
Throughput P50: — tps
Cost / task: $0
Standards
| Standard |
Composite |
Tier |
Easy |
Medium |
Hard |
Easy-only cap |
| std.coding |
0% |
worst |
0 |
0 |
0 |
No |
| std.instruction_follow |
0% |
worst |
— |
— |
— |
No |
| std.json_structured |
0% |
worst |
0 |
0 |
0 |
No |
| std.long_context |
0% |
worst |
— |
— |
— |
No |
| std.math |
0% |
worst |
0 |
0 |
0 |
No |
| std.multiturn |
0% |
worst |
— |
— |
— |
No |
| std.reasoning |
0% |
worst |
0 |
0 |
0 |
No |
| std.safety_policy |
0% |
worst |
— |
— |
— |
No |
| std.slop.contradiction |
0% |
worst |
— |
— |
— |
No |
| std.slop.format |
0% |
worst |
— |
— |
— |
No |
| std.slop.hallucination |
0% |
worst |
— |
— |
— |
No |
| std.slop.relevance |
0% |
worst |
— |
— |
— |
No |
| std.slop.topic_drift |
0% |
worst |
— |
— |
— |
No |
| std.slop.uncertainty |
0% |
worst |
— |
— |
— |
No |
| std.stability.refusal |
0% |
worst |
— |
— |
— |
No |
| std.stability.repeat |
0% |
worst |
— |
— |
— |
No |
| std.stability.schema |
0% |
worst |
— |
— |
— |
No |
| std.summarization |
0% |
worst |
0 |
0 |
0 |
No |
| std.tool_use |
0% |
worst |
— |
— |
— |
No |
| std.translation |
0% |
worst |
— |
— |
— |
No |
Slop profiles
| Standard |
Violation % |
Samples |
| std.slop.contradiction |
100% |
3 |
| std.slop.format |
100% |
3 |
| std.slop.hallucination |
100% |
3 |
| std.slop.relevance |
100% |
3 |
| std.slop.topic_drift |
100% |
3 |
| std.slop.uncertainty |
100% |
3 |
Compile gates
-
G0 — status=complete
-
G1 — all required standards present
-
G2 — json_break=0/3
-
G3 — composite=0.000
-
G4 — safety_fn_rate=1
-
G5 — safety_fp_rate=1
-
G6 — latency_p95_ms=405.3999999999998 cap_ms=10000 size_band=le8b
-
G6b — throughput_p50_tps=-1 floor_tps=20 size_band=le8b
-
G7 — cost_proxy=0.000000 median_comparable=0.000000
-
G8 — provider_error_rate=1.000
-
G9 — no easy_only_cap
Weakness tags
- weakness:over_refuse (safety_fp / )
- weakness:under_refuse (safety_fn / )
|
|
Llama 3.3 70B Instruct Turbo
Exploratory · safety not required
|
Meta |
54.4%
|
61.3%
|
25%
|
60%
|
0%
|
80%
|
83.3%
|
—
|
Not tested
|
Not tested
|
63.8%
|
50%
|
100%
|
Below compile bar
Not tested
|
Model key: meta-llama/Llama-3.3-70B-Instruct-Turbo
Model ID: meta-llama/Llama-3.3-70B-Instruct-Turbo
Size band: 9to70b
Provenance: provider_standard
Capability pack: bench-pack-v2
Safety pack: —
Latency P95: 10858 ms
Throughput P50: 20.7 tps
Cost / task: $0.000126
Strengths
- std.json_structured (best)
- std.safety_policy (good)
- std.slop.contradiction (best)
- std.slop.hallucination (best)
- std.slop.relevance (best)
- std.slop.topic_drift (best)
- std.slop.uncertainty (best)
- std.stability.refusal (best)
- std.stability.repeat (best)
- std.stability.schema (best)
- std.summarization (best)
- std.tool_use (best)
Standards
| Standard |
Composite |
Tier |
Easy |
Medium |
Hard |
Easy-only cap |
| std.coding |
60% |
bad |
0 |
1 |
0 |
No |
| std.instruction_follow |
0% |
worst |
— |
— |
— |
No |
| std.json_structured |
100% |
best |
1 |
1 |
1 |
No |
| std.long_context |
0% |
worst |
— |
— |
— |
No |
| std.math |
60% |
bad |
0 |
1 |
0 |
No |
| std.multiturn |
0% |
worst |
— |
— |
— |
No |
| std.reasoning |
25% |
worst |
0 |
0 |
1 |
No |
| std.safety_policy |
83.3% |
good |
— |
— |
— |
No |
| std.slop.contradiction |
100% |
best |
— |
— |
— |
No |
| std.slop.format |
0% |
worst |
— |
— |
— |
No |
| std.slop.hallucination |
100% |
best |
— |
— |
— |
No |
| std.slop.relevance |
100% |
best |
— |
— |
— |
No |
| std.slop.topic_drift |
100% |
best |
— |
— |
— |
No |
| std.slop.uncertainty |
100% |
best |
— |
— |
— |
No |
| std.stability.refusal |
100% |
best |
— |
— |
— |
No |
| std.stability.repeat |
100% |
best |
— |
— |
— |
No |
| std.stability.schema |
100% |
best |
— |
— |
— |
No |
| std.summarization |
85% |
best |
0 |
1 |
1 |
No |
| std.tool_use |
100% |
best |
— |
— |
— |
No |
| std.translation |
0% |
worst |
— |
— |
— |
No |
Slop profiles
| Standard |
Violation % |
Samples |
| std.slop.contradiction |
0% |
3 |
| std.slop.format |
100% |
3 |
| std.slop.hallucination |
0% |
3 |
| std.slop.relevance |
0% |
3 |
| std.slop.topic_drift |
0% |
3 |
| std.slop.uncertainty |
0% |
3 |
Compile gates
-
G0 — status=complete
-
G1 — all required standards present
-
G2 — json_break=0/3
-
G3 — composite=1.000
-
G4 — safety_fn_rate=0
-
G5 — safety_fp_rate=0
-
G6 — latency_p95_ms=10857.999999999996 cap_ms=30000 size_band=9to70b
-
G6b — throughput_p50_tps=20.68661971830986 floor_tps=15 size_band=9to70b
-
G7 — cost_proxy=0.000126 median_comparable=0.000000
-
G8 — provider_error_rate=0.000
-
G9 — no easy_only_cap
Weakness tags
- weakness:math_halluc (error_mode / ERR_HALLUCINATION)
- weakness:slow (probe / )
|
|
Llama 3.3 70B Instruct Turbo (Together)
Exploratory · safety not required
|
Meta |
55.4%
|
61.3%
|
25%
|
60%
|
0%
|
80%
|
83.3%
|
—
|
Not tested
|
Not tested
|
68.8%
|
50%
|
100%
|
Below compile bar
Not tested
|
Model key: together/meta-llama/llama-3.3-70b-instruct-turbo@created-1733507177
Model ID: meta-llama/llama-3.3-70b-instruct-turbo
Size band: 9to70b
Provenance: provider_standard
Capability pack: bench-pack-v2
Safety pack: —
Latency P95: 9363 ms
Throughput P50: 13.5 tps
Cost / task: $0.000127
Strengths
- std.json_structured (best)
- std.safety_policy (good)
- std.slop.contradiction (best)
- std.slop.hallucination (best)
- std.slop.relevance (best)
- std.slop.topic_drift (best)
- std.slop.uncertainty (best)
- std.stability.refusal (best)
- std.stability.repeat (best)
- std.stability.schema (best)
- std.summarization (best)
- std.tool_use (best)
Standards
| Standard |
Composite |
Tier |
Easy |
Medium |
Hard |
Easy-only cap |
| std.coding |
60% |
bad |
0 |
1 |
0 |
No |
| std.instruction_follow |
0% |
worst |
— |
— |
— |
No |
| std.json_structured |
100% |
best |
1 |
1 |
1 |
No |
| std.long_context |
0% |
worst |
— |
— |
— |
No |
| std.math |
60% |
bad |
0 |
1 |
0 |
No |
| std.multiturn |
0% |
worst |
— |
— |
— |
No |
| std.reasoning |
25% |
worst |
0 |
0 |
1 |
No |
| std.safety_policy |
83.3% |
good |
— |
— |
— |
No |
| std.slop.contradiction |
100% |
best |
— |
— |
— |
No |
| std.slop.format |
0% |
worst |
— |
— |
— |
No |
| std.slop.hallucination |
100% |
best |
— |
— |
— |
No |
| std.slop.relevance |
100% |
best |
— |
— |
— |
No |
| std.slop.topic_drift |
100% |
best |
— |
— |
— |
No |
| std.slop.uncertainty |
100% |
best |
— |
— |
— |
No |
| std.stability.refusal |
100% |
best |
— |
— |
— |
No |
| std.stability.repeat |
100% |
best |
— |
— |
— |
No |
| std.stability.schema |
100% |
best |
— |
— |
— |
No |
| std.summarization |
85% |
best |
0 |
1 |
1 |
No |
| std.tool_use |
100% |
best |
— |
— |
— |
No |
| std.translation |
0% |
worst |
— |
— |
— |
No |
Slop profiles
| Standard |
Violation % |
Samples |
| std.slop.contradiction |
0% |
3 |
| std.slop.format |
100% |
3 |
| std.slop.hallucination |
0% |
3 |
| std.slop.relevance |
0% |
3 |
| std.slop.topic_drift |
0% |
3 |
| std.slop.uncertainty |
0% |
3 |
Compile gates
-
G0 — status=complete
-
G1 — all required standards present
-
G2 — json_break=0/3
-
G3 — composite=1.000
-
G4 — safety_fn_rate=0
-
G5 — safety_fp_rate=0
-
G6 — latency_p95_ms=9362.999999999998 cap_ms=30000 size_band=9to70b
-
G6b — throughput_p50_tps=13.46389228886169 floor_tps=15 size_band=9to70b
-
G7 — cost_proxy=0.000127 median_comparable=0.000000
-
G8 — provider_error_rate=0.000
-
G9 — no easy_only_cap
Weakness tags
- weakness:math_halluc (error_mode / ERR_HALLUCINATION)
|
|
Mistral Small 24B Instruct
Exploratory · safety not required
|
Mistral |
24.9%
|
0%
|
0%
|
0%
|
100%
|
0%
|
0%
|
0%
|
100%
|
0%
|
99.6%
|
—
|
0%
|
Below compile bar
Blocked
|
Model key: together/mistralai/mistral-small-24b-instruct-2501@created-1738246136
Model ID: mistralai/mistral-small-24b-instruct-2501
Size band: 9to70b
Provenance: provider_standard
Capability pack: bench-pack-v2
Safety pack: bench-pack-safety-v2
Latency P95: 127.4 ms
Throughput P50: — tps
Cost / task: $0
Standards
| Standard |
Composite |
Tier |
Easy |
Medium |
Hard |
Easy-only cap |
| std.coding |
0% |
worst |
0 |
0 |
0 |
No |
| std.instruction_follow |
0% |
worst |
— |
— |
— |
No |
| std.json_structured |
0% |
worst |
0 |
0 |
0 |
No |
| std.long_context |
0% |
worst |
— |
— |
— |
No |
| std.math |
0% |
worst |
0 |
0 |
0 |
No |
| std.multiturn |
0% |
worst |
— |
— |
— |
No |
| std.reasoning |
0% |
worst |
0 |
0 |
0 |
No |
| std.safety_policy |
0% |
worst |
— |
— |
— |
No |
| std.slop.contradiction |
0% |
worst |
— |
— |
— |
No |
| std.slop.format |
0% |
worst |
— |
— |
— |
No |
| std.slop.hallucination |
0% |
worst |
— |
— |
— |
No |
| std.slop.relevance |
0% |
worst |
— |
— |
— |
No |
| std.slop.topic_drift |
0% |
worst |
— |
— |
— |
No |
| std.slop.uncertainty |
0% |
worst |
— |
— |
— |
No |
| std.stability.refusal |
0% |
worst |
— |
— |
— |
No |
| std.stability.repeat |
0% |
worst |
— |
— |
— |
No |
| std.stability.schema |
0% |
worst |
— |
— |
— |
No |
| std.summarization |
0% |
worst |
0 |
0 |
0 |
No |
| std.tool_use |
0% |
worst |
— |
— |
— |
No |
| std.translation |
0% |
worst |
— |
— |
— |
No |
Slop profiles
| Standard |
Violation % |
Samples |
| std.slop.contradiction |
100% |
3 |
| std.slop.format |
100% |
3 |
| std.slop.hallucination |
100% |
3 |
| std.slop.relevance |
100% |
3 |
| std.slop.topic_drift |
100% |
3 |
| std.slop.uncertainty |
100% |
3 |
Compile gates
-
G0 — status=complete
-
G1 — all required standards present
-
G2 — json_break=0/3
-
G3 — composite=0.000
-
G4 — safety_fn_rate=1
-
G5 — safety_fp_rate=1
-
G6 — latency_p95_ms=127.39999999999998 cap_ms=30000 size_band=9to70b
-
G6b — throughput_p50_tps=-1 floor_tps=15 size_band=9to70b
-
G7 — cost_proxy=0.000000 median_comparable=0.000000
-
G8 — provider_error_rate=1.000
-
G9 — no easy_only_cap
Safety gates
-
P0 — status=complete pack=bench-pack-safety-v2
-
P1 — all safety standards present
-
P6 — catalog_provenance=provider_standard
-
P2 — safety_fn_rate=1
-
P3 — safety_fp_rate=1
-
P4 — jailbreak_resist_rate=0
-
P5 — harmful_output_rate=1
-
P7 — pii_leak_rate=1
-
P8 — bias_report_only=true
Weakness tags
- weakness:over_refuse (safety_fp / )
- weakness:under_refuse (safety_fn / )
|
|
Qwen 2.5 7B Instruct Turbo
Exploratory · safety not required
|
Qwen |
51.6%
|
58.8%
|
25%
|
85%
|
0%
|
80%
|
83.3%
|
—
|
Not tested
|
Not tested
|
49.5%
|
50%
|
100%
|
Below compile bar
Not tested
|
Model key: Qwen/Qwen2.5-7B-Instruct-Turbo
Model ID: Qwen/Qwen2.5-7B-Instruct-Turbo
Size band: le8b
Provenance: provider_standard
Capability pack: bench-pack-v2
Safety pack: —
Latency P95: 5055 ms
Throughput P50: 30.4 tps
Cost / task: $0.000038
Strengths
- std.coding (best)
- std.instruction_follow (best)
- std.json_structured (best)
- std.multiturn (best)
- std.safety_policy (good)
- std.slop.contradiction (best)
- std.slop.hallucination (best)
- std.slop.relevance (best)
- std.slop.topic_drift (best)
- std.slop.uncertainty (best)
- std.stability.refusal (best)
- std.stability.repeat (best)
- std.stability.schema (best)
- std.summarization (good)
- std.tool_use (best)
Standards
| Standard |
Composite |
Tier |
Easy |
Medium |
Hard |
Easy-only cap |
| std.coding |
85% |
best |
0 |
1 |
1 |
No |
| std.instruction_follow |
100% |
best |
— |
— |
— |
No |
| std.json_structured |
100% |
best |
1 |
1 |
1 |
No |
| std.long_context |
0% |
worst |
— |
— |
— |
No |
| std.math |
25% |
worst |
0 |
0 |
1 |
No |
| std.multiturn |
100% |
best |
— |
— |
— |
No |
| std.reasoning |
25% |
worst |
0 |
0 |
1 |
No |
| std.safety_policy |
83.3% |
good |
— |
— |
— |
No |
| std.slop.contradiction |
100% |
best |
— |
— |
— |
No |
| std.slop.format |
0% |
worst |
— |
— |
— |
No |
| std.slop.hallucination |
100% |
best |
— |
— |
— |
No |
| std.slop.relevance |
100% |
best |
— |
— |
— |
No |
| std.slop.topic_drift |
100% |
best |
— |
— |
— |
No |
| std.slop.uncertainty |
100% |
best |
— |
— |
— |
No |
| std.stability.refusal |
100% |
best |
— |
— |
— |
No |
| std.stability.repeat |
100% |
best |
— |
— |
— |
No |
| std.stability.schema |
100% |
best |
— |
— |
— |
No |
| std.summarization |
76.7% |
good |
0 |
1 |
0.67 |
No |
| std.tool_use |
100% |
best |
— |
— |
— |
No |
| std.translation |
0% |
worst |
— |
— |
— |
No |
Slop profiles
| Standard |
Violation % |
Samples |
| std.slop.contradiction |
0% |
3 |
| std.slop.format |
100% |
3 |
| std.slop.hallucination |
0% |
3 |
| std.slop.relevance |
0% |
3 |
| std.slop.topic_drift |
0% |
3 |
| std.slop.uncertainty |
0% |
3 |
Compile gates
-
G0 — status=complete
-
G1 — all required standards present
-
G2 — json_break=0/3
-
G3 — composite=1.000
-
G4 — safety_fn_rate=0
-
G5 — safety_fp_rate=0
-
G6 — latency_p95_ms=5054.999999999997 cap_ms=10000 size_band=le8b
-
G6b — throughput_p50_tps=30.3951367781155 floor_tps=20 size_band=le8b
-
G7 — cost_proxy=0.000038 median_comparable=0.000000
-
G8 — provider_error_rate=0.000
-
G9 — no easy_only_cap
Weakness tags
- weakness:math_halluc (error_mode / ERR_HALLUCINATION)
|