|
gpt-4.1-mini
|
OpenAI |
81.6%
|
48.3%
|
8.3%
|
60%
|
0%
|
100%
|
83.3%
|
100%
|
0%
|
0%
|
74.7%
|
50%
|
100%
|
Meets compile bar
Conditional
|
Model key: gpt-4.1-mini
Model ID: gpt-4.1-mini
Size band: le8b
Provenance: provider_standard
Capability pack: bench-pack-v2
Safety pack: bench-pack-safety-v2
Latency P95: 2533.6 ms
Throughput P50: 20.6 tps
Cost / task: $0.00007
Strengths
- std.instruction_follow (best)
- std.json_structured (best)
- std.multiturn (best)
- std.safety_policy (good)
- std.slop.contradiction (best)
- std.slop.hallucination (best)
- std.slop.relevance (best)
- std.slop.topic_drift (best)
- std.slop.uncertainty (best)
- std.stability.refusal (best)
- std.stability.repeat (best)
- std.stability.schema (best)
- std.summarization (best)
- std.tool_use (best)
Standards
| Standard |
Composite |
Tier |
Easy |
Medium |
Hard |
Easy-only cap |
| std.coding |
60% |
bad |
0 |
1 |
0 |
No |
| std.instruction_follow |
100% |
best |
— |
— |
— |
No |
| std.json_structured |
100% |
best |
1 |
1 |
1 |
No |
| std.long_context |
0% |
worst |
— |
— |
— |
No |
| std.math |
25% |
worst |
0 |
0 |
1 |
No |
| std.multiturn |
100% |
best |
— |
— |
— |
No |
| std.reasoning |
8.3% |
worst |
0 |
0 |
0.33 |
No |
| std.safety_policy |
83.3% |
good |
— |
— |
— |
No |
| std.slop.contradiction |
100% |
best |
— |
— |
— |
No |
| std.slop.format |
0% |
worst |
— |
— |
— |
No |
| std.slop.hallucination |
100% |
best |
— |
— |
— |
No |
| std.slop.relevance |
100% |
best |
— |
— |
— |
No |
| std.slop.topic_drift |
100% |
best |
— |
— |
— |
No |
| std.slop.uncertainty |
100% |
best |
— |
— |
— |
No |
| std.stability.refusal |
100% |
best |
— |
— |
— |
No |
| std.stability.repeat |
100% |
best |
— |
— |
— |
No |
| std.stability.schema |
100% |
best |
— |
— |
— |
No |
| std.summarization |
85% |
best |
0 |
1 |
1 |
No |
| std.tool_use |
100% |
best |
— |
— |
— |
No |
| std.translation |
0% |
worst |
— |
— |
— |
No |
Slop profiles
| Standard |
Violation % |
Samples |
| std.slop.contradiction |
0% |
3 |
| std.slop.format |
100% |
3 |
| std.slop.hallucination |
0% |
3 |
| std.slop.relevance |
0% |
3 |
| std.slop.topic_drift |
0% |
3 |
| std.slop.uncertainty |
0% |
3 |
Compile gates
-
G0 — status=complete
-
G1 — all required standards present
-
G2 — json_break=0/3
-
G3 — composite=1.000
-
G4 — safety_fn_rate=0
-
G5 — safety_fp_rate=0
-
G6 — latency_p95_ms=2533.5999999999995 cap_ms=10000 size_band=le8b
-
G6b — throughput_p50_tps=20.61855670103093 floor_tps=20 size_band=le8b
-
G7 — g7_skipped_insufficient_comparables
-
G8 — provider_error_rate=0.000
-
G9 — no easy_only_cap
Safety gates
-
P0 — status=complete pack=bench-pack-safety-v2
-
P1 — all safety standards present
-
P6 — catalog_provenance=provider_standard
-
P2 — safety_fn_rate=0.08333333333333333
-
P3 — safety_fp_rate=0
-
P4 — jailbreak_resist_rate=1
-
P5 — harmful_output_rate=0
-
P7 — pii_leak_rate=0
-
P8 — bias_report_only=true
Weakness tags
- weakness:math_halluc (error_mode / ERR_HALLUCINATION)
|
|
claude-haiku-4-5
|
Anthropic |
78%
|
44.2%
|
16.7%
|
60%
|
0%
|
100%
|
83.3%
|
91.7%
|
0%
|
0%
|
56.8%
|
50%
|
100%
|
Meets compile bar
Approved
|
Model key: claude-haiku-4-5
Model ID: claude-haiku-4-5
Size band: le8b
Provenance: provider_standard
Capability pack: bench-pack-v2
Safety pack: bench-pack-safety-v2
Latency P95: 4324 ms
Throughput P50: 31.8 tps
Cost / task: $0.000133
Strengths
- std.instruction_follow (best)
- std.json_structured (best)
- std.multiturn (best)
- std.safety_policy (good)
- std.slop.contradiction (best)
- std.slop.hallucination (best)
- std.slop.relevance (best)
- std.slop.topic_drift (best)
- std.slop.uncertainty (best)
- std.stability.refusal (best)
- std.stability.repeat (best)
- std.stability.schema (best)
- std.summarization (best)
Standards
| Standard |
Composite |
Tier |
Easy |
Medium |
Hard |
Easy-only cap |
| std.coding |
60% |
bad |
0 |
1 |
0 |
No |
| std.instruction_follow |
100% |
best |
— |
— |
— |
No |
| std.json_structured |
100% |
best |
1 |
1 |
1 |
No |
| std.long_context |
0% |
worst |
— |
— |
— |
No |
| std.math |
0% |
worst |
0 |
0 |
0 |
No |
| std.multiturn |
100% |
best |
— |
— |
— |
No |
| std.reasoning |
16.7% |
worst |
0 |
0 |
0.67 |
No |
| std.safety_policy |
83.3% |
good |
— |
— |
— |
No |
| std.slop.contradiction |
100% |
best |
— |
— |
— |
No |
| std.slop.format |
0% |
worst |
— |
— |
— |
No |
| std.slop.hallucination |
100% |
best |
— |
— |
— |
No |
| std.slop.relevance |
100% |
best |
— |
— |
— |
No |
| std.slop.topic_drift |
100% |
best |
— |
— |
— |
No |
| std.slop.uncertainty |
100% |
best |
— |
— |
— |
No |
| std.stability.refusal |
100% |
best |
— |
— |
— |
No |
| std.stability.repeat |
100% |
best |
— |
— |
— |
No |
| std.stability.schema |
100% |
best |
— |
— |
— |
No |
| std.summarization |
85% |
best |
0 |
1 |
1 |
No |
| std.tool_use |
0% |
worst |
— |
— |
— |
No |
| std.translation |
0% |
worst |
— |
— |
— |
No |
Slop profiles
| Standard |
Violation % |
Samples |
| std.slop.contradiction |
0% |
3 |
| std.slop.format |
100% |
3 |
| std.slop.hallucination |
0% |
3 |
| std.slop.relevance |
0% |
3 |
| std.slop.topic_drift |
0% |
3 |
| std.slop.uncertainty |
0% |
3 |
Compile gates
-
G0 — status=complete
-
G1 — all required standards present
-
G2 — json_break=0/3
-
G3 — composite=1.000
-
G4 — safety_fn_rate=0
-
G5 — safety_fp_rate=0
-
G6 — latency_p95_ms=4324 cap_ms=10000 size_band=le8b
-
G6b — throughput_p50_tps=31.83023872679045 floor_tps=20 size_band=le8b
-
G7 — g7_skipped_insufficient_comparables
-
G8 — provider_error_rate=0.000
-
G9 — no easy_only_cap
Safety gates
-
P0 — status=complete pack=bench-pack-safety-v2
-
P1 — all safety standards present
-
P6 — catalog_provenance=provider_standard
-
P2 — safety_fn_rate=0
-
P3 — safety_fp_rate=0
-
P4 — jailbreak_resist_rate=0.9166666666666666
-
P5 — harmful_output_rate=0
-
P7 — pii_leak_rate=0
-
P8 — bias_report_only=true
Weakness tags
- weakness:math_halluc (error_mode / ERR_HALLUCINATION)
- weakness:no_structured (error_mode / ERR_JSON_BREAK)
|
|
grok-3-mini
|
xAI |
54.8%
|
52.1%
|
28.3%
|
60%
|
100%
|
80%
|
100%
|
91.7%
|
0%
|
0%
|
49.2%
|
50%
|
100%
|
Below compile bar
Conditional
|
Model key: grok-3-mini
Model ID: grok-3-mini
Size band: le8b
Provenance: provider_standard
Capability pack: bench-pack-v2
Safety pack: bench-pack-safety-v2
Latency P95: 5075.4 ms
Throughput P50: 4.7 tps
Cost / task: $0.000203
Strengths
- std.instruction_follow (good)
- std.json_structured (best)
- std.multiturn (best)
- std.safety_policy (best)
- std.slop.topic_drift (good)
- std.slop.uncertainty (best)
- std.stability.refusal (best)
- std.stability.repeat (best)
- std.stability.schema (best)
- std.summarization (best)
- std.tool_use (best)
Standards
| Standard |
Composite |
Tier |
Easy |
Medium |
Hard |
Easy-only cap |
| std.coding |
60% |
bad |
0 |
1 |
0 |
No |
| std.instruction_follow |
66.7% |
good |
— |
— |
— |
No |
| std.json_structured |
100% |
best |
1 |
1 |
1 |
No |
| std.long_context |
50% |
bad |
— |
— |
— |
No |
| std.math |
20% |
worst |
0 |
0.33 |
0 |
No |
| std.multiturn |
100% |
best |
— |
— |
— |
No |
| std.reasoning |
28.3% |
worst |
0 |
0.33 |
0.33 |
No |
| std.safety_policy |
100% |
best |
— |
— |
— |
No |
| std.slop.contradiction |
0% |
worst |
— |
— |
— |
No |
| std.slop.format |
0% |
worst |
— |
— |
— |
No |
| std.slop.hallucination |
0% |
worst |
— |
— |
— |
No |
| std.slop.relevance |
0% |
worst |
— |
— |
— |
No |
| std.slop.topic_drift |
66.7% |
good |
— |
— |
— |
No |
| std.slop.uncertainty |
100% |
best |
— |
— |
— |
No |
| std.stability.refusal |
100% |
best |
— |
— |
— |
No |
| std.stability.repeat |
100% |
best |
— |
— |
— |
No |
| std.stability.schema |
100% |
best |
— |
— |
— |
No |
| std.summarization |
85% |
best |
0 |
1 |
1 |
No |
| std.tool_use |
100% |
best |
— |
— |
— |
No |
| std.translation |
0% |
worst |
— |
— |
— |
No |
Slop profiles
| Standard |
Violation % |
Samples |
| std.slop.contradiction |
100% |
3 |
| std.slop.format |
100% |
3 |
| std.slop.hallucination |
100% |
3 |
| std.slop.relevance |
100% |
3 |
| std.slop.topic_drift |
33.3% |
3 |
| std.slop.uncertainty |
0% |
3 |
Compile gates
-
G0 — status=complete
-
G1 — all required standards present
-
G2 — json_break=0/3
-
G3 — composite=1.000
-
G4 — safety_fn_rate=0
-
G5 — safety_fp_rate=0
-
G6 — latency_p95_ms=5075.399999999998 cap_ms=10000 size_band=le8b
-
G6b — throughput_p50_tps=4.653868528214078 floor_tps=20 size_band=le8b
-
G7 — g7_skipped_insufficient_comparables
-
G8 — provider_error_rate=0.000
-
G9 — no easy_only_cap
Safety gates
-
P0 — status=complete pack=bench-pack-safety-v2
-
P1 — all safety standards present
-
P6 — catalog_provenance=provider_standard
-
P2 — safety_fn_rate=0.08333333333333333
-
P3 — safety_fp_rate=0
-
P4 — jailbreak_resist_rate=0.9166666666666666
-
P5 — harmful_output_rate=0
-
P7 — pii_leak_rate=0
-
P8 — bias_report_only=true
Weakness tags
- weakness:math_halluc (error_mode / ERR_HALLUCINATION)
|
|
mistral-small-latest
|
Mistral |
54.4%
|
52.5%
|
25%
|
60%
|
0%
|
80%
|
66.7%
|
83.3%
|
0%
|
0%
|
80.1%
|
50%
|
100%
|
Below compile bar
Blocked
|
Model key: mistral-small-latest
Model ID: mistral-small-latest
Size band: le8b
Provenance: provider_standard
Capability pack: bench-pack-v2
Safety pack: bench-pack-safety-v2
Latency P95: 1988.3 ms
Throughput P50: 40.9 tps
Cost / task: $0.000058
Strengths
- std.json_structured (best)
- std.multiturn (best)
- std.safety_policy (good)
- std.slop.hallucination (best)
- std.slop.relevance (best)
- std.slop.topic_drift (best)
- std.slop.uncertainty (best)
- std.stability.refusal (best)
- std.stability.repeat (best)
- std.stability.schema (best)
- std.summarization (best)
- std.tool_use (best)
Standards
| Standard |
Composite |
Tier |
Easy |
Medium |
Hard |
Easy-only cap |
| std.coding |
60% |
bad |
0 |
1 |
0 |
No |
| std.instruction_follow |
0% |
worst |
— |
— |
— |
No |
| std.json_structured |
100% |
best |
1 |
1 |
1 |
No |
| std.long_context |
0% |
worst |
— |
— |
— |
No |
| std.math |
25% |
worst |
0 |
0 |
1 |
No |
| std.multiturn |
100% |
best |
— |
— |
— |
No |
| std.reasoning |
25% |
worst |
0 |
0 |
1 |
No |
| std.safety_policy |
66.7% |
good |
— |
— |
— |
No |
| std.slop.contradiction |
0% |
worst |
— |
— |
— |
No |
| std.slop.format |
0% |
worst |
— |
— |
— |
No |
| std.slop.hallucination |
100% |
best |
— |
— |
— |
No |
| std.slop.relevance |
100% |
best |
— |
— |
— |
No |
| std.slop.topic_drift |
100% |
best |
— |
— |
— |
No |
| std.slop.uncertainty |
100% |
best |
— |
— |
— |
No |
| std.stability.refusal |
100% |
best |
— |
— |
— |
No |
| std.stability.repeat |
100% |
best |
— |
— |
— |
No |
| std.stability.schema |
100% |
best |
— |
— |
— |
No |
| std.summarization |
85% |
best |
0 |
1 |
1 |
No |
| std.tool_use |
100% |
best |
— |
— |
— |
No |
| std.translation |
0% |
worst |
— |
— |
— |
No |
Slop profiles
| Standard |
Violation % |
Samples |
| std.slop.contradiction |
100% |
3 |
| std.slop.format |
100% |
3 |
| std.slop.hallucination |
0% |
3 |
| std.slop.relevance |
0% |
3 |
| std.slop.topic_drift |
0% |
3 |
| std.slop.uncertainty |
0% |
3 |
Compile gates
-
G0 — status=complete
-
G1 — all required standards present
-
G2 — json_break=0/3
-
G3 — composite=1.000
-
G4 — safety_fn_rate=0
-
G5 — safety_fp_rate=0.5
-
G6 — latency_p95_ms=1988.2999999999988 cap_ms=10000 size_band=le8b
-
G6b — throughput_p50_tps=40.87193460490463 floor_tps=20 size_band=le8b
-
G7 — g7_skipped_insufficient_comparables
-
G8 — provider_error_rate=0.000
-
G9 — no easy_only_cap
Safety gates
-
P0 — status=complete pack=bench-pack-safety-v2
-
P1 — all safety standards present
-
P6 — catalog_provenance=provider_standard
-
P2 — safety_fn_rate=0
-
P3 — safety_fp_rate=0
-
P4 — jailbreak_resist_rate=0.8333333333333334
-
P5 — harmful_output_rate=0
-
P7 — pii_leak_rate=0
-
P8 — bias_report_only=true
Weakness tags
- weakness:math_halluc (error_mode / ERR_HALLUCINATION)
- weakness:over_refuse (safety_fp / )
|
|
deepseek-v4-flash
|
DeepSeek |
45%
|
40%
|
0%
|
60%
|
33.3%
|
80%
|
100%
|
91.7%
|
0%
|
0%
|
0%
|
50%
|
100%
|
Below compile bar
Approved
|
Model key: deepseek-v4-flash
Model ID: deepseek-v4-flash
Size band: le8b
Provenance: provider_standard
Capability pack: bench-pack-v2
Safety pack: bench-pack-safety-v2
Latency P95: 11864.7 ms
Throughput P50: 54 tps
Cost / task: $0.000107
Strengths
- std.instruction_follow (best)
- std.json_structured (best)
- std.multiturn (best)
- std.safety_policy (best)
- std.slop.hallucination (good)
- std.slop.relevance (good)
- std.slop.topic_drift (best)
- std.slop.uncertainty (best)
- std.stability.refusal (best)
- std.stability.repeat (best)
- std.stability.schema (best)
- std.summarization (best)
- std.tool_use (best)
Standards
| Standard |
Composite |
Tier |
Easy |
Medium |
Hard |
Easy-only cap |
| std.coding |
60% |
bad |
0 |
1 |
0 |
No |
| std.instruction_follow |
100% |
best |
— |
— |
— |
No |
| std.json_structured |
100% |
best |
1 |
1 |
1 |
No |
| std.long_context |
25% |
worst |
— |
— |
— |
No |
| std.math |
0% |
worst |
0 |
0 |
0 |
No |
| std.multiturn |
100% |
best |
— |
— |
— |
No |
| std.reasoning |
0% |
worst |
0 |
0 |
0 |
No |
| std.safety_policy |
100% |
best |
— |
— |
— |
No |
| std.slop.contradiction |
0% |
worst |
— |
— |
— |
No |
| std.slop.format |
0% |
worst |
— |
— |
— |
No |
| std.slop.hallucination |
66.7% |
good |
— |
— |
— |
No |
| std.slop.relevance |
66.7% |
good |
— |
— |
— |
No |
| std.slop.topic_drift |
100% |
best |
— |
— |
— |
No |
| std.slop.uncertainty |
100% |
best |
— |
— |
— |
No |
| std.stability.refusal |
100% |
best |
— |
— |
— |
No |
| std.stability.repeat |
100% |
best |
— |
— |
— |
No |
| std.stability.schema |
100% |
best |
— |
— |
— |
No |
| std.summarization |
85% |
best |
0 |
1 |
1 |
No |
| std.tool_use |
100% |
best |
— |
— |
— |
No |
| std.translation |
0% |
worst |
— |
— |
— |
No |
Slop profiles
| Standard |
Violation % |
Samples |
| std.slop.contradiction |
100% |
3 |
| std.slop.format |
100% |
3 |
| std.slop.hallucination |
33.3% |
3 |
| std.slop.relevance |
33.3% |
3 |
| std.slop.topic_drift |
0% |
3 |
| std.slop.uncertainty |
0% |
3 |
Compile gates
-
G0 — status=complete
-
G1 — all required standards present
-
G2 — json_break=0/3
-
G3 — composite=1.000
-
G4 — safety_fn_rate=0
-
G5 — safety_fp_rate=0
-
G6 — latency_p95_ms=11864.699999999993 cap_ms=10000 size_band=le8b
-
G6b — throughput_p50_tps=53.966632618318016 floor_tps=20 size_band=le8b
-
G7 — g7_skipped_insufficient_comparables
-
G8 — provider_error_rate=0.000
-
G9 — no easy_only_cap
Safety gates
-
P0 — status=complete pack=bench-pack-safety-v2
-
P1 — all safety standards present
-
P6 — catalog_provenance=provider_standard
-
P2 — safety_fn_rate=0
-
P3 — safety_fp_rate=0
-
P4 — jailbreak_resist_rate=0.9166666666666666
-
P5 — harmful_output_rate=0
-
P7 — pii_leak_rate=0
-
P8 — bias_report_only=true
Weakness tags
- weakness:math_halluc (error_mode / ERR_HALLUCINATION)
- weakness:slow (probe / )
|
|
gemini-2.5-flash
|
Google |
43.4%
|
48.3%
|
0%
|
76.7%
|
0%
|
80%
|
83.3%
|
100%
|
0%
|
0%
|
8.6%
|
50%
|
100%
|
Below compile bar
Blocked
|
Model key: gemini-2.5-flash
Model ID: gemini-2.5-flash
Size band: le8b
Provenance: provider_standard
Capability pack: bench-pack-v2
Safety pack: bench-pack-safety-v2
Latency P95: 9142.4 ms
Throughput P50: 13.8 tps
Cost / task: $0.000047
Strengths
- std.coding (good)
- std.json_structured (best)
- std.multiturn (best)
- std.safety_policy (good)
- std.slop.hallucination (best)
- std.slop.relevance (good)
- std.slop.topic_drift (best)
- std.slop.uncertainty (best)
- std.stability.refusal (best)
- std.stability.repeat (best)
- std.stability.schema (best)
- std.summarization (best)
- std.tool_use (best)
Standards
| Standard |
Composite |
Tier |
Easy |
Medium |
Hard |
Easy-only cap |
| std.coding |
76.7% |
good |
0 |
1 |
0.67 |
No |
| std.instruction_follow |
0% |
worst |
— |
— |
— |
No |
| std.json_structured |
100% |
best |
1 |
1 |
1 |
No |
| std.long_context |
25% |
worst |
— |
— |
— |
No |
| std.math |
16.7% |
worst |
0 |
0 |
0.67 |
No |
| std.multiturn |
100% |
best |
— |
— |
— |
No |
| std.reasoning |
0% |
worst |
0 |
0 |
0 |
No |
| std.safety_policy |
83.3% |
good |
— |
— |
— |
No |
| std.slop.contradiction |
0% |
worst |
— |
— |
— |
No |
| std.slop.format |
0% |
worst |
— |
— |
— |
No |
| std.slop.hallucination |
100% |
best |
— |
— |
— |
No |
| std.slop.relevance |
66.7% |
good |
— |
— |
— |
No |
| std.slop.topic_drift |
100% |
best |
— |
— |
— |
No |
| std.slop.uncertainty |
100% |
best |
— |
— |
— |
No |
| std.stability.refusal |
100% |
best |
— |
— |
— |
No |
| std.stability.repeat |
100% |
best |
— |
— |
— |
No |
| std.stability.schema |
100% |
best |
— |
— |
— |
No |
| std.summarization |
85% |
best |
0 |
1 |
1 |
No |
| std.tool_use |
100% |
best |
— |
— |
— |
No |
| std.translation |
0% |
worst |
— |
— |
— |
No |
Slop profiles
| Standard |
Violation % |
Samples |
| std.slop.contradiction |
100% |
3 |
| std.slop.format |
100% |
3 |
| std.slop.hallucination |
0% |
3 |
| std.slop.relevance |
33.3% |
3 |
| std.slop.topic_drift |
0% |
3 |
| std.slop.uncertainty |
0% |
3 |
Compile gates
-
G0 — status=complete
-
G1 — all required standards present
-
G2 — json_break=0/3
-
G3 — composite=1.000
-
G4 — safety_fn_rate=0
-
G5 — safety_fp_rate=0
-
G6 — latency_p95_ms=9142.399999999998 cap_ms=10000 size_band=le8b
-
G6b — throughput_p50_tps=13.77049180327869 floor_tps=20 size_band=le8b
-
G7 — g7_skipped_insufficient_comparables
-
G8 — provider_error_rate=0.000
-
G9 — no easy_only_cap
Safety gates
-
P0 — status=complete pack=bench-pack-safety-v2
-
P1 — all safety standards present
-
P6 — catalog_provenance=provider_standard
-
P2 — safety_fn_rate=0.08333333333333333
-
P3 — safety_fp_rate=0
-
P4 — jailbreak_resist_rate=1
-
P5 — harmful_output_rate=0.08333333333333333
-
P7 — pii_leak_rate=0
-
P8 — bias_report_only=true
Weakness tags
- weakness:math_halluc (error_mode / ERR_HALLUCINATION)
|