All model metrics

Search, sort, and compare reference board models in the selected size class (direct vendor APIs).

How to read scores

Each number is a 0–100 checklist composite from our lab battery — not a real-world accuracy percentage or a user preference ranking.

85+ Strong on the tested checklist
65–84 Solid, with room to improve
<65 Early or mixed results — common on strict v1 gates

Compile means the model passed engine routing gates for deployment — not a product endorsement. Full methodology

Reference board Together Catalog

Fast models: one curated pick per major direct API vendor (OpenAI, Anthropic, xAI, Google Gemini, Mistral, DeepSeek).

Size class

Sort by

Direction

Clear search

Wide table — scroll sideways on desktop, or view as cards on mobile.

Identity

Capability

Safety

Performance

Status

Model

Vendor

Deploy

Accuracy

Reasoning

Coding

Slop

Reliability

Cap. safety

Jailbreak

PII

Bias

Latency

Cost

Stability

Badges

gpt-4.1-mini

OpenAI

81.6%

48.3%

8.3%

60%

100%

83.3%

100%

74.7%

50%

100%

Meets compile bar Conditional

Model key: gpt-4.1-mini

Model ID: gpt-4.1-mini

Size band: le8b

Provenance: provider_standard

Capability pack: bench-pack-v2

Safety pack: bench-pack-safety-v2

Latency P95: 2533.6 ms

Throughput P50: 20.6 tps

Cost / task: $0.00007

Strengths

std.instruction_follow (best)
std.json_structured (best)
std.multiturn (best)
std.safety_policy (good)
std.slop.contradiction (best)
std.slop.hallucination (best)
std.slop.relevance (best)
std.slop.topic_drift (best)
std.slop.uncertainty (best)
std.stability.refusal (best)
std.stability.repeat (best)
std.stability.schema (best)
std.summarization (best)
std.tool_use (best)

Standards

Standard	Composite	Tier	Easy	Medium	Hard	Easy-only cap
std.coding	60%	bad	0	1	0	No
std.instruction_follow	100%	best	—	—	—	No
std.json_structured	100%	best	1	1	1	No
std.long_context	0%	worst	—	—	—	No
std.math	25%	worst	0	0	1	No
std.multiturn	100%	best	—	—	—	No
std.reasoning	8.3%	worst	0	0	0.33	No
std.safety_policy	83.3%	good	—	—	—	No
std.slop.contradiction	100%	best	—	—	—	No
std.slop.format	0%	worst	—	—	—	No
std.slop.hallucination	100%	best	—	—	—	No
std.slop.relevance	100%	best	—	—	—	No
std.slop.topic_drift	100%	best	—	—	—	No
std.slop.uncertainty	100%	best	—	—	—	No
std.stability.refusal	100%	best	—	—	—	No
std.stability.repeat	100%	best	—	—	—	No
std.stability.schema	100%	best	—	—	—	No
std.summarization	85%	best	0	1	1	No
std.tool_use	100%	best	—	—	—	No
std.translation	0%	worst	—	—	—	No

Slop profiles

Standard	Violation %	Samples
std.slop.contradiction	0%	3
std.slop.format	100%	3
std.slop.hallucination	0%	3
std.slop.relevance	0%	3
std.slop.topic_drift	0%	3
std.slop.uncertainty	0%	3

Compile gates

G0 — status=complete
G1 — all required standards present
G2 — json_break=0/3
G3 — composite=1.000
G4 — safety_fn_rate=0
G5 — safety_fp_rate=0
G6 — latency_p95_ms=2533.5999999999995 cap_ms=10000 size_band=le8b
G6b — throughput_p50_tps=20.61855670103093 floor_tps=20 size_band=le8b
G7 — g7_skipped_insufficient_comparables
G8 — provider_error_rate=0.000
G9 — no easy_only_cap

Safety gates

P0 — status=complete pack=bench-pack-safety-v2
P1 — all safety standards present
P6 — catalog_provenance=provider_standard
P2 — safety_fn_rate=0.08333333333333333
P3 — safety_fp_rate=0
P4 — jailbreak_resist_rate=1
P5 — harmful_output_rate=0
P7 — pii_leak_rate=0
P8 — bias_report_only=true

Weakness tags

weakness:math_halluc (error_mode / ERR_HALLUCINATION)

claude-haiku-4-5

Anthropic

78%

44.2%

16.7%

60%

100%

83.3%

91.7%

56.8%

50%

100%

Meets compile bar Approved

Model key: claude-haiku-4-5

Model ID: claude-haiku-4-5

Size band: le8b

Provenance: provider_standard

Capability pack: bench-pack-v2

Safety pack: bench-pack-safety-v2

Latency P95: 4324 ms

Throughput P50: 31.8 tps

Cost / task: $0.000133

Strengths

std.instruction_follow (best)
std.json_structured (best)
std.multiturn (best)
std.safety_policy (good)
std.slop.contradiction (best)
std.slop.hallucination (best)
std.slop.relevance (best)
std.slop.topic_drift (best)
std.slop.uncertainty (best)
std.stability.refusal (best)
std.stability.repeat (best)
std.stability.schema (best)
std.summarization (best)

Standards

Standard	Composite	Tier	Easy	Medium	Hard	Easy-only cap
std.coding	60%	bad	0	1	0	No
std.instruction_follow	100%	best	—	—	—	No
std.json_structured	100%	best	1	1	1	No
std.long_context	0%	worst	—	—	—	No
std.math	0%	worst	0	0	0	No
std.multiturn	100%	best	—	—	—	No
std.reasoning	16.7%	worst	0	0	0.67	No
std.safety_policy	83.3%	good	—	—	—	No
std.slop.contradiction	100%	best	—	—	—	No
std.slop.format	0%	worst	—	—	—	No
std.slop.hallucination	100%	best	—	—	—	No
std.slop.relevance	100%	best	—	—	—	No
std.slop.topic_drift	100%	best	—	—	—	No
std.slop.uncertainty	100%	best	—	—	—	No
std.stability.refusal	100%	best	—	—	—	No
std.stability.repeat	100%	best	—	—	—	No
std.stability.schema	100%	best	—	—	—	No
std.summarization	85%	best	0	1	1	No
std.tool_use	0%	worst	—	—	—	No
std.translation	0%	worst	—	—	—	No

Slop profiles

Standard	Violation %	Samples
std.slop.contradiction	0%	3
std.slop.format	100%	3
std.slop.hallucination	0%	3
std.slop.relevance	0%	3
std.slop.topic_drift	0%	3
std.slop.uncertainty	0%	3

Compile gates

G0 — status=complete
G1 — all required standards present
G2 — json_break=0/3
G3 — composite=1.000
G4 — safety_fn_rate=0
G5 — safety_fp_rate=0
G6 — latency_p95_ms=4324 cap_ms=10000 size_band=le8b
G6b — throughput_p50_tps=31.83023872679045 floor_tps=20 size_band=le8b
G7 — g7_skipped_insufficient_comparables
G8 — provider_error_rate=0.000
G9 — no easy_only_cap

Safety gates

P0 — status=complete pack=bench-pack-safety-v2
P1 — all safety standards present
P6 — catalog_provenance=provider_standard
P2 — safety_fn_rate=0
P3 — safety_fp_rate=0
P4 — jailbreak_resist_rate=0.9166666666666666
P5 — harmful_output_rate=0
P7 — pii_leak_rate=0
P8 — bias_report_only=true

Weakness tags

weakness:math_halluc (error_mode / ERR_HALLUCINATION)
weakness:no_structured (error_mode / ERR_JSON_BREAK)

grok-3-mini

xAI

54.8%

52.1%

28.3%

60%

100%

80%

100%

91.7%

49.2%

50%

100%

Below compile bar Conditional

Model key: grok-3-mini

Model ID: grok-3-mini

Size band: le8b

Provenance: provider_standard

Capability pack: bench-pack-v2

Safety pack: bench-pack-safety-v2

Latency P95: 5075.4 ms

Throughput P50: 4.7 tps

Cost / task: $0.000203

Strengths

std.instruction_follow (good)
std.json_structured (best)
std.multiturn (best)
std.safety_policy (best)
std.slop.topic_drift (good)
std.slop.uncertainty (best)
std.stability.refusal (best)
std.stability.repeat (best)
std.stability.schema (best)
std.summarization (best)
std.tool_use (best)

Standards

Standard	Composite	Tier	Easy	Medium	Hard	Easy-only cap
std.coding	60%	bad	0	1	0	No
std.instruction_follow	66.7%	good	—	—	—	No
std.json_structured	100%	best	1	1	1	No
std.long_context	50%	bad	—	—	—	No
std.math	20%	worst	0	0.33	0	No
std.multiturn	100%	best	—	—	—	No
std.reasoning	28.3%	worst	0	0.33	0.33	No
std.safety_policy	100%	best	—	—	—	No
std.slop.contradiction	0%	worst	—	—	—	No
std.slop.format	0%	worst	—	—	—	No
std.slop.hallucination	0%	worst	—	—	—	No
std.slop.relevance	0%	worst	—	—	—	No
std.slop.topic_drift	66.7%	good	—	—	—	No
std.slop.uncertainty	100%	best	—	—	—	No
std.stability.refusal	100%	best	—	—	—	No
std.stability.repeat	100%	best	—	—	—	No
std.stability.schema	100%	best	—	—	—	No
std.summarization	85%	best	0	1	1	No
std.tool_use	100%	best	—	—	—	No
std.translation	0%	worst	—	—	—	No

Slop profiles

Standard	Violation %	Samples
std.slop.contradiction	100%	3
std.slop.format	100%	3
std.slop.hallucination	100%	3
std.slop.relevance	100%	3
std.slop.topic_drift	33.3%	3
std.slop.uncertainty	0%	3

Compile gates

G0 — status=complete
G1 — all required standards present
G2 — json_break=0/3
G3 — composite=1.000
G4 — safety_fn_rate=0
G5 — safety_fp_rate=0
G6 — latency_p95_ms=5075.399999999998 cap_ms=10000 size_band=le8b
G6b — throughput_p50_tps=4.653868528214078 floor_tps=20 size_band=le8b
G7 — g7_skipped_insufficient_comparables
G8 — provider_error_rate=0.000
G9 — no easy_only_cap

Safety gates

P0 — status=complete pack=bench-pack-safety-v2
P1 — all safety standards present
P6 — catalog_provenance=provider_standard
P2 — safety_fn_rate=0.08333333333333333
P3 — safety_fp_rate=0
P4 — jailbreak_resist_rate=0.9166666666666666
P5 — harmful_output_rate=0
P7 — pii_leak_rate=0
P8 — bias_report_only=true

Weakness tags

weakness:math_halluc (error_mode / ERR_HALLUCINATION)

mistral-small-latest

Mistral

54.4%

52.5%

25%

60%

80%

66.7%

83.3%

80.1%

50%

100%

Below compile bar Blocked

Model key: mistral-small-latest

Model ID: mistral-small-latest

Size band: le8b

Provenance: provider_standard

Capability pack: bench-pack-v2

Safety pack: bench-pack-safety-v2

Latency P95: 1988.3 ms

Throughput P50: 40.9 tps

Cost / task: $0.000058

Strengths

std.json_structured (best)
std.multiturn (best)
std.safety_policy (good)
std.slop.hallucination (best)
std.slop.relevance (best)
std.slop.topic_drift (best)
std.slop.uncertainty (best)
std.stability.refusal (best)
std.stability.repeat (best)
std.stability.schema (best)
std.summarization (best)
std.tool_use (best)

Standards

Standard	Composite	Tier	Easy	Medium	Hard	Easy-only cap
std.coding	60%	bad	0	1	0	No
std.instruction_follow	0%	worst	—	—	—	No
std.json_structured	100%	best	1	1	1	No
std.long_context	0%	worst	—	—	—	No
std.math	25%	worst	0	0	1	No
std.multiturn	100%	best	—	—	—	No
std.reasoning	25%	worst	0	0	1	No
std.safety_policy	66.7%	good	—	—	—	No
std.slop.contradiction	0%	worst	—	—	—	No
std.slop.format	0%	worst	—	—	—	No
std.slop.hallucination	100%	best	—	—	—	No
std.slop.relevance	100%	best	—	—	—	No
std.slop.topic_drift	100%	best	—	—	—	No
std.slop.uncertainty	100%	best	—	—	—	No
std.stability.refusal	100%	best	—	—	—	No
std.stability.repeat	100%	best	—	—	—	No
std.stability.schema	100%	best	—	—	—	No
std.summarization	85%	best	0	1	1	No
std.tool_use	100%	best	—	—	—	No
std.translation	0%	worst	—	—	—	No

Slop profiles

Standard	Violation %	Samples
std.slop.contradiction	100%	3
std.slop.format	100%	3
std.slop.hallucination	0%	3
std.slop.relevance	0%	3
std.slop.topic_drift	0%	3
std.slop.uncertainty	0%	3

Compile gates

G0 — status=complete
G1 — all required standards present
G2 — json_break=0/3
G3 — composite=1.000
G4 — safety_fn_rate=0
G5 — safety_fp_rate=0.5
G6 — latency_p95_ms=1988.2999999999988 cap_ms=10000 size_band=le8b
G6b — throughput_p50_tps=40.87193460490463 floor_tps=20 size_band=le8b
G7 — g7_skipped_insufficient_comparables
G8 — provider_error_rate=0.000
G9 — no easy_only_cap

Safety gates

P0 — status=complete pack=bench-pack-safety-v2
P1 — all safety standards present
P6 — catalog_provenance=provider_standard
P2 — safety_fn_rate=0
P3 — safety_fp_rate=0
P4 — jailbreak_resist_rate=0.8333333333333334
P5 — harmful_output_rate=0
P7 — pii_leak_rate=0
P8 — bias_report_only=true

Weakness tags

weakness:math_halluc (error_mode / ERR_HALLUCINATION)
weakness:over_refuse (safety_fp / )

deepseek-v4-flash

DeepSeek

45%

40%

60%

33.3%

80%

100%

91.7%

50%

100%

Below compile bar Approved

Model key: deepseek-v4-flash

Model ID: deepseek-v4-flash

Size band: le8b

Provenance: provider_standard

Capability pack: bench-pack-v2

Safety pack: bench-pack-safety-v2

Latency P95: 11864.7 ms

Throughput P50: 54 tps

Cost / task: $0.000107

Strengths

std.instruction_follow (best)
std.json_structured (best)
std.multiturn (best)
std.safety_policy (best)
std.slop.hallucination (good)
std.slop.relevance (good)
std.slop.topic_drift (best)
std.slop.uncertainty (best)
std.stability.refusal (best)
std.stability.repeat (best)
std.stability.schema (best)
std.summarization (best)
std.tool_use (best)

Standards

Standard	Composite	Tier	Easy	Medium	Hard	Easy-only cap
std.coding	60%	bad	0	1	0	No
std.instruction_follow	100%	best	—	—	—	No
std.json_structured	100%	best	1	1	1	No
std.long_context	25%	worst	—	—	—	No
std.math	0%	worst	0	0	0	No
std.multiturn	100%	best	—	—	—	No
std.reasoning	0%	worst	0	0	0	No
std.safety_policy	100%	best	—	—	—	No
std.slop.contradiction	0%	worst	—	—	—	No
std.slop.format	0%	worst	—	—	—	No
std.slop.hallucination	66.7%	good	—	—	—	No
std.slop.relevance	66.7%	good	—	—	—	No
std.slop.topic_drift	100%	best	—	—	—	No
std.slop.uncertainty	100%	best	—	—	—	No
std.stability.refusal	100%	best	—	—	—	No
std.stability.repeat	100%	best	—	—	—	No
std.stability.schema	100%	best	—	—	—	No
std.summarization	85%	best	0	1	1	No
std.tool_use	100%	best	—	—	—	No
std.translation	0%	worst	—	—	—	No

Slop profiles

Standard	Violation %	Samples
std.slop.contradiction	100%	3
std.slop.format	100%	3
std.slop.hallucination	33.3%	3
std.slop.relevance	33.3%	3
std.slop.topic_drift	0%	3
std.slop.uncertainty	0%	3

Compile gates

G0 — status=complete
G1 — all required standards present
G2 — json_break=0/3
G3 — composite=1.000
G4 — safety_fn_rate=0
G5 — safety_fp_rate=0
G6 — latency_p95_ms=11864.699999999993 cap_ms=10000 size_band=le8b
G6b — throughput_p50_tps=53.966632618318016 floor_tps=20 size_band=le8b
G7 — g7_skipped_insufficient_comparables
G8 — provider_error_rate=0.000
G9 — no easy_only_cap

Safety gates

P0 — status=complete pack=bench-pack-safety-v2
P1 — all safety standards present
P6 — catalog_provenance=provider_standard
P2 — safety_fn_rate=0
P3 — safety_fp_rate=0
P4 — jailbreak_resist_rate=0.9166666666666666
P5 — harmful_output_rate=0
P7 — pii_leak_rate=0
P8 — bias_report_only=true

Weakness tags

weakness:math_halluc (error_mode / ERR_HALLUCINATION)
weakness:slow (probe / )

gemini-2.5-flash

Google

43.4%

48.3%

76.7%

80%

83.3%

100%

8.6%

50%

100%

Below compile bar Blocked

Model key: gemini-2.5-flash

Model ID: gemini-2.5-flash

Size band: le8b

Provenance: provider_standard

Capability pack: bench-pack-v2

Safety pack: bench-pack-safety-v2

Latency P95: 9142.4 ms

Throughput P50: 13.8 tps

Cost / task: $0.000047

Strengths

std.coding (good)
std.json_structured (best)
std.multiturn (best)
std.safety_policy (good)
std.slop.hallucination (best)
std.slop.relevance (good)
std.slop.topic_drift (best)
std.slop.uncertainty (best)
std.stability.refusal (best)
std.stability.repeat (best)
std.stability.schema (best)
std.summarization (best)
std.tool_use (best)

Standards

Standard	Composite	Tier	Easy	Medium	Hard	Easy-only cap
std.coding	76.7%	good	0	1	0.67	No
std.instruction_follow	0%	worst	—	—	—	No
std.json_structured	100%	best	1	1	1	No
std.long_context	25%	worst	—	—	—	No
std.math	16.7%	worst	0	0	0.67	No
std.multiturn	100%	best	—	—	—	No
std.reasoning	0%	worst	0	0	0	No
std.safety_policy	83.3%	good	—	—	—	No
std.slop.contradiction	0%	worst	—	—	—	No
std.slop.format	0%	worst	—	—	—	No
std.slop.hallucination	100%	best	—	—	—	No
std.slop.relevance	66.7%	good	—	—	—	No
std.slop.topic_drift	100%	best	—	—	—	No
std.slop.uncertainty	100%	best	—	—	—	No
std.stability.refusal	100%	best	—	—	—	No
std.stability.repeat	100%	best	—	—	—	No
std.stability.schema	100%	best	—	—	—	No
std.summarization	85%	best	0	1	1	No
std.tool_use	100%	best	—	—	—	No
std.translation	0%	worst	—	—	—	No

Slop profiles

Standard	Violation %	Samples
std.slop.contradiction	100%	3
std.slop.format	100%	3
std.slop.hallucination	0%	3
std.slop.relevance	33.3%	3
std.slop.topic_drift	0%	3
std.slop.uncertainty	0%	3

Compile gates

G0 — status=complete
G1 — all required standards present
G2 — json_break=0/3
G3 — composite=1.000
G4 — safety_fn_rate=0
G5 — safety_fp_rate=0
G6 — latency_p95_ms=9142.399999999998 cap_ms=10000 size_band=le8b
G6b — throughput_p50_tps=13.77049180327869 floor_tps=20 size_band=le8b
G7 — g7_skipped_insufficient_comparables
G8 — provider_error_rate=0.000
G9 — no easy_only_cap

Safety gates

P0 — status=complete pack=bench-pack-safety-v2
P1 — all safety standards present
P6 — catalog_provenance=provider_standard
P2 — safety_fn_rate=0.08333333333333333
P3 — safety_fp_rate=0
P4 — jailbreak_resist_rate=1
P5 — harmful_output_rate=0.08333333333333333
P7 — pii_leak_rate=0
P8 — bias_report_only=true

Weakness tags

weakness:math_halluc (error_mode / ERR_HALLUCINATION)