evals

Model Benchmarks

Evals we built to break models on purpose — procedural generation against contamination, blind engine-vs-engine combat, and measuring when the safety layer says no to homework. Further down: benchmarks we found online and liked enough to run new models against.

systems

js262-faceoff

Five JavaScript engines written in C, judged blind under test262, head-to-head across five rounds. A 5.19× spread between the best and worst attempt — a series record. The full paper covers the contest design, the survivorship trap that zeroed a returning champion, and why optimal policy decouples proof from progress.

five engines in C, judged blind — full paper below

JS262-FACEOFF — FULL PAPERDOWNLOAD PDF ↓

safety

Refusals — Fable 5 Classifier FPR

A stratified MMLU probe of Fable 5's safety classifier: 36% of benign academic questions blocked — 100% of high-school biology — while the control model answered all 120. The banding separates pure false positives from research-adjacent material, so the over-triggering is legible as a concentrated bio/chem mis-fire rather than global imprecision. A snapshot of one configuration, not a permanent property.

stratified MMLU probe — report below

FABLE 5 CLASSIFIER FPR — REPORTDOWNLOAD PDF ↓

kaggle

GSM8K BenchMaxxed ↗

A procedural math benchmark that's contamination-proof by construction — every run generates fresh problems, so no model has ever seen the test set. Reports a Fragility Index: how fast a model falls apart when the problem shifts under it. Live on Kaggle.

procedural generation — live on kaggle

found online

Benchmarks we like

Benchmarks we found online and liked enough to run new models against.

agents

SWE-Marathon ↗

An ultra-long-horizon software-engineering benchmark from abundant-ai/swe-marathon ↗: real repositories, real Docker, tasks measured in hours of agent labor rather than seconds. The harness is proven — the no-op scores zero and the oracle walks out alive, eight for eight.

We ran Fable 5 against it. Two reports below — the general capability sweep across 7 tasks, and a deep-dive on the one task that actually separated Fable 5 from Opus 4.8: a half-point of partial score over the typical Opus run on the only evaluated task with real headroom.

our Fable 5 run — reports below

FABLE 5 ON SWE-MARATHON — CAPABILITY, ORCHESTRATION & PRESSUREDOWNLOAD PDF ↓

SPOTLIGHT: RUBY-RUST-PORT — WHERE FABLE 5 SEPARATESDOWNLOAD PDF ↓