evals

Model Benchmarks

Evals that try to break models on purpose — procedural generation against contamination, hour-scale agent endurance, blind engine-vs-engine combat, and measuring when the safety layer says no to homework.

systems

js262-faceoff

Five JavaScript engines written in C, judged blind under test262, head-to-head across five rounds. A 5.19× spread between the best and worst attempt — a series record. The full paper covers the contest design, the survivorship trap that zeroed a returning champion, and why optimal policy decouples proof from progress.

five engines in C, judged blind — full paper below
JS262-FACEOFF — FULL PAPERDOWNLOAD PDF ↓

PDF preview unavailable in this browser — open the paper directly.

safety

Refusals — Fable 5 Classifier FPR

A stratified MMLU probe of Fable 5's safety classifier: 36% of benign academic questions blocked — 100% of high-school biology — while the control model answered all 120. The banding separates pure false positives from research-adjacent material, so the over-triggering is legible as a concentrated bio/chem mis-fire rather than global imprecision. A snapshot of one configuration, not a permanent property.

stratified MMLU probe — report below
FABLE 5 CLASSIFIER FPR — REPORTDOWNLOAD PDF ↓

PDF preview unavailable in this browser — open the report directly.

agents

SWE-Marathon

An ultra-long-horizon software-engineering benchmark: real repositories, real Docker, tasks measured in hours of agent labor rather than seconds. The harness is proven — the no-op scores zero and the oracle walks out alive, eight for eight.

real repos, real docker, hour-scale tasks
kaggle

GSM8K BenchMaxxed

A procedural math benchmark that's contamination-proof by construction — every run generates fresh problems, so no model has ever seen the test set. Reports a Fragility Index: how fast a model falls apart when the problem shifts under it. Live on Kaggle.

procedural generation — live on kaggle
watson