Evals that try to break models on purpose — procedural generation against contamination, hour-scale agent endurance, blind engine-vs-engine combat, and measuring when the safety layer says no to homework.
Five JavaScript engines written in C, judged blind under test262, head-to-head across five rounds. A 5.19× spread between the best and worst attempt — a series record. The full paper covers the contest design, the survivorship trap that zeroed a returning champion, and why optimal policy decouples proof from progress.
A stratified MMLU probe of Fable 5's safety classifier: 36% of benign academic questions blocked — 100% of high-school biology — while the control model answered all 120. The banding separates pure false positives from research-adjacent material, so the over-triggering is legible as a concentrated bio/chem mis-fire rather than global imprecision. A snapshot of one configuration, not a permanent property.
An ultra-long-horizon software-engineering benchmark: real repositories, real Docker, tasks measured in hours of agent labor rather than seconds. The harness is proven — the no-op scores zero and the oracle walks out alive, eight for eight.
A procedural math benchmark that's contamination-proof by construction — every run generates fresh problems, so no model has ever seen the test set. Reports a Fragility Index: how fast a model falls apart when the problem shifts under it. Live on Kaggle.


