BENCHMARK 01
EngineCodeBench evaluates whether frontier coding agents can make scoped C++ changes that compile and behave correctly inside real, running Unreal Engine 5 projects.
01What it measures
Correctness in Unreal depends on the engine, not just the diff. A patch can compile, look plausible, and still fail because it misses replicated state, mishandles an actor lifecycle transition, or breaks a gameplay ability system it never touched directly. Each task gives the agent a buildable start state, a behavior specification, and a scoped set of editable files — then programmatic tests and judge auditing check whether the resulting behavior is actually correct at runtime.
02Leaderboard
Claude Fable 5 [max]
55.5%
GPT-5.5 [xhigh]
29.1%
Claude Opus 4.8 [max]
23.6%
GPT-5.5 [high]
19.1%
Gemini 3.1 Pro [High]
18.2%
Claude Opus 4.8 [high]
12.7%
Claude Opus 4.7 [high]
10%
GPT-5.5 [medium]
9.1%
Kimi for Coding
8.2%
Claude Sonnet 4.6 [high]
7.3%
Qwen 3.7 Plus
3.6%
DeepSeek 4 Pro
2.7%
0%10%20%30%40%50%60%70%
Judge-calibrated pass@1, active task set (n=110).
03Key takeaways
- →The strongest configuration, Claude Fable 5 (max), now solves a majority of the task set at 55.5% calibrated pass@1 — but every other evaluated configuration stays under 30%.
- →Partial progress is consistently far ahead of full-task success: GPT-5.5 xhigh reaches 29.1% pass@1 but 80.4% mean requirement satisfaction; Claude Opus 4.8 max reaches 23.6% pass@1 but 77.9% requirement satisfaction.
- →The hard part is runtime integration, not syntax — strong models usually compile and reach execution; the remaining failures are authority mistakes, replication/state-sync errors, actor lifecycle bugs, and incomplete integration with surrounding gameplay systems.
- →Failure is not evenly spread — save/persistence has the largest unresolved share, with weapons/combat, serialization, and AI/world orchestration also retaining unsolved tasks.
- →79 of 110 tasks are solved by at least one configuration; 31 remain unsolved by every configuration evaluated — model capabilities are complementary rather than strictly nested.
04Where agents struggle
Movement / physics
3/3
Data / serialization
3/3
Online services
2/2
XR / spatial systems
2/2
Gameplay ability systems
4/5
Core gameplay
15/19
Animation / camera / locomotion
11/14
AI / world orchestration
3/4
Inventory / interaction
11/15
UI / sessions / game features
9/13
Engine systems / pooling
8/12
Graphics / rendering plugins
4/7
Save / persistence
2/5
Weapons / combat
2/6
036912151821
Solved by at least one configurationNo calibrated pass
05Dataset & task design
Core gameplay
Multiplayer & replication
Weapons & combat
Inventory & interaction
Gameplay ability systems
Save / load & persistence
AI & world orchestration
Animation & movement
UI / session systems
XR / spatial
Rendering & plugins
06Methodology
07Failure analysis