BENCHMARK 01

GameEngineBench

EngineCodeBench evaluates whether frontier coding agents can make scoped C++ changes that compile and behave correctly inside real, running Unreal Engine 5 projects.

Read the paper

Open Source GitHub

110

UE5 C++ TASKS

SOURCE PROJECTS

MODEL CONFIGS EVALUATED

55.5%

BEST PASS RATE

01What it measures

Correctness in Unreal depends on the engine, not just the diff. A patch can compile, look plausible, and still fail because it misses replicated state, mishandles an actor lifecycle transition, or breaks a gameplay ability system it never touched directly. Each task gives the agent a buildable start state, a behavior specification, and a scoped set of editable files — then programmatic tests and judge auditing check whether the resulting behavior is actually correct at runtime.

02Leaderboard

Claude Fable 5 [max]

55.5%

GPT-5.5 [xhigh]

29.1%

Claude Opus 4.8 [max]

23.6%

GPT-5.5 [high]

19.1%

Gemini 3.1 Pro [High]

18.2%

Claude Opus 4.8 [high]

12.7%

Claude Opus 4.7 [high]

10%

GPT-5.5 [medium]

9.1%

Kimi for Coding

8.2%

Claude Sonnet 4.6 [high]

7.3%

Qwen 3.7 Plus

3.6%

DeepSeek 4 Pro

2.7%

0%10%20%30%40%50%60%70%

Judge-calibrated pass@1, active task set (n=110).

03Key takeaways

→The strongest configuration, Claude Fable 5 (max), now solves a majority of the task set at 55.5% calibrated pass@1 — but every other evaluated configuration stays under 30%.
→Partial progress is consistently far ahead of full-task success: GPT-5.5 xhigh reaches 29.1% pass@1 but 80.4% mean requirement satisfaction; Claude Opus 4.8 max reaches 23.6% pass@1 but 77.9% requirement satisfaction.
→The hard part is runtime integration, not syntax — strong models usually compile and reach execution; the remaining failures are authority mistakes, replication/state-sync errors, actor lifecycle bugs, and incomplete integration with surrounding gameplay systems.
→Failure is not evenly spread — save/persistence has the largest unresolved share, with weapons/combat, serialization, and AI/world orchestration also retaining unsolved tasks.
→79 of 110 tasks are solved by at least one configuration; 31 remain unsolved by every configuration evaluated — model capabilities are complementary rather than strictly nested.

04Where agents struggle

Movement / physics

3/3

Data / serialization

3/3

Online services

2/2

XR / spatial systems

2/2

Gameplay ability systems

4/5

Core gameplay

15/19

Animation / camera / locomotion

11/14

AI / world orchestration

3/4

Inventory / interaction

11/15

UI / sessions / game features

9/13

Engine systems / pooling

8/12

Graphics / rendering plugins

4/7

Save / persistence

2/5

Weapons / combat

2/6

036912151821

Solved by at least one configurationNo calibrated pass

05Dataset & task design

Core gameplay

Multiplayer & replication

Weapons & combat

Inventory & interaction

Gameplay ability systems

Save / load & persistence

AI & world orchestration

Animation & movement

UI / session systems

XR / spatial

Rendering & plugins

06Methodology

01

Start state

Each task begins from a buildable Unreal project — no scaffolding to invent.

02

Scoped edit

The agent edits a constrained set of native C++ files against a behavior spec.

03

Programmatic tests

Tests run programmatically against the real Unreal build and runtime after the agent finishes.

04

Judge audit

LLM-as-judge checks requested behavior, not a match to a reference diff.

07Failure analysis

TASK 01 — UNSOLVED BY ALL

Zombie System

Requires AI control, round-state updates, server-authoritative damage, and replicated feedback to all agree — models solve pieces but rarely the whole coordination.

TASK 19 — UNSOLVED BY ALL

Map Orchestrator

Procedural generation, actor pooling, and readiness signaling — failures are almost always ordering and lifecycle bugs, not generation logic itself.

TASKS 76–80 — SPUD PERSISTENCE

Save / load lifecycle

Plausible serialization code that still loses actor identity or state across streaming levels and engine-managed teardown boundaries.