Recreating a game with AI agents: four workflows, one Water Sort

April 23, 2026

I have been experimenting with workflows for AI agents that try to recreate an existing game from scratch, using a popular offline-style puzzle as the lab rat: Water Sort. The constraint I care about is rebuilding from an offline / mobile “casual game” reference (here: Offline Games by JindoBlu) rather than cloning a live product API-for-API.

This post walks through four versions of that experiment, each with a different way of instructing the agent, splitting context, and checking its own work.

The setup

Target: Recreate a familiar puzzle game in the browser, aligned with a specific visual and UX style (offline casual games, not a generic “AI slop” UI).
Reference style: JindoBlu Offline Games palette, feel, and presentation.
What varies: How much is packed into one file vs. split into skills and subagent-style passes, and how review is wired in.

You can compare the four builds below; later sections will go deeper into what worked and what felt rough in each.

Mark 0 - One `AGENTS.md`, one context

Approach: A single AGENTS.md that holds everything the agent needs in one place:

Style: JindoBlu Offline Games look and colors, tone, and UI habits.
Design: Pull in the frontend-design skill so the result isn’t generic.
Planning: Plan mode—how to break down the work and when to think before coding.
Reality check: Current project layout so file moves and imports stay coherent.

Aside from the frontend design skill part, here is the content of AGENTS.md:

# Styles

Similar styles to JindoBlu's Offline Games: flat geometric minimalism with bold solid colors, simple shapes only (circles, squares, triangles, lines), clean white sans-serif UI text, dark or vivid single-color backgrounds, no gradients or textures, no images. Animations should be snappy with subtle scale/bounce tweens.

# Workflow

- In PLAN mode we will aggressively ask follow-up and clarifying questions, make fewer assumptions, and ask the user more about the details of the game.
- When PLANNING we should go into details and explore possible problems or edge cases we can encounter when making the game, then think about and explore possible solutions, choose one, and note it down in the PLAN. We should PLAN everything so that when we do BUILD we just write code without the need of problem solving.
- For the game logic write it separately and write unit tests alongside it (we use vitest, just run pnpm test)
- Always run `typecheck` and fix all issues after finishing building.
- DO NOT run dev server or build; if typecheck passes it will build.

# Project

This project is a Phaser 3.90 browser game template using Vite and TypeScript. It has:

- Entry: index.html loads Phaser from CDN and /src/main.ts as an ES module.
- Config: GAME_WIDTH / GAME_HEIGHT in src/config.ts (400×600) but we can always change this to fit the game style.
- Game: src/main.ts creates a Phaser game with MenuScene, FIT scaling, and mobile pipeline, mainly to bootstrap and run the game
- Scene: MenuScene in src/scenes/menu.ts shows "Hello World" at (100, 100).
- Build: Vite 8, TypeScript 5.9, Prettier with import sorting.

The prompt was barely a spec: something like let’s make the water sort game, plus whatever Cursor already had in context (this repo, AGENTS.md, and the frontend-design skill). No separate product doc, no level spreadsheet, no art handoff—just that line and the rules in one file.

How I ran it: I used Cursor Plan mode with Claude Opus 4.5 to turn that thin input into a real plan. Opus could lean on the workflow section in AGENTS.md: ask follow-ups, call out edge cases (illegal pours, win detection, empty tubes), split pure game logic from Phaser, and line up Vitest next to the logic so implementation could stay dumb. The plan was the whole bridge from one sentence to “what to build, in what order, with which files.”

Implementation: I used Claude Sonnet 4.5 in the normal agent flow to follow the plan, add scenes and UI, wire Phaser, write the tests, then run pnpm test and typecheck as AGENTS.md demands. Same repo and instructions; the only deliberate split was plan in Opus, code in Sonnet so I was not doing heavy architecture and line-by-line editing in the same pass.

Result: A decent-looking game that matches the theme and style; the main weak spot was animation.

Mark 1 - Brainstormed “workflow”: skills, subagents, and cross-review

Why: The next iterations really wanted playtesting and feedback—is the pour readable, does the level feel fair, is anything janky. That is slow if I have to be the only tester in the loop. An agent with browser tools can run the build, click through, and report what it sees, so the process might not need a human in the middle for every small tweak. That pushed us toward a pipeline of named steps instead of one static AGENTS.md.

Mr Tien argued the agent should learn from its mistakes over time, not only ship code. That pointed to an explicit retrospective phase after a milestone, so the next pass can carry forward what went wrong instead of repeating the same blind spots.

Approach: I brainstormed with an agent (here: GPT-5.4) and landed on a workflow with multiple skills (one main job each), separate review passes (code vs. how it feels in the browser), and a loop so the run does not dead-end after the first plan.

The workflow we sketched

brainstorming-game
writing-game-plan (same context: from the brainstorm, write a plan.md)
executing-game-plan
review-game-code (fresh context, focus on structure and bugs)
review-gameplay (browser-driven: does it play like the target?)
milestone-retrospective (what failed, what to do differently)
choose-next-milestone—if there is more to do, loop back to writing-game-plan with that retrospective in mind so the agent can auto-loop without a new chat each time

Intent: Get structure, specialization, and self-correction—without hand-writing every file upfront, and with a clear handoff from “we shipped a slice” to “what we plan next.”

How I ran it: For this build I stayed in Cursor Agent mode end to end and used Claude Opus 4.6 only (no Sonnet swap for implementation). I drove it through two milestones by name: watersort-solvable-progression, then visual-direction-contract—the idea was to lock solvable levels and progression first, then nail look-and-feel.

Result: Overall disappointing for how much more context and tokens it burned compared to Mark 0. The game did get animation, but it felt super janky—arguably worse than Mark 0, not better. Buttons and copy also did not match the offline-games style we asked for; they read as generic UI, not the flat bold look in the brief. So the pipeline gave us more process, not a clearly better product.

What I think went wrong: Watching the run, the orchestrator (the main Cursor agent) has to keep the entire workflow description in play, and it is the only role that can talk to me. Subagents or task-style passes do not get a real back-and-forth with the user. So planning, clarification, and “what milestone next” all land on that one thread—a lot of back and forth, and the context window fills fast. I suspect that pressure is part of why quality dropped: less room left for the actual game code and visual polish by the time heavy steps had already run.

Browser review: Even with browser tools, the agent could not meaningfully review animation. It can see layout and whether something moves, but not “is this tween acceptable” or “does this match the reference feel.” So review-gameplay helped for clickability and structure, not for motion or timing—and that gap showed in the build.

Mark 2 - Forking “superpower” skills for this use case

Why: Lately superpower-style skills have been popular around the company: people pass around what worked on their projects, and the same themes keep coming up: treat the repo as the source of truth (real files, not a scroll of chat), start a new run when the phase changes so context stays cheap, and hand review to a subagent so the “doer” and the “checker” are not sharing one bloated turn. After Mark 1, we wanted to see if that pattern could fix the “one orchestrator is carrying the whole workflow” problem—so the next pass was: map those habits onto the Mark 1 graph and fork the superpower material to match this game, instead of inventing something else from a blank page.

Approach: Start from existing superpower skills and workflows and fork them—wording, file names, and gates—so they fit a browser offline-game build. That is still less risky than inventing a pipeline from zero, but the loop is no longer a vague checklist; it is “write artifact X, then a new session reads X.”

What we changed, step by step

brainstorming-game: Write a design.md right away (a concrete artifact, not a wall of chat). Then spawn a subagent whose only job is to review that file—coverage, risks, and holes—before the next phase treats it as ground truth.
writing-game-plan: New context. Read design.md. The plan step is not “produce a plan because the skill says so”—the skill should spell out how to build the plan: step-by-step instructions, a clear quality bar, and a short list of common mistakes to avoid, so the output is scorable and useful for implementation.
executing-game-plan: New context again. The prompt includes the frontend-design skill in the content so look-and-feel is first-class in the build pass, not an afterthought bolted on at the end.
review-game-code, review-gameplay, milestone-retrospective, choose-next-milestone: Same flow as Mark 1—code review, browser pass, retro, then decide whether to loop the plan. The difference is not new names; it is that upstream steps now land in files and get clean context on purpose.

Narrowed: In Superpower TDD, the stock test-driven flow assumes you can test everything worth testing. For a game, UI and animation tests are noisy and do not buy much. We modified the skill so the agent only writes and maintains tests for logic code (rules, state, solvability)—the stuff that is cheap to make deterministic. Everything else is manual, visual, or “does it feel right” in the real build.

Intent: Reuse what superpower already got right (artifacts, fresh eyes, TDD where it is honest) but retarget the wording, steps, and gates for “offline-game clone in the browser” and cut the fantasy of testing Phaser tweens the same way you test a pure function.

How I ran it: I still used the same old prompt, “Let’s make the water sort game,” and I still used Cursor with Claude Opus 4.6 for the whole run—the fork was the skills and file handoffs, not a new model or a new IDE. I only needed a single milestone to get to something I could call bearable and stop for a look.

Result: The build was closer to playable in a boring, checklist sense, but the soul was wrong. I got bottles where I wanted the test-tube look. The “animation” was logically fine: a bottle tilts and you pour from one vessel into another. The water itself still teleported—there was no sense of liquid moving, no stream or column you could read as flow between tube and tube. The motion read as state change with props, not as water doing water things.

Honest read: On paper the build ticked the requirements I had written into the skills—enough for a retro or a pass/fail—and I still did not feel like I was looking at the game I wanted. It kind of killed my hope in the run for a while: if that much structure and superpower-style discipline only gets you to “correct but hollow,” the part you care about (feel, juice, the Offline Games thing) is not on the rubric at all.

Mark 3 - My usual prompting style, formalized

After Mark 2 I was stuck in a low place: a lot of structure and a rubric the agent could “pass” without delivering the feel I actually wanted. Stepping back helped. What I already do in practice is prompt and steer an agent in order: get the core rules and state right, then a rough surface, then polish and motion. Mark 1 and 2 also did that. So the issue was not “we forgot a step in the pipeline”—it was what the pipeline was allowed to stand on.

When I guide, I am usually filling in a gap the model does not have: a clearer picture of the game we are trying to recreate than whatever came baked into weights from training. The model still needs inputs it can treat as fact, not as vibe. If we leave the model running on memory, the work can fail quietly: it looks like progress, but the rule set, layout, and animation brief are all slightly wrong, and you only notice in motion.

Mark 3 is the same habits as before—skills, fresh-context review where it helps—but with an explicit evidence-first leg and artifacts so the run is not competing with a vague “Water Sort” in everyone’s head.

The shape of the run

Discovery / research — Research intensively; treat “I think it works like this” as untrusted until something in the world backs it up.
Recreation brief (synthesis) — From the pile of sources, what are we actually recreating in this project (mechanics, camera, level scope, and what is explicitly out of scope).
Define scope — One milestone, one player-visible bar.
Plan — Spawn three subagents with separate focus: project and code structure, UI and UX (including motion), and acceptance criteria (what “done” is in the running build). Same idea as Mark 1 and 2, but the plan writers are not inventing the game from an empty file; they are distilling the brief and the research.
Build — Complete rewrite; it worked unexpectedly well.
Build review — Not really needed anymore (you will know why later).

How I ran it: This was deliberately boring—the same thin prompt I used before (basically, let’s make a water sort game) but dispatched onto Claude Opus 4.7—a newer, slightly less capable model according to benchmarks, but cheaper on Cursor because it is in its launch period. After that one simple prompt, I watched it run. The difference was not charisma; it was the files on disk and what got researched before “plan” existed.

What the research pass turned up made the pre-plan phase worth it on its own. I had not gone looking before; the agent is not a substitute for a literature pass.

Kattis “watersort” (2021 Virginia Tech HS, via Kattis) — competitive-programming style statement: capacity 4 layers, 2 empty vials, “pour as much as fits,” segments stay separated; good for exact rules and a worked example.
Sorting balls and water: equivalence and complexity (FUN 2022) — ball vs. water, NP angle, empty-bin style bounds. Useful when we get serious about level generation, not just one-off levels.
Kociemba, “Color sort puzzles” — family view (C containers, N layers, K empty), liquid vs. ball rule split.
For look and feel in plain language, third-party help still helped: store listings, and guides like the play-watersort.com how-to (objective, selection lift, “same top color moves together” pour, 4-segment cap). The research also surfaced video sources; I did not plug in a video understanding model this round, so direct frames stayed deferred (candidate YouTube and “install the real app” paths live in open-questions.md for a future cycle). Honest limit: no screenshot or frame evidence from the canonical app this time—only text and secondary descriptions for some visual claims.

Here is another impressive thing the research produced:

Observed in third-party descriptions of the canonical game:

- **[obs]** When a tube is selected, **it rises slightly** to indicate the
  selected state. [R1, R2, R7]
- **[obs]** Pouring is animated: water is shown decanting from source to
  destination with "dynamic" / "smooth" / "satisfying" motion. Exact rig
  (tilt angle, path, whether source translates toward destination, stream
  rendering) is described loosely and is not pinned to a specific
  animation. [R4, R5, R11]
- **[obs]** Color segments in the tube are flat, saturated color bands with a
  hard boundary between layers (no gradient between differing colors). [R3]
  (Direct visual confirmation from the canonical game is an open question;
  see Q3.)
- **[int]** Completion of a tube likely has a distinct visual acknowledgment
  (particles, glow, or similar). One cited variant (`watersort.org`)
  describes "a unicorn appears above the flask" as completion feedback
  [R12]; that is a genre variant, not the canonical IEC Global behavior.
  Tracked as Q4.

From that, the agent could synthesize a recreation brief, scope to a playable core, and only then have three plan threads converge. Skimming the UI/UX plan was when I first thought this would create something awesome.

Global: input locked from phase 1 through end of phase 6 settle.
Phase 1: source lifts an extra 48px (180ms, ease-out)
Phase 2: translate on straight horizontal path to above destination (280ms, ease-in-out)
Phase 3: tilt 62° toward dest, pivot at mouth (180ms, ease-out)
Phase 4: 10px stream, 120ms per unit transferred, destination fill rises in one continuous
         motion, source top segments shrink in sync - "the stream is the star"
... then untilt, return, landing bump (60ms) ...
1 unit total rig ~1.46s, up to ~1.82s for 4 units.

That is not something you reliably get from “make it juice” in one line. It is also why Mark 3 felt different from Mark 1 and 2 in day-to-day work: the orchestrator had less need to invent the target feel from chat memory; it had a stack of references and plan files to execute against, and a build section that forces browser verification against those plans instead of declaring victory after typecheck.

The build process (what the agent is supposed to actually do in implementation) was written down like a checklist, not a vibe. The instructions looked like this:

## Process

1. Re-read `scope.md` and the three `plan/*.md` artifacts. Restate the milestone's player-visible outcome and the planned structural map in your own words before writing code.
2. If a prior `build-review.md` exists with open findings, list those findings as the first build tasks for this iteration.
3. Set up the running build path so you can render what you change:
   - confirm or start the dev server
   - load the relevant route or screen via the browser MCP
   - capture an initial screenshot or snapshot to anchor "before" state when iterating
4. Implement in plan order:
   - structural and state work first
   - then UI surface and styling
   - then motion, feedback, and polish described in `plan/ui-ux.md`
5. After meaningful changes, re-render in the browser MCP and compare against the relevant section of `plan/ui-ux.md` and the source references named in `scope.md`.
6. Cross off acceptance criteria from `plan/acceptance-criteria.md` only when you have actually verified them in the running build.
7. When you hit a real conflict between plan and implementation reality:
   - stop and describe the conflict to the user in plain language
   - propose the smallest change that resolves it
   - record the resolution in `deviations.md` with: what the plan said, what was built instead, why, and what player-visible impact it has
8. When the milestone is implemented and the acceptance criteria are credibly met against the running build, hand off to `build-review` rather than self-certifying.

What that did in practice: The agent was pushed to ship a playable, reviewable slice, then exercise it in the browser and fix problems while the context was still warm—before small issues stacked and turned into a cascade of hacks. That loop works, but it is not free: the full build pass ran more than 30 minutes in wall time, largely because playtesting and re-rendering were part of the loop instead of a quick typecheck and done. The trade was worth it: from the same single-line prompt as the earlier marks, I got a satisfying pour animation and a playable game in one go. The funny part is that build-review never ran—the handoff step is still in the doc for honesty, but continuous verification during build meant there was nothing left that needed a separate review pass to catch.

Where informal prompting still wins: A formal workflow does not remove you as editor. I still nudge scope, cut corners when the plan is too heavy for a milestone, and decide what “enough” evidence is. Mark 3 just makes the default grounded and comparable to the earlier marks, not magically effortless.

Approach (summary): I recreated the workflow I already use when prompting an agent, but with research and artifacts before plan, three focused plan tracks, a concrete build loop with browser checks, and build review as the gate in the spec—so it is repeatable next to m0–m2, not only “best practice from a template.”

Intent: Encode this process so it is comparable across runs, with evidence as the load-bearing layer—not a longer chat. A longer build is acceptable when the output is the game, not a green check on the wrong thing.

Result: I am pretty satisfied with this initial result. It has a clean palette, the animation is almost there, but it is still way better than all previous iterations. Remember: this was just a single prompt.

After that, I had another milestone: three difficulty levels and infinite level generation. That stretch helped me see that the whole workflow feels strong in the bootstrap phase—it tends to produce a surprisingly complete first cut—but once you are iterating on something that already exists, the same ceremony can feel like too much. That part of the process still leans on trial and error for me. You can play the final version below:

Here are the zip files of the final iteration, I hope you can reproduce my success. You can delete the docs and src folder to try to recreate any game you want!

Signing off, vulq2

The setup

Mark 0 - One AGENTS.md, one context

Mark 1 - Brainstormed “workflow”: skills, subagents, and cross-review

Mark 2 - Forking “superpower” skills for this use case

Mark 3 - My usual prompting style, formalized

Mark 0 - One `AGENTS.md`, one context