Digital Twins, Not Mock Responses

This is Part 2 of a series on mock servers and AI development. Part 1 covers why AI agents need local service infrastructure at all.

The word “mock” has a perception problem.

When developers hear “mock server,” they picture a testing utility that returns hardcoded JSON. A function that says 200 OK and hands you the same response every time. That mental model is too small for what AI-assisted development actually requires.

What AI agents need — and what the best engineering teams have been building internally for years — isn’t a response faker. It’s a local replica of a service that behaves realistically enough to develop against with full confidence.

The manufacturing and IoT industries call this a digital twin: a virtual replica that mirrors the behavior of a real system. The concept maps directly to software development, and the timing isn’t a coincidence. AI agents are the reason it matters now.

The mock spectrum

There’s a wide range of what “mocking” can mean, and where you land on that range determines how useful your mock is to an AI agent:

Low end: A function that returns 200 OK with a hardcoded JSON blob. This is fine for a unit test. It verifies that your code handles a successful response. That’s all it does.

Middle: A server that matches requests by path and method, returns configured responses, maybe generates dynamic data like random UUIDs or timestamps. This covers most integration testing scenarios.

High end: A service replica that models state transitions, enforces auth flows, returns contextually correct error responses, and behaves closely enough to the real service that code developed against it works unchanged when you point it at staging.

Most mock servers live in the low-to-middle range. AI agents need the high end. Here’s why.

Why agents need realistic service behavior

A developer building a payment integration knows the Stripe API. They know that POST /charges returns a charge object with a status field. They know status: pending means the charge hasn’t settled. They know the edge cases because they’ve read the docs and built against Stripe before.

An AI agent doesn’t have that internalized context. It discovers behavior through interaction. It writes code, runs it, reads the response, and infers how the API works based on what comes back.

If your mock always returns 200 OK with a static charge object, the agent never encounters:

What happens when a charge fails (402 Payment Required)
What happens when the idempotency key is reused
What happens when the auth token expires mid-session
What happens when the amount exceeds the account’s limit

The agent builds code that only handles the happy path — because the happy path is all the mock ever showed it.

A digital twin models these behaviors. POST a charge with an expired token and you get a 401. POST a charge with an invalid amount and you get a 400 with a specific error code. POST a charge, then GET that charge, and it’s there with status: pending. POST the same idempotency key twice and you get the same response without a duplicate being created.

When an agent develops against a twin that behaves this way, it discovers edge cases organically. It writes error handling code because it encountered errors. It handles auth refresh because the token expired. It handles validation because validation failed. The resulting code is production-ready by the time it touches staging — not because someone told the agent about edge cases, but because the agent ran into them.

Multi-worktree parallel development

Here’s the scenario that made me realize this matters at a completely different scale than I originally thought.

You have three engineers. Each is working on a different feature in a separate git worktree. Each has an AI agent handling the implementation.

Engineer A: building a new checkout flow (needs payment service + inventory service)
Engineer B: building user profile updates (needs user service + notification service)
Engineer C: refactoring auth middleware (needs auth service + session service)

All three agents need running services to develop against. In the old world, they’d all point at the same staging environment.

The problems are immediate:

Engineer A’s agent creates test payment records that show up in Engineer C’s auth test scenarios
Engineer B’s agent triggers real notification emails in the staging environment
Engineer C’s agent modifies auth behavior, which breaks staging for Engineers A and B
All three agents compete for the same staging server’s capacity

Shared staging doesn’t work for parallel AI development. The agents are too fast, too aggressive, and too independent to share mutable state.

The fix is obvious once you see it: each worktree gets its own mock server.

worktree-checkout-flow/
  mockd.yaml          # payment + inventory twins
  mockd on :4280

worktree-user-profiles/
  mockd.yaml          # user + notification twins
  mockd on :4281

worktree-auth-refactor/
  mockd.yaml          # auth + session twins
  mockd on :4282

Three agents, three environments, zero conflicts. Each mock server starts in under a second, uses roughly 20 MB of memory, and runs until the worktree is done. Tear it down, spin up a new one for the next feature. There’s no shared state to corrupt and no infrastructure team to coordinate with.

This scales linearly. Ten engineers with ten worktrees? Ten mock servers, each isolated, each disposable. The cost is negligible — a Go binary running locally doesn’t show up on anyone’s cloud bill.

The setup agent / develop agent pattern

This is the workflow pattern I’m seeing emerge from teams that have figured out how to use AI agents for real work, not just code completion:

Agent 1 (the setup agent) reads the OpenAPI spec for the services you depend on. It creates a full mock environment — endpoints, response schemas, error scenarios, auth flows. Its job is to build the digital twin.

Agent 2 (the development agent) writes the actual feature code. It develops against the twin as if it’s the real service. Same base URL, same paths, same response shapes. It doesn’t know — and doesn’t need to know — that it’s talking to a mock.

The handoff is clean. Agent 1 creates infrastructure. Agent 2 uses infrastructure. When Agent 2’s code is ready, you change one environment variable and run the same code against staging.

With MCP-enabled mock servers, Agent 1 doesn’t need a human to write configuration files. It calls the mock server’s tools directly:

Agent 1 (setup):
  1. Reads openapi.yaml from the dependency repo
  2. Calls mockd MCP tool: import_mocks(format: "openapi", content: ...)
  3. Adds error scenarios: create_mock(path: /charges, status: 402, ...)
  4. Adds auth flow: create_mock(path: /auth/token, ...)
  5. Verifies all endpoints respond correctly

Agent 2 (development):
  1. Writes feature code against localhost:4280
  2. Tests, iterates, handles errors it encounters
  3. All tests pass
  4. Ready for staging validation

This is how MCP works today. mockd exposes 19 tools that any MCP-compatible AI assistant — Claude Code, Cursor, GitHub Copilot — can call programmatically. The setup agent doesn’t generate YAML and hope it’s valid. It calls tools that create mocks, verifies they respond correctly, and hands off a working environment.

State is what separates twins from mocks

If I had to pick the single biggest difference between a static mock and a digital twin, it’s state.

A static mock: GET /users/1 always returns the same user. POST /users always returns 201 with a canned response. GET /users/1 still returns the same thing — the POST didn’t actually create anything.

A stateful twin: POST /users with {"name": "Alice", "email": "alice@example.com"} creates a user and returns {"id": "7", "name": "Alice", "email": "alice@example.com"}. Now GET /users/7 returns Alice. PUT /users/7 updates her. DELETE /users/7 removes her. GET /users/7 returns 404.

This matters enormously for AI agents. An agent building a user management feature needs to exercise the full create-read-update-delete cycle. If the mock can’t hold state between requests, the agent can’t verify that its creation code and its retrieval code work together. It writes both in isolation and hopes they’re compatible.

With stateful mocking, the agent’s test workflow mirrors a real user’s workflow:

# mockd.yaml — stateful user service twin
statefulResources:
  - name: users
    basePath: /api/users
    seedData:
      - id: "1"
        name: "Alice"
        email: "alice@example.com"
      - id: "2"
        name: "Bob"
        email: "bob@example.com"
    validation:
      required: [name, email]
      fields:
        email:
          type: string
          format: email

That configuration creates a user service twin with auto-generated CRUD endpoints, input validation, seed data, and state that persists across requests within the session. POST a new user, GET the list, the new user is there. An AI agent building a user management feature has everything it needs to develop with confidence.

Combine twins across protocols

Real services don’t just speak HTTP. A modern application might have:

An HTTP REST API for CRUD operations
A WebSocket connection for real-time notifications
An OAuth provider for authentication
A gRPC service for internal communication

A digital twin of this system needs to model all of these. If your mock only handles HTTP, the agent can’t develop the WebSocket notification handler, can’t test the OAuth login flow, can’t verify the gRPC call.

This is where multi-protocol mock servers matter. One mockd config, one process, all the protocols:

version: "1.0"
mocks:
  # REST API
  - name: User API
    type: http
    http:
      matcher: { method: GET, path: /api/users }
      response:
        statusCode: 200
        body: '[{"id": 1, "name": "Alice"}]'

  # Real-time notifications
  - name: Notification Feed
    type: websocket
    websocket:
      path: /ws/notifications
      matchers:
        - match: { type: json, path: "$.type", value: "subscribe" }
          response:
            type: json
            value: { type: subscribed, channel: user-updates }

  # Auth provider
  - name: OAuth Token
    type: http
    http:
      matcher: { method: POST, path: /oauth/token }
      response:
        statusCode: 200
        body: '{"access_token": "{{uuid}}", "token_type": "bearer", "expires_in": 3600}'

The agent develops against all three simultaneously. The HTTP endpoint, the WebSocket feed, and the auth flow all run from a single process on a single port. The same code that talks to localhost:4280 today talks to api.production.com tomorrow.

When to graduate from the twin

Digital twins are for development. They’re not the final word on whether your code works.

The pipeline should look like this:

Twin (local mock server): Active development. Agent writes code, iterates, tests. Fast, free, isolated. This is where 80% of development time is spent.
Staging (real services): Integration validation. The code that worked against the twin gets tested against real dependencies. This should be quick — the twin already caught most issues.
Production: Ship it.

If your team is spending significant time debugging code against staging, that’s a signal: your twins aren’t realistic enough. The fix isn’t more staging time — it’s better twins.

The goal isn’t to eliminate staging. It’s to make staging boring. By the time code reaches staging, it should work. The twin already exercised the happy paths, the error paths, the auth flows, and the state transitions. Staging is a sanity check, not a development environment.

Up next

This post covered why AI agents need service replicas, not just mock responses, and how multi-worktree parallel development makes isolated mock environments essential.

Part 3 zooms out to the enterprise level: how mock servers become the first environment in your deployment pipeline, what companies like Airbnb and Uber built internally (and what it cost them), and why AI agents accelerate the need for this infrastructure by an order of magnitude.

Read the series:

Part 1: Your AI Coding Agent Needs a Dev Environment Too
Part 3: The Mock Server Is the First Environment in Your Pipeline

Learn more

All features — stateful resources, MCP server, recording proxy, and multi-protocol mocking
All 7 protocols — HTTP, gRPC, GraphQL, WebSocket, MQTT, SSE, SOAP in one binary
Mockd vs WireMock — why multi-protocol matters when building service replicas
Enterprise features — RBAC, audit logging, and team-wide mock sharing