All posts

Fixing Flaky Tests in MCP Apps, ChatGPT Apps, and Claude Connectors (May 2026)

Abe Wheeler
MCP Apps MCP App Testing MCP App Framework ChatGPT Apps ChatGPT App Testing ChatGPT App Framework Claude Connectors Claude Connector Testing Claude Connector Framework Flaky Tests Test Reliability
How to find and fix flaky tests in MCP Apps across ChatGPT and Claude.

How to find and fix flaky tests in MCP Apps across ChatGPT and Claude.

TL;DR: Flaky tests pass and fail on the same code, which destroys trust in your suite. MCP Apps flake for the usual reasons (real network calls, async UI timing, shared state, the clock, randomness) plus one that is unique to AI apps: the model picks tools and arguments differently on each run. Fix flakiness by making every input deterministic. Pin tool data with simulation files, stub external APIs, replace sleeps with auto-retrying assertions, freeze time, seed randomness, and isolate tests. For evals, assert on a pass rate across many runs instead of a single exact match.


A flaky test is one that passes and fails on the same code without any change. You push, CI goes red, you re-run it, and now it is green. Nothing changed except your confidence in the test suite.

Flaky tests are worse than no tests because they train your team to ignore failures. Once people learn that a red build sometimes means nothing, they start clicking “re-run” on every failure, including the real ones. A suite you do not trust is a suite you do not use.

MCP Apps, ChatGPT Apps, and Claude Connectors get flaky for the same reasons web apps do, plus one extra that comes from sitting behind a language model. This post covers every source of flakiness in an MCP App test suite and how to fix each one, whether or not you use a framework like sunpeak.

Why MCP Apps Flake More Than Regular Web Apps

A standard web app has one runtime: the browser. An MCP App has several layers that all run during a single tool call, and each one can introduce non-determinism:

  1. A model interprets your tool schema and decides which tool to call with which arguments.
  2. The MCP protocol routes that call to your tool handler.
  3. Your tool handler runs server-side and often calls a database or a third-party API.
  4. The host (ChatGPT, Claude, VS Code, Goose) renders your resource component in an iframe.
  5. Your component reads the tool output and renders UI.

Every one of these layers is a place where the same test input can produce a different result. The model is non-deterministic by nature. The external API can be slow or return different data. The iframe renders asynchronously. If your test does not control these inputs, the test result depends on luck.

The good news is that flakiness is fixable. Every flaky test has a specific non-deterministic input, and every input can be pinned down. The rest of this post walks through the common ones, from the AI-specific cause to the boring-but-frequent ones.

Cause 1: LLM Non-Determinism

This is the cause unique to AI apps, and it is the one people get wrong most often.

When a test sends a real prompt to a real model (for example, an eval that checks whether GPT-4o calls the right tool), the model can choose differently on each run. A prompt that calls show-albums nine times out of ten will call get-photos on the tenth. If your test asserts that the model calls show-albums exactly once and treats anything else as a failure, the test will go red roughly one run in ten. That is flakiness, but the test is not buggy. The model is doing what models do.

The mistake is treating a non-deterministic check like a deterministic one. You cannot make a model deterministic, so do not write the test as if you can.

The fix has three parts.

First, run each case many times and assert on a pass rate, not a single pass. A model that calls the right tool 9 of 10 times is reliable enough to ship. A model that calls it 5 of 10 times has a real schema problem. You only learn the difference by running the case repeatedly.

// tests/evals/eval.config.ts
import { defineEvalConfig } from 'sunpeak/eval';

export default defineEvalConfig({
  models: ['gpt-4o', 'claude-sonnet-4-20250514', 'gemini-2.0-flash'],
  defaults: {
    runs: 10,
    temperature: 0,
  },
});

Setting temperature: 0 makes the model as deterministic as the API allows, which cuts noise. It does not eliminate variation, because most providers do not guarantee identical output even at temperature 0, so you still need multiple runs.

Second, match arguments loosely. A model might say "New York" on one run and "new york" on another, and both are correct. Asserting an exact string makes a correct call look like a failure. Use a pattern instead:

// tests/evals/albums.eval.ts
import { expect } from 'vitest';
import { defineEval } from 'sunpeak/eval';

export default defineEval({
  cases: [
    {
      name: 'asks for a specific album',
      prompt: 'Show me photos from my vacation in Italy',
      expect: {
        tool: 'show-albums',
        args: {
          query: expect.stringMatching(/italy/i),
        },
      },
    },
  ],
});

Third, keep evals out of the blocking pull request pipeline. Evals cost API credits and they vary by design, so running them on every push wastes money and produces random red builds. Gate them to the main branch in CI/CD and run your deterministic tests on every push. For the full eval setup, see MCP App evals.

The broader rule: keep model behavior out of your deterministic tests entirely. Your unit, integration, and e2e tests should never call a real model. They should test your code with fixed tool data, which brings us to the next cause.

Cause 2: Real Network Calls in Tool Handlers

Most MCP App tool handlers call something: a database, your own backend, a third-party API. If your test runs the real handler and the handler makes a real network call, the test now depends on that service being up, fast, and returning the same data every time. None of those are guaranteed.

A test that calls a live weather API will fail when the API is down, when it rate-limits you, or when the forecast changes. The test is not finding a bug in your code. It is finding a bug in the network.

Stub the external service so the handler gets the same response every run. The exact mechanism depends on your stack, but the principle is the same: replace the network boundary with a fixed value.

import { test, expect, vi } from 'vitest';
import { fetchWeather } from '../../src/services/weather';
import { handleShowWeather } from '../../src/tools/show-weather';

test('weather tool formats the API response', async () => {
  vi.spyOn({ fetchWeather }, 'fetchWeather').mockResolvedValue({
    tempC: 18,
    condition: 'cloudy',
  });

  const result = await handleShowWeather({ city: 'London' });

  expect(result.structuredContent.temperature).toBe('18°C');
  expect(result.structuredContent.condition).toBe('cloudy');
});

For component-level tests, go one layer up and pin the entire tool result with a simulation file. A simulation is a JSON file that defines a complete tool invocation (the input the host sends and the output your component receives), so your resource component always renders against the same data without any handler or network call running.

{
  "tool": "show-weather",
  "userMessage": "What is the weather in London?",
  "toolInput": { "city": "London" },
  "toolResult": {
    "structuredContent": {
      "temperature": "18°C",
      "condition": "cloudy"
    }
  }
}

With the data fixed, the only thing your test exercises is your code, which is the only thing it should be testing. Save real network calls for live tests that run before a release, not for the suite you run on every commit.

Cause 3: Async UI Timing

This is the single most common cause of flaky browser tests, in MCP Apps and everywhere else. Your component renders asynchronously: data arrives, state updates, React re-renders, images load. A test that checks the DOM before that work finishes fails. A test that checks after it finishes passes. The difference is a few milliseconds of timing that varies with machine load.

The wrong fix is a fixed sleep:

// Don't do this
await page.waitForTimeout(500);
await expect(app.locator('.album-card')).toHaveCount(4);

A fixed sleep is a guess. On a fast machine it wastes 500ms. On a slow CI runner it is not long enough and the test flakes. There is no sleep value that is both fast and reliable.

The right fix is a web-first assertion that retries until the condition is true or a timeout is hit. Playwright (which sunpeak’s e2e fixture uses) does this automatically for expect on locators:

import { test, expect } from 'sunpeak/test';

test('album grid renders search results', async ({ inspector }) => {
  const result = await inspector.renderTool('search-albums', {
    query: 'vacation',
  });

  const app = result.app();

  // Retries until the heading appears, no fixed sleep
  await expect(app.getByRole('heading')).toContainText('vacation');
  await expect(app.locator('.album-card')).toHaveCount(4);
});

The assertion polls the DOM until the album cards exist or it times out. Fast machines pass instantly. Slow machines wait as long as they need. The test is both quick and stable because it waits for the actual condition instead of a clock.

A few more timing rules that kill e2e flakiness:

  • Never assert on an element mid-animation. Wait for the animation to settle, or disable animations in the test environment.
  • Wait for the specific element you care about, not for a generic networkidle or a page load event, which can fire before your component finishes rendering.
  • If you test display mode transitions, wait for the post-transition layout to be visible before asserting on it.

Cause 4: Shared State Between Tests

Tests should not depend on each other. When test B only passes because test A ran first and left some state behind, you have an ordering dependency. It looks fine until the order changes, which happens when you run tests in parallel, when you run a single test in isolation, or when a test runner shuffles the order.

Common culprits in MCP Apps:

  • A module-level variable that one test mutates and another reads.
  • A mock set up in one test that leaks into the next because it was not reset.
  • A shared fixture object that tests modify in place.

The fix is isolation. Reset mocks between tests, build fresh test data in each test instead of sharing one mutable object, and avoid module-level mutable state in your handlers. In Vitest, reset automatically:

import { afterEach, vi } from 'vitest';

afterEach(() => {
  vi.restoreAllMocks();
});

To catch ordering dependencies before they bite you in CI, shuffle the order on purpose:

pnpm test:unit -- --sequence.shuffle

If shuffling turns a green suite red, you have a hidden dependency between tests. Find the shared state and remove it.

Cause 5: Time, Dates, and Timezones

Any code that reads the clock is a flakiness risk. A test that checks “this item is from today” passes today and fails tomorrow. A test that formats a timestamp passes in your timezone and fails in CI, which usually runs in UTC. A test that computes “3 days ago” gets a different answer every day.

Freeze the clock so time stops moving during the test:

import { test, expect, vi, beforeEach, afterEach } from 'vitest';
import { formatRelativeTime } from '../../src/utils/time';

beforeEach(() => {
  vi.useFakeTimers();
  vi.setSystemTime(new Date('2026-05-22T12:00:00Z'));
});

afterEach(() => {
  vi.useRealTimers();
});

test('formats a timestamp from two hours ago', () => {
  expect(formatRelativeTime('2026-05-22T10:00:00Z')).toBe('2 hours ago');
});

Now Date.now() and new Date() return the same fixed value every run, so the relative time is always “2 hours ago” no matter when the test runs. Pin your dates to a fixed UTC instant in your test data too, so a developer in Tokyo and a CI runner in UTC compute the same result.

Cause 6: Randomness

If your code calls Math.random(), generates a UUID, or shuffles an array, the output changes every run. Asserting on a random value is asking for a flaky test.

Do not stub Math.random() globally, because that affects unrelated code. Instead, make the randomness an input you can control. Pass the random source in as a parameter, or wrap it so a test can substitute a fixed sequence:

// src/utils/sample.ts
export function pickFeatured(items: Album[], rng = Math.random): Album {
  return items[Math.floor(rng() * items.length)];
}

// test
test('picks the first item with a fixed rng', () => {
  const albums = [{ id: 'a' }, { id: 'b' }, { id: 'c' }];
  const featured = pickFeatured(albums, () => 0); // always picks index 0
  expect(featured.id).toBe('a');
});

The production code uses real randomness. The test passes a fixed function, so the result is deterministic. If you cannot inject the source, assert on a property that holds for any random value (the result is one of the valid items) rather than on the specific value.

How to Detect Flaky Tests Before They Bite

A flaky test usually passes on the run where you wrote it, then fails weeks later when timing shifts. To catch it early, force it to run many times under varied conditions.

Run each test repeatedly. Playwright runs every test N times in one command, which reproduces timing races that a single run hides:

pnpm test:e2e -- --repeat-each=20

If a test passes 20 times in a row, it is probably stable. If it fails 1 of 20, you have found a flaky test before it found you. Run this on any test that touches the DOM, async data, or timing.

Shuffle the order to catch shared state, as shown earlier with --sequence.shuffle. Run the suite headless with the same browser version locally that CI uses, so a “works on my machine” pass does not hide a CI-only failure. For the broader picture of which tests to run where, see the MCP App testing strategy.

Why Retries Are a Trap

When a test flakes, the tempting fix is to retry it automatically. Most runners support it, and it makes the red build go green. It also hides real bugs.

A test that fails 1 run in 50 is often catching a genuine race condition or a real intermittent error in your code. If you retry it until it passes, you ship the bug and silence the only thing that was warning you about it. Blanket retries turn your test suite into a system that reports “probably fine.”

A more honest approach:

  • Allow retries only as a temporary measure, and log every retry so you can see which tests need them.
  • Treat any test that relies on retries as a bug ticket, not a settled state.
  • Quarantine a test you cannot fix immediately: move it out of the blocking suite so it stops blocking merges, but keep running it so you still see the signal, and fix the root cause from the list of causes above.

The goal is a suite where a red build always means something. Retries work against that goal, so use them only while you hunt down the real cause.

Run the Bulk of Your Tests Locally and Deterministically

The most reliable test is one that does not depend on anything outside your control. That points to a clear split:

  • Run unit, integration, e2e, and cross-host tests locally against a simulated host runtime with fixed data. These are deterministic, so run them on every push.
  • Run evals on the main branch, with pass-rate thresholds, because model behavior varies by design.
  • Run live tests against real ChatGPT and Claude before a release, as a final smoke check, not as the gate on every commit.

This is where a testing framework earns its keep. sunpeak runs your tests against a local replica of the ChatGPT and Claude runtimes, so your e2e and cross-host tests exercise real host behavior (CSS variables, iframe sandboxing, protocol timing) without depending on a real account, a real network, or a real model. Simulation files pin the data, the mcp and inspector fixtures from sunpeak/test give you protocol-accurate inputs, and Playwright’s web-first assertions handle the timing. The combination removes the non-deterministic inputs that cause most flakiness, which means a red build means a real problem.

A Deterministic Test Checklist

Before you call a test stable, check that it controls every non-deterministic input:

  1. No real model calls. Model behavior lives in evals with pass-rate thresholds, not in your other tests.
  2. No real network calls. External services are stubbed or replaced with simulation data.
  3. No fixed sleeps. Timing is handled by auto-retrying assertions that wait for the actual condition.
  4. No shared mutable state. Mocks reset between tests, and test data is built fresh per test.
  5. No live clock. Time is frozen and dates are pinned to a fixed UTC instant.
  6. No raw randomness in assertions. The random source is injected and fixed, or you assert on a property instead of a value.

A test that passes all six runs the same way every time. That is the whole point: the same code should always give the same result, so when a test fails, you know the code changed, not the weather.

Get Started

Documentation →
npx sunpeak new

Further Reading

Frequently Asked Questions

Why are my MCP App tests flaky?

MCP App tests go flaky for the same reasons as any web app, plus one that is unique to AI apps. The shared causes are real network calls to external APIs, async UI rendering with hard-coded sleeps, shared state between tests, and code that reads the clock or generates random values. The AI-specific cause is LLM non-determinism: when a test sends a prompt to a real model, the model can pick a different tool or different arguments on each run, so any test that asserts an exact model response will pass and fail at random. The fix is to make each layer deterministic: pin tool data with simulation files, stub external services, replace sleeps with web-first assertions, freeze time and seed randomness, and treat model behavior as a statistical pass rate rather than an exact match.

How do I make ChatGPT App and MCP App tests deterministic?

Control every input your test depends on. Use simulation files (fixed JSON tool inputs and outputs) so resource components always receive the same data. Stub external APIs in your tool handlers so network responses do not vary. Replace waitForTimeout with auto-retrying assertions like expect(locator).toBeVisible() so the test waits for the actual condition instead of a fixed delay. Freeze time with fake timers and seed any random number generation. Run each test in isolation with no shared mutable state. Once every input is fixed, the same code produces the same result every run.

Are LLM evals supposed to be flaky?

Evals are non-deterministic by design, so a single run proves nothing, but a well-written eval is not flaky. The trick is to assert on a pass rate across many runs instead of requiring a perfect pass on one run. Run each case 10 or more times per model, set temperature to 0 to reduce noise, use fuzzy argument matching for values the model might phrase differently, and define a threshold (for example, fail if a case passes fewer than 8 of 10 runs). Keep evals out of the blocking pull request pipeline and gate them to the main branch so normal model variation does not randomly block merges.

Should I retry flaky tests in CI?

Retries are a stopgap, not a fix. A blanket retry hides real bugs because a test that fails one run in twenty is often catching a genuine race condition or a real intermittent error. Use retries sparingly, log every retry so you can see which tests need them, and treat any test that relies on retries as a bug to investigate. The better pattern is to quarantine a flaky test (move it out of the blocking suite, file a ticket, and fix the root cause) rather than retrying it forever in the main pipeline.

How do I stop e2e tests for MCP Apps from being flaky?

Most e2e flakiness comes from timing. Replace fixed sleeps with web-first assertions that retry until the condition is true, such as await expect(app.getByRole("heading")).toContainText("Results"). Render components against fixed simulation data instead of live tool handlers so the data never changes between runs. Avoid asserting on animations mid-transition. Run the suite headless in CI with the same browser version you use locally. If a test still flakes, run it with Playwright --repeat-each=20 to reproduce the failure, then fix the specific wait that is racing.

How do I test code that uses Date.now() or Math.random() in an MCP App?

Freeze the clock and seed the randomness so the output is reproducible. In Vitest, call vi.useFakeTimers() and vi.setSystemTime(new Date("2026-01-01")) before the test so any Date.now() or new Date() returns a fixed value, then vi.useRealTimers() in cleanup. For random values, inject the random source as a parameter or wrap it so a test can replace it with a fixed sequence. Never assert on a value that is computed from the real clock or a real random call, because it will differ on the next run.

How do I detect flaky tests before they reach CI?

Run the test many times in a row and shuffle the order. Playwright --repeat-each=20 runs each test twenty times in one command, which surfaces timing races. Vitest --sequence.shuffle randomizes test order, which surfaces hidden state shared between tests. Run both locally before you push. In CI, track which tests fail intermittently over time rather than only on the run that happened to fail, because a test that fails once a week is still flaky.

Why are tests that hit real ChatGPT or Claude flaky?

Live tests against real hosts depend on infrastructure you do not control: model behavior changes, the host runtime updates, networks vary, and accounts rate-limit. That makes live tests inherently less stable than local tests. Use them sparingly for final validation before a release, not as the gate on every pull request. Do the bulk of your testing locally against a simulated host runtime, which is deterministic, then run live tests as a final smoke check.