All posts

Regression Testing MCP Apps, ChatGPT Apps, and Claude Connectors (May 2026)

Abe Wheeler
MCP Apps MCP App Testing MCP App Framework ChatGPT Apps ChatGPT App Testing ChatGPT App Framework Claude Connectors Claude Connector Testing Claude Connector Framework Regression Testing Integration Testing
Regression testing MCP Apps across ChatGPT and Claude hosts.

Regression testing MCP Apps across ChatGPT and Claude hosts.

TL;DR: MCP Apps break in five ways: tool output shapes change, component rendering shifts, cross-host behavior drifts, display mode layouts break, and app state contracts snap. Regression testing catches all five. Use structural assertions on structuredContent for tool handler regressions, snapshot tests for rendering regressions, visual baselines for CSS regressions, and cross-host matrix runs for host-specific breakage. Run the full suite in CI with pnpm test && pnpm test:e2e.


You push a change to your MCP App. The unit tests pass. The e2e tests pass. You deploy. Then a user on Claude reports that the product card is missing the price, and a user on ChatGPT says the fullscreen layout overflows. Both worked yesterday.

This is a regression: something that used to work stopped working because of a change somewhere else. In MCP Apps, regressions are harder to spot than in standard web apps because your code runs in multiple host environments (ChatGPT, Claude, VS Code, Goose), across multiple display modes (inline, PiP, fullscreen), and with data flowing through multiple protocol boundaries (tool handler to host to resource iframe).

Regression testing is a strategy for catching these breaks before they reach production. It builds on the test types you may already have (unit, e2e, snapshot, visual) and organizes them into a safety net that covers the specific ways MCP Apps break.

Where MCP Apps Regress

Regular web apps regress when you change HTML, CSS, or JavaScript. MCP Apps have additional regression surfaces because of the protocol layer between your code and the host.

Here are the five categories of regressions that show up in production MCP Apps:

Tool output shape changes. Your tool handler returns structuredContent that your resource component depends on. Rename a field from imageUrl to thumbnailUrl and the resource renders a broken image. Unit tests don’t catch this if they test the handler and component in isolation with separate mocks.

Component rendering regressions. You refactor a component, change a CSS class, or update a dependency. The HTML output changes in a way that breaks the layout or drops content. Without a baseline to compare against, you won’t know until someone sees it.

Cross-host regressions. ChatGPT and Claude have different CSS variable sets, different iframe sandboxing behavior, and different protocol implementations. A change that works in ChatGPT might break in Claude because of a CSS variable name difference or a timing difference in protocol notifications.

Display mode regressions. Your resource looks right in inline mode but overflows in fullscreen because a container doesn’t have max-width. Or PiP mode breaks because a fixed-position element overlaps the pip controls. Each display mode is a separate layout context that can regress independently.

App state contract regressions. If your app uses useAppState and multiple tools read or write the same state, a change to the state shape in one tool breaks the others. This is a variant of the multi-tool contract problem, but it’s specifically about state rather than tool output.

Tool Output Regression Tests

The most common MCP App regression is a tool handler returning a different structuredContent shape than the resource component expects. This happens when someone modifies a handler without checking what the resource reads from it.

Structural assertions are better than snapshots for this job. Snapshots break on any change, including harmless additions like a new field. Structural assertions break only on changes that matter: missing fields, wrong types, renamed keys.

import { test, expect } from 'sunpeak/test';

test('get-weather output has the fields WeatherCard needs', async ({ mcp }) => {
  const result = await mcp.callTool('get-weather', { city: 'Portland' });

  expect(result.structuredContent).toEqual(
    expect.objectContaining({
      city: expect.any(String),
      temperature: expect.any(Number),
      unit: expect.stringMatching(/^(celsius|fahrenheit)$/),
      conditions: expect.any(String),
      forecast: expect.arrayContaining([
        expect.objectContaining({
          day: expect.any(String),
          high: expect.any(Number),
          low: expect.any(Number),
        }),
      ]),
    })
  );
});

This test passes when you add a new humidity field to the output because expect.objectContaining allows extra properties. It fails when you rename temperature to temp or change forecast from an array to an object, which are the changes that would break the WeatherCard resource component.

Backing Assertions with Shared Types

The assertion above catches regressions at test time. You can catch them even earlier at build time by defining shared TypeScript interfaces that both the tool handler and the resource component import:

// src/types/weather.ts
export interface WeatherOutput {
  city: string;
  temperature: number;
  unit: 'celsius' | 'fahrenheit';
  conditions: string;
  forecast: ForecastDay[];
}

export interface ForecastDay {
  day: string;
  high: number;
  low: number;
}

The handler uses this type as its return shape, and the resource component uses it as the generic argument to useToolData<unknown, WeatherOutput>. TypeScript catches field renames at compile time. The regression test catches runtime issues where the type is correct but the data is wrong (empty arrays, null values, missing optional fields).

Testing Additive Changes

Not every output change is a regression. Adding a new field is usually safe because resource components ignore fields they don’t use. But removing, renaming, or changing the type of a field is almost always a regression.

Structure your regression tests to match this reality:

test('get-weather output is backwards compatible', async ({ mcp }) => {
  const result = await mcp.callTool('get-weather', { city: 'Portland' });
  const content = result.structuredContent;

  // These fields must exist (removing them is a regression)
  expect(content).toHaveProperty('city');
  expect(content).toHaveProperty('temperature');
  expect(content).toHaveProperty('forecast');

  // These fields must have the right type (changing them is a regression)
  expect(typeof content.temperature).toBe('number');
  expect(Array.isArray(content.forecast)).toBe(true);

  // Don't assert exact equality on the full object, new fields are fine
});

Component Rendering Regressions

Tool output regression tests cover the data layer. Rendering regression tests cover what the user sees. For this, snapshot tests and visual regression tests work together.

Snapshot tests catch structural changes (missing elements, changed attributes, reordered nodes). They’re fast and don’t need a browser:

import { render } from '@testing-library/react';
import { describe, it, expect, vi } from 'vitest';
import { WeatherCard } from './weather-card';

const mockWeatherData = {
  city: 'Portland',
  temperature: 72,
  unit: 'fahrenheit' as const,
  conditions: 'Partly cloudy',
  forecast: [
    { day: 'Mon', high: 75, low: 58 },
    { day: 'Tue', high: 68, low: 55 },
  ],
};

vi.mock('sunpeak', () => ({
  useToolData: () => ({
    output: mockWeatherData,
    isLoading: false,
    isError: false,
    isCancelled: false,
    input: null,
    inputPartial: null,
    cancelReason: null,
  }),
  useAppState: (initial: unknown) => [initial, vi.fn()],
  useDisplayMode: () => 'inline',
  useHostContext: () => ({ theme: 'light', host: 'chatgpt' }),
  SafeArea: ({ children }: { children: React.ReactNode }) => children,
}));

it('WeatherCard renders the baseline layout', () => {
  const { container } = render(<WeatherCard />);
  expect(container).toMatchSnapshot();
});

When someone changes the component, the snapshot diff shows exactly what changed. This catches regressions like a missing <span> for the temperature unit, a reordered forecast list, or a removed CSS class.

Visual regression tests catch CSS and layout changes that snapshots miss. You need these because two different HTML structures can produce the same snapshot but look completely different on screen:

import { test, expect } from 'sunpeak/test';

test('WeatherCard visual baseline', async ({ inspector }) => {
  const result = await inspector.renderTool('get-weather', { city: 'Portland' });
  await expect(result.screenshot()).toMatchSnapshot();
});

Run pnpm test:visual to compare against baselines. When a CSS change moves the forecast section 20 pixels down, the visual test catches it even though the HTML snapshot is identical.

Cross-Host Regressions

A change that works in ChatGPT can break in Claude because the two hosts provide different CSS variables, different iframe attributes, and slightly different protocol timing. Testing on one host and assuming the other works is one of the most common sources of MCP App regressions.

sunpeak runs e2e tests across both host environments automatically. Each call to inspector.renderTool() executes in both the ChatGPT and Claude runtimes, so your tests cover both hosts without any extra configuration:

import { test, expect } from 'sunpeak/test';

test('WeatherCard renders correctly on both hosts', async ({ inspector }) => {
  const result = await inspector.renderTool('get-weather', { city: 'Portland' });
  const app = result.app();

  await expect(app.locator('[data-testid="city"]')).toHaveText('Portland');
  await expect(app.locator('[data-testid="temperature"]')).toHaveText('72');
  await expect(app.locator('[data-testid="forecast"]')).toBeVisible();
});

This test runs once against the ChatGPT runtime and once against the Claude runtime. If your component uses a CSS variable that exists in ChatGPT but not in Claude, the Claude run will catch the visual regression even if the ChatGPT run passes.

Targeting Host-Specific Regressions

Sometimes you need to test host-specific behavior that doesn’t apply to both hosts. Use the host property from the test context:

test('fullscreen close button renders on ChatGPT only', async ({ inspector, host }) => {
  const result = await inspector.renderTool('get-weather', {
    city: 'Portland',
    displayMode: 'fullscreen',
  });
  const app = result.app();
  const closeBtn = app.locator('[data-testid="close-button"]');

  if (host === 'chatgpt') {
    await expect(closeBtn).toBeVisible();
  } else {
    await expect(closeBtn).not.toBeVisible();
  }
});

Display Mode Regressions

Each display mode is a separate layout context. Inline mode has a narrow, fixed-width container. Fullscreen takes over the conversation area. PiP is a floating window. Regressions often hide in the modes you test less frequently.

Test each mode by passing displayMode to inspector.renderTool():

import { test, expect } from 'sunpeak/test';

const displayModes = ['inline', 'pip', 'fullscreen'] as const;

for (const mode of displayModes) {
  test(`WeatherCard layout in ${mode} mode`, async ({ inspector }) => {
    const result = await inspector.renderTool('get-weather', {
      city: 'Portland',
      displayMode: mode,
    });
    const app = result.app();

    // Forecast should always be visible
    await expect(app.locator('[data-testid="forecast"]')).toBeVisible();

    // Detailed view only appears in fullscreen
    if (mode === 'fullscreen') {
      await expect(app.locator('[data-testid="hourly-forecast"]')).toBeVisible();
    }
  });
}

Looping over display modes takes three lines of code and catches an entire category of regressions that manual testing misses. The most common display mode regression is a container without responsive constraints that renders fine in inline mode but overflows in fullscreen or gets clipped in PiP.

App State Regressions

If your app uses useAppState to share state between tools or between tool calls, changes to the state shape are regressions. This is similar to the multi-tool contract testing pattern, but focused specifically on app state.

import { test, expect } from 'sunpeak/test';

test('configure-filters sets state shape show-results expects', async ({ mcp }) => {
  const configResult = await mcp.callTool('configure-filters', {
    category: 'electronics',
    priceRange: { min: 0, max: 500 },
  });

  const defaultState = configResult.structuredContent.defaultState;

  // show-results reads these fields from useAppState
  expect(defaultState).toHaveProperty('category');
  expect(defaultState).toHaveProperty('priceRange');
  expect(defaultState.priceRange).toHaveProperty('min');
  expect(defaultState.priceRange).toHaveProperty('max');
  expect(typeof defaultState.priceRange.min).toBe('number');
});

test('show-results handles missing filter state gracefully', async ({ mcp }) => {
  // If the user triggers show-results before configure-filters, state is undefined
  const result = await mcp.callTool('show-results', {});
  expect(result.isError).toBeFalsy();

  // Should fall back to default values
  const content = result.structuredContent;
  expect(content.appliedFilters.category).toBe('all');
});

The first test is a contract: it verifies that configure-filters produces state in the shape show-results expects. The second test verifies graceful degradation when the state doesn’t exist yet. Both are regression tests because they protect an existing contract from unintentional changes.

Organizing Regression Tests

Regression tests aren’t a separate test type. They’re a strategy applied across your existing test layers. Here’s how they fit into a typical MCP App test directory:

tests/
  unit/
    weather-card.test.tsx          # snapshot tests for rendering regressions
    get-weather.test.ts            # handler logic + output shape assertions
  e2e/
    weather.spec.ts                # cross-host + display mode regression matrix
    state-contracts.spec.ts        # app state regression tests
    visual-baselines.spec.ts       # screenshot comparisons
  simulations/
    get-weather.json               # deterministic tool data for reproducible tests

The unit tests catch output shape and rendering regressions without a browser. The e2e tests catch cross-host, display mode, and visual regressions with a real browser. Together they form a regression safety net that runs in seconds for the unit layer and minutes for the e2e layer.

Running Regression Tests in CI

Regression tests only work if they run on every push. A regression test suite that developers have to remember to run locally will miss regressions.

Add them to your GitHub Actions workflow:

jobs:
  test:
    runs-on: ubuntu-latest
    steps:
      - uses: actions/checkout@v4
      - uses: pnpm/action-setup@v4
      - uses: actions/setup-node@v4
        with:
          node-version: 22
          cache: 'pnpm'
      - run: pnpm install
      - run: pnpm exec playwright install --with-deps chromium

      # Unit tests: output shape + snapshot regressions
      - run: pnpm test

      # E2E tests: cross-host + display mode + visual regressions
      - run: pnpm test:e2e

      - uses: actions/upload-artifact@v4
        if: failure()
        with:
          name: regression-artifacts
          path: |
            tests/e2e/__screenshots__/
            tests/e2e/__diffs__/

The upload-artifact step saves screenshot diffs when visual regression tests fail. You can download them from the Actions run to see exactly what changed.

When Baselines Need Updating

Not every regression test failure is a bug. Sometimes you made an intentional change and the baselines need updating.

For snapshot tests, run pnpm test:unit -- -u to update .snap files.

For visual regression tests, run pnpm test:visual --update to regenerate baseline screenshots.

The important step is reviewing what changed before committing. Snapshot diffs are text diffs in your normal git diff output. Visual regression diffs are image files you can inspect in a Git GUI or image viewer. If a diff shows changes you didn’t intend, that’s a regression you just caught.

Commit updated baselines alongside the code change that caused them, in the same PR. This keeps the baseline changes in context for reviewers, who can verify that the visual or structural changes match the intent of the code change.

A Regression Testing Checklist

Before shipping an MCP App update, verify:

  • Tool handler structuredContent shapes are covered by structural assertions
  • Shared TypeScript types exist between tool handlers and resource components
  • Resource components have snapshot tests covering the default state
  • Visual baselines exist for each display mode and theme
  • E2E tests run across both ChatGPT and Claude host modes
  • Display mode tests cover inline, PiP, and fullscreen
  • useAppState contracts are tested for both the writing and reading tool
  • State fallback tests verify behavior when expected state is missing
  • All regression tests run in CI on every push
  • Snapshot and visual baselines are committed alongside code changes

Get Started

Documentation →
npx sunpeak new

Further Reading

Frequently Asked Questions

What is regression testing for MCP Apps?

Regression testing for MCP Apps verifies that changes to your code do not break existing behavior. MCP Apps have five regression surfaces: tool handler output shapes, resource component rendering, cross-host compatibility (ChatGPT vs Claude), display mode layouts, and app state contracts. A regression test suite combines schema assertions, snapshot comparisons, visual baselines, and cross-host matrix runs to catch breakage before it reaches production.

How is regression testing different from unit testing for MCP Apps?

Unit tests verify that individual functions produce correct output for given inputs. Regression tests verify that previously working behavior still works after a change. A unit test for a tool handler checks that it returns the right data. A regression test checks that the data shape has not changed from what the resource component expects. Unit tests prove correctness. Regression tests prove stability.

What are the most common regressions in MCP Apps?

The most common regressions are structuredContent shape changes (renamed or removed fields that break resource components), CSS regressions from host variable changes or display mode edge cases, cross-host behavior drift where an update works on ChatGPT but breaks on Claude, tool schema changes that cause LLMs to pass different arguments, and useAppState contract breaks where one tool changes the state shape another tool depends on.

How do I set up a regression test suite for an MCP App?

Start with three layers. First, add structuredContent shape assertions to your integration tests using the mcp fixture. These catch tool output regressions. Second, add snapshot tests for your resource components to catch rendering regressions. Third, run your e2e tests across both ChatGPT and Claude host modes to catch cross-host regressions. Run all three in CI with pnpm test and pnpm test:e2e.

How do I test for schema regressions in MCP App tool handlers?

Use the mcp fixture to call your tool handler and assert the output shape with structural matchers like expect.objectContaining and toHaveProperty. Save the expected shape as a TypeScript interface and assert against it. When someone changes the handler output, the test fails with a clear diff showing what changed. This is faster and more stable than snapshot testing for data shapes.

How do I run regression tests across ChatGPT and Claude hosts?

sunpeak runs e2e tests across host modes automatically. Each test renders your resource component in both the ChatGPT and Claude runtime environments. Use pnpm test:e2e to run the cross-host matrix. If a CSS variable difference or protocol behavior causes a regression on one host, the test fails for that host while passing for the other, so you can see exactly where the regression occurred.

How do I prevent display mode regressions in MCP Apps?

Write e2e tests that render your resource in each display mode (inline, PiP, fullscreen) and assert layout expectations. Use inspector.renderTool with the displayMode option to switch modes. Check that elements resize correctly, scroll behavior works, and conditional UI (like expand buttons) appears or hides based on the mode. Run these tests in CI alongside your other e2e tests.

Should I use snapshot testing or structural assertions for MCP App regression tests?

Use both. Snapshot tests (toMatchSnapshot) are good for catching unexpected changes in rendered HTML because they cover everything without explicit assertions. Structural assertions (toHaveProperty, expect.objectContaining) are better for tool handler output because they tolerate additive changes like new fields while failing on breaking changes like removed or renamed fields. Snapshots are brittle to harmless changes. Structural assertions are precise about what matters.