Visual Regression Testing for MCP Apps, ChatGPT Apps, and Claude Connectors (April 2026)

April 13, 2026 Abe Wheeler

MCP Apps MCP App Testing MCP App Framework ChatGPT Apps ChatGPT App Testing ChatGPT App Framework Claude Connectors Claude Connector Testing Claude Connector Framework Visual Regression Testing Visual Testing

Visual regression testing catches pixel-level UI regressions across ChatGPT and Claude hosts.

Your MCP App passes all its unit tests. The snapshot tests are green. You deploy, open ChatGPT, and the layout is broken. A CSS change shifted a margin. A host theme update changed the background color your component was relying on. The HTML structure is identical, so text-based tests didn’t notice.

Visual regression testing catches this. It compares actual screenshots of your rendered app against saved baselines and fails when anything visible changes. One pixel off? You’ll know.

TL;DR: Add result.screenshot() calls to your e2e tests and run pnpm test:visual. Screenshots are captured in both ChatGPT and Claude hosts, compared pixel-by-pixel against baselines, and reported as diffs when something changes. Use pnpm test:visual --update to regenerate baselines after intentional changes. Works locally and in CI/CD with no paid host accounts.

Why Visual Regression Testing Matters for MCP Apps

MCP Apps render inside host iframes. Your component’s appearance depends on your own CSS, the host’s theme variables, the display mode (inline, fullscreen, picture-in-picture), and the host itself (ChatGPT and Claude have different color palettes, typography, and spacing). That’s a lot of visual surface area.

Snapshot tests catch structural changes, missing elements, changed attributes, but they serialize HTML as text. They can’t tell you that a flex-direction change rotated your layout 90 degrees or that a color variable stopped resolving. Unit tests verify logic. E2e tests verify behavior. Visual regression tests verify what the user actually sees.

Here’s what visual regression tests catch that other tests miss:

CSS regressions: a gap change, a missing border-radius, a broken media query
Host theme changes: ChatGPT updates its dark mode palette and your hardcoded color no longer matches
Display mode layout bugs: your component looks fine inline but overflows in PiP
Cross-host differences: padding that works in ChatGPT clips text in Claude
Font rendering changes: a fallback font loads instead of the host’s primary typeface

How Visual Regression Works

The process is simple:

Your e2e test calls a tool and renders a resource component inside the sunpeak inspector
You call result.screenshot('name') to capture the rendered output
On first run (or with --update), the screenshot is saved as a baseline image
On subsequent runs, the current screenshot is compared pixel-by-pixel against the baseline
If the images differ beyond the configured threshold, the test fails with a diff image

The diff image highlights changed pixels in red, so you can see exactly what shifted.

Writing Your First Visual Regression Test

If you already have e2e tests using the inspector fixture from sunpeak/test, adding visual regression is one line per state you want to capture.

// tests/e2e/albums.spec.ts
import { test, expect } from 'sunpeak/test';

test('albums resource renders correctly', async ({ inspector }) => {
  const result = await inspector.renderTool('get-albums', {
    artist: 'Radiohead',
  });

  // Verify content is there
  await expect(result.app().getByText('OK Computer')).toBeVisible();

  // Capture a visual baseline
  await result.screenshot('albums-default');
});

The result.screenshot() call is a no-op unless you run with --visual. Your regular e2e tests aren’t slowed down by screenshot logic.

Generate baselines the first time:

pnpm test:visual --update

Then on every subsequent run:

pnpm test:visual

If the rendering changed, the test fails and you get three images: the baseline, the actual screenshot, and a diff.

Screenshot Targets

By default, result.screenshot() captures the content inside the app iframe, which is usually what you want. But you can capture more:

// Just the app content (default)
await result.screenshot('app-content');

// The full inspector page, including host chrome
await result.screenshot('full-page', { target: 'page' });

// A specific element
await result.screenshot('submit-button', {
  element: app.locator('button[type="submit"]'),
});

Use target: 'page' when you need to verify how your app looks inside the host’s conversation UI. Use the element option for focused component-level regression tests where you don’t want unrelated areas to cause failures.

Testing Across Hosts, Themes, and Display Modes

MCP Apps need to look right in multiple environments. Visual regression tests should cover the combinations that matter.

Cross-Host Testing

sunpeak runs e2e tests against both ChatGPT and Claude hosts automatically (via Playwright projects configured in defineConfig()). Each host gets its own baseline directory:

tests/__screenshots__/
  chatgpt/
    albums.spec.ts/
      albums-default.png
      albums-dark.png
  claude/
    albums.spec.ts/
      albums-default.png
      albums-dark.png

If your app renders correctly in ChatGPT but has a layout bug in Claude, only the Claude baseline fails. Separate baselines per host mean you catch host-specific regressions without false positives from legitimate cross-host differences.

Theme Testing

Test both light and dark mode for each host:

test('albums renders in dark mode', async ({ inspector }) => {
  const result = await inspector.renderTool('get-albums', {
    artist: 'Radiohead',
  }, { theme: 'dark' });

  await expect(result.app().getByText('OK Computer')).toBeVisible();
  await result.screenshot('albums-dark');
});

Dark mode regressions are common because developers tend to build in light mode and forget that host CSS variables resolve to different values in dark mode. A visual test for both themes catches color contrast issues, invisible text on dark backgrounds, and borders that disappear.

Display Mode Testing

Display modes change your component’s available viewport. An app that fits perfectly inline might overflow in PiP. Test the modes your app supports:

test('albums in fullscreen', async ({ inspector }) => {
  const result = await inspector.renderTool('get-albums', {
    artist: 'Radiohead',
  }, { displayMode: 'fullscreen' });

  await result.screenshot('albums-fullscreen');
});

test('albums in pip', async ({ inspector }) => {
  const result = await inspector.renderTool('get-albums', {
    artist: 'Radiohead',
  }, { displayMode: 'pip' });

  await result.screenshot('albums-pip');
});

Configuring Comparison Thresholds

Pixel-perfect comparison is too strict for most MCP Apps. Anti-aliasing, sub-pixel rendering, and font hinting produce tiny differences across platforms and CI runners. Configure thresholds to allow acceptable variation.

Per-Screenshot Thresholds

await result.screenshot('albums-default', {
  threshold: 0.2,          // Per-pixel color tolerance (0-1)
  maxDiffPixelRatio: 0.01, // Max 1% of pixels can differ
});

Project-Wide Defaults

Set defaults in your Playwright config so you don’t repeat options on every call:

// playwright.config.ts
import { defineConfig } from 'sunpeak/test/config';

export default defineConfig({
  visual: {
    threshold: 0.2,
    maxDiffPixelRatio: 0.05,
    animations: 'disabled',
    snapshotPathTemplate:
      '{testDir}/__screenshots__/{projectName}/{testFilePath}/{arg}{ext}',
  },
});

A few rules of thumb for thresholds:

threshold: 0.2 with maxDiffPixelRatio: 0.01 works for most apps
Increase maxDiffPixelRatio if your app has gradients or shadows that render slightly differently across environments
Set animations: 'disabled' to avoid flaky tests caused by capturing mid-animation frames
If a specific screenshot is flaky, increase its threshold individually rather than loosening the project-wide setting

Handling Flaky Visual Tests

Visual regression tests can be flaky if you’re not careful. Here’s how to avoid the common causes.

Dynamic content. If your component shows timestamps, random IDs, or live data, those will change on every run. Use simulation files with fixed data so your screenshots are deterministic.

Animations. CSS transitions and animations produce different frames depending on timing. Set animations: 'disabled' in your visual config, which tells Playwright to finish all CSS animations before capturing.

Font loading. If a custom font hasn’t loaded when the screenshot is captured, you’ll get fallback font baselines that break once the font does load. The sunpeak inspector waits for fonts by default, but if you’re loading fonts from a custom domain, make sure it’s in your connectDomains config.

Viewport size. Screenshots depend on the viewport dimensions. The sunpeak test config sets consistent viewport sizes per display mode, so this shouldn’t be an issue unless you override them. If you do, make sure your overrides match between local and CI.

Visual Regression in CI/CD

Visual regression tests belong in your CI/CD pipeline. The baselines live in your repo, so every pull request compares against the committed baselines.

Add --visual to your existing test workflow:

# .github/workflows/test.yml
jobs:
  test:
    runs-on: ubuntu-latest
    steps:
      - uses: actions/checkout@v4
      - uses: pnpm/action-setup@v4
      - uses: actions/setup-node@v4
        with:
          node-version: 20
          cache: 'pnpm'
      - run: pnpm install
      - run: pnpm exec playwright install --with-deps chromium
      - run: pnpm test:visual
      - uses: actions/upload-artifact@v4
        if: failure()
        with:
          name: visual-regression-diffs
          path: test-results/

The upload-artifact step is important. When a visual test fails in CI, you need to see the actual screenshot, the baseline, and the diff. Uploading test-results/ makes them available as downloadable artifacts on the failed workflow run.

Cross-Platform Baseline Considerations

Font rendering differs between macOS and Linux. If you develop on macOS and your CI runs on Ubuntu, your baselines won’t match because of sub-pixel rendering differences. You have two options:

Generate baselines in CI. Run pnpm test:visual --update in your CI environment, download the artifacts, and commit those baselines. This means your baselines always match the CI runner.
Use Docker locally. Run your visual tests in the same Linux container your CI uses. This makes local and CI baselines identical.

Option 1 is simpler for most teams. Generate baselines once in CI after any intentional UI change, then commit the updated images.

Reviewing Visual Changes in Pull Requests

When a PR changes your app’s appearance, the visual regression tests will fail. This is correct behavior. The review workflow is:

Open the failed CI run and download the visual regression artifacts
Compare the baseline and actual screenshots. Does the change look intentional?
If yes, the PR author runs pnpm test:visual --update (in the CI environment or matching Docker container), commits the new baselines, and pushes
If no, there’s an unintended regression. Fix the CSS or component code

GitHub and most Git GUIs render image diffs side-by-side. Some teams add a CI step that posts the diff images as PR comments for easier review.

What to Screenshot

You don’t need to screenshot everything. Focus on states where visual regressions are likely and costly:

Default state with representative data. The happy path your users see most.
Empty state. How your app looks with no data, zero results, or a fresh load.
Error state. Error handling UI should look correct too.
Dark mode. For each screenshot you take in light mode, take one in dark mode.
Each display mode you support. Inline, fullscreen, and PiP have different viewport constraints.
Edge-case data. Very long text, deeply nested structures, single-item lists.

Skip screenshots of intermediate loading spinners or states that are inherently transient. They create flaky tests and don’t add much value.

Putting It All Together

Here’s a complete visual regression test file for an albums resource that covers the states above:

// tests/e2e/albums.visual.spec.ts
import { test, expect } from 'sunpeak/test';

test('default state', async ({ inspector }) => {
  const result = await inspector.renderTool('get-albums', {
    artist: 'Radiohead',
  });
  await expect(result.app().getByText('OK Computer')).toBeVisible();
  await result.screenshot('albums-default');
});

test('dark mode', async ({ inspector }) => {
  const result = await inspector.renderTool('get-albums', {
    artist: 'Radiohead',
  }, { theme: 'dark' });
  await result.screenshot('albums-dark');
});

test('empty results', async ({ inspector }) => {
  const result = await inspector.renderTool('get-albums', {
    artist: 'Unknown Artist With No Albums',
  });
  await expect(result.app().getByText('No albums found')).toBeVisible();
  await result.screenshot('albums-empty');
});

test('fullscreen display mode', async ({ inspector }) => {
  const result = await inspector.renderTool('get-albums', {
    artist: 'Radiohead',
  }, { displayMode: 'fullscreen' });
  await result.screenshot('albums-fullscreen');
});

test('pip display mode', async ({ inspector }) => {
  const result = await inspector.renderTool('get-albums', {
    artist: 'Radiohead',
  }, { displayMode: 'pip' });
  await result.screenshot('albums-pip');
});

These five tests, running against both ChatGPT and Claude hosts, produce ten baseline screenshots. That’s ten opportunities to catch a visual regression before your users do.

Get Started

sunpeak’s visual regression testing runs locally and in CI/CD with no paid AI host accounts. Add result.screenshot() calls to your e2e tests and run pnpm test:visual.

npx sunpeak new
pnpm test:visual --update  # Generate initial baselines
pnpm test:visual           # Compare against baselines

See the testing framework docs for the full API reference, or follow the complete testing guide for a walkthrough of unit, e2e, and visual tests together.

Get Started

Documentation →


npx sunpeak new

Frequently Asked Questions

What is visual regression testing for MCP Apps?

Visual regression testing captures screenshots of your MCP App rendering inside host runtimes (ChatGPT, Claude) and compares them pixel-by-pixel against saved baselines. When the UI changes unexpectedly, a CSS tweak breaks a layout, or a host update shifts your rendering, the test fails and shows you a diff image highlighting exactly what changed. This catches bugs that unit tests and snapshot tests miss because those only check HTML structure, not actual visual output.

How do I run visual regression tests for an MCP App?

Run pnpm test:visual. This runs your e2e tests with screenshot comparison enabled. Any result.screenshot() calls in your tests capture the current rendering and compare it against baseline images. If no baselines exist yet, run pnpm test:visual --update to generate them. The tests run against both ChatGPT and Claude hosts automatically.

What is the difference between snapshot testing and visual regression testing for MCP Apps?

Snapshot testing serializes rendered HTML as text and compares text diffs. It catches structural changes like missing elements or changed attributes but misses CSS bugs, layout shifts, and color changes. Visual regression testing compares actual screenshots pixel-by-pixel, so it catches any visible change. Snapshot tests run in milliseconds without a browser. Visual regression tests need a real browser (Playwright) and take longer but cover what you actually see.

How do I configure screenshot comparison thresholds?

Pass Playwright toHaveScreenshot options to result.screenshot() or set project-wide defaults in defineConfig(). The threshold option (0 to 1) controls per-pixel color sensitivity. maxDiffPixelRatio (0 to 1) controls what percentage of pixels can differ before failing. For example, threshold 0.2 with maxDiffPixelRatio 0.05 means each pixel can differ by up to 20% in color, and up to 5% of all pixels can differ, before the test fails.

How do I test visual regressions across ChatGPT and Claude hosts?

sunpeak runs visual regression tests against both hosts automatically. Each screenshot is saved with the host name in the path (e.g., __screenshots__/chatgpt/test.png and __screenshots__/claude/test.png). If your app looks correct in ChatGPT but breaks in Claude, the Claude screenshot baseline will catch it. You maintain separate baselines per host because each host has different chrome, colors, and spacing.

How do I update visual regression baselines after intentional UI changes?

Run pnpm test:visual --update. This replaces all baseline images with the current screenshots. Review the changed images in your git diff (most Git GUIs show image diffs) before committing. Only update baselines when the visual change is intentional. Commit the updated baseline images alongside the code change that caused them.

Can I run visual regression tests for MCP Apps in CI/CD?

Yes. Add pnpm test:visual to your GitHub Actions workflow. Baseline images live in your repo so the CI runner compares against them. If a test fails, the CI artifacts include the actual screenshot, the baseline, and a diff image. Install Playwright browsers with pnpm exec playwright install --with-deps chromium before running the tests.

What parts of my MCP App should I screenshot for visual regression tests?

At minimum, screenshot each resource component in its default state with default tool data. Then add screenshots for dark mode, different display modes (inline, fullscreen, PiP), loading and error states, and edge-case data like empty lists or long text. Use the target option to capture just the app iframe (default), the full inspector page, or a specific element.

Why Visual Regression Testing Matters for MCP Apps

How Visual Regression Works

Writing Your First Visual Regression Test

Screenshot Targets

Testing Across Hosts, Themes, and Display Modes

Cross-Host Testing

Theme Testing

Display Mode Testing

Configuring Comparison Thresholds

Per-Screenshot Thresholds

Project-Wide Defaults

Handling Flaky Visual Tests

Visual Regression in CI/CD

Cross-Platform Baseline Considerations

Reviewing Visual Changes in Pull Requests

What to Screenshot

Putting It All Together

Get Started

Get Started

Further Reading

Frequently Asked Questions