Cross-Host Compatibility Testing for MCP Apps: ChatGPT, Claude, and Beyond (May 2026)

May 6, 2026 Abe Wheeler

MCP Apps MCP App Testing MCP App Framework ChatGPT Apps ChatGPT App Testing ChatGPT App Framework Claude Connectors Claude Connector Testing Claude Connector Framework Cross-Host Testing Compatibility Testing

Cross-host compatibility testing catches bugs that only appear in specific MCP App hosts.

Your MCP App passes every test. Green across the board. You submit it to the ChatGPT App Store, it works perfectly, and then someone installs it as a Claude Connector. The layout breaks. A button that worked in ChatGPT does nothing. The dark mode colors look wrong.

MCP Apps run on an open standard, which means one codebase works across ChatGPT, Claude, VS Code Copilot, Goose, Postman, and MCPJam. But “works” is doing a lot of heavy lifting. Each host renders your app in its own iframe with its own CSS variables, its own display mode behavior, and its own set of host-specific APIs. If you only test against one host, you’re only testing half the picture.

TL;DR: Use defineConfig() from sunpeak/test/config to run every test against both ChatGPT and Claude automatically. Build a compatibility test matrix that covers host × display mode × theme. Test feature detection paths (isChatGPT(), isClaude()) so both the host-specific UI and the fallback get exercised. Use visual regression tests with per-host screenshot baselines to catch CSS differences. Run the full matrix in CI/CD so regressions never reach production.

What Differs Between Hosts

Before writing compatibility tests, you need to know what can go wrong. Here’s what actually changes from one host to another.

CSS Variables and Theming

Each host injects its own set of CSS variables into the iframe where your app renders. ChatGPT and Claude share some variable names from the MCP Apps protocol, but the values differ: different font families, different spacing scales, different color palettes. Claude’s dark mode palette is not the same as ChatGPT’s dark mode palette.

If you hardcode colors instead of using var(--text-primary) or similar host variables, your app will look correct on the host you tested and wrong everywhere else. The styling guide covers the available CSS variables. From a testing perspective, what matters is that you verify your app looks right under both sets of values.

Display Mode Behavior

MCP Apps support three display modes: inline (embedded in the conversation), fullscreen (takes up the full screen below the host header), and picture-in-picture (floating window). All three are defined in the protocol, but hosts implement them differently.

For example, PiP mode on mobile screen widths in ChatGPT falls back to fullscreen. Other hosts may not support PiP at all. The iframe dimensions at each display mode vary between hosts because each host has different chrome (header bars, sidebars, padding). Your layout needs to adapt, and your tests need to verify it does.

Host-Specific APIs

ChatGPT provides APIs that Claude doesn’t: useRequestCheckout for payments, useUploadFile for file uploads, and useRequestModal for modal windows. These live under sunpeak/chatgpt and are gated behind isChatGPT() runtime checks. When your app uses these APIs, it needs a fallback path for hosts where they’re not available.

If you only test on ChatGPT, the fallback paths never run. If you only test on Claude, the host-specific paths never run. You need both.

Viewport and Safe Areas

Each host wraps your iframe in different chrome. ChatGPT has a conversation sidebar. Claude has its own layout. The usable viewport at each display mode is a different number of pixels in each host. SafeArea from the core sunpeak API handles this, but if you’re doing manual layout calculations or absolute positioning, the numbers won’t match across hosts.

Setting Up Cross-Host Tests

The mechanics of running tests across hosts are straightforward. Two pieces of configuration give you automatic cross-host coverage.

defineConfig Creates Projects Per Host

// playwright.config.ts
import { defineConfig } from 'sunpeak/test/config';
export default defineConfig();

This creates separate Playwright projects for ChatGPT and Claude. Every test file runs twice, once per host. The test report prefixes each result with the host name:

✓ [chatgpt] dashboard renders revenue chart (1.1s)
✓ [chatgpt] dashboard adapts to dark mode (0.9s)
✗ [claude] dashboard renders revenue chart (1.3s)
✓ [claude] dashboard adapts to dark mode (1.0s)

When a test fails on Claude but passes on ChatGPT, you know the bug is host-specific. You don’t have to guess.

The Inspector Fixture

Tests use the inspector fixture from sunpeak/test, which provides renderTool() and host properties:

import { test, expect } from 'sunpeak/test';

test('weather card shows temperature', async ({ inspector }) => {
  const result = await inspector.renderTool('show-weather', {
    city: 'Portland',
    state: 'Oregon',
  });
  const app = result.app();

  await expect(app.locator('.temperature')).toContainText('82');
});

This test runs against both hosts automatically. No host-selection logic needed in the test itself. The E2E testing guide covers renderTool options like display modes and themes.

Building a Compatibility Test Matrix

Running every test on every host is the baseline. A compatibility test matrix goes further: it systematically covers the combinations of host, display mode, and theme that matter for your app.

The Dimensions

Your app can render in:

2 hosts: ChatGPT, Claude
3 display modes: inline, fullscreen, PiP
2 themes: light, dark

That’s 12 combinations. You don’t need 12 copies of every test, but you do need to make sure that every combination gets exercised by at least one test.

Organizing Matrix Tests

Group your matrix tests by the dimension they’re primarily testing:

import { test, expect } from 'sunpeak/test';

const displayModes = ['inline', 'fullscreen', 'pip'] as const;
const themes = ['light', 'dark'] as const;

for (const mode of displayModes) {
  for (const theme of themes) {
    test(`dashboard renders in ${mode} ${theme}`, async ({ inspector }) => {
      const result = await inspector.renderTool('show-dashboard', undefined, {
        displayMode: mode,
        theme,
      });
      const app = result.app();

      await expect(app.locator('.dashboard-container')).toBeVisible();

      if (mode === 'pip') {
        await expect(app.locator('.compact-view')).toBeVisible();
      } else {
        await expect(app.locator('.full-view')).toBeVisible();
      }
    });
  }
}

Because defineConfig() already runs each test on both hosts, this loop produces 12 test runs (2 hosts × 3 modes × 2 themes). If your dashboard breaks only in Claude dark mode fullscreen, you’ll see exactly which combination failed.

Focus on the Risky Combinations

Not every combination is equally likely to break. Prioritize tests for:

Dark mode on Claude: Claude’s dark palette differs significantly from ChatGPT’s. If you’re using host CSS variables correctly, both work. If you hardcoded a color anywhere, dark mode on the “other” host is where it breaks.
PiP on hosts with limited support: Some hosts don’t support PiP or fall back to fullscreen. Your PiP layout should degrade gracefully.
Inline mode with long content: Inline mode gives your app the least space. If your layout overflows on one host but not the other (because of different inline viewport widths), that’s a cross-host bug.

Testing Feature Detection and Fallbacks

When your app uses host-specific APIs, you’re branching your code based on which host is running. Both branches need tests.

The Pattern in Your Component

A typical host-specific feature gate looks like this:

import { isChatGPT } from 'sunpeak';
import { useRequestCheckout } from 'sunpeak/chatgpt';

function BuyButton({ sku }: { sku: string }) {
  if (!isChatGPT()) {
    return <a href={`/checkout?sku=${sku}`}>Buy on Web</a>;
  }
  const requestCheckout = useRequestCheckout();
  return <button onClick={() => requestCheckout({ sku })}>Buy Now</button>;
}

Testing Both Paths

Write tests that assert the correct UI for each host:

import { test, expect } from 'sunpeak/test';

test('buy button uses native checkout on ChatGPT', async ({ inspector }) => {
  test.skip(inspector.host !== 'chatgpt', 'ChatGPT-only feature');

  const result = await inspector.renderTool('show-product', { sku: 'WIDGET-1' });
  const app = result.app();

  const button = app.locator('button:has-text("Buy Now")');
  await expect(button).toBeVisible();
  // No <a> tag fallback should render
  await expect(app.locator('a:has-text("Buy on Web")')).not.toBeVisible();
});

test('buy button falls back to web link on non-ChatGPT hosts', async ({ inspector }) => {
  test.skip(inspector.host === 'chatgpt', 'Testing fallback path');

  const result = await inspector.renderTool('show-product', { sku: 'WIDGET-1' });
  const app = result.app();

  const link = app.locator('a:has-text("Buy on Web")');
  await expect(link).toBeVisible();
  await expect(link).toHaveAttribute('href', '/checkout?sku=WIDGET-1');
});

The first test only runs on ChatGPT and checks the host-specific path. The second test only runs on other hosts and checks the fallback. Together, they cover both branches.

When the Fallback is “Nothing”

Some host-specific features have no reasonable fallback. Maybe useUploadFile only makes sense in ChatGPT, and on other hosts you just don’t show the upload button. Test that it’s absent:

test('upload button hidden on non-ChatGPT hosts', async ({ inspector }) => {
  test.skip(inspector.host === 'chatgpt', 'Testing absence of ChatGPT feature');

  const result = await inspector.renderTool('show-editor');
  const app = result.app();

  await expect(app.locator('[data-testid="upload-btn"]')).not.toBeVisible();
});

The fallback is the absence of the feature, and that’s worth testing. A bug where isChatGPT() returns true on the wrong host (or where the gate has a logic error) would render a broken button on Claude.

Visual Regression Tests Across Hosts

Structural assertions (is this element visible?) catch missing or misplaced elements. They don’t catch a background color that blends with text, a margin that shifts 10px, or a font that renders wider in one host than another. Visual regression tests do.

The visual regression testing guide covers the full setup. For cross-host compatibility, the key detail is that screenshot baselines are saved per host:

__screenshots__/
  chatgpt/
    dashboard-inline-light.png
    dashboard-inline-dark.png
    dashboard-fullscreen-light.png
  claude/
    dashboard-inline-light.png
    dashboard-inline-dark.png
    dashboard-fullscreen-light.png

You maintain separate baselines because the hosts should look different. ChatGPT has its color palette, Claude has its own. What you’re catching is unintended changes within each host’s rendering.

Add screenshot captures to your matrix tests:

import { test } from 'sunpeak/test';

const displayModes = ['inline', 'fullscreen'] as const;
const themes = ['light', 'dark'] as const;

for (const mode of displayModes) {
  for (const theme of themes) {
    test(`dashboard visual - ${mode} ${theme}`, async ({ inspector }) => {
      const result = await inspector.renderTool('show-dashboard', undefined, {
        displayMode: mode,
        theme,
      });

      await result.screenshot();
    });
  }
}

This generates 8 baseline images (2 hosts × 2 modes × 2 themes). When a CSS change breaks the Claude dark mode layout, the diff image shows exactly what moved.

CSS-Specific Cross-Host Assertions

Sometimes you need to verify a specific CSS property without full visual regression. Playwright’s toHaveCSS works on elements inside the inspector iframe:

test('text uses host color token', async ({ inspector }) => {
  const result = await inspector.renderTool('show-dashboard');
  const app = result.app();

  const heading = app.locator('h1');
  const color = await heading.evaluate(
    el => getComputedStyle(el).color
  );

  // Should not be a hardcoded value
  // The exact value depends on the host, but it should not be black (#000)
  // or any color you hardcoded
  expect(color).not.toBe('rgb(0, 0, 0)');
});

This test doesn’t check for a specific color (because each host has a different palette). It checks that you’re not using a hardcoded value that would look wrong on some hosts. A more targeted version:

test('links are readable against background', async ({ inspector }) => {
  const result = await inspector.renderTool('show-dashboard', undefined, {
    theme: 'dark',
  });
  const app = result.app();

  const link = app.locator('a').first();
  const linkColor = await link.evaluate(el => getComputedStyle(el).color);
  const bgColor = await app.locator('.dashboard-container').evaluate(
    el => getComputedStyle(el).backgroundColor
  );

  // Basic contrast check: link and background shouldn't be the same color
  expect(linkColor).not.toBe(bgColor);
});

This catches the common cross-host bug where a hardcoded link color matches the host’s dark mode background, making the link invisible.

Running Cross-Host Tests in CI/CD

Your GitHub Actions workflow should run the full cross-host matrix. The standard setup already handles this because defineConfig() creates both host projects:

# .github/workflows/test.yml
name: Test
on: [push, pull_request]
jobs:
  test:
    runs-on: ubuntu-latest
    steps:
      - uses: actions/checkout@v4
      - uses: pnpm/action-setup@v4
      - uses: actions/setup-node@v4
        with:
          node-version: 22
          cache: pnpm
      - run: pnpm install
      - run: pnpm exec playwright install --with-deps chromium
      - run: pnpm test
      - run: pnpm test:visual
      - uses: actions/upload-artifact@v4
        if: failure()
        with:
          name: test-results
          path: test-results/

When a visual regression test fails, the uploaded artifacts include the actual screenshot, the baseline, and the diff image. The file paths include the host name, so you can see immediately which host broke.

For visual regression tests, update baselines explicitly when making intentional visual changes:

pnpm test:visual --update

Review the changed images in your git diff before committing. Most Git GUIs render image diffs side-by-side.

A Compatibility Checklist

When you add a new feature or resource to your MCP App, run through this checklist:

Does it use host CSS variables? If you hardcoded any colors, fonts, or spacing, replace them with CSS variables from the host. Test in both light and dark themes.
Does it adapt to display modes? Test inline, fullscreen, and PiP. Check that content doesn’t overflow in inline mode and that the layout uses available space in fullscreen.
Does it use host-specific APIs? Write tests for both the host-specific path and the fallback. Make sure the fallback renders something useful (or hides the feature cleanly).
Does the layout depend on viewport size? Each host gives your iframe different dimensions. Use relative units and flexbox/grid instead of fixed pixel values.
Does it handle SafeArea? Wrap your content in SafeArea so it respects each host’s chrome and padding. Don’t offset manually.

If you use the sunpeak Inspector (npx sunpeak inspect), you can toggle between ChatGPT and Claude from the sidebar, switch display modes, and toggle themes without redeploying. This gives you fast manual verification before your automated tests run.

When to Add Host-Specific Tests

Not every component needs a compatibility matrix. Here’s a simple rule:

Portable components (data display, forms, charts) using only core sunpeak hooks: the automatic cross-host runs from defineConfig() are enough. If the test passes on both hosts, you’re good.
Layout-sensitive components (responsive grids, collapsible panels, components that fill available space): add display mode × host tests to verify layout at different viewport sizes.
Components with host-specific features (checkout, file upload, modals): write explicit tests for both the feature path and the fallback path.
Styled components with custom CSS: add visual regression screenshots per host and theme to catch color and spacing drift.

The goal is coverage without redundancy. Don’t write 12-combination matrix tests for a component that renders a static paragraph. Save the matrix for components where host differences actually affect the output.

Get Started

Documentation →


npx sunpeak new

Frequently Asked Questions

Why do I need cross-host testing for MCP Apps?

MCP Apps render inside host iframes, and each host (ChatGPT, Claude, VS Code Copilot, Goose, etc.) provides different CSS variables, display mode support, viewport dimensions, and host-specific APIs. An app that looks correct in ChatGPT can break in Claude because of different padding, font stacks, or color tokens. Cross-host testing catches these bugs before your users see them.

What are the main differences between ChatGPT and Claude for MCP Apps?

ChatGPT and Claude differ in CSS variable values (colors, fonts, spacing), display mode behavior (PiP support varies), theme implementation (different dark mode palettes), viewport dimensions at each display mode, and host-specific APIs. ChatGPT provides useRequestCheckout, useUploadFile, and useRequestModal through sunpeak/chatgpt. Claude has its own connector-specific features. The core MCP Apps protocol (useToolData, useAppState, useDisplayMode, SafeArea) works identically across both.

How do I run MCP App tests against both ChatGPT and Claude?

Use defineConfig() from sunpeak/test/config in your playwright.config.ts. It creates separate Playwright projects for each host automatically. Every test runs once per host, and the test report shows [chatgpt] or [claude] prefixes so you can see which host failed. No paid accounts or API credits required.

How do I test host-specific feature detection in MCP Apps?

Use isChatGPT() and isClaude() runtime checks in your components, then write tests that verify both code paths. In ChatGPT host tests, assert the host-specific UI renders (e.g., a native checkout button). In Claude host tests, assert the fallback renders instead (e.g., a standard link). The inspector fixture provides inspector.host so you can write conditional assertions per host.

What is a compatibility test matrix for MCP Apps?

A compatibility test matrix systematically covers the combinations of host (ChatGPT, Claude), display mode (inline, fullscreen, PiP), and theme (light, dark) that your app needs to support. Instead of testing each dimension separately, you identify the combinations that matter most and write tests that cover them. This catches bugs that only appear in specific combinations, like a dark mode layout break in Claude fullscreen.

How do I test CSS differences between ChatGPT and Claude hosts?

Use visual regression tests with pnpm test:visual. Screenshots are saved per host (e.g., __screenshots__/chatgpt/test.png and __screenshots__/claude/test.png), so you maintain separate baselines for each host. For targeted CSS checks, use Playwright locator assertions like toHaveCSS to verify specific properties. Always use host CSS variables (var(--text-primary)) instead of hardcoded colors so your app adapts to each host theme.

Can I skip a test on a specific host?

Yes. Use test.skip(inspector.host === "claude", "reason") to skip a test on Claude, or test.skip(inspector.host === "chatgpt", "reason") to skip on ChatGPT. This is useful for tests that exercise host-specific features not available on all hosts, like PiP mode on hosts that do not support it.

How do I test MCP Apps for hosts beyond ChatGPT and Claude?

The MCP Apps protocol is an open standard. Apps built with the core sunpeak API (useToolData, useAppState, useDisplayMode, useHostContext, SafeArea) work on any host that implements the spec, including VS Code via GitHub Copilot, Goose, Postman, and MCPJam. If your tests pass on ChatGPT and Claude, the portable code paths work everywhere. Host-specific features gated behind isChatGPT() or isClaude() only run where the API is available.