All posts

Live Testing for Claude Connectors and ChatGPT Apps: Test Against Real Hosts with Playwright

Abe Wheeler
Claude Connectors Claude Connector Testing MCP Apps MCP App Testing ChatGPT Apps ChatGPT App Testing Playwright Live Testing
Playwright running live tests against a real ChatGPT session with a Claude Connector.

Playwright running live tests against a real ChatGPT session with a Claude Connector.

The sunpeak simulator tests cover a lot. They replicate the ChatGPT and Claude runtimes, run display mode transitions, test themes, and validate tool invocations without any paid accounts or AI credits. For most development work, they’re enough.

But simulators don’t catch everything. Real ChatGPT wraps your app in a nested iframe sandbox. The MCP protocol goes through ChatGPT’s actual connection layer. Resource loading happens over a real network with production builds. There’s a gap between “works in the simulator” and “works in ChatGPT,” and the only way to close it is to test against the real thing.

sunpeak 0.16.23 adds live testing: automated Playwright tests that run against real ChatGPT. You write the same kind of assertions you write for simulator tests, and sunpeak handles authentication, MCP server refresh, host-specific message formatting, and iframe traversal.

TL;DR: Run pnpm test:live with a tunnel active. sunpeak imports your browser session, starts the dev server, refreshes the MCP connection, and runs your tests/live/*.spec.ts files in parallel against real ChatGPT. You write assertions against the app iframe. Everything else is automated.

What Live Tests Actually Do

A live test opens a real ChatGPT session in a browser, types a message that triggers your MCP tool, waits for ChatGPT to call it, and then asserts against the rendered app inside the host’s iframe.

Here’s a complete live test for an albums resource:

import { test, expect } from 'sunpeak/test';

test('albums tool renders photo grid', async ({ live }) => {
  const app = await live.invoke('show-albums');

  await expect(app.getByText('Summer Slice')).toBeVisible({ timeout: 15_000 });
  await expect(app.locator('img').first()).toBeVisible();

  // Switch to dark mode without re-invoking the tool
  await live.setColorScheme('dark', app);
  await expect(app.getByText('Summer Slice')).toBeVisible();
});

live.invoke('show-albums') starts a new chat, sends /{appName} show-albums to ChatGPT, waits for the LLM response to finish streaming, waits for the app iframe to render, and returns a Playwright FrameLocator pointed at your app’s content. From there, it’s standard Playwright assertions.

The { timeout: 15_000 } accounts for the LLM response time. ChatGPT needs to process your message, decide to call the tool, receive the result, and render the iframe. In practice this takes 5 to 10 seconds.

Prerequisites

You need three things:

  1. A ChatGPT account with MCP/Apps support (Plus or higher)
  2. A tunnel tool like ngrok or Cloudflare Tunnel
  3. Your MCP server connected in ChatGPT (Settings > Apps > Create, enter your tunnel URL with /mcp path)

You do not need to install anything extra in your sunpeak project. Live test infrastructure ships with sunpeak starting at v0.16.23. New projects scaffolded with sunpeak new include example live test specs and the Playwright config.

Running Live Tests

Open two terminals:

# Terminal 1: Start a tunnel
ngrok http 8000

# Terminal 2: Run live tests
pnpm test:live

On first run, sunpeak imports your ChatGPT session from your browser. It checks Chrome, Arc, Brave, and Edge automatically. If no valid session is found, it opens a browser window and waits for you to log in. The session is saved to tests/live/.auth/chatgpt.json and reused for 24 hours.

After authentication, sunpeak:

  1. Starts sunpeak dev --prod-resources (production resource builds)
  2. Navigates to ChatGPT Settings > Apps, finds your MCP server, and clicks Refresh
  3. Runs all tests/live/*.spec.ts files fully in parallel, each in its own chat window

The MCP refresh happens once in globalSetup, before any test workers start. This means your test workers don’t each individually refresh the connection, which would be slow and flaky.

The Fixture API

All live tests import from sunpeak/test:

import { test, expect } from 'sunpeak/test';

The test function provides a live fixture with:

MethodWhat it does
invoke(prompt)Starts a new chat, sends the prompt with host-specific formatting, waits for the app iframe, returns a FrameLocator
sendMessage(text)Sends a message in the current chat with /{appName} prefix
sendRawMessage(text)Sends a message without any prefix
startNewChat()Opens a fresh conversation
waitForAppIframe()Waits for the MCP app iframe and returns a FrameLocator
setColorScheme(scheme, appFrame?)Switches to 'light' or 'dark' via page.emulateMedia()
pageRaw Playwright Page object

Most tests only need invoke and setColorScheme. The invoke method handles the full flow: new chat, message formatting (ChatGPT requires /{appName} before your prompt), waiting for streaming to finish, waiting for the nested iframe to render, and returning a locator into your app’s content.

Theme Testing Without Re-Invocation

Sending a second message to trigger a new tool call is slow and burns credits. setColorScheme avoids that by switching the browser’s prefers-color-scheme via Playwright’s page.emulateMedia(). ChatGPT propagates the change into the iframe, and your app re-renders with the new theme.

test('ticket card text stays readable in dark mode', async ({ live }) => {
  const app = await live.invoke('show-ticket');

  const title = app.getByText('Search results not loading on mobile');
  await expect(title).toBeVisible({ timeout: 15_000 });

  // Verify status badge and assignee are visible in light mode
  await expect(app.getByText('in progress')).toBeVisible();
  await expect(app.getByText('Sarah Chen')).toBeVisible();

  // Switch to dark mode — common bugs: text blends into background,
  // borders disappear, badge colors lose contrast
  await live.setColorScheme('dark', app);

  // Same elements should still be visible with the new theme applied
  await expect(title).toBeVisible();
  await expect(app.getByText('in progress')).toBeVisible();
  await expect(app.getByText('Sarah Chen')).toBeVisible();

  // Badge background should still be distinguishable from the card
  const badge = app.locator('span:has-text("high")');
  const badgeBg = await badge.evaluate(
    (el) => window.getComputedStyle(el).backgroundColor
  );
  expect(badgeBg).not.toBe('rgba(0, 0, 0, 0)');
});

The second argument to setColorScheme tells it to wait for the app’s <html data-theme="dark"> attribute to confirm the theme propagated through the iframe boundary before your assertions run.

A Full Example

Here’s a live test for a review card resource. It invokes the tool, checks the rendered content, verifies a button interaction triggers a state transition, and confirms the card re-themes correctly in dark mode:

import { test, expect } from 'sunpeak/test';

test('review card renders and handles approval flow', async ({ live }) => {
  const app = await live.invoke('review-diff');

  // Verify the card rendered with the right content
  const title = app.locator('h1').first();
  await expect(title).toBeVisible({ timeout: 15_000 });
  await expect(title).toHaveText('Refactor Authentication Module');

  // Action buttons present
  const applyButton = app.getByRole('button', { name: 'Apply Changes' });
  await expect(applyButton).toBeVisible();

  // Theme switch: card should stay readable in dark mode
  await live.setColorScheme('dark', app);
  await expect(title).toBeVisible();
  await expect(applyButton).toBeVisible();

  // Click Apply Changes — UI transitions to accepted state
  await applyButton.click();
  await expect(applyButton).not.toBeVisible({ timeout: 5_000 });
  await expect(
    app.locator('text=Applying changes...').first()
  ).toBeVisible({ timeout: 5_000 });
});

This catches real issues that simulator tests can miss: the iframe sandbox blocking a script load, a theme change not propagating through the nested iframe boundary, or a button click failing because of host-specific event handling.

The Playwright Config

The live test config is a one-liner:

// tests/live/playwright.config.ts
import { defineLiveConfig } from 'sunpeak/test/config';

export default defineLiveConfig();

This generates a full Playwright config with:

  • globalSetup pointing to sunpeak’s auth and MCP refresh flow
  • headless: false because chatgpt.com blocks headless browsers
  • Anti-bot browser arguments and a real Chrome user agent
  • 2-minute timeout per test (LLM responses can be slow)
  • 1 retry per test (LLM responses are non-deterministic)
  • Fully parallel execution (each test gets its own chat)
  • Automatic dev server with --prod-resources on a dynamically allocated port

You can pass options to customize the environment:

export default defineLiveConfig({
  colorScheme: 'dark',
  viewport: { width: 1440, height: 900 },
  locale: 'fr-FR',
  timezoneId: 'Europe/Paris',
  geolocation: { latitude: 48.8566, longitude: 2.3522 },
  permissions: ['geolocation'],
});

How It Relates to Simulator Tests

Live tests don’t replace simulator tests. They complement them.

Simulator (pnpm test:e2e)Live (pnpm test:live)
Runs againstLocal simulatorReal ChatGPT
SpeedSeconds10-30 seconds per test
CostFreeRequires ChatGPT Plus
CI/CDYesNot recommended (needs auth)
CatchesComponent logic, display modes, themes, cross-host layoutReal MCP connection, LLM tool invocation, iframe sandbox, production resource loading

Use simulator tests for development and CI/CD. Use live tests before shipping, after major changes, or when debugging issues that only reproduce in the real host.

The Testing Pyramid for Claude Connectors

A Claude Connector built with sunpeak now has three test tiers:

  1. Unit tests (pnpm test): Vitest, jsdom, fast, test component logic in isolation
  2. Simulator e2e tests (pnpm test:e2e): Playwright against the local ChatGPT and Claude simulator, test display modes and themes, runs in CI/CD
  3. Live tests (pnpm test:live): Playwright against real ChatGPT (with Claude coming soon), test real MCP protocol behavior and iframe rendering

Each tier catches different classes of bugs. Unit tests catch logic errors. Simulator tests catch rendering and layout issues across hosts and display modes. Live tests catch protocol and sandbox issues that only show up in the real host environment.

All three are pre-configured when you run sunpeak new. You don’t need to set up Vitest, Playwright, or any test infrastructure yourself.

Host-Agnostic Architecture

The live test infrastructure is designed to support multiple hosts. The live fixture resolves the correct host page object based on the Playwright project name. All host-specific DOM interaction (selectors, login flow, settings navigation, iframe nesting) lives in per-host page objects that sunpeak maintains.

Your test code is host-agnostic:

import { test, expect } from 'sunpeak/test';

test('my resource renders', async ({ live }) => {
  const app = await live.invoke('show me something');
  await expect(app.locator('h1')).toBeVisible();
});

This same test will run against any host that sunpeak supports. Today that’s ChatGPT. When Claude live testing ships, add it with one line:

// tests/live/playwright.config.ts
export default defineLiveConfig({ hosts: ['chatgpt', 'claude'] });

No changes to your test files.

Getting Started

If you have an existing sunpeak project, update to v0.16.23 or later:

pnpm add sunpeak@latest && sunpeak upgrade

Create tests/live/playwright.config.ts:

import { defineLiveConfig } from 'sunpeak/test/config';
export default defineLiveConfig();

Add the test script to package.json:

{
  "scripts": {
    "test:live": "playwright test --config tests/live/playwright.config.ts"
  }
}

Write your first live test in tests/live/your-resource.spec.ts:

import { test, expect } from 'sunpeak/test';

test('my tool renders correctly in ChatGPT', async ({ live }) => {
  const app = await live.invoke('your prompt here');
  await expect(app.locator('your-selector')).toBeVisible({ timeout: 15_000 });
});

Start a tunnel, run pnpm test:live, and watch Playwright drive a real ChatGPT session.

New projects created with sunpeak new include all of this out of the box, with example live tests for every starter resource.

Get Started

Documentation →
pnpm add -g sunpeak && sunpeak new

Further Reading

Frequently Asked Questions

What is live testing for MCP Apps?

Live testing runs automated Playwright tests against a real ChatGPT session instead of a local simulator. Your test sends a message to ChatGPT, ChatGPT calls your MCP tool, and your test asserts against the rendered app inside the real host iframe. This catches issues that simulator tests miss: real MCP connection behavior, actual LLM tool invocation, host-specific iframe rendering, and production resource loading.

How do I run live tests for my Claude Connector?

Start a tunnel to your MCP server (e.g., ngrok http 8000), then run pnpm test:live. sunpeak imports your ChatGPT session from your browser automatically, starts the dev server with production resource builds, refreshes the MCP connection in ChatGPT settings, and runs your tests/live/*.spec.ts files in parallel. Each test gets its own chat window.

Do I need a paid ChatGPT account for live testing?

Yes. Live tests run against real ChatGPT, which requires a ChatGPT Plus or higher subscription for MCP/Apps support. For free testing during development, use sunpeak simulator tests (pnpm test:e2e) which replicate the ChatGPT and Claude runtimes locally with no account required. Live tests are for final validation before shipping.

How does sunpeak handle ChatGPT authentication in live tests?

On first run, sunpeak imports cookies from your browser (Chrome, Arc, Brave, or Edge) automatically. If no session is found, it opens a browser window for you to log in manually. The session is saved to tests/live/.auth/chatgpt.json and reused for 24 hours. All parallel test workers share the same session.

Can I test light mode and dark mode in a single live test?

Yes. The live fixture provides a setColorScheme method that switches the browser color scheme via Playwright page.emulateMedia(). Call live.setColorScheme("dark", app) after your initial assertions to switch themes without a second tool invocation. The method waits for the app iframe to confirm the theme change.

How is live testing different from e2e testing with the sunpeak simulator?

Simulator e2e tests (pnpm test:e2e) run against a local replica of ChatGPT and Claude. They are fast, free, and run in CI/CD without any accounts. Live tests (pnpm test:live) run against the real ChatGPT website. They catch host-specific rendering issues, real MCP protocol behavior, and iframe sandbox edge cases that the simulator cannot replicate. Use simulator tests for development, live tests for pre-ship validation.

What does the live test fixture API look like?

Import test and expect from sunpeak/test. The live fixture provides invoke(prompt) to start a new chat and get the app iframe, sendMessage(text) with automatic host formatting, setColorScheme(scheme) for theme switching, and the raw Playwright page object. A typical test is about 10 lines: invoke a tool, assert the rendered content, switch themes, assert again.

Can I run live tests against both ChatGPT and Claude?

The live testing infrastructure is host-agnostic. Tests import from sunpeak/test and use a live fixture that resolves the correct host based on the Playwright project name. ChatGPT is supported today. Claude support is coming. When it ships, add it with a one-line config change: defineLiveConfig({ hosts: ["chatgpt", "claude"] }).