Live Testing for Claude Connectors and ChatGPT Apps: Test Against Real Hosts with Playwright
Playwright running live tests against a real ChatGPT session with a Claude Connector.
The sunpeak simulator tests cover a lot. They replicate the ChatGPT and Claude runtimes, run display mode transitions, test themes, and validate tool invocations without any paid accounts or AI credits. For most development work, they’re enough.
But simulators don’t catch everything. Real ChatGPT wraps your app in a nested iframe sandbox. The MCP protocol goes through ChatGPT’s actual connection layer. Resource loading happens over a real network with production builds. There’s a gap between “works in the simulator” and “works in ChatGPT,” and the only way to close it is to test against the real thing.
sunpeak 0.16.23 adds live testing: automated Playwright tests that run against real ChatGPT. You write the same kind of assertions you write for simulator tests, and sunpeak handles authentication, MCP server refresh, host-specific message formatting, and iframe traversal.
TL;DR: Run pnpm test:live with a tunnel active. sunpeak imports your browser session, starts the dev server, refreshes the MCP connection, and runs your tests/live/*.spec.ts files in parallel against real ChatGPT. You write assertions against the app iframe. Everything else is automated.
What Live Tests Actually Do
A live test opens a real ChatGPT session in a browser, types a message that triggers your MCP tool, waits for ChatGPT to call it, and then asserts against the rendered app inside the host’s iframe.
Here’s a complete live test for an albums resource:
import { test, expect } from 'sunpeak/test';
test('albums tool renders photo grid', async ({ live }) => {
const app = await live.invoke('show-albums');
await expect(app.getByText('Summer Slice')).toBeVisible({ timeout: 15_000 });
await expect(app.locator('img').first()).toBeVisible();
// Switch to dark mode without re-invoking the tool
await live.setColorScheme('dark', app);
await expect(app.getByText('Summer Slice')).toBeVisible();
});
live.invoke('show-albums') starts a new chat, sends /{appName} show-albums to ChatGPT, waits for the LLM response to finish streaming, waits for the app iframe to render, and returns a Playwright FrameLocator pointed at your app’s content. From there, it’s standard Playwright assertions.
The { timeout: 15_000 } accounts for the LLM response time. ChatGPT needs to process your message, decide to call the tool, receive the result, and render the iframe. In practice this takes 5 to 10 seconds.
Prerequisites
You need three things:
- A ChatGPT account with MCP/Apps support (Plus or higher)
- A tunnel tool like ngrok or Cloudflare Tunnel
- Your MCP server connected in ChatGPT (Settings > Apps > Create, enter your tunnel URL with
/mcppath)
You do not need to install anything extra in your sunpeak project. Live test infrastructure ships with sunpeak starting at v0.16.23. New projects scaffolded with sunpeak new include example live test specs and the Playwright config.
Running Live Tests
Open two terminals:
# Terminal 1: Start a tunnel
ngrok http 8000
# Terminal 2: Run live tests
pnpm test:live
On first run, sunpeak imports your ChatGPT session from your browser. It checks Chrome, Arc, Brave, and Edge automatically. If no valid session is found, it opens a browser window and waits for you to log in. The session is saved to tests/live/.auth/chatgpt.json and reused for 24 hours.
After authentication, sunpeak:
- Starts
sunpeak dev --prod-resources(production resource builds) - Navigates to ChatGPT Settings > Apps, finds your MCP server, and clicks Refresh
- Runs all
tests/live/*.spec.tsfiles fully in parallel, each in its own chat window
The MCP refresh happens once in globalSetup, before any test workers start. This means your test workers don’t each individually refresh the connection, which would be slow and flaky.
The Fixture API
All live tests import from sunpeak/test:
import { test, expect } from 'sunpeak/test';
The test function provides a live fixture with:
| Method | What it does |
|---|---|
invoke(prompt) | Starts a new chat, sends the prompt with host-specific formatting, waits for the app iframe, returns a FrameLocator |
sendMessage(text) | Sends a message in the current chat with /{appName} prefix |
sendRawMessage(text) | Sends a message without any prefix |
startNewChat() | Opens a fresh conversation |
waitForAppIframe() | Waits for the MCP app iframe and returns a FrameLocator |
setColorScheme(scheme, appFrame?) | Switches to 'light' or 'dark' via page.emulateMedia() |
page | Raw Playwright Page object |
Most tests only need invoke and setColorScheme. The invoke method handles the full flow: new chat, message formatting (ChatGPT requires /{appName} before your prompt), waiting for streaming to finish, waiting for the nested iframe to render, and returning a locator into your app’s content.
Theme Testing Without Re-Invocation
Sending a second message to trigger a new tool call is slow and burns credits. setColorScheme avoids that by switching the browser’s prefers-color-scheme via Playwright’s page.emulateMedia(). ChatGPT propagates the change into the iframe, and your app re-renders with the new theme.
test('ticket card text stays readable in dark mode', async ({ live }) => {
const app = await live.invoke('show-ticket');
const title = app.getByText('Search results not loading on mobile');
await expect(title).toBeVisible({ timeout: 15_000 });
// Verify status badge and assignee are visible in light mode
await expect(app.getByText('in progress')).toBeVisible();
await expect(app.getByText('Sarah Chen')).toBeVisible();
// Switch to dark mode — common bugs: text blends into background,
// borders disappear, badge colors lose contrast
await live.setColorScheme('dark', app);
// Same elements should still be visible with the new theme applied
await expect(title).toBeVisible();
await expect(app.getByText('in progress')).toBeVisible();
await expect(app.getByText('Sarah Chen')).toBeVisible();
// Badge background should still be distinguishable from the card
const badge = app.locator('span:has-text("high")');
const badgeBg = await badge.evaluate(
(el) => window.getComputedStyle(el).backgroundColor
);
expect(badgeBg).not.toBe('rgba(0, 0, 0, 0)');
});
The second argument to setColorScheme tells it to wait for the app’s <html data-theme="dark"> attribute to confirm the theme propagated through the iframe boundary before your assertions run.
A Full Example
Here’s a live test for a review card resource. It invokes the tool, checks the rendered content, verifies a button interaction triggers a state transition, and confirms the card re-themes correctly in dark mode:
import { test, expect } from 'sunpeak/test';
test('review card renders and handles approval flow', async ({ live }) => {
const app = await live.invoke('review-diff');
// Verify the card rendered with the right content
const title = app.locator('h1').first();
await expect(title).toBeVisible({ timeout: 15_000 });
await expect(title).toHaveText('Refactor Authentication Module');
// Action buttons present
const applyButton = app.getByRole('button', { name: 'Apply Changes' });
await expect(applyButton).toBeVisible();
// Theme switch: card should stay readable in dark mode
await live.setColorScheme('dark', app);
await expect(title).toBeVisible();
await expect(applyButton).toBeVisible();
// Click Apply Changes — UI transitions to accepted state
await applyButton.click();
await expect(applyButton).not.toBeVisible({ timeout: 5_000 });
await expect(
app.locator('text=Applying changes...').first()
).toBeVisible({ timeout: 5_000 });
});
This catches real issues that simulator tests can miss: the iframe sandbox blocking a script load, a theme change not propagating through the nested iframe boundary, or a button click failing because of host-specific event handling.
The Playwright Config
The live test config is a one-liner:
// tests/live/playwright.config.ts
import { defineLiveConfig } from 'sunpeak/test/config';
export default defineLiveConfig();
This generates a full Playwright config with:
globalSetuppointing to sunpeak’s auth and MCP refresh flowheadless: falsebecause chatgpt.com blocks headless browsers- Anti-bot browser arguments and a real Chrome user agent
- 2-minute timeout per test (LLM responses can be slow)
- 1 retry per test (LLM responses are non-deterministic)
- Fully parallel execution (each test gets its own chat)
- Automatic dev server with
--prod-resourceson a dynamically allocated port
You can pass options to customize the environment:
export default defineLiveConfig({
colorScheme: 'dark',
viewport: { width: 1440, height: 900 },
locale: 'fr-FR',
timezoneId: 'Europe/Paris',
geolocation: { latitude: 48.8566, longitude: 2.3522 },
permissions: ['geolocation'],
});
How It Relates to Simulator Tests
Live tests don’t replace simulator tests. They complement them.
Simulator (pnpm test:e2e) | Live (pnpm test:live) | |
|---|---|---|
| Runs against | Local simulator | Real ChatGPT |
| Speed | Seconds | 10-30 seconds per test |
| Cost | Free | Requires ChatGPT Plus |
| CI/CD | Yes | Not recommended (needs auth) |
| Catches | Component logic, display modes, themes, cross-host layout | Real MCP connection, LLM tool invocation, iframe sandbox, production resource loading |
Use simulator tests for development and CI/CD. Use live tests before shipping, after major changes, or when debugging issues that only reproduce in the real host.
The Testing Pyramid for Claude Connectors
A Claude Connector built with sunpeak now has three test tiers:
- Unit tests (
pnpm test): Vitest, jsdom, fast, test component logic in isolation - Simulator e2e tests (
pnpm test:e2e): Playwright against the local ChatGPT and Claude simulator, test display modes and themes, runs in CI/CD - Live tests (
pnpm test:live): Playwright against real ChatGPT (with Claude coming soon), test real MCP protocol behavior and iframe rendering
Each tier catches different classes of bugs. Unit tests catch logic errors. Simulator tests catch rendering and layout issues across hosts and display modes. Live tests catch protocol and sandbox issues that only show up in the real host environment.
All three are pre-configured when you run sunpeak new. You don’t need to set up Vitest, Playwright, or any test infrastructure yourself.
Host-Agnostic Architecture
The live test infrastructure is designed to support multiple hosts. The live fixture resolves the correct host page object based on the Playwright project name. All host-specific DOM interaction (selectors, login flow, settings navigation, iframe nesting) lives in per-host page objects that sunpeak maintains.
Your test code is host-agnostic:
import { test, expect } from 'sunpeak/test';
test('my resource renders', async ({ live }) => {
const app = await live.invoke('show me something');
await expect(app.locator('h1')).toBeVisible();
});
This same test will run against any host that sunpeak supports. Today that’s ChatGPT. When Claude live testing ships, add it with one line:
// tests/live/playwright.config.ts
export default defineLiveConfig({ hosts: ['chatgpt', 'claude'] });
No changes to your test files.
Getting Started
If you have an existing sunpeak project, update to v0.16.23 or later:
pnpm add sunpeak@latest && sunpeak upgrade
Create tests/live/playwright.config.ts:
import { defineLiveConfig } from 'sunpeak/test/config';
export default defineLiveConfig();
Add the test script to package.json:
{
"scripts": {
"test:live": "playwright test --config tests/live/playwright.config.ts"
}
}
Write your first live test in tests/live/your-resource.spec.ts:
import { test, expect } from 'sunpeak/test';
test('my tool renders correctly in ChatGPT', async ({ live }) => {
const app = await live.invoke('your prompt here');
await expect(app.locator('your-selector')).toBeVisible({ timeout: 15_000 });
});
Start a tunnel, run pnpm test:live, and watch Playwright drive a real ChatGPT session.
New projects created with sunpeak new include all of this out of the box, with example live tests for every starter resource.
Get Started
pnpm add -g sunpeak && sunpeak new
Further Reading
- Complete guide to testing MCP Apps - covers unit tests, e2e tests, and simulation files
- MCP App CI/CD - run simulator tests in GitHub Actions
- Claude Connectors tutorial - build and deploy a connector from scratch
- Claude Connectors vs Claude Apps
- Claude simulator for MCP Apps
- Testing guide - full documentation
- Claude Connector framework
- MCP App framework
Frequently Asked Questions
What is live testing for MCP Apps?
Live testing runs automated Playwright tests against a real ChatGPT session instead of a local simulator. Your test sends a message to ChatGPT, ChatGPT calls your MCP tool, and your test asserts against the rendered app inside the real host iframe. This catches issues that simulator tests miss: real MCP connection behavior, actual LLM tool invocation, host-specific iframe rendering, and production resource loading.
How do I run live tests for my Claude Connector?
Start a tunnel to your MCP server (e.g., ngrok http 8000), then run pnpm test:live. sunpeak imports your ChatGPT session from your browser automatically, starts the dev server with production resource builds, refreshes the MCP connection in ChatGPT settings, and runs your tests/live/*.spec.ts files in parallel. Each test gets its own chat window.
Do I need a paid ChatGPT account for live testing?
Yes. Live tests run against real ChatGPT, which requires a ChatGPT Plus or higher subscription for MCP/Apps support. For free testing during development, use sunpeak simulator tests (pnpm test:e2e) which replicate the ChatGPT and Claude runtimes locally with no account required. Live tests are for final validation before shipping.
How does sunpeak handle ChatGPT authentication in live tests?
On first run, sunpeak imports cookies from your browser (Chrome, Arc, Brave, or Edge) automatically. If no session is found, it opens a browser window for you to log in manually. The session is saved to tests/live/.auth/chatgpt.json and reused for 24 hours. All parallel test workers share the same session.
Can I test light mode and dark mode in a single live test?
Yes. The live fixture provides a setColorScheme method that switches the browser color scheme via Playwright page.emulateMedia(). Call live.setColorScheme("dark", app) after your initial assertions to switch themes without a second tool invocation. The method waits for the app iframe to confirm the theme change.
How is live testing different from e2e testing with the sunpeak simulator?
Simulator e2e tests (pnpm test:e2e) run against a local replica of ChatGPT and Claude. They are fast, free, and run in CI/CD without any accounts. Live tests (pnpm test:live) run against the real ChatGPT website. They catch host-specific rendering issues, real MCP protocol behavior, and iframe sandbox edge cases that the simulator cannot replicate. Use simulator tests for development, live tests for pre-ship validation.
What does the live test fixture API look like?
Import test and expect from sunpeak/test. The live fixture provides invoke(prompt) to start a new chat and get the app iframe, sendMessage(text) with automatic host formatting, setColorScheme(scheme) for theme switching, and the raw Playwright page object. A typical test is about 10 lines: invoke a tool, assert the rendered content, switch themes, assert again.
Can I run live tests against both ChatGPT and Claude?
The live testing infrastructure is host-agnostic. Tests import from sunpeak/test and use a live fixture that resolves the correct host based on the Playwright project name. ChatGPT is supported today. Claude support is coming. When it ships, add it with a one-line config change: defineLiveConfig({ hosts: ["chatgpt", "claude"] }).