All posts

Live Testing for Claude Connectors and ChatGPT Apps: Test Against Real Hosts with Playwright (June 2026)

Abe Wheeler
Claude Connectors Claude Connector Testing MCP Apps MCP App Testing ChatGPT Apps ChatGPT App Testing Playwright Live Testing
Playwright running live tests against a real ChatGPT session with a Claude Connector.

Playwright running live tests against a real ChatGPT session with a Claude Connector.

Local tests should carry most of your MCP App test suite. They are fast, repeatable, and cheap. With the sunpeak inspector, you can render the same UI resource in ChatGPT and Claude host modes, switch themes and display modes, load simulation fixtures, and run Playwright assertions without a real host account.

Live tests solve a different problem. They prove that your app still works after it leaves your local replica and enters the real host. A real ChatGPT or Claude session has real connector settings, real tool discovery, real model routing, real iframe sandboxing, real CSP checks, and real account state. Those are exactly the places where “works locally” can still fail.

TL;DR: Keep your full regression suite local with pnpm test, pnpm test:e2e, and pnpm test:visual. Add a small pnpm test:live smoke suite for real ChatGPT and Claude validation before release. Live tests should prove that the host can discover your tools, invoke the right tool from a prompt, load the production resource, preserve theme and state, and allow the user path that matters most.

What Changed Since the April Version

The biggest change is that live testing is no longer just “drive ChatGPT once before shipping.” The broader MCP App ecosystem has moved toward standard app resources and cross-host behavior:

  • The MCP Apps extension now describes apps as tool-linked ui:// resources rendered in sandboxed iframes, with JSON-RPC communication over postMessage, host context, display modes, and tool calls flowing through the host bridge.
  • OpenAI’s Apps SDK docs now use the ChatGPT Apps and Connectors flow: enable developer mode, create a connector from Settings, provide a public /mcp endpoint, test prompts in a real conversation, and refresh connector metadata when tools or descriptions change.
  • Claude custom connectors now run through remote MCP servers, which means your server must be reachable from the host over the public internet. Interactive connectors can render inline cards and fullscreen views inside Claude conversations.
  • sunpeak’s testing story has expanded around server-agnostic testing, cross-host inspector tests, visual regression tests, multi-model evals, and live browser tests against real hosts.

That means the useful live-test question is more specific now: does the real host path still work for the one or two flows that would block a launch?

What a Live Test Actually Does

A live test opens a real host session, sends a prompt that should trigger your tool, waits for the model and host runtime to finish, and then asserts against the app iframe.

For a ChatGPT App, that path looks like this:

  1. ChatGPT has your MCP server connected as a connector.
  2. Your test opens a new ChatGPT conversation with that connector available.
  3. The prompt causes the model to call your MCP tool.
  4. ChatGPT fetches or uses the linked UI resource, then renders it in a sandboxed iframe.
  5. Playwright finds the iframe and asserts against your UI.

For a Claude Connector, the same high-level path applies, but setup and host UI differ. Claude connects to remote MCP servers through Claude’s connector flow, and interactive connectors may render inline cards or fullscreen views.

Here is the shape of a live test:

import { test, expect } from 'sunpeak/test';

test('albums tool renders in the real host', async ({ live }) => {
  const app = await live.invoke('Show my photo albums');

  await expect(app.getByRole('heading', { name: /albums/i })).toBeVisible({
    timeout: 20_000,
  });
  await expect(app.locator('img').first()).toBeVisible();

  await live.setColorScheme('dark', app);
  await expect(app.getByRole('heading', { name: /albums/i })).toBeVisible();
});

The important detail is that the assertion targets your app, not the model’s prose. A model may phrase its response differently from run to run, but the app either loaded with the expected UI state or it did not.

Where Live Tests Fit in the Test Suite

Live tests are the top of the pyramid, not the base.

LayerCommandRuns againstBest for
Unit testspnpm test:unitTool handlers and component logicSchemas, transformations, error paths
Integration testspnpm test or targeted test filesMCP server protocol callsTool output contracts and server behavior
Inspector e2e testspnpm test:e2eLocal ChatGPT and Claude runtime replicasUI rendering, themes, display modes, app state
Visual testspnpm test:visualLocal inspector screenshotsLayout regressions across hosts and modes
Evalspnpm test:evalReal models through APIsTool selection reliability
Live testspnpm test:liveReal ChatGPT and Claude sessionsFinal host integration smoke tests

If you can test something locally, test it locally. Empty states, long strings, 500 errors, 200-item lists, OAuth denial, and mobile layout should be inspector tests with fixed fixtures. Live tests should cover the real host path that a local inspector cannot fully own.

What Live Tests Catch

Live tests are worth the friction because they cover failure modes that only appear after you connect to a real host.

Tool discovery and metadata drift. ChatGPT and Claude both cache or remember connector metadata in host-specific ways. If you rename a tool, change a description, or update a linked resource, the host may need a refresh before the new version is used. A live test catches stale metadata because the prompt will call the wrong tool, call no tool, or render the old UI.

Model routing. Local inspector tests can call a tool directly. Real users do not. They ask a question, and the model chooses a tool from your schema and descriptions. Live tests tell you whether your “golden prompts” still map to the right tool in the real host.

Sandbox and CSP behavior. MCP Apps render in sandboxed iframes. Network requests, scripts, fonts, images, nested iframes, and external links depend on the host’s CSP interpretation. The official MCP Apps docs call out _meta.ui.csp, and OpenAI’s Apps SDK docs include host-specific CSP metadata. A local test should cover your intended allowlist, but a live test proves the host accepts it.

Production resource loading. Development servers hide packaging mistakes. Live tests should run against production-like resource bundles so they catch missing assets, wrong MIME types, broken dynamic imports, and CDN paths that only exist after build.

Auth and account state. OAuth, account linking, expired sessions, revoked scopes, and org-level connector settings often behave differently in a real host. Live tests do not replace security tests, but they are useful for the “can a real account still connect and invoke the app?” check.

Prerequisites

You need a few pieces in place before pnpm test:live is useful:

  1. Your MCP server must be reachable over HTTPS. For local development, use ngrok, Cloudflare Tunnel, or OpenAI’s Secure MCP Tunnel where appropriate.
  2. Your connector must be added in the host. In ChatGPT, OpenAI documents this under Settings > Apps & Connectors / Connectors after developer mode is enabled. In Claude, Anthropic documents custom connectors through remote MCP server URLs.
  3. Your test account needs access to the host feature. For ChatGPT, confirm developer mode and connector creation are enabled for the account or organization. For Claude, confirm custom connectors are enabled for the plan or workspace.
  4. Your app should already pass local tests. If pnpm test:e2e or pnpm test:visual is red, live testing will mostly waste time.

Do not wait until the live test step to discover basic UI states. Build those with simulation files, then keep the live suite small.

Configure Live Tests

In an existing sunpeak project, add a live Playwright config:

// tests/live/playwright.config.ts
import { defineLiveConfig } from 'sunpeak/test/live/config';

export default defineLiveConfig({
  hosts: ['chatgpt', 'claude'],
  colorScheme: 'light',
  viewport: { width: 1440, height: 900 },
});

Use one host first if you are still setting up the account:

export default defineLiveConfig({
  hosts: ['chatgpt'],
});

Then add a focused test:

// tests/live/checkout-smoke.spec.ts
import { test, expect } from 'sunpeak/test';

test('real host can render checkout approval flow', async ({ live }) => {
  const app = await live.invoke('Review the pending checkout request');

  await expect(app.getByRole('heading', { name: /checkout/i })).toBeVisible({
    timeout: 20_000,
  });

  const approve = app.getByRole('button', { name: /approve/i });
  await expect(approve).toBeVisible();
  await approve.click();

  await expect(app.getByText(/approved|submitted|confirmed/i)).toBeVisible({
    timeout: 10_000,
  });
});

This is a better live test than “does every row in every table render?” because it validates the user path that matters: the host invokes the tool, the UI loads, a real interaction works, and state changes visibly.

Run the Suite

The usual local flow is:

# Terminal 1: expose the MCP server if your config does not start a tunnel
ngrok http 8000

# Terminal 2: run local tests first
pnpm test
pnpm test:e2e

# Terminal 3: run the live smoke suite
pnpm test:live

New sunpeak projects include test wiring out of the box. For existing projects, update before adding live tests:

pnpm add sunpeak@latest
sunpeak upgrade

As of this refresh, the latest published sunpeak package is 0.20.49, but use sunpeak@latest in docs and project setup so the command follows the current release.

Write Prompts Like Test Inputs

A live test prompt should be boring and direct. The model is part of the system under test, so vague prompts create noise.

Good live prompt:

Show my open support tickets in the ticket dashboard.

Weak live prompt:

What should I look at today?

The second prompt may be a useful eval case, but it is a poor live smoke test because too many correct model behaviors are possible. If the goal is to validate host integration, give the model a clean path to the intended tool.

For discovery quality, put those broader prompts in MCP App evals. Evals can run each prompt many times and measure pass rate. Live tests should answer a smaller question: can the real host still run the app path we expect?

Keep Assertions Stable

Live tests depend on real hosts, so assertions need to avoid brittle details.

Prefer:

  • Role-based selectors inside your app iframe.
  • Durable labels, headings, and data-testid attributes.
  • Assertions on app state, not model wording.
  • expect(locator).toBeVisible() and other auto-retrying Playwright assertions.
  • One user path per test.

Avoid:

  • Exact text from the host response outside your app iframe.
  • Fixed sleeps like waitForTimeout(5000).
  • Pixel-perfect screenshots in real hosts.
  • Tests that depend on today’s real data unless you control that data in a test account.
  • Big test files that try to cover every feature in one live session.

If you need pixel checks, use visual regression tests against the sunpeak inspector. Real hosts can change spacing, container chrome, fonts, and loading behavior without your app changing.

Test Theme and Display Mode Carefully

MCP Apps receive host context such as theme, display mode, locale, timezone, platform, and container size. The official MCP Apps docs describe theme changes and display modes as host-controlled. That means your app should respond when the host changes context, but the host decides which modes it supports and when a mode request is honored.

For live tests, check only the modes you truly depend on:

test('ticket details remain readable after dark mode switch', async ({ live }) => {
  const app = await live.invoke('Open ticket TICK-1042');

  const title = app.getByRole('heading', { name: /TICK-1042/i });
  await expect(title).toBeVisible({ timeout: 20_000 });

  await live.setColorScheme('dark', app);

  await expect(title).toBeVisible();
  await expect(app.getByRole('button', { name: /assign/i })).toBeVisible();
});

Do not turn live tests into a full theme matrix. Run that matrix locally with inspector tests because local tests can render the same simulation in every host, theme, and display mode without spending host time.

Refresh Host Metadata When Tools Change

Live tests often fail for a simple reason: the host is using old metadata.

OpenAI’s docs tell developers to refresh connector metadata after changing tools or descriptions. In practice, refresh whenever you change:

  • Tool names.
  • Tool descriptions.
  • Input schemas.
  • Output schemas.
  • Linked UI resource URIs.
  • App metadata that affects discovery.

sunpeak’s live setup can automate host refresh steps where the host exposes them, but you should still treat refresh as part of your release checklist. If a live test suddenly calls an old tool after a deploy, stale metadata is one of the first things to check.

What Not to Put in Live Tests

Live tests are expensive in time and attention. Keep these out:

Every edge case. Empty lists, null fields, long text, failed API calls, and loading states belong in simulation files and inspector tests.

Every supported host mode. If you support inline, fullscreen, and picture-in-picture, test the full matrix locally. In live tests, cover the mode your primary user path needs.

Exact model prose. Assert that your UI loaded and that key data exists. Do not assert that the model says a fixed sentence.

Destructive real actions. Use test accounts, test workspaces, fake payment methods, and reversible actions. If a flow can delete or send something, use a review step or a sandbox service.

Broad app crawls. A live test should be a smoke test, not a manual QA script translated into Playwright.

A Practical Release Workflow

Use this sequence before shipping a meaningful MCP App or connector change:

  1. Run pnpm test for unit and integration coverage.
  2. Run pnpm test:e2e for local inspector coverage.
  3. Run pnpm test:visual when UI changed.
  4. Run targeted evals if tool descriptions changed.
  5. Deploy or start the production-like MCP server.
  6. Refresh connector metadata in ChatGPT and Claude.
  7. Run pnpm test:live.
  8. Manually inspect one conversation only if the live test finds something ambiguous.

The point is to find most problems before the real host ever opens. Live tests should be the final confirmation that the host path works, not the first serious test your app sees.

Debugging Live Test Failures

When a live test fails, sort it into one of these buckets before changing code:

The host did not call the tool. Check the connector is enabled in the conversation, tool metadata is refreshed, the prompt is direct enough, and the tool description matches the user intent.

The tool ran but returned bad data. Check server logs, auth, input arguments, and output schema. Reproduce with an integration test through the mcp fixture so you can debug without the host.

The iframe did not render. Check resource URI, MIME type, production bundle path, CSP, console errors, and whether the host blocked a nested resource.

The UI rendered but the assertion failed. Check whether the selector is brittle, whether data changed, whether the app is still loading, or whether the host changed display mode or viewport.

The test account failed. Check expired sessions, revoked OAuth grants, org policy changes, rate limits, and whether the connector was disabled for that conversation.

Once you know the bucket, add the lowest-level regression test that would have caught it. A failure caused by bad structured data should become an integration test. A failure caused by a CSS bug should become an inspector or visual test. A failure caused by host metadata should stay in the live smoke suite.

The Role of sunpeak

You can live-test MCP Apps with raw Playwright, but the repetitive parts are host-specific: authentication, new-chat setup, connector selection, host refresh, prompt formatting, iframe traversal, theme switching, and waiting for streaming to finish.

sunpeak wraps those pieces in the live fixture so your test stays focused on product behavior:

const app = await live.invoke('Show open incidents');
await expect(app.getByRole('heading', { name: /open incidents/i })).toBeVisible();

That is also why the local inspector matters. The live fixture is for the real host path. The inspector fixture is for the broad regression matrix. Together they give you a workflow that matches how MCP Apps ship: test most states locally, then prove the real host still accepts and renders the app.

Start with the testing framework, then add one live smoke test for your highest-value flow. If that one test catches stale metadata, broken CSP, or a host-only iframe issue before launch, it has paid for itself.

Get Started

Documentation →
npx sunpeak new

Further Reading

Frequently Asked Questions

What is live testing for MCP Apps?

Live testing runs automated Playwright tests against a real AI host, usually ChatGPT or Claude, instead of only a local inspector. The test sends a real prompt, the host decides whether to call your MCP tool, the host renders your UI resource in its sandboxed iframe, and Playwright asserts against the rendered app. This catches connection, tool discovery, host iframe, CSP, OAuth, and production resource issues that local tests may not reproduce.

How do I run live tests for a Claude Connector or ChatGPT App?

Expose your MCP server over HTTPS with a tunnel, connect it in the host, then run pnpm test:live. In sunpeak projects, defineLiveConfig starts the app in a production-like resource mode, refreshes host metadata when needed, opens real host sessions, invokes your tool with prompts, and returns a Playwright locator for the rendered iframe. New sunpeak projects include live test examples, and existing projects can add tests/live/playwright.config.ts.

Do I need a paid ChatGPT account for live testing?

Not always. OpenAI documents ChatGPT Apps support across ChatGPT plans, but your account or organization still needs developer mode and connector access enabled. If developer mode is blocked by plan, workspace policy, or rollout state, live ChatGPT tests cannot run for that account. Local sunpeak inspector tests do not need a ChatGPT account, Claude account, host credits, or a public tunnel.

Can I run live tests against both ChatGPT and Claude?

Yes. Configure live hosts in tests/live/playwright.config.ts, for example defineLiveConfig({ hosts: ["chatgpt", "claude"] }). The same test can run against each configured host, while sunpeak keeps host-specific login, prompt formatting, settings navigation, refresh behavior, and iframe lookup behind the live fixture.

What does live testing catch that inspector tests miss?

Live tests catch issues in real host integration: stale tool metadata, connector refresh problems, actual model tool selection, OAuth or account linking failures, production bundle loading, iframe sandbox restrictions, CSP allowlist mistakes, real host theme propagation, and layout bugs caused by the host container. Inspector tests should still cover most UI states because they are faster, deterministic, and safe to run in CI.

Should live tests run in CI/CD?

Usually no. Run unit, integration, e2e, visual, and cross-host inspector tests on every pull request. Run live tests manually before release, on a scheduled smoke test, or from a protected workflow that has a dedicated test account. Live tests depend on real hosts, sessions, model behavior, network conditions, and rate limits, so they are less stable than local inspector tests.

How do I keep live tests from becoming flaky?

Keep the live suite small, use specific prompts, assert on durable UI states, avoid exact model wording, prefer web-first Playwright assertions over sleeps, isolate each test in a fresh conversation, and run local inspector tests for edge cases. Use live tests as a smoke suite for the real host path, not as your full regression suite.

How is live testing different from MCP Inspector or the sunpeak inspector?

MCP Inspector and the sunpeak inspector are local development and testing tools. They let you inspect tools, render UI resources, load simulation fixtures, and run deterministic Playwright tests without a real host account. Live testing opens ChatGPT or Claude and validates the production integration path. Use the inspector while building and in CI, then use live tests as final host validation.