MCP App Testing Strategy: Which Tests to Write First and What to Skip (May 2026)
A practical testing strategy for MCP Apps across ChatGPT and Claude.
TL;DR: Start with integration tests on tool handlers (highest ROI), then add e2e tests for resource rendering, then cross-host tests before publishing. Add unit tests, snapshots, visual regression, and evals as your app grows. Skip testing the protocol layer, host CSS variables, and anything the framework handles for you.
MCP Apps have more test surfaces than standard web apps. Your code runs server-side in a tool handler, client-side in a resource component, inside an iframe controlled by the host, across multiple hosts (ChatGPT, Claude, VS Code, Goose), in multiple display modes (inline, PiP, fullscreen), and with data flowing through the MCP protocol between each layer. You could write tests for every combination. You shouldn’t.
This post is a prioritized testing strategy for MCP Apps, ChatGPT Apps, and Claude Connectors. It tells you which tests to write first, which to add later, and which to skip entirely, based on where bugs actually happen in production MCP Apps.
The MCP App Testing Pyramid
The standard testing pyramid (unit → integration → e2e) applies to MCP Apps, but with extra layers. Here is the full pyramid from fastest and cheapest at the bottom to slowest and most expensive at the top:
- Unit tests: Test tool handlers as plain functions and resource components with mocked hooks. Run in milliseconds with Vitest and happy-dom. No browser, no server.
- Integration tests: Test tool handlers through the running MCP server using the
mcpfixture. Catches protocol-boundary bugs. No browser needed. - E2E tests: Render resource components in simulated ChatGPT and Claude runtimes using the
inspectorfixture. Full browser rendering with Playwright. - Cross-host tests: E2E tests that run across both host modes to catch CSS variable differences, protocol timing differences, and host-specific edge cases.
- Evals: Send prompts to real LLMs (GPT-4o, Claude, Gemini) and verify they call the right tools with the right arguments. Cost API credits per run.
Write more tests at the bottom and fewer at the top. Most teams should start in the middle, at layer 2.
Start Here: Integration Tests on Tool Handlers
If you write one type of test for your MCP App, make it integration tests on your tool handlers.
Tool handlers are the highest-ROI test target because every other layer depends on their output. Your resource component renders structuredContent from the tool handler. If the handler returns the wrong shape, the wrong data, or an unhandled error, the resource breaks. Integration tests with the mcp fixture catch these problems at the protocol boundary, where most production bugs live.
import { test, expect } from 'sunpeak/test';
test('search tool returns albums with required fields', async ({ mcp }) => {
const result = await mcp.callTool('search-albums', {
query: 'vacation',
});
expect(result.isError).toBeFalsy();
const content = result.structuredContent;
expect(content.albums).toBeInstanceOf(Array);
expect(content.albums[0]).toHaveProperty('title');
expect(content.albums[0]).toHaveProperty('coverUrl');
expect(content.albums[0]).toHaveProperty('photoCount');
});
This test runs your real tool handler through the real MCP server. It catches shape mismatches (renamed fields, missing properties), Zod validation errors (wrong types from the host), serialization bugs, and unhandled exceptions. Unit tests miss these because they import the handler directly and bypass the protocol.
Write one integration test per tool. For each test, call the tool with typical arguments and assert on the structuredContent shape. This takes about 10 minutes per tool and catches the class of bugs that causes the most user-visible breakage.
Run with:
pnpm test:e2e
Second: E2E Tests for Resource Rendering
After your tool handlers have integration tests, add e2e tests for your resource components. E2E tests verify that your React components render correctly when they receive real tool output inside a simulated host runtime.
import { test, expect } from 'sunpeak/test';
test('album grid renders search results', async ({ inspector }) => {
const result = await inspector.renderTool('search-albums', {
query: 'vacation',
});
const app = result.app();
await expect(app.getByRole('heading')).toContainText('vacation');
await expect(app.locator('.album-card')).toHaveCount(4);
});
The inspector fixture renders your resource in a real browser with Playwright, inside a simulated ChatGPT or Claude runtime. This catches rendering bugs, CSS issues, broken images, and state handling problems (loading, error, cancelled) that integration tests don’t cover because they never render HTML.
Focus your e2e tests on three things:
- Happy path rendering. The component shows the right content when
useToolDatareturns valid data. - Error states. The component handles
isError,isLoading, andisCancelledwithout crashing. - Display mode layouts. The component looks right in inline, PiP, and fullscreen modes if your app supports display mode transitions.
You don’t need an e2e test for every edge case. Save edge cases for unit tests, which run faster.
Third: Cross-Host Tests Before Publishing
Before you submit your app to the ChatGPT App Store or the Claude Connectors Directory, add cross-host tests. These are e2e tests that run across both ChatGPT and Claude host modes to catch host-specific bugs.
ChatGPT and Claude have different CSS variable names, different iframe sandboxing behavior, different protocol timing, and different annotation requirements. A resource that looks right in ChatGPT can overflow, miscolor, or crash in Claude. Cross-host tests are the only way to catch these differences automatically.
sunpeak’s inspector fixture runs each e2e test in both host modes by default. If you already have e2e tests from step two, you already have cross-host coverage. Check your test output for host-specific failures and fix them before submission.
The most common cross-host bugs:
- CSS variable fallbacks missing for one host (e.g., using a ChatGPT-only variable without a fallback)
- Layout differences from different default font sizes or container widths
- Tool annotation requirements that differ between hosts (Claude requires at minimum
readOnlyHintordestructiveHint; ChatGPT requires all three includingopenWorldHint) - Timing differences in when
tool-inputandtool-input-partialnotifications arrive
Fourth: Unit Tests for Complex Logic
Once you have integration and e2e tests covering the main paths, add unit tests for components and handlers with complex logic.
Unit tests are faster than integration tests (milliseconds vs. seconds) and give more precise failure messages. They’re the right tool for testing:
- Tool handlers with branching logic. If your handler has multiple code paths based on input arguments, external API responses, or feature flags, unit test each branch.
- Resource components with conditional rendering. If your component shows different UI based on data shape, empty arrays, long strings, or missing optional fields, unit test each condition.
- Data transformation functions. If you extract shared logic into utility functions (date formatting, data normalization, filtering), unit test those directly.
Unit tests are the wrong tool for testing protocol-level behavior. If your test requires mocking useToolData to return specific data shapes, you’re testing against your assumptions about the protocol rather than the protocol itself. That’s what integration tests are for.
import { render, screen } from '@testing-library/react';
import { vi, test, expect } from 'vitest';
import AlbumGrid from '../../src/resources/albums/AlbumGrid';
vi.mock('sunpeak', () => ({
useToolData: () => ({
output: { albums: [] },
isLoading: false,
isError: false,
isCancelled: false,
}),
SafeArea: ({ children }: any) => children,
}));
test('empty album list shows empty state message', () => {
render(<AlbumGrid />);
expect(screen.getByText('No albums found')).toBeDefined();
});
What to Add as Your App Grows
These test types have lower ROI on day one but become worthwhile as your app gets more complex or more users:
Snapshot tests: Add these when your resource components have stabilized and you want to catch accidental rendering changes. Snapshot tests are brittle to intentional changes (every UI update requires updating the snapshot), so wait until your design is stable before adopting them.
Visual regression tests: Add these when CSS bugs start showing up in production. Visual regression tests compare screenshots pixel by pixel, which catches layout shifts, color changes, and responsive breakage that text-based tests miss. They require baseline images and produce false positives when you intentionally change the UI, so they have a maintenance cost.
Regression tests: Add structural assertions on structuredContent shapes when your tool output format stabilizes. These catch field renames, type changes, and removed properties that break the contract between your tool handler and resource component. They’re different from integration tests in that they assert on the shape, not the values.
Evals: Add these when you have three or more tools, when tools have overlapping functionality, or when users report that the AI calls the wrong tool. Evals send prompts to GPT-4o, Claude, and Gemini and check whether each model calls the right tool. They cost API credits per run, so gate them to main branch in CI/CD.
Performance tests: Add these when your app loads slowly or handles large datasets. Measure tool handler response time, resource component render time, and bundle size.
Security tests: Add these before production launch. Test for injection in tool inputs, CSP compliance, and data leakage between tools.
Accessibility tests: Add these when your app has interactive elements. Test keyboard navigation, screen reader labels, and color contrast inside the host iframe.
What to Skip
Some things look like they should be tested but have near-zero ROI for MCP Apps:
The MCP protocol itself. Don’t test JSON-RPC serialization, postMessage wiring, iframe sandboxing, or host communication. The framework (sunpeak or whatever you use) handles this, and the host handles the other side. If the protocol is broken, all apps are broken, not just yours.
Host-provided CSS variables. Don’t test that --oai-font-family or --connector-surface-primary resolve to the right values. The host provides these at runtime and they change without notice. Test that your components work with reasonable fallback values.
Third-party library internals. Don’t test that React renders a <div>, that Zod validates a schema, or that Vitest runs assertions. Test your code, not theirs.
Simple pass-through components. If a resource component does nothing but destructure useToolData output and pass fields to child elements, it has so little logic that a unit test adds maintenance cost without catching real bugs. Integration tests and e2e tests already cover the data flow.
Every possible display mode and theme combination. Test the modes your app actually supports. If your app only uses inline mode, don’t write tests for PiP and fullscreen. If you don’t have dark mode styles, don’t test dark mode.
Evals for simple tool sets. If you have one tool with a clear name and description, LLMs won’t confuse it. Evals add value when there’s ambiguity between tools.
The Recommended Order
Here is the order I recommend for building out a test suite for an MCP App, from first to last:
- Integration tests on every tool handler (day one)
- E2E tests for main resource rendering paths (before first deploy)
- Cross-host tests (before app store / directory submission)
- Unit tests for complex components and handlers (as complexity grows)
- Tool annotation tests (before submission)
- Snapshot tests (when design stabilizes)
- Regression tests on
structuredContentshapes (when tool output stabilizes) - Security tests (before production launch)
- Accessibility tests (when interactive elements exist)
- Visual regression tests (when CSS bugs appear in production)
- Performance tests (when load times matter)
- Evals (when tool ambiguity appears)
You won’t need all 12 for every app. A simple MCP App with one tool and one resource can ship confidently with layers 1 through 3. A complex app with ten tools, multiple resources, and cross-host requirements will eventually need most of them.
Setting Up CI/CD
Run your test suite in CI/CD so tests catch bugs before they reach production. A minimal GitHub Actions workflow for an MCP App looks like this:
name: Test
on: [push, pull_request]
jobs:
test:
runs-on: ubuntu-latest
steps:
- uses: actions/checkout@v4
- uses: pnpm/action-setup@v4
- uses: actions/setup-node@v4
with:
node-version: 20
cache: pnpm
- run: pnpm install
- run: pnpm test:unit
- run: pnpm exec playwright install --with-deps chromium
- run: pnpm test:e2e
pnpm test:unit runs Vitest unit tests. pnpm test:e2e runs Playwright integration and e2e tests (including cross-host tests). Keep evals in a separate job gated to the main branch because they cost API credits.
For more detail on CI/CD setup, see the GitHub Actions guide.
How to Measure Coverage
Vitest has built-in coverage reporting. Add --coverage to see which lines your tests exercise:
pnpm test:unit -- --coverage
Coverage numbers are useful as a sanity check, not as a target. High coverage on tool handlers and resource components matters. High coverage on configuration files and type definitions does not. Aim for coverage on the code that can break your app in production, not coverage on the code that makes your coverage report look good.
For MCP Apps, the most important coverage targets are:
- Tool handler functions (every code path that produces
structuredContent) - Resource component branches (loading, error, cancelled, success, empty data)
- Data transformation utilities (formatting, filtering, normalization)
If those three areas have solid test coverage, your app is tested where it matters.
When Tests Save You
Here are real scenarios where each test type catches a bug that the others miss:
- Integration test: You rename a
structuredContentfield fromimageUrltocoverUrl. Unit tests pass because they mock the old field name. Integration tests fail because the real handler returns the new name and the assertion expects the old one. - E2E test: You add a CSS class that looks right in your dev tools but overflows the inline display mode iframe. Integration tests pass because they don’t render HTML. E2E tests fail because the rendered layout exceeds the viewport.
- Cross-host test: You use a ChatGPT CSS variable (
--oai-border-radius) without a fallback. ChatGPT looks fine. Claude renders square corners because it doesn’t provide that variable. Cross-host tests catch the difference. - Unit test: Your component has a conditional branch for empty arrays that shows “No results found.” No integration or e2e test covers this case because the test data always has results. A unit test with
output: { albums: [] }verifies the branch. - Eval: You have two tools named
get-photosandshow-albums. Both look clear to you, but GPT-4o confuses them 30% of the time. An eval running 10 prompts per model catches this before users do.
Each test type has a specific failure mode it catches that the others don’t. That’s why the strategy matters: you add each type when its failure mode becomes a real risk for your app, not before.
Get Started
npx sunpeak new
Further Reading
- Complete guide to testing ChatGPT Apps and MCP Apps
- Integration testing MCP Apps - the mcp fixture for protocol-level tests
- E2E testing MCP Apps - the inspector fixture for full-stack tests
- Unit testing MCP Apps - mock hooks and test components in isolation
- Cross-host compatibility testing - run tests across ChatGPT and Claude
- MCP App evals - test tool calling across GPT-4o, Claude, and Gemini
- Regression testing MCP Apps - catch unintended breakage before production
- MCP App CI/CD with GitHub Actions - automate your test suite
- Pre-submission testing - validate before publishing to ChatGPT and Claude
- Testing framework
- MCP App framework
- ChatGPT App framework
- Claude Connector framework
Frequently Asked Questions
What should I test first in an MCP App?
Start with integration tests on your tool handlers using the mcp fixture from sunpeak/test. Call mcp.callTool() and assert on the structuredContent shape. Tool handlers are the highest-ROI test target because every other layer depends on their output. A broken tool handler breaks the entire app. Integration tests catch shape mismatches, serialization bugs, and missing fields at the protocol boundary. Write one test per tool before adding any other test type.
How many test types does an MCP App need?
Most MCP Apps need three test types to ship with confidence: integration tests (tool handler output shapes), e2e tests (resource rendering in simulated hosts), and cross-host tests (behavior differences between ChatGPT and Claude). Add unit tests for complex component logic, snapshot tests for rendering stability, and evals for tool naming clarity as your app grows. You do not need all test types on day one.
What is the testing pyramid for MCP Apps?
The MCP App testing pyramid has five layers from bottom to top: unit tests (fast, test tool handlers and components in isolation), integration tests (test tool handlers through the MCP protocol), e2e tests (test resources rendered in simulated host runtimes), cross-host tests (verify behavior across ChatGPT and Claude), and evals (test whether LLMs call your tools correctly). Write more tests at the bottom layers and fewer at the top. Integration tests give the best cost-to-coverage ratio for most MCP Apps.
What can I skip testing in an MCP App?
Skip testing the MCP protocol itself (JSON-RPC serialization, postMessage wiring, iframe sandboxing). Skip testing host-provided CSS variables. Skip testing third-party library internals. Skip writing unit tests for components that only pass useToolData output to child elements without logic. Skip visual regression tests until your app has a stable design. Skip evals until you have multiple tools with similar names or descriptions that an LLM might confuse.
Should I write unit tests or integration tests first for an MCP App?
Write integration tests first. MCP App unit tests require mocking sunpeak hooks (useToolData, useAppState, useDisplayMode), which means you are testing against your assumptions about what the hooks return rather than what they actually return. Integration tests with the mcp fixture exercise the real MCP server and real tool handlers, so they catch bugs at the protocol boundary that unit tests with mocked data will miss. Add unit tests second for components with complex rendering logic.
When should I add e2e tests to my MCP App?
Add e2e tests after you have integration tests for your tool handlers. E2e tests use the inspector fixture from sunpeak/test to render your resource components in simulated ChatGPT and Claude runtimes with Playwright. They verify that your UI renders correctly given real tool output in a real browser. Add them when you need confidence that your resource components handle all useToolData states (loading, error, cancelled, success) and display modes (inline, PiP, fullscreen).
How do I test my MCP App across ChatGPT and Claude?
The sunpeak inspector fixture runs e2e tests across host modes automatically. Each test renders your resource component in both the ChatGPT and Claude runtime environments, with the correct CSS variables, iframe sandboxing, and protocol behavior for each host. Run with pnpm test:e2e. If a CSS variable name difference or a protocol timing difference causes a failure on one host, the test output shows which host failed.
Do I need evals if I only have one or two tools?
Probably not. Evals test whether LLMs can pick the right tool from your tool list and pass correct arguments. If you only have one or two tools with distinct names and clear descriptions, LLMs will almost always call the right one. Add evals when you have three or more tools, when tools have overlapping functionality, or when you notice users reporting that the AI calls the wrong tool. Evals cost API credits per run, so wait until you have a real ambiguity problem.