MCP Testing Framework

The sunpeak testing framework tests any MCP server, whether or not you use the sunpeak app framework. It provides four levels of automated testing that cover everything from fast unit checks to live host validation and multi-model eval runs.

Quickstart

No sunpeak project required. Scaffold tests for any running MCP server:

npx sunpeak test init --server http://localhost:8000/mcp
npx sunpeak test

This generates test infrastructure (Playwright configs, E2E specs, visual regression, eval boilerplate) and runs the scaffolded smoke test. See Getting Started for the full setup walkthrough, including language-specific tips for Python, Go, and Rust servers. For sunpeak framework projects, the dev server starts automatically and no server URL is needed.

Testing Levels

1. E2E Tests

Playwright specs that call your MCP tools and render them in simulated ChatGPT and Claude runtimes. The mcp fixture from sunpeak/test covers protocol calls (callTool, listTools, listResources, readResource), while the inspector fixture handles rendering: inspector navigation, double-iframe traversal, and host switching via renderTool. Simulations (JSON fixtures) define reproducible tool states so you can test every combination of host, theme, display mode, and device without deploying or burning API credits. Visual regression is built in. Pass --visual to compare screenshots against baselines, or --visual --update to regenerate them.

E2E Testing

Write Playwright tests against simulated ChatGPT and Claude runtimes.

Visual Regression

Screenshot comparison across themes, display modes, and hosts.

2. Live Tests

Playwright specs that run against real ChatGPT (and future hosts). They open a browser, send messages that trigger tool calls against your MCP server, and verify the rendered app. This catches issues that inspector tests cannot: real MCP connection behavior, actual LLM tool invocation, host-specific iframe rendering, and production resource loading.

Live Testing

Validate your MCP Apps inside real AI chat hosts.

3. Evals

Multi-model tool calling tests. Evals connect to your MCP server via the MCP protocol, discover its tools, and send prompts to multiple LLM models (GPT-4o, Claude, Gemini, etc.). Each eval case runs N times per model and reports statistical pass/fail counts, so you can measure whether your tool descriptions work reliably across models.

Evals

Test tool calling reliability across GPT-4o, Claude, Gemini, and more.

CLI Commands

Command	What it runs	Runtime
`sunpeak test`	Unit (if configured) + E2E tests	Vitest + Playwright
`sunpeak test --e2e`	E2E tests only	Playwright + inspector
`sunpeak test --visual`	E2E with visual regression	Playwright + inspector
`sunpeak test --visual --update`	Update visual baselines	Playwright + inspector
`sunpeak test --live`	Live tests against real hosts	Playwright + real host
`sunpeak test --eval`	Evals against multiple models	Vitest + Vercel AI SDK
`sunpeak test --unit`	Unit tests (app framework only)	Vitest + happy-dom

Flags are additive: --e2e --live --eval runs all three.

--eval and --live are not included in the default sunpeak test run because they require API keys and cost money. You must opt in explicitly.

Scaffolding

For existing MCP servers (not built with sunpeak), run npx sunpeak test init to generate all the test infrastructure:

npx sunpeak test init

For JS/TS projects, this creates files at the project root:

tests/e2e/ with smoke and visual regression test specs
tests/evals/ with eval config, .env.example, and example eval specs
tests/live/ with live test config and example specs

For non-JS projects (Python, Go, Rust, etc.), everything goes into a self-contained tests/sunpeak/ directory with its own package.json. For sunpeak framework projects, sunpeak new scaffolds all of this automatically.

Learn More

Inspector

The multi-host inspector that powers E2E tests.

Simulations

JSON fixtures for reproducible tool states.

E2E Testing

Playwright specs against simulated hosts.

Visual Regression

Screenshot baselines and comparison.

Live Testing

Tests against real ChatGPT and Claude.

Evals

Multi-model tool calling reliability.

​Quickstart

​Testing Levels

​1. E2E Tests

E2E Testing

Visual Regression

​2. Live Tests

Live Testing

​3. Evals

Evals

​CLI Commands

​Scaffolding

​Learn More

Inspector

Simulations

E2E Testing

Visual Regression

Live Testing

Evals

Quickstart

Testing Levels

1. E2E Tests

2. Live Tests

3. Evals

CLI Commands

Scaffolding

Learn More