Skip to main content
The sunpeak testing framework tests any MCP server, whether or not you use the sunpeak app framework. It provides four levels of automated testing that cover everything from fast unit checks to live host validation and multi-model eval runs.

Quickstart

No sunpeak project required. Scaffold tests for any running MCP server:
npx sunpeak test init --server http://localhost:8000/mcp
npx sunpeak test
This generates test infrastructure (Playwright configs, E2E specs, visual regression, eval boilerplate) and runs the scaffolded smoke test. See Getting Started for the full setup walkthrough, including language-specific tips for Python, Go, and Rust servers. For sunpeak framework projects, the dev server starts automatically and no server URL is needed.

Testing Levels

1. E2E Tests

Playwright specs that call your MCP tools and render them in simulated ChatGPT and Claude runtimes. The mcp fixture from sunpeak/test handles inspector navigation, iframe traversal, and host switching. Simulations (JSON fixtures) define reproducible tool states so you can test every combination of host, theme, display mode, and device without deploying or burning API credits. Visual regression is built in. Pass --visual to compare screenshots against baselines, or --visual --update to regenerate them.

E2E Testing

Write Playwright tests against simulated ChatGPT and Claude runtimes.

Visual Regression

Screenshot comparison across themes, display modes, and hosts.

2. Live Tests

Playwright specs that run against real ChatGPT (and future hosts). They open a browser, send messages that trigger tool calls against your MCP server, and verify the rendered app. This catches issues that inspector tests cannot: real MCP connection behavior, actual LLM tool invocation, host-specific iframe rendering, and production resource loading.

Live Testing

Validate your MCP Apps inside real AI chat hosts.

3. Evals

Multi-model tool calling tests. Evals connect to your MCP server via the MCP protocol, discover its tools, and send prompts to multiple LLM models (GPT-4o, Claude, Gemini, etc.). Each eval case runs N times per model and reports statistical pass/fail counts, so you can measure whether your tool descriptions work reliably across models.

Evals

Test tool calling reliability across GPT-4o, Claude, Gemini, and more.

CLI Commands

CommandWhat it runsRuntime
sunpeak testUnit (if configured) + E2E testsVitest + Playwright
sunpeak test --e2eE2E tests onlyPlaywright + inspector
sunpeak test --visualE2E with visual regressionPlaywright + inspector
sunpeak test --visual --updateUpdate visual baselinesPlaywright + inspector
sunpeak test --liveLive tests against real hostsPlaywright + real host
sunpeak test --evalEvals against multiple modelsVitest + Vercel AI SDK
sunpeak test --unitUnit tests (app framework only)Vitest + happy-dom
Flags are additive: --e2e --live --eval runs all three.
--eval and --live are not included in the default sunpeak test run because they require API keys and cost money. You must opt in explicitly.

Scaffolding

For existing MCP servers (not built with sunpeak), run npx sunpeak test init to generate all the test infrastructure:
npx sunpeak test init
For JS/TS projects, this creates files at the project root:
  • tests/e2e/ with smoke and visual regression test specs
  • tests/evals/ with eval config, .env.example, and example eval specs
  • tests/live/ with live test config and example specs
For non-JS projects (Python, Go, Rust, etc.), everything goes into a self-contained tests/sunpeak/ directory with its own package.json. For sunpeak framework projects, sunpeak new scaffolds all of this automatically.

Learn More

Inspector

The multi-host inspector that powers E2E tests.

Simulations

JSON fixtures for reproducible tool states.

E2E Testing

Playwright specs against simulated hosts.

Visual Regression

Screenshot baselines and comparison.

Live Testing

Tests against real ChatGPT and Claude.

Evals

Multi-model tool calling reliability.