Skip to main content
The sunpeak testing framework tests any MCP server, whether or not you use the sunpeak app framework. It provides four levels of automated testing that cover everything from fast unit checks to live host validation and multi-model eval runs.

Standalone Usage

You do not need to build your MCP server with sunpeak to use the testing framework. Point it at any MCP server URL:
# Scaffold test infrastructure for an existing MCP server
sunpeak test init

# Launch the inspector against any MCP server
sunpeak inspect --server http://localhost:8000/mcp
sunpeak test init generates the test directory structure, Playwright configs, example specs, and eval boilerplate. From there, configure defineConfig() with your server URL and write tests against your tools and resources. For sunpeak framework projects, the dev server starts automatically and no server URL is needed.

Testing Levels

1. Unit Tests

Standard Vitest with happy-dom for component and hook logic testing. No special framework integration required.

Unit Testing

Fast component and hook tests with Vitest.

2. E2E Tests

Playwright specs that call your MCP tools and render them in simulated ChatGPT and Claude runtimes. The mcp fixture from sunpeak/test handles inspector navigation, iframe traversal, and host switching. Simulations (JSON fixtures) define reproducible tool states so you can test every combination of host, theme, display mode, and device without deploying or burning API credits. Visual regression is built in. Pass --visual to compare screenshots against baselines, or --visual --update to regenerate them.

E2E Testing

Write Playwright tests against simulated ChatGPT and Claude runtimes.

Visual Regression

Screenshot comparison across themes, display modes, and hosts.

3. Live Tests

Playwright specs that run against real ChatGPT (and future hosts). They open a browser, send messages that trigger tool calls against your MCP server, and verify the rendered app. This catches issues that inspector tests cannot: real MCP connection behavior, actual LLM tool invocation, host-specific iframe rendering, and production resource loading.

Live Testing

Validate your MCP Apps inside real AI chat hosts.

4. Evals

Multi-model tool calling tests. Evals connect to your MCP server via the MCP protocol, discover its tools, and send prompts to multiple LLM models (GPT-4o, Claude, Gemini, etc.). Each eval case runs N times per model and reports statistical pass/fail counts, so you can measure whether your tool descriptions work reliably across models.

Evals

Test tool calling reliability across GPT-4o, Claude, Gemini, and more.

CLI Commands

CommandWhat it runsRuntime
sunpeak testUnit + E2E testsVitest + Playwright
sunpeak test --unitUnit tests onlyVitest + happy-dom
sunpeak test --e2eE2E tests onlyPlaywright + inspector
sunpeak test --visualE2E with visual regressionPlaywright + inspector
sunpeak test --visual --updateUpdate visual baselinesPlaywright + inspector
sunpeak test --liveLive tests against real hostsPlaywright + real host
sunpeak test --evalEvals against multiple modelsVitest + Vercel AI SDK
Flags are additive: --unit --e2e --live --eval runs all four.
--eval and --live are not included in the default sunpeak test run because they require API keys and cost money. You must opt in explicitly.

Scaffolding

For existing MCP servers (not built with sunpeak), run sunpeak test init to generate all the test infrastructure:
sunpeak test init
This creates:
  • tests/e2e/ with example Playwright specs and config
  • tests/simulations/ with example simulation JSON fixtures
  • tests/evals/ with eval config, .env.example, and example eval specs
  • tests/live/ with live test config and example specs
For sunpeak framework projects, sunpeak new scaffolds all of this automatically.

Learn More

Inspector

The multi-host inspector that powers E2E tests.

Simulations

JSON fixtures for reproducible tool states.

Unit Testing

Fast component and hook tests with Vitest.

E2E Testing

Playwright specs against simulated hosts.

Visual Regression

Screenshot baselines and comparison.

Live Testing

Tests against real ChatGPT and Claude.

Evals

Multi-model tool calling reliability.