MCP App CI/CD: Run Your Tests in GitHub Actions (June 2026)

June 14, 2026 Abe Wheeler

MCP Apps MCP App Testing MCP App Framework ChatGPT Apps ChatGPT App Testing CI/CD GitHub Actions

GitHub Actions running MCP App tests against the sunpeak inspector, no paid host accounts needed.

[Updated 2026-06-14] Here’s the GitHub Actions setup I would put in a production MCP App project today. It runs fast deterministic checks on every pull request, saves browser artifacts when something fails, and keeps slower live host tests and model evals out of the default loop.

TL;DR: Put pnpm test in your default CI workflow after installing dependencies and Playwright Chromium. That covers unit tests and inspector E2E tests against replicated ChatGPT and Claude runtimes. Add pnpm test:visual when screenshot drift matters. Run pnpm test:live and pnpm test:eval in separate jobs because they use real host sessions or provider API keys. If you already have an MCP server in another language, run npx sunpeak test init --server http://localhost:8000/mcp and use the same CI pattern.

What Changed Since April 2026

MCP Apps are now a shared MCP pattern, not a ChatGPT-only integration detail. The official MCP Apps announcement describes UI resources that render inside host conversations, and OpenAI’s MCP Apps compatibility guide maps older Apps SDK fields to the standard MCP Apps bridge.

That changes what a good CI pipeline should prove. You want tests for:

The MCP contract: tools, schemas, tool results, resources, annotations, and metadata.
The app runtime: iframe rendering, host bridge events, display modes, theme, safe area, and teardown.
The user-facing UI: empty states, loading states, error states, full data states, and visual regressions.
The model-facing contract: whether models can pick the right tool and pass useful arguments.

sunpeak covers those layers with unit tests, inspector E2E tests, visual regression tests, live tests, and evals. CI should run the cheap deterministic layers all the time, then run the external layers on a slower cadence.

The CI Split That Works

Do not put every test in one job. MCP App tests have different costs and failure modes, so split them by purpose:

Job	Command	When to run	External dependency
Static checks	`pnpm typecheck && pnpm lint`	Every pull request	None
Unit tests	`pnpm test:unit`	Every pull request	None
Inspector E2E	`pnpm test:e2e`	Every pull request	None
Visual regression	`pnpm test:visual`	UI pull requests, or every PR for UI-heavy apps	None
Live host tests	`pnpm test:live`	Main, release, or manual dispatch	Real host account
Tool evals	`pnpm test:eval`	Tool schema changes, main, release, or manual dispatch	Provider API keys

pnpm test is the right default for most sunpeak projects because it runs unit tests and E2E tests. It does not spend model credits, and it does not need ChatGPT or Claude credentials.

Minimal GitHub Actions Workflow

Put this in .github/workflows/test.yml:

name: Test

on:
  pull_request:
  push:
    branches: [main]

jobs:
  test:
    name: Unit + E2E
    runs-on: ubuntu-latest

    steps:
      - uses: actions/checkout@v6

      - uses: pnpm/action-setup@v6
        with:
          version: 11

      - uses: actions/setup-node@v6
        with:
          node-version: 24
          cache: pnpm
          cache-dependency-path: pnpm-lock.yaml

      - name: Install dependencies
        run: pnpm install --frozen-lockfile

      - name: Install Playwright Chromium
        run: pnpm exec playwright install --with-deps chromium

      - name: Run unit and E2E tests
        run: pnpm test

      - name: Upload Playwright report
        if: failure()
        uses: actions/upload-artifact@v4
        with:
          name: playwright-report
          path: playwright-report/
          retention-days: 7

      - name: Upload test artifacts
        if: failure()
        uses: actions/upload-artifact@v4
        with:
          name: test-results
          path: test-results/
          retention-days: 7

Use the Node version your project supports. The example uses Node 24 because the current GitHub action examples have moved to the Node 24 action runtime, and new projects should be testing on a current LTS line. If your production server is still on Node 20, test Node 20 too or keep the workflow pinned there until your runtime moves.

The Playwright step matters. GitHub-hosted runners do not include the browser binaries your test suite needs. Playwright’s CI docs recommend installing browsers and Linux dependencies before running tests, and --with-deps chromium is the shortest reliable version when your sunpeak tests only need Chromium.

What `pnpm test` Runs in a sunpeak Project

A current sunpeak project gives you these commands:

pnpm test          # unit + inspector E2E tests
pnpm test:unit     # component and hook tests with Vitest
pnpm test:e2e      # Playwright tests against the inspector
pnpm test:visual   # screenshot regression tests
pnpm test:live     # browser tests against real hosts
pnpm test:eval     # multi-model tool calling evals

The Playwright config stays small:

// playwright.config.ts
import { defineConfig } from 'sunpeak/test/config';

export default defineConfig();

That defineConfig() call handles the pieces that usually make MCP App testing annoying in CI: starting the app server, choosing ports, opening the inspector, rendering resources inside iframes, and running tests across supported host projects.

An E2E test can stay focused on behavior:

import { expect, test } from 'sunpeak/test';

test('dashboard renders the weekly total', async ({ inspector }) => {
  const result = await inspector.renderTool('get-dashboard', undefined, {
    host: 'chatgpt',
    displayMode: 'inline',
    theme: 'dark',
  });

  const app = result.app();
  await expect(app.getByText('4,218')).toBeVisible();
});

The same pattern works for Claude and ChatGPT host shells. Use host, theme, display mode, viewport, and simulation options to cover the states that can break your UI.

Add Visual Regression When Layout Matters

If your MCP App renders tables, charts, forms, maps, or any dense UI, run visual tests in CI:

  visual:
    name: Visual regression
    runs-on: ubuntu-latest

    steps:
      - uses: actions/checkout@v6
      - uses: pnpm/action-setup@v6
        with:
          version: 11
      - uses: actions/setup-node@v6
        with:
          node-version: 24
          cache: pnpm
          cache-dependency-path: pnpm-lock.yaml
      - run: pnpm install --frozen-lockfile
      - run: pnpm exec playwright install --with-deps chromium
      - run: pnpm test:visual
      - name: Upload visual diffs
        if: failure()
        uses: actions/upload-artifact@v4
        with:
          name: visual-regression-results
          path: |
            playwright-report/
            test-results/
          retention-days: 7

Store screenshot baselines in the repo. When the test fails, review the diff artifact. If the change is intentional, update baselines locally and commit them with the UI change. If the change is accidental, the diff usually points straight at the broken state: host, theme, display mode, viewport, or data fixture.

For high-churn apps, you can start by running visual tests only on main and on PRs that touch src/resources/**. For apps where UI correctness is the product, run them on every PR.

Cache Dependencies Without Hiding Test Problems

Use actions/setup-node’s pnpm cache first:

- uses: actions/setup-node@v6
  with:
    node-version: 24
    cache: pnpm
    cache-dependency-path: pnpm-lock.yaml

For monorepos, point cache-dependency-path at every lockfile that matters:

cache-dependency-path: |
  pnpm-lock.yaml
  packages/*/pnpm-lock.yaml

Playwright browser caching is optional. Browser downloads are sometimes fast enough that caching adds more moving parts than it saves. If your workflow spends real time downloading Chromium, cache the Playwright browser directory:

- name: Cache Playwright browsers
  uses: actions/cache@v4
  id: playwright-cache
  with:
    path: ~/.cache/ms-playwright
    key: playwright-${{ runner.os }}-${{ hashFiles('pnpm-lock.yaml') }}

- name: Install Playwright Chromium
  if: steps.playwright-cache.outputs.cache-hit != 'true'
  run: pnpm exec playwright install --with-deps chromium

Keep pnpm install --frozen-lockfile in the workflow even with caching. The cache speeds up installs, but the install step is still what proves your lockfile and dependency graph are valid.

Use Simulations as CI Fixtures

Simulation files are the most useful testing layer for MCP App CI because they turn model-driven states into deterministic fixtures. A simulation might look like this:

{
  "tool": "get_dashboard",
  "userMessage": "Show me this week's analytics",
  "toolInput": { "timeRange": "7d" },
  "toolResult": {
    "structuredContent": {
      "visits": 4218,
      "conversions": 83,
      "bounceRate": 0.41
    }
  }
}

Write one simulation per meaningful UI state:

Happy path with realistic data.
Empty state with no records.
Error state from your backend.
Loading or partial input state if your app uses streamed tool input.
Each display mode that changes layout.
Each permission or auth state that changes available actions.

Those files run the same locally and in CI. They also give agents and teammates a clear map of the states your app claims to support.

Keep Live Tests Separate

Inspector tests are the default because they are fast and deterministic. Live tests still matter because real hosts can differ in auth, account setup, iframe policy, rollout state, and UI chrome. Run them separately:

  live:
    name: Live host smoke tests
    runs-on: ubuntu-latest
    if: github.ref == 'refs/heads/main' || github.event_name == 'workflow_dispatch'

    steps:
      - uses: actions/checkout@v6
      - uses: pnpm/action-setup@v6
        with:
          version: 11
      - uses: actions/setup-node@v6
        with:
          node-version: 24
          cache: pnpm
          cache-dependency-path: pnpm-lock.yaml
      - run: pnpm install --frozen-lockfile
      - run: pnpm exec playwright install --with-deps chromium
      - name: Run live tests
        run: pnpm test:live
        env:
          SUNPEAK_LIVE_AUTH_STATE: ${{ secrets.SUNPEAK_LIVE_AUTH_STATE }}

Keep this job small. It should prove that the deployed or local server connects to the real host, a representative tool can be called, and the resource renders. Use inspector tests for the broad matrix because they are better suited to every PR.

Run Evals Only Where They Pay Off

Evals test whether models call the right MCP tools with the right arguments. They catch a different class of bug than unit or E2E tests:

Tool names that are too similar.
Descriptions that omit the user intent a model needs.
Input schemas that are too loose.
App context that makes a tool look safer or broader than it is.

Put evals in a gated job:

  eval:
    name: Multi-model tool evals
    runs-on: ubuntu-latest
    if: github.ref == 'refs/heads/main' || github.event_name == 'workflow_dispatch'

    steps:
      - uses: actions/checkout@v6
      - uses: pnpm/action-setup@v6
        with:
          version: 11
      - uses: actions/setup-node@v6
        with:
          node-version: 24
          cache: pnpm
          cache-dependency-path: pnpm-lock.yaml
      - run: pnpm install --frozen-lockfile
      - name: Run evals
        run: pnpm test:eval
        env:
          OPENAI_API_KEY: ${{ secrets.OPENAI_API_KEY }}
          ANTHROPIC_API_KEY: ${{ secrets.ANTHROPIC_API_KEY }}
          GOOGLE_GENERATIVE_AI_API_KEY: ${{ secrets.GOOGLE_GENERATIVE_AI_API_KEY }}

Run this after changes to tool descriptions, tool names, schemas, auth hints, or model-visible app context. You do not need to spend credits because someone changed CSS.

Testing an Existing MCP Server

You do not need to rewrite your app into the sunpeak framework to use the testing layer. For an existing server, scaffold tests next to it:

npx sunpeak test init --server http://localhost:8000/mcp
npx sunpeak test

For a stdio server:

npx sunpeak test init --server "python server.py"

sunpeak creates a test harness that can connect to your server, render UI resources in the inspector, and run Playwright assertions. That makes CI/CD useful for Python, Go, Rust, TypeScript, and mixed-stack MCP servers.

A Practical Release Gate

For most MCP App teams, this is enough:

Every pull request runs typecheck, lint, unit tests, and inspector E2E.
UI-heavy pull requests run visual regression tests.
Main branch runs live host smoke tests.
Tool schema changes run evals before release.
Failures upload Playwright reports, screenshots, traces, and test results.

This gives you fast feedback where it matters. The inspector catches the repeatable app and protocol failures. Visual tests catch layout drift. Live tests catch real host connection issues. Evals catch model selection problems.

That is the CI setup I would ship with every MCP App project from day one: deterministic tests on every pull request, real host checks where they add value, and no paid host accounts in the default loop.

Get Started

Documentation →


npx sunpeak new

Frequently Asked Questions

Do I need a ChatGPT or Claude account to run MCP App tests in GitHub Actions?

No for the default CI path. sunpeak inspector tests run against replicated ChatGPT and Claude runtimes on the GitHub Actions runner, so they do not need host accounts, host credentials, model calls, or AI credits. Keep live host tests in a separate opt-in job because those do use real host sessions.

What should my default MCP App CI/CD workflow run?

Run typecheck, lint, unit tests, and inspector E2E tests on every pull request. Add visual regression tests when UI layout matters. Run live host tests and multi-model evals on main, release branches, or manual dispatch because they depend on external accounts or API keys.

How do I run MCP App end-to-end tests in GitHub Actions?

Install dependencies, install Playwright browsers with pnpm exec playwright install --with-deps chromium, then run pnpm test:e2e or pnpm test. The sunpeak Playwright config starts the inspector, connects to your MCP server, renders the app in host iframes, and shuts everything down after the suite.

Can I run MCP App tests against both ChatGPT and Claude in CI/CD?

Yes. The defineConfig() helper from sunpeak/test/config creates Playwright projects for supported hosts. One E2E spec can run against ChatGPT and Claude inspector runtimes, which catches host-specific layout, theme, display mode, and bridge behavior before release.

Should I cache pnpm and Playwright in GitHub Actions?

Yes. Use actions/setup-node with cache: pnpm and cache-dependency-path pointing at your pnpm-lock.yaml. For Playwright, either install Chromium every run with --with-deps, or cache ~/.cache/ms-playwright if browser download time is material for your repo.

How do I handle visual regression screenshots in CI?

Store approved baselines in the repository, run pnpm test:visual in CI, and upload Playwright reports plus test-results artifacts on failure. Update baselines only after reviewing the diff and confirming the UI change is intentional.

Should MCP App evals run on every pull request?

Usually no. Evals call real models, cost provider credits, and can be slower than inspector tests. Run them after tool schema, tool description, or app context changes, and gate the eval job to main, release branches, or manual dispatch.

What changed since earlier MCP App CI guides?

MCP Apps are now documented as a shared MCP extension rather than a ChatGPT-only pattern. OpenAI recommends the standard _meta.ui.resourceUri field for linking tools to UI resources, while keeping OpenAI-specific aliases for compatibility. CI should test the portable MCP App contract first, then add host-specific checks where your app uses host extensions.