MCP App Evals: How to Test Tool Calling Across GPT-4o, Claude, and Gemini (May 2026)

May 4, 2026 Abe Wheeler

MCP Apps MCP App Testing MCP App Framework ChatGPT Apps ChatGPT App Testing ChatGPT App Framework Claude Connectors Claude Connector Testing Claude Connector Framework Evals LLM Testing

Test whether GPT-4o, Claude, and Gemini call your MCP App tools correctly with evals.

Unit tests tell you your component renders correctly. E2E tests tell you your app works inside a simulated host. Neither tells you whether GPT-4o, Claude, or Gemini can actually find and call your tools when a user asks a question.

That’s what evals are for. Evals send real prompts to real models, then check whether each model calls the right tool with the right arguments. They’re the only test type that exercises the full path from user intent to tool invocation, which makes them the only way to catch tool naming bugs, description gaps, and schema ambiguity before your users do.

TL;DR: Write evals in tests/evals/*.eval.ts using defineEval() from sunpeak/eval. Configure models in tests/evals/eval.config.ts with defineEvalConfig(). Run with pnpm test:eval. Each case runs multiple times per model to produce statistical pass rates (e.g. “GPT-4o 10/10, Gemini 6/10”). Evals cost API credits, so run them explicitly and gate them to main branch in CI/CD.

What Evals Test That Other Tests Don’t

MCP Apps expose tools to AI models through the MCP protocol. When a user says “show me my photo albums,” the model needs to pick the right tool from your schema and call it with sensible arguments. This is not deterministic. Different models interpret tool schemas differently, and the same model can make different choices on repeated attempts.

Your existing tests cover the layers below this:

Unit tests verify your tool handler returns the right structuredContent for a given input
Integration tests verify the MCP protocol correctly routes tool calls
E2E tests verify your resource component renders correctly in simulated hosts

Evals test the layer none of these reach: does the model understand your tool well enough to call it correctly?

Here’s a concrete example. Say you have two tools: get-photos (returns raw photo data) and show-albums (returns organized album views). A user asks “show me my photo albums.” GPT-4o might call show-albums every time. Gemini might call get-photos 40% of the time because “photos” appears in both tool names and Gemini weighs the name more heavily than the description. Without evals, you’d only discover this after shipping.

Setting Up Evals

Evals come scaffolded when you create a new project with npx sunpeak new or add testing to an existing project with npx sunpeak test init. The structure looks like this:

tests/
  evals/
    .env                 # API keys (gitignored)
    eval.config.ts       # Model configuration
    albums.eval.ts       # Your eval specs

Add API keys for each provider you want to test against:

# tests/evals/.env
OPENAI_API_KEY=sk-...
ANTHROPIC_API_KEY=sk-ant-...
GOOGLE_GENERATIVE_AI_API_KEY=AIza...

This file is gitignored by default. In CI/CD, these become GitHub Actions secrets.

Writing Your First Eval

An eval spec defines cases. Each case has a prompt (what the user says), an expected tool (which tool the model should call), and optional expected arguments.

// tests/evals/albums.eval.ts
import { expect } from 'vitest';
import { defineEval } from 'sunpeak/eval';

export default defineEval({
  cases: [
    {
      name: 'asks for photo albums',
      prompt: 'Show me my photo albums',
      expect: {
        tool: 'show-albums',
      },
    },
    {
      name: 'asks for a specific album',
      prompt: 'Show me photos from my vacation in Italy',
      expect: {
        tool: 'show-albums',
        args: {
          query: expect.stringMatching(/italy/i),
        },
      },
    },
    {
      name: 'asks to search all photos',
      prompt: 'Find all photos of sunsets',
      expect: {
        tool: 'get-photos',
        args: {
          search: expect.stringMatching(/sunset/i),
        },
      },
    },
  ],
});

The prompt field is the message sent to each model. The expect.tool field is the tool name you expect the model to call. The expect.args field (optional) checks that the model passes the right arguments. Use expect.stringMatching() from Vitest for fuzzy matching when you don’t need an exact string.

Run it:

pnpm test:eval

Evals are not included in the default pnpm test command because they cost real money (API credits for each model call). You opt in explicitly.

Configuring Models

The eval config file controls which models run and how many times each case repeats.

// tests/evals/eval.config.ts
import { defineEvalConfig } from 'sunpeak/eval';

export default defineEvalConfig({
  models: ['gpt-4o', 'claude-sonnet-4-20250514', 'gemini-2.0-flash'],
  defaults: {
    runs: 10,
    temperature: 0,
  },
});

runs sets how many times each case runs per model. LLM responses are non-deterministic, so a single pass proves nothing. Ten runs gives you a meaningful pass rate. If you’re on a budget during development, drop to 3-5 runs and bump to 10+ for CI.

temperature: 0 makes responses as deterministic as possible. You still get variation (most APIs don’t guarantee identical outputs at temperature 0), but it reduces noise.

You don’t need all three providers. Start with the model your primary host uses (GPT-4o for ChatGPT Apps, Claude for Claude Connectors) and add others later.

Reading Eval Results

Eval output looks like this:

tests/evals/albums.eval.ts
  asks for photo albums
    gpt-4o           10/10 passed (100%)  avg 1.2s
    claude-sonnet    9/10 passed  (90%)   avg 0.8s
    gemini-flash     6/10 passed  (60%)   avg 0.9s
      └ failures: called 'get-photos' instead of 'show-albums' (4x)

Each line shows the model name, pass count, pass rate, and average latency. When a model fails, the output tells you what it called instead. This is the information you need to fix the problem.

In this example, Gemini called get-photos instead of show-albums four times out of ten. The fix might be renaming get-photos to something less ambiguous like search-photos, improving the show-albums tool description to mention “albums” more clearly, or adding an example prompt to the tool description.

Common Eval Patterns

Testing argument extraction

Some tools require the model to extract structured data from a natural language prompt. Test that the model parses arguments correctly:

{
  name: 'extracts date range from prompt',
  prompt: 'Show me sales data from January to March 2026',
  expect: {
    tool: 'show-sales',
    args: {
      startDate: expect.stringMatching(/2026-01/),
      endDate: expect.stringMatching(/2026-03/),
    },
  },
}

Testing tool disambiguation

When you have tools with overlapping names or descriptions, write evals that specifically target the overlap:

{
  name: 'prefers show-albums over get-photos for album requests',
  prompt: 'I want to see my photo albums',
  expect: { tool: 'show-albums' },
},
{
  name: 'prefers get-photos for single photo searches',
  prompt: 'Find the photo I took at the Eiffel Tower',
  expect: { tool: 'get-photos' },
}

Testing edge cases in prompts

Models sometimes struggle with ambiguous or unusual phrasing. Test prompts that real users would type but that might confuse the model:

{
  name: 'handles vague prompt',
  prompt: 'albums',
  expect: { tool: 'show-albums' },
},
{
  name: 'handles prompt with typo',
  prompt: 'Show me my phto albms',
  expect: { tool: 'show-albums' },
}

Testing that the model does not call a tool

Sometimes you want to verify that a prompt does not trigger a tool call. For example, if the user asks a general question unrelated to your tools:

{
  name: 'does not call a tool for unrelated question',
  prompt: 'What is the capital of France?',
  expect: { tool: null },
}

Fixing Failed Evals

When a model consistently fails an eval case, the problem is almost always in your tool schema, not in the model. Here’s how to diagnose and fix common failures.

Model calls the wrong tool. Your tool names or descriptions are ambiguous. Look at the tool the model called instead and ask: why would the model think that tool was a better match? Rename tools to be more specific, add a clearer first sentence to the description, or remove overlapping language between tool descriptions.

Model passes the wrong arguments. Your argument schema is too loose. Add enum constraints where possible, use descriptive parameter names (e.g. cityName instead of q), and add a description field to each parameter in your tool schema.

Model calls the right tool on GPT-4o but not on Gemini. Different models weight tool names, descriptions, and parameter schemas differently. If one model fails, try rewording the tool description to be more explicit. Shorter, more direct descriptions tend to work better across models than long, detailed ones.

Pass rate is around 50%. The model is guessing between two tools. This usually means two tools have overlapping descriptions or the prompt is genuinely ambiguous. Rewrite the descriptions so each tool’s purpose is clearly distinct in the first sentence.

After making changes, re-run pnpm test:eval and compare pass rates. Incremental improvement is normal. Going from 60% to 90% on a tricky case is a good result.

Running Evals in CI/CD

Evals cost API credits per run, so you don’t want them on every push to every branch. Gate them to main branch merges in a separate GitHub Actions job:

eval:
  name: Multi-Model Evals
  runs-on: ubuntu-latest
  if: github.ref == 'refs/heads/main'
  steps:
    - uses: actions/checkout@v5
    - uses: pnpm/action-setup@v4
    - uses: actions/setup-node@v4
      with:
        node-version: 22
        cache: pnpm
    - run: pnpm install --frozen-lockfile
    - name: Run evals
      run: pnpm test:eval
      env:
        OPENAI_API_KEY: ${{ secrets.OPENAI_API_KEY }}
        ANTHROPIC_API_KEY: ${{ secrets.ANTHROPIC_API_KEY }}
        GOOGLE_GENERATIVE_AI_API_KEY: ${{ secrets.GOOGLE_GENERATIVE_AI_API_KEY }}

Store API keys as repository secrets in GitHub. The job only runs on pushes to main, so feature branch pushes skip the cost entirely. If an eval fails on main, it shows up in your Actions tab with the same pass/fail output you see locally.

For teams that want eval feedback before merging, add a manual trigger (workflow_dispatch) so developers can run evals on demand from any branch without making them automatic.

Where Evals Fit in the Testing Pyramid

Evals sit above E2E tests in the testing pyramid for MCP Apps, ChatGPT Apps, and Claude Connectors:

Unit tests (fast, free). Test component rendering and tool handler logic in isolation.
Integration tests (fast, free). Test MCP protocol routing with the mcp fixture.
E2E tests (medium speed, free). Test full app rendering in simulated hosts with the inspector fixture.
Visual regression tests (medium speed, free). Compare screenshots across hosts, themes, and display modes.
Evals (medium speed, costs API credits). Test whether real models call the right tools with the right arguments.
Live tests (slow, requires accounts). Test against real ChatGPT and Claude for final validation.

Run layers 1-4 on every push. Run evals on main branch merges. Run live tests before major releases.

Most bugs live in layers 1-3. Evals catch a specific category of bug that nothing else can: “my tools work perfectly, but the model can’t figure out which one to use.” That category matters because it directly affects how users experience your app. If a model calls the wrong tool, the user sees wrong or broken output regardless of how well your code works.

When to Write Evals

Write evals when:

You add a new tool and want to confirm models can find and call it
You have two or more tools with related names or descriptions
You rename a tool or change its description
Users report that the AI “doesn’t do the right thing” with your app
You add support for a new LLM host and want to verify tool calling works on that model

Skip evals when you only have one tool with an obvious name. A single tool called show-weather with a description of “Shows the current weather for a city” is unlikely to confuse any model. Evals pay for themselves when ambiguity is possible.

sunpeak scaffolds eval files when you create a new project (npx sunpeak new) or add testing (npx sunpeak test init). The eval infrastructure connects to your MCP server, discovers tools via the MCP protocol, and handles all the model API calls. You just write the cases.

Get Started

Documentation →


npx sunpeak new

Frequently Asked Questions

What are MCP App evals?

Evals are tests that send natural language prompts to LLMs and verify that each model calls the correct MCP tool with the correct arguments. Unlike unit or e2e tests that mock tool data, evals exercise the full loop: your tool schema reaches the model, the model interprets it, and the model makes a tool call. Evals run each case multiple times per model to produce statistical pass/fail rates, because LLM responses are non-deterministic.

Why do I need evals if I already have unit and e2e tests?

Unit tests verify your tool handler logic. E2E tests verify your UI renders correctly in the host. Neither tests whether an LLM can actually understand your tool schema and call the right tool. A tool named get-photos and a tool named show-albums might both make sense to you, but GPT-4o might confuse them 40% of the time. Evals catch tool naming ambiguity, description gaps, and argument mismatches that no other test type can detect.

How do I run evals for my MCP App?

Run pnpm test:eval. This connects to your MCP server, discovers your tools via the MCP protocol, sends prompts to each configured model, and asserts that each model calls the expected tool with the expected arguments. Evals are not included in the default pnpm test run because they cost API credits. You opt in explicitly.

What models can I eval MCP App tools against?

Any model that supports tool calling. Common choices are GPT-4o (OpenAI), Claude Sonnet (Anthropic), and Gemini 2.0 Flash (Google). Configure models in tests/evals/eval.config.ts using defineEvalConfig. You need an API key for each provider, stored in tests/evals/.env.

How many times should each eval case run per model?

At least 10 runs per case per model. LLM responses are non-deterministic, so a single pass does not prove reliability. 10 runs gives you a meaningful pass rate. If a tool passes 10/10 on GPT-4o but 6/10 on Gemini, you know the tool description needs work for that model. Set the runs count in defineEvalConfig defaults or per-case.

How do I test tool arguments in MCP App evals?

Use the expect.args field in your eval case. For exact matches, pass a literal object. For flexible matching, use expect.stringMatching(/pattern/i) from Vitest to match arguments with regex patterns. This is useful when the model might format an argument differently (e.g. "New York" vs "new york") but the tool still works.

Should I run MCP App evals in CI/CD?

Yes, but only on main or release branches. Evals cost API credits per run, so running them on every push to every feature branch gets expensive. Put evals in a separate GitHub Actions job gated to main branch pushes. Store API keys as GitHub Actions secrets (OPENAI_API_KEY, ANTHROPIC_API_KEY, GOOGLE_GENERATIVE_AI_API_KEY).

What do I do when a model fails an eval?

Read the failure output to see what the model actually called instead of what you expected. Common fixes: rename the tool to be less ambiguous, improve the tool description, add examples to the description, constrain the argument schema with enums or patterns, or split a tool that does too many things into separate tools. Re-run the eval after each change to measure improvement.