Evals - sunpeak

Overview

Evals test whether different LLMs call your tools correctly. They connect to your MCP server via MCP protocol, discover tools, send prompts to multiple models (GPT-4o, Claude, Gemini, etc.), and assert that each model calls the right tools with the right arguments. Eval cases can also include App Context for follow-up prompts that depend on model-visible UI state. Each eval case runs multiple times per model to measure reliability across non-deterministic LLM responses. Evals work with any MCP server. For sunpeak framework projects, the dev server starts automatically. For standalone use, point to your running server. Evals are not included in the default sunpeak test run because they cost money (API credits). Run them explicitly with --eval.

Prerequisites

Vercel AI SDK: install ai
Provider packages: Install only the ones you need:
- @ai-sdk/openai for GPT-4o, GPT-4o-mini, o4-mini
- @ai-sdk/anthropic for Claude Sonnet
- @ai-sdk/google for Gemini 2.0 Flash
API keys: Set in tests/evals/.env (gitignored) or as environment variables

Setup

Evals are scaffolded automatically by sunpeak new and sunpeak test init. The directory structure:

tests/evals/
├── eval.config.ts      # Model list, run count, defaults
├── .env                # API keys (gitignored)
├── .env.example        # Template showing required keys
└── *.eval.ts           # Eval spec files

Configuration

// tests/evals/eval.config.ts
import { defineEvalConfig } from 'sunpeak/eval';

export default defineEvalConfig({
  // Omit server for sunpeak projects (auto-detected and auto-started).
  // For non-sunpeak projects:
  // server: 'http://localhost:8000/mcp',

  models: [
    'gpt-4o',                      // OPENAI_API_KEY
    'claude-sonnet-4-20250514',    // ANTHROPIC_API_KEY
    'gemini-2.0-flash',            // GOOGLE_GENERATIVE_AI_API_KEY
  ],

  defaults: {
    runs: 10,          // Number of times to run each case per model
    maxSteps: 1,       // Max tool call steps per run
    temperature: 0,    // 0 for most deterministic results
    timeout: 30_000,   // Timeout per run in ms
  },
});

API keys are loaded automatically from tests/evals/.env. Copy .env.example to .env and fill in your keys:

# tests/evals/.env
OPENAI_API_KEY=sk-...
ANTHROPIC_API_KEY=sk-ant-...
GOOGLE_GENERATIVE_AI_API_KEY=...

Writing Evals

Each eval file exports a defineEval with an array of cases. Each case has a prompt, optional App Context, and an expectation for which tool gets called:

// tests/evals/albums.eval.ts
import { expect } from 'vitest';
import { defineEval } from 'sunpeak/eval';

export default defineEval({
  cases: [
    {
      name: 'asks for photo albums',
      prompt: 'Show me my photo albums',
      expect: { tool: 'show-albums' },
    },
    {
      name: 'asks for food photos',
      prompt: 'Show me photos from my Austin pizza tour',
      expect: {
        tool: 'show-albums',
        args: { search: expect.stringMatching(/pizza|austin/i) },
      },
    },
    {
      name: 'uses selected app state for follow-up',
      prompt: 'Book this one',
      appContext: {
        structuredContent: {
          selectedFlight: { carrier: 'delta', flightNumber: 'DL123' },
        },
      },
      expect: {
        tool: 'book-flight',
        args: { carrier: 'delta' },
      },
    },
  ],
});

Use appContext when you need to test a follow-up turn that depends on state shared by the rendered MCP App, such as the selected flight, focused row, current cart, or active review decision. The shape matches updateModelContext: pass structuredContent for JSON state and content for model-visible content blocks. Each eval run exposes that context to the model before the prompt.

Assertion Levels

There are three ways to check results: Single tool checks that the first tool call matches:

expect: {
  tool: 'show-albums',
  args: { category: expect.stringMatching(/travel/i) },
}

Ordered sequence checks multi-step tool call order:

maxSteps: 3,
expect: [
  { tool: 'review-post' },
  { tool: 'publish-post' },
],

Custom function gives you full access to the result:

assert: (result) => {
  expect(result.toolCalls).toHaveLength(1);
  expect(result.toolCalls[0].name).toBe('show-albums');
},

Args use partial matching. Extra keys in the actual tool call are allowed. Vitest asymmetric matchers (expect.stringMatching, expect.arrayContaining, etc.) work in args expectations.

Running Evals

pnpm
npm
yarn

pnpm test:eval                             # Run all evals
pnpm test:eval -- albums                   # Filter by name
pnpm test:eval -- --unit                   # Run evals + unit tests

npm run test:eval                             # Run all evals
npm run test:eval -- albums                   # Filter by name
npm run test:eval -- --unit                   # Run evals + unit tests

yarn test:eval                             # Run all evals
yarn test:eval albums                      # Filter by name
yarn test:eval --unit                      # Run evals + unit tests

Output

Each case runs N times per model. The reporter shows pass/fail counts:

tests/evals/albums.eval.ts
  asks for photo albums
    gpt-4o           10/10 passed (100%)  avg 1.2s
    claude-sonnet    9/10 passed  (90%)   avg 0.8s
    gemini-flash     6/10 passed  (60%)   avg 0.9s
      └ failures: called 'get-photos' instead of 'show-albums' (4x)

Summary: 25/30 passed (83%) across 3 models

Per-Eval Overrides

Individual eval files can override the global model list, run count, or pass threshold:

export default defineEval({
  models: ['gpt-4o'],       // Only test this model
  runs: 5,                  // Override default run count
  threshold: 0.8,           // Pass at 80% instead of 100%
  cases: [/* ... */],
});

​Overview

​Prerequisites

​Setup

​Configuration

​Writing Evals

​Assertion Levels

​Running Evals

​Output

​Per-Eval Overrides