Documentation Index
Fetch the complete documentation index at: https://sunpeak.ai/docs/llms.txt
Use this file to discover all available pages before exploring further.
Overview
Evals test whether different LLMs call your tools correctly. They connect to your MCP server via MCP protocol, discover tools, send prompts to multiple models (GPT-4o, Claude, Gemini, etc.), and assert that each model calls the right tools with the right arguments. Eval cases can also include App Context for follow-up prompts that depend on model-visible UI state. Each eval case runs multiple times per model to measure reliability across non-deterministic LLM responses.
Evals work with any MCP server. For sunpeak framework projects, the dev server starts automatically. For standalone use, point to your running server.
Evals are not included in the default sunpeak test run because they cost money (API credits). Run them explicitly with --eval.
Prerequisites
- Vercel AI SDK: install
ai
- Provider packages: Install only the ones you need:
@ai-sdk/openai for GPT-4o, GPT-4o-mini, o4-mini
@ai-sdk/anthropic for Claude Sonnet
@ai-sdk/google for Gemini 2.0 Flash
- API keys: Set in
tests/evals/.env (gitignored) or as environment variables
Setup
Evals are scaffolded automatically by sunpeak new and sunpeak test init. The directory structure:
tests/evals/
├── eval.config.ts # Model list, run count, defaults
├── .env # API keys (gitignored)
├── .env.example # Template showing required keys
└── *.eval.ts # Eval spec files
Configuration
// tests/evals/eval.config.ts
import { defineEvalConfig } from 'sunpeak/eval';
export default defineEvalConfig({
// Omit server for sunpeak projects (auto-detected and auto-started).
// For non-sunpeak projects:
// server: 'http://localhost:8000/mcp',
models: [
'gpt-4o', // OPENAI_API_KEY
'claude-sonnet-4-20250514', // ANTHROPIC_API_KEY
'gemini-2.0-flash', // GOOGLE_GENERATIVE_AI_API_KEY
],
defaults: {
runs: 10, // Number of times to run each case per model
maxSteps: 1, // Max tool call steps per run
temperature: 0, // 0 for most deterministic results
timeout: 30_000, // Timeout per run in ms
},
});
API keys are loaded automatically from tests/evals/.env. Copy .env.example to .env and fill in your keys:
# tests/evals/.env
OPENAI_API_KEY=sk-...
ANTHROPIC_API_KEY=sk-ant-...
GOOGLE_GENERATIVE_AI_API_KEY=...
Writing Evals
Each eval file exports a defineEval with an array of cases. Each case has a prompt, optional App Context, and an expectation for which tool gets called:
// tests/evals/albums.eval.ts
import { expect } from 'vitest';
import { defineEval } from 'sunpeak/eval';
export default defineEval({
cases: [
{
name: 'asks for photo albums',
prompt: 'Show me my photo albums',
expect: { tool: 'show-albums' },
},
{
name: 'asks for food photos',
prompt: 'Show me photos from my Austin pizza tour',
expect: {
tool: 'show-albums',
args: { search: expect.stringMatching(/pizza|austin/i) },
},
},
{
name: 'uses selected app state for follow-up',
prompt: 'Book this one',
appContext: {
structuredContent: {
selectedFlight: { carrier: 'delta', flightNumber: 'DL123' },
},
},
expect: {
tool: 'book-flight',
args: { carrier: 'delta' },
},
},
],
});
Use appContext when you need to test a follow-up turn that depends on state shared by the rendered MCP App, such as the selected flight, focused row, current cart, or active review decision. The shape matches updateModelContext: pass structuredContent for JSON state and content for model-visible content blocks. Each eval run exposes that context to the model before the prompt.
Assertion Levels
There are three ways to check results:
Single tool checks that the first tool call matches:
expect: {
tool: 'show-albums',
args: { category: expect.stringMatching(/travel/i) },
}
Ordered sequence checks multi-step tool call order:
maxSteps: 3,
expect: [
{ tool: 'review-post' },
{ tool: 'publish-post' },
],
Custom function gives you full access to the result:
assert: (result) => {
expect(result.toolCalls).toHaveLength(1);
expect(result.toolCalls[0].name).toBe('show-albums');
},
Args use partial matching. Extra keys in the actual tool call are allowed. Vitest asymmetric matchers (expect.stringMatching, expect.arrayContaining, etc.) work in args expectations.
Running Evals
pnpm test:eval # Run all evals
pnpm test:eval -- albums # Filter by name
pnpm test:eval -- --unit # Run evals + unit tests
npm run test:eval # Run all evals
npm run test:eval -- albums # Filter by name
npm run test:eval -- --unit # Run evals + unit tests
yarn test:eval # Run all evals
yarn test:eval albums # Filter by name
yarn test:eval --unit # Run evals + unit tests
Output
Each case runs N times per model. The reporter shows pass/fail counts:
tests/evals/albums.eval.ts
asks for photo albums
gpt-4o 10/10 passed (100%) avg 1.2s
claude-sonnet 9/10 passed (90%) avg 0.8s
gemini-flash 6/10 passed (60%) avg 0.9s
└ failures: called 'get-photos' instead of 'show-albums' (4x)
Summary: 25/30 passed (83%) across 3 models
Per-Eval Overrides
Individual eval files can override the global model list, run count, or pass threshold:
export default defineEval({
models: ['gpt-4o'], // Only test this model
runs: 5, // Override default run count
threshold: 0.8, // Pass at 80% instead of 100%
cases: [/* ... */],
});