Overview
Evals test whether different LLMs call your tools correctly. They connect to your MCP server via MCP protocol, discover tools, send prompts to multiple models (GPT-4o, Claude, Gemini, etc.), and assert that each model calls the right tools with the right arguments. Each eval case runs multiple times per model to measure reliability across non-deterministic LLM responses. Evals work with any MCP server. For sunpeak framework projects, the dev server starts automatically. For standalone use, point to your running server. Evals are not included in the defaultsunpeak test run because they cost money (API credits). Run them explicitly with --eval.
Prerequisites
- Vercel AI SDK:
pnpm add ai - Provider packages: Install only the ones you need:
pnpm add @ai-sdk/openaifor GPT-4o, GPT-4o-mini, o4-minipnpm add @ai-sdk/anthropicfor Claude Sonnet, Claude Haikupnpm add @ai-sdk/googlefor Gemini 2.0 Flash
- API keys: Set in
tests/evals/.env(gitignored) or as environment variables
Setup
Evals are scaffolded automatically bysunpeak new and sunpeak test init. The directory structure:
Configuration
tests/evals/.env. Copy .env.example to .env and fill in your keys:
Writing Evals
Each eval file exports adefineEval with an array of cases. Each case has a prompt and an expectation for which tool gets called:
Assertion Levels
There are three ways to check results: Single tool checks that the first tool call matches:expect.stringMatching, expect.arrayContaining, etc.) work in args expectations.