Testing Multi-Tool MCP Apps: Tool Contracts, Workflows, and Disambiguation

May 13, 2026 Abe Wheeler

MCP Apps MCP App Testing MCP App Framework ChatGPT Apps ChatGPT App Testing ChatGPT App Framework Claude Connectors Claude Connector Testing Claude Connector Framework Multi-Tool Testing Integration Testing

Testing multi-tool MCP Apps across ChatGPT and Claude hosts.

Most MCP App tutorials show a single tool returning data to a single resource component. That’s the “hello world” of MCP Apps. Real apps have multiple tools: a search tool, a detail tool, a create tool, an update tool. Those tools share data formats, depend on each other’s output, and compete for the same user prompts.

Testing a single tool is straightforward. Testing multiple tools working together requires a different set of patterns, because the bugs that matter most live in the gaps between tools.

TL;DR: Test multi-tool MCP Apps at three levels. Contract tests verify that each tool’s output matches what its consumers expect. Workflow integration tests chain mcp.callTool() calls to verify tools work together end to end. Disambiguation evals check that LLMs pick the right tool when tools have overlapping intent. Use simulation files with multi-step message arrays to test the UI through complete workflows. Run contract tests and workflow tests in CI with pnpm test:e2e, and disambiguation evals on the main branch with pnpm test:eval.

What Goes Wrong in Multi-Tool Apps

Single-tool bugs are usually obvious. The handler crashes, the component renders wrong, the schema rejects valid input. Multi-tool bugs are subtler because they happen at the boundaries.

Here are the three failure modes that come up repeatedly:

Contract mismatches. Tool A returns { items: [{ id, name }] } and tool B expects { results: [{ itemId, title }] }. Both tools work perfectly in isolation. Together they break because someone renamed fields in tool A without updating tool B. Unit tests with mocked data hide this bug because each test mocks the data it expects.

Tool confusion. Your app has get-orders and show-order-history. A user says “show me my orders.” GPT-4o calls show-order-history. Claude calls get-orders. Neither is wrong, but your app behaves differently depending on which model runs it. Without evals, you won’t know until users report inconsistent behavior.

State assumptions. Tool A uses useAppState to set { selectedItem: "abc" }. Tool B reads selectedItem from app state and assumes it exists. If the user triggers tool B without running tool A first, the app crashes. This is a cross-tool dependency that no test catches unless you test the sequence explicitly.

Contract Testing Between Tools

A contract test verifies that a tool’s output contains the fields its consumers need. It’s not about testing the tool handler’s logic (that’s a unit test). It’s about testing the interface between tools.

Say you have an e-commerce app with three tools: search-products, product-detail, and add-to-cart. The flow is: search returns a list of products with IDs, the user picks one, detail shows the full product using that ID, the user adds it to cart using the product ID and a variant.

Here’s how to write contract tests for this chain:

import { test, expect } from 'sunpeak/test';

test('search-products output contains fields product-detail needs', async ({ mcp }) => {
  const result = await mcp.callTool('search-products', { query: 'headphones' });
  const content = result.structuredContent;

  // product-detail expects an id field on each item
  expect(content.items.length).toBeGreaterThan(0);
  expect(content.items[0]).toHaveProperty('id');
  expect(typeof content.items[0].id).toBe('string');
});

test('product-detail output contains fields add-to-cart needs', async ({ mcp }) => {
  const result = await mcp.callTool('product-detail', { id: 'prod-001' });
  const content = result.structuredContent;

  // add-to-cart expects id and variants
  expect(content).toHaveProperty('id');
  expect(content).toHaveProperty('variants');
  expect(content.variants.length).toBeGreaterThan(0);
  expect(content.variants[0]).toHaveProperty('variantId');
  expect(content.variants[0]).toHaveProperty('price');
});

These tests call real tool handlers through the mcp fixture, so they exercise the actual output shape. When someone renames variantId to sku in the product-detail handler, the contract test fails before the change reaches production.

You don’t need to test every field. Focus on the fields that downstream tools and resource components actually use. If product-detail returns 20 fields but add-to-cart only needs id and variantId, test those two.

Typing Contracts with Shared Types

Contracts are strongest when backed by TypeScript types. Define shared types that both the producing tool and the consuming tool reference:

// src/types/product.ts
export interface ProductSummary {
  id: string;
  name: string;
  price: number;
  imageUrl: string;
}

export interface ProductDetail extends ProductSummary {
  description: string;
  variants: ProductVariant[];
}

export interface ProductVariant {
  variantId: string;
  label: string;
  price: number;
  inStock: boolean;
}

When both tools import from the same type file, the TypeScript compiler catches contract breaks at build time. The contract test is a runtime safety net for cases where types are correct but the handler returns the wrong data (for example, returning an empty variants array when the type says ProductVariant[]).

Workflow Integration Tests

Contract tests verify interfaces. Workflow integration tests verify that a multi-step sequence produces the right outcome. The difference is that workflow tests chain multiple callTool() calls, using the output of each call as input to the next.

import { test, expect } from 'sunpeak/test';

test('search → detail → add-to-cart workflow', async ({ mcp }) => {
  // Step 1: Search
  const searchResult = await mcp.callTool('search-products', {
    query: 'wireless headphones',
  });
  expect(searchResult.isError).toBeFalsy();
  const firstProduct = searchResult.structuredContent.items[0];

  // Step 2: Get detail using the ID from search
  const detailResult = await mcp.callTool('product-detail', {
    id: firstProduct.id,
  });
  expect(detailResult.isError).toBeFalsy();
  const firstVariant = detailResult.structuredContent.variants[0];

  // Step 3: Add to cart using IDs from previous steps
  const cartResult = await mcp.callTool('add-to-cart', {
    productId: firstProduct.id,
    variantId: firstVariant.variantId,
    quantity: 1,
  });
  expect(cartResult.isError).toBeFalsy();
  expect(cartResult.structuredContent.cartItems).toHaveLength(1);
  expect(cartResult.structuredContent.cartItems[0].productId).toBe(firstProduct.id);
});

This test catches bugs that no unit test or single-tool integration test can find:

If search-products returns IDs in a format that product-detail doesn’t accept
If product-detail returns variant data that add-to-cart can’t parse
If the cart tool silently drops the product because of a type coercion issue

Testing Error Paths in Workflows

Happy paths are the starting point. Workflow tests should also cover what happens when a step fails mid-flow:

test('add-to-cart with invalid product ID returns error', async ({ mcp }) => {
  const result = await mcp.callTool('add-to-cart', {
    productId: 'nonexistent-id',
    variantId: 'v-001',
    quantity: 1,
  });
  expect(result.isError).toBeTruthy();

  const errorContent = result.content?.[0];
  expect(errorContent?.text).toContain('not found');
});

test('product-detail with out-of-stock variant still renders', async ({ mcp }) => {
  const result = await mcp.callTool('product-detail', { id: 'prod-discontinued' });
  expect(result.isError).toBeFalsy();
  // Verify the resource can handle all-out-of-stock variants
  const variants = result.structuredContent.variants;
  expect(variants.every((v: { inStock: boolean }) => !v.inStock)).toBe(true);
});

Simulation Files for Multi-Tool Workflows

Simulation files define deterministic tool states for the inspector and for E2E tests. For single-tool apps, a simulation file has one tool call. For multi-tool workflows, you need simulation files that chain multiple calls so the inspector shows the complete flow.

Here’s a simulation file that models the search-then-detail workflow:

{
  "messages": [
    {
      "role": "user",
      "content": "Find me some wireless headphones"
    },
    {
      "role": "assistant",
      "content": "I found several wireless headphones. Here are your options:",
      "toolCalls": [
        {
          "tool": "search-products",
          "toolInput": { "query": "wireless headphones" },
          "toolResult": {
            "structuredContent": {
              "items": [
                { "id": "prod-001", "name": "Studio Wireless Pro", "price": 249 },
                { "id": "prod-002", "name": "Sport Buds X", "price": 89 }
              ]
            }
          }
        }
      ]
    },
    {
      "role": "user",
      "content": "Tell me more about the Studio Wireless Pro"
    },
    {
      "role": "assistant",
      "content": "Here are the details for the Studio Wireless Pro:",
      "toolCalls": [
        {
          "tool": "product-detail",
          "toolInput": { "id": "prod-001" },
          "toolResult": {
            "structuredContent": {
              "id": "prod-001",
              "name": "Studio Wireless Pro",
              "price": 249,
              "description": "Active noise cancellation with 30-hour battery life.",
              "variants": [
                { "variantId": "v-black", "label": "Black", "price": 249, "inStock": true },
                { "variantId": "v-white", "label": "White", "price": 249, "inStock": false }
              ]
            }
          }
        }
      ]
    }
  ]
}

This simulation lets you test the complete user experience in the inspector. The search results render first, then the detail view renders below. You can verify that both resource components look correct and that the data is consistent between them (the product ID and name match across both views).

When to Use Multi-Message Simulations

Use multi-message simulation files when you need to:

See how multiple tool outputs appear together in a conversation
Test a resource component that references data from a previous tool call
Verify the UI flow from search through selection to action
Demonstrate a complete workflow in the inspector during development

For simple contract and integration tests, you don’t need simulation files. The mcp fixture calls tool handlers directly and returns structured results. Simulations are for the visual and E2E layer, where you want to see (and test) what the user sees.

Disambiguation Evals

When your app has multiple tools, the LLM needs to pick the right one. This gets tricky when tools overlap in purpose. An app with list-invoices, search-invoices, and get-invoice has three tools that a user might trigger with “show me my invoices.”

Evals test this at the model level. Each eval case sends a prompt and asserts which tool the model calls:

import { defineEval } from 'sunpeak/eval';

defineEval('invoice tools', [
  {
    prompt: 'Show me all my invoices from this month',
    expect: { tool: 'list-invoices' },
  },
  {
    prompt: 'Find the invoice for Acme Corp',
    expect: { tool: 'search-invoices', args: { query: expect.stringMatching(/acme/i) } },
  },
  {
    prompt: 'Pull up invoice INV-2024-0042',
    expect: { tool: 'get-invoice', args: { invoiceId: 'INV-2024-0042' } },
  },
  {
    prompt: 'What invoices do I have?',
    expect: { tool: 'list-invoices' },
  },
  {
    prompt: 'Show me the details on my latest invoice',
    expect: { tool: 'list-invoices' },
  },
]);

Run with pnpm test:eval. Each case runs 10+ times per model to produce statistical confidence. If get-invoice gets called 30% of the time for “show me my invoices,” that’s a signal the tool names or descriptions are too similar.

Fixing Disambiguation Failures

When evals reveal confusion between tools, the fix is usually in the tool schema, not the test. Common patterns:

Make names more distinct. list-invoices vs search-invoices is ambiguous. Consider renaming to list-recent-invoices and search-invoices-by-keyword, or consolidating into a single find-invoices tool with an optional query parameter.

Add discriminating descriptions. If two tools do similar things, their descriptions need to explain when to use each one. “Lists the 20 most recent invoices, sorted by date” vs “Searches all invoices by company name, invoice number, or amount” gives the model clear selection criteria.

Consolidate when possible. If evals show persistent confusion between two tools, consider whether they should be one tool with a mode parameter. Fewer tools means fewer disambiguation decisions for the model, which means fewer wrong calls.

Testing Shared Resources

Some MCP Apps use a single resource component that renders output from multiple tools. A project management app might have a task-card resource that renders data from both get-task and create-task. The resource component needs to handle different structuredContent shapes depending on which tool produced the data.

Test this by writing separate E2E tests for each tool that uses the shared resource:

import { test, expect } from 'sunpeak/test';

test('task-card renders get-task output', async ({ inspector }) => {
  const result = await inspector.renderTool('get-task', {
    taskId: 'task-123',
  });
  const app = result.app();
  await expect(app.locator('h2')).toHaveText('Fix login bug');
  await expect(app.locator('[data-testid="status"]')).toHaveText('In Progress');
  await expect(app.locator('[data-testid="assignee"]')).toBeVisible();
});

test('task-card renders create-task output', async ({ inspector }) => {
  const result = await inspector.renderTool('create-task', {
    title: 'Add dark mode',
    assignee: 'alice',
  });
  const app = result.app();
  await expect(app.locator('h2')).toHaveText('Add dark mode');
  await expect(app.locator('[data-testid="status"]')).toHaveText('Open');
  await expect(app.locator('[data-testid="created-badge"]')).toBeVisible();
});

The key is testing each tool path separately and verifying that the shared resource handles both data shapes. If create-task returns a createdAt timestamp but get-task returns an updatedAt timestamp, the resource component needs to handle both, and your tests need to cover both.

Testing Cross-Tool State with useAppState

When tools share state through useAppState, test that the state contract is honored across tool boundaries. The producing tool should set state that the consuming tool can read, and the consuming tool should handle the case where the state doesn’t exist yet.

test('configure-dashboard sets state that show-dashboard reads', async ({ mcp }) => {
  // Tool A sets preferences
  const configResult = await mcp.callTool('configure-dashboard', {
    layout: 'grid',
    widgets: ['revenue', 'users', 'errors'],
  });
  expect(configResult.isError).toBeFalsy();

  // Verify the tool instructs the resource to set app state
  const configContent = configResult.structuredContent;
  expect(configContent.defaultState).toEqual({
    layout: 'grid',
    widgets: ['revenue', 'users', 'errors'],
  });
});

test('show-dashboard handles missing configuration state', async ({ mcp }) => {
  // Tool B should work even without prior configuration
  const result = await mcp.callTool('show-dashboard', {});
  expect(result.isError).toBeFalsy();

  // Should fall back to default layout
  const content = result.structuredContent;
  expect(content.defaultState.layout).toBe('list');
});

The first test verifies that the configuration tool produces state in the format the dashboard tool expects. The second test verifies the dashboard tool works without prior configuration, which is the state a new user sees.

Organizing Multi-Tool Tests

Multi-tool tests fit into the existing test structure alongside your single-tool tests:

tests/
  unit/
    search-products.test.ts     # handler logic
    product-detail.test.ts      # handler logic
    task-card.test.tsx           # component rendering
  e2e/
    contracts.spec.ts           # cross-tool contracts
    workflows.spec.ts           # multi-step chains
    search.spec.ts              # single-tool E2E
    detail.spec.ts              # single-tool E2E
    shared-resources.spec.ts    # shared resource rendering
  evals/
    disambiguation.eval.ts      # tool selection tests
  simulations/
    search-then-detail.json     # multi-step simulation
    search-results.json         # single-tool simulation
    product-detail.json         # single-tool simulation

Run them as part of your normal test commands:

pnpm test runs unit tests
pnpm test:e2e runs contracts, workflows, and E2E tests
pnpm test:eval runs disambiguation evals

All three run in CI/CD with no configuration changes. Contract tests and workflow tests need no external services. Evals need API keys, so gate them to the main branch.

A Checklist for Multi-Tool Testing

Before you ship a multi-tool MCP App, verify:

Each tool’s output shape is covered by a contract test
Shared TypeScript types are imported by both producer and consumer tools
At least one workflow integration test chains the primary user flow
Error paths are tested (invalid IDs, missing data, out-of-order tool calls)
Disambiguation evals cover the ambiguous prompts your users will send
Shared resources have separate E2E tests for each tool that feeds them
Cross-tool state via useAppState has tests for both the set and get paths
Multi-message simulation files model the complete workflow in the inspector
All tests run on both ChatGPT and Claude hosts

Get Started

Documentation →


npx sunpeak new

Frequently Asked Questions

How do I test MCP Apps with multiple tools?

Test at three levels. First, use contract tests to verify that each tool's output matches what downstream tools and resource components expect. Second, use integration tests with the mcp fixture to chain callTool() calls and verify multi-tool workflows end to end. Third, use evals to verify that LLMs pick the right tool when your app has multiple tools with overlapping intent. Run all three with pnpm test, pnpm test:e2e, and pnpm test:eval.

What is a tool contract test for MCP Apps?

A contract test verifies the shape of a tool's structuredContent output against what its consumers expect. If tool A returns a list of items and tool B expects an item ID from that list, a contract test calls tool A through the mcp fixture and asserts that the output contains the fields tool B needs. This catches breaking changes when you refactor a tool without updating its consumers.

How do I write simulation files for multi-tool MCP App workflows?

Create a simulation file in tests/simulations/ that includes a messages array with multiple tool calls in sequence. Each message specifies the tool name, toolInput, and toolResult. The inspector renders each tool call in order, so you can test the UI state after each step. Use this to simulate a complete user workflow like search, select, then confirm.

How do I test that an LLM picks the right tool from multiple MCP App tools?

Write disambiguation evals using defineEval() from sunpeak/eval. Each eval case specifies a user prompt and the expected tool name. Run the eval against multiple models (GPT-4o, Claude, Gemini) with 10+ runs per case to get statistical confidence. If a model calls the wrong tool more than 20% of the time, the tool names or descriptions need work.

What breaks most often in multi-tool MCP Apps?

The three most common failures are contract mismatches (tool A changes its output format and tool B still expects the old shape), tool confusion (the LLM picks the wrong tool because two tools have similar names or descriptions), and state assumptions (a tool expects prior state from useAppState that another tool was supposed to set). Contract tests, disambiguation evals, and workflow integration tests each catch one of these.

How do I test a shared resource component that renders output from multiple tools?

Write separate simulation files for each tool that shares the resource. Each simulation provides different structuredContent to the same resource component. In your e2e tests, call inspector.renderTool() once per tool name and assert that the resource renders correctly for each tool's data shape. This verifies that your resource handles all the content variations it will receive in production.

Should I test multi-tool workflows in unit tests or integration tests?

Integration tests. Unit tests mock the MCP protocol, so they cannot verify that tool A's actual output matches tool B's expectations. Integration tests with the mcp fixture call real tool handlers through the running MCP server, so you can chain callTool() calls and check that data flows correctly between tools. Save unit tests for individual tool handler logic and component rendering.

How do I run multi-tool tests in CI/CD for MCP Apps?

Add contract tests and workflow integration tests to your pnpm test:e2e command in your GitHub Actions workflow. They run with the mcp fixture and need no external services. Add disambiguation evals to a separate job gated to the main branch, since evals cost API credits. The same workflow runs unit tests (pnpm test), integration tests (pnpm test:e2e), and evals (pnpm test:eval) in sequence or parallel.