Benchmarks: 35% Cheaper, 29% Faster Agent Runs

Claims without data are marketing. Here’s our data.

The Setup

We ran 500 agent benchmark runs using Claude Sonnet across 50 API specs covering all five supported formats (OpenAPI 3.x, Swagger 2.0, GraphQL, AsyncAPI, Postman). Each spec was tested at five compression tiers: Original, Standard, Lean, Ultra-Lean, and Minimal.

The task: given an API spec and a natural language request, produce the correct API call — right endpoint, right method, right parameters, right auth.

Every run was scored on correctness (did the agent pick the right endpoint and parameters?), and we tracked token usage, cost, and response time.

The Results

MetricFull SpecLAP-LeanChange
Success Rate0.8240.851+3.2%
Avg. Tokens48.2K23.1K-52%
Avg. Cost/Run$0.37$0.24-35%
Avg. Time4.1s2.9s-29%

Read that first row again. Agents didn’t just maintain quality with compressed specs — they got better. Not dramatically better (3.2% isn’t a revolution), but the point stands: cutting 52% of tokens didn’t hurt. It helped.

Why Less Is More

This isn’t as counterintuitive as it sounds. AI models have finite attention. A 130K-token API spec contains maybe 15K tokens of actionable information — endpoint signatures, parameter types, auth requirements. The rest is noise: verbose descriptions, redundant schemas, deprecated endpoints, example values.

When you remove the noise, the signal-to-noise ratio improves. The agent spends less attention on irrelevant content and more on the parts that matter for making correct API calls.

It’s the same reason a concise brief outperforms a 50-page document for human decision-making. Brevity forces clarity.

Cost Breakdown by Format

The savings vary by format because some specs are more bloated than others:

FormatFull CostLAP CostSavings
OpenAPI 3.x$0.42$0.2443%
Swagger 2.0$0.38$0.2339%
GraphQL$0.29$0.2128%
AsyncAPI$0.31$0.2229%
Postman$0.44$0.2641%

OpenAPI and Postman specs benefit most — they’re the most verbose formats. GraphQL is already relatively lean, so the gains are smaller but still meaningful.

The Tier Trade-Off

LAP offers multiple compression tiers. More aggressive compression saves more tokens but eventually starts dropping useful information:

TierTokensScoreCost
Original48.2K0.824$0.37
Standard35.1K0.842$0.31
Lean23.1K0.851$0.24
Ultra-Lean15.8K0.839$0.20
Minimal9.2K0.798$0.16

Lean is the sweet spot. It delivers the best success rate while cutting costs by 35%. Ultra-Lean saves more money but starts losing accuracy. Minimal is useful for massive APIs where you need to fit the spec in a small context window and can tolerate lower accuracy.

At Scale

For a production agent making 1,000 API calls per day:

  • Full specs: ~$370/day → ~$11,100/month
  • LAP-Lean: ~$240/day → ~$7,200/month
  • Monthly savings: ~$3,900

For teams running multiple agents across multiple APIs, the savings compound fast. And that’s just the direct token cost — faster response times also reduce infrastructure costs and improve user experience.

Methodology Notes

Transparency matters. Here’s what to know about our benchmark:

  • Model: Claude 3.5 Sonnet (via Anthropic API)
  • Temperature: 0 (deterministic)
  • Scoring: Automated — correct endpoint + correct required parameters = 1.0, partial credit for partial matches
  • Specs: 50 real-world APIs from our registry, not synthetic examples
  • Runs: 10 runs per spec per tier = 500 total runs per tier, 2,500 total
  • What we didn’t test: Multi-step agent workflows, tool-use chains, or real API execution. This measures spec comprehension, not end-to-end agent performance.

The full dataset and methodology are published in our benchmark repository.

What This Means

The token economy of AI agents is real and growing. As agents take on more complex tasks involving multiple APIs, the cost of reading API specs becomes a significant line item.

LAP doesn’t make agents smarter. It makes them more efficient — same quality output with half the input. For teams building agent-powered products, that’s the difference between a viable unit economics and a cost problem that scales with usage.


Dig into the data:

The complete benchmark results, methodology, and raw data are available at github.com/Lap-Platform/Lap-benchmark-docs.