HN
Today

Show HN: Agent-skills-eval – Test whether Agent Skills improve outputs

Introducing agent-skills-eval, an open-source test runner designed to rigorously validate the efficacy of AI agent skills based on Anthropic's agentskills.io standard. It addresses the critical need to quantitatively prove whether adding a skill measurably improves an agent's output. By comparing agent performance with and without a skill using a judge model, it provides concrete evidence, moving AI skill development from speculative improvements to data-backed verification.

10
Score
0
Comments
#10
Highest Rank
4h
on Front Page
First Seen
May 7, 8:00 AM
Last Seen
May 7, 11:00 AM
Rank Over Time
10111519

The Lowdown

The agent-skills-eval project offers a much-needed framework for objectively evaluating the effectiveness of AI agent skills. While it's easy to define new capabilities for AI agents using standards like Anthropic's agentskills.io, proving that these skills actually enhance performance is a significant challenge. This tool provides a systematic approach to measure the real impact of such skills, ensuring that development is guided by data rather than mere assumption.

  • Core Functionality: The system operates by running a target AI model against specific prompts twice: once with the agent skill loaded (with_skill) and once without it (without_skill).
  • Objective Grading: A separate 'judge model' then grades both sets of outputs based on predefined assertions and expected results, providing a pass/fail score for each scenario.
  • Output and Reporting: It generates comprehensive side-by-side reports, including static HTML, clearly demonstrating the performance lift (or lack thereof) attributable to the skill.
  • Flexibility and Compatibility: agent-skills-eval is designed to be highly flexible, working with any OpenAI-compatible chat model for both the target agent and the judge, supporting a wide range of APIs and even local Llama servers.
  • Developer Toolkit: It comes with a user-friendly command-line interface (CLI) for quick evaluations and a robust TypeScript SDK for integrating evaluations into CI/CD pipelines, custom dashboards, or complex programmatic workflows.
  • Advanced Features: Key features include judge-graded outputs with cited assertions, deterministic tool-call assertions for agents interacting with external tools, and portable JSON artifacts for downstream analysis.
  • Standard Compliant: The tool is fully compliant with the agentskills.io specification, validating SKILL.md frontmatter, evals/evals.json schemas, and artifact layouts, while also adding extensions like per-eval defaults and model parameters.
  • Configurability: Evaluation runs can be configured via CLI flags or a YAML file, allowing detailed control over target and judge models, concurrency, logging, and reporting.

In essence, agent-skills-eval empowers developers to move beyond qualitative assessments and embrace a data-driven methodology for building and refining AI agent capabilities. By providing clear, verifiable evidence of skill performance, it fosters more efficient and effective AI development within the agentskills.io ecosystem.