Eval Harness

Use @m4trix/evals to define datasets, test cases, and evaluators for repeatable AI evaluation runs.

Location

examples/evals-example/

How It Works

  1. Dataset — Groups test cases by tags and/or file paths

  2. Test Case — Defines input/output pairs (e.g. prompt + expected score threshold)

  3. Evaluator — Applies scoring logic to each test case

  4. CLI Run — Execute with eval-agents-simple run --dataset "..." --evaluator "..."

Setup

pnpm add @m4trix/evals

Create files with suffixes:

  • *.dataset.ts — Dataset definitions

  • *.evaluator.ts — Evaluator definitions

  • *.test-case.ts — Test case definitions

Run Evals

With patterns:

Key Files in evals-example

  • src/evals/demo.dataset.ts — Dataset with includedTags: ['demo']

  • src/evals/demo.evaluator.ts — Evaluators (score, length, multi-score, diff)

  • src/evals/demo.test-case.ts — Test cases with prompts and expected outputs

  • m4trix-eval.config.ts — Discovery and artifact paths

Config

Optional m4trix-eval.config.ts at project root:

Next

Last updated