Limitations

Know where benchmark, model, curation, and stale-memory limits remain.

What to measure

Evaluation claims should be tied to artifacts, suite version, commit, model/provider, and configuration. Pair no-memory and memory-enabled variants when possible.

How to use it

Run a dry run first, then a real suite only after reviewing scripts and fixtures. Keep raw outputs and compare item-level results.

Verify

memory eval run --suite evals/examples/memory-smoke --condition full-memory --profile offline --dry-run

Read Run evaluations, Metrics, and Limitations.

Limitations

What to measure

How to use it

Verify

Next

On this page