Evaluations
Limitations
Know where benchmark, model, curation, and stale-memory limits remain.
What to measure
Evaluation claims should be tied to artifacts, suite version, commit, model/provider, and configuration. Pair no-memory and memory-enabled variants when possible.
How to use it
Run a dry run first, then a real suite only after reviewing scripts and fixtures. Keep raw outputs and compare item-level results.
Verify
memory eval run --suite evals/examples/memory-smoke --condition full-memory --profile offline --dry-runNext
Read Run evaluations, Metrics, and Limitations.