Research

Explorations in AI agents, evaluation systems, and building reliable intelligent systems.


Toward Automated Evaluation of AI Agents: A Multi-Layer Framework

We present a framework for automated agent evaluation combining LLM judges, rule-based validators, and trace-derived test generation. Achieves 97% agreement with human annotators and 100% defect detection through multi-layer approaches. Includes methodology, experimental results, and implications for practice.