LLM-based agents and systems don't fail like normal software, they regress silently. Everything from tool chain changes, prompt drifts, routing issues, context limits, but the agent "still works" … until a real user hits an edge case and everything spirals into disaster.
This poster introduces a Python-first pattern for making agent behavior reproducible, testable, and optimizable by treating agent runs as pytest simulations: multi-turn “episodes” driven by scenario fixtures, seeded user/environment simulators, and strict execution budgets. Each episode produces a structured trace (messages, tool calls, intermediate state, timings) that can be asserted with deterministic checks (e.g., schema correctness, tool-call limits, policy constraints) and scored with lightweight rubrics (e.g., goal completion, instruction adherence, tool efficiency). The result is an agent test suite that behaves like a CI gate: small enough to run on every PR, and realistic enough to catch the failures that unit tests miss.
Attendees will leave with a practical blueprint and patterns to apply and think more broadly about how we can carry out lightweight testing and hardnesses within CI/CD and outside for more reliable ai applications in python.