Build a CLI evaluation framework that dynamically executes agent interactions against SOPs. Developers should focus on the graph-guided aspect to validate service agent performance beyond simple static prompts.