Build a benchmark framework that tests 'continuity'—the ability of an agent system to maintain truth over long sessions. Move beyond simple RAG retrieval metrics to 'persistence' testing.