Develop an evaluation framework for testing the limits of multi-step agentic reasoning at scale. This tool should focus on measuring consistency and error propagation in long-chain operations.