Create an agent evaluation framework that measures the 'self-development' capability of agents—specifically how they adjust their own system prompts based on task failure.
Suggested repo: self-evolve-agent
"The agent that fixes its own logic when it hits a wall."
Estimated effort: 60h