Develop a benchmark runner for 'Creative Problem-Solving' that goes beyond simple brainteasers. This provides a standardized way to test if agents can actually synthesize new ideas.