Develop a tool to evaluate agent realism compared to real human behavior logs. This provides a benchmarking suite for researchers to reduce the 'reality gap' in user simulators.
Suggested repo: realism-bench
"Is your bot behaving like a human or a script?"
Estimated effort: 60h