Develop an automated evaluation suite that specifically tests multi-turn dialogue coherence regarding speaker attribution. This tool should identify instances where models conflate identities in complex chat logs to serve as a standard benchmarking dataset.