Create an automated benchmarking suite to test Qwen3.6-Max-Preview against open-source models in specific coding tasks. Build a script to compare reasoning latency vs accuracy.