Build a VLM evaluation harness specifically targeting negation and affirmation bias across different linguistic families. This will help developers understand why their VLM is failing specific edge cases.