Analyze the systemic limitations and safety guardrails detailed in the model card. Build an open-source evaluation suite to benchmark similar open-weights models against these stated capabilities.