adocomplete
View original ↗Analyze the systemic limitations and safety guardrails detailed in the model card. Build an open-source evaluation suite to benchmark similar open-weights models against these stated capabilities.
Suggested repo: opus-eval
"Benchmarking open-source against the state-of-the-art."
Estimated effort: 40h