Finding Model Biases
Hate speech detection models can help make the internet a better place—
but they need to work for everyone.
Everyone deserves to be
safe from hate
If models detect hate against some identity groups less reliably than hate against other groups (e.g. they are worse at detecting hate against black people than women), this will reinforce inequalities in how different groups are protected on the internet. A good hate speech detection model should never be biased in its target coverage.
Diagnosing target biases with HateCheck
The HateCheck test suites are constructed to reveal such biases in target coverage. Because test cases are generated from templates like “I hate [IDENTITY]”, there are hundreds of sets of test cases that only differ in which identity group is targeted. If a model correctly classifies an example for one target group but misclassifies the equivalent example for another target group, this is a strong indication of a bias in target coverage.
At a more macro level, models should be equally accurate on HateCheck test cases across target groups. In the original HateCheck paper, we used this method to identify clear biases in several state-of-the-art models.
Expanding HateCheck’s target coverage
For each of the 11 languages covered by HateCheck so far, we created test cases for seven different target identity groups. These groups were chosen for their relevance to hate in each language-specific context. However, HateCheck is set up in such a way that it can easily be expanded to include more identity groups. Including more groups also allows for more extensive testing of potential biases in target coverage. We are very keen to support any such expansion of HateCheck, so if you have any questions, please get in touch!
Start using HateCheck!
Read the HateCheck research papers and
start testing hate speech detection models today.