If you’re a native English speaker, you may not have consciously realized that masculine and feminine pronouns are not grammatically the same as each other. This matters because AI – specifically large, open, shared AI language models – have an important, inherent gender bias simply because of a quirk in the English language.
In a Medium post, Robert Munro, the author of a book on Human-in-the-Loop machine learning, walks through his discovery that an asymmetry in gender pronouns has led to deep gender bias in language-based machine learning models; him/his and her/hers are not the same grammatical patterns.
Note the use of pronouns in these two sentences:
- I’m driving with him in his car.
- I’m driving with her in her car.
This has introduced an arbitrary cause of bias in AI.
“In the existing datasets there are 100s of examples of the Dependent Possessive “his” as in “his car”. So, the NLP systems can learn that “his” is a pronoun in the Dependent context and then guess correctly because it’s the same word in the Independent context. This isn’t possible for “her/hers” with the different spellings.”– Robert Munro
Why does this matter? Munro went to Google’s BERT to check how this might bias one of the core AI building block systems. As he explains, a core part of BERT’s architecture is predicting which word can occur in a sentence, after training over a large amount of raw data. This means that the raw frequency of words matter – specifically, in order to avoid bias, the training data needs to have an equal number of the different possessive pronouns. Munro’s analysis measured bias by seeing if a sentence like “the car is his” is preferred over “the car is hers” within BERT.
BERT preference for “hers” and (independent) “his”:
You can see from the ratios that “mom” is 7.4 times more likely to make BERT predict “hers” than “his” and “money” is 23 times more likely to make BERT predict “his” than “hers.” The only thing that is “hers”? Mom. Which, of course, could be “his” or “hers” equally in the world.
“The world, and almost everything in it, are “his”,” says Munro. “Action” is particularly problematic; almost 70 times more likely to be masculine than feminine. This is likely to reflect the inherent bias in agency ascribed to different gender pronouns in the language that BERT was trained on.
BERT is a very important model because it is used in Google search. Google announced in October that BERT will be used in 10% of searches. It thinks that men own and run everything in the world at a rate far higher than women do. Even when Munro tried to bias BERT towards the most “hers” and “theirs” items, BERT still preferred “his.”
We are lucky to have such rich data with which to build AI models of the English language. This, in itself, represents a bias. English’s pronoun system is one of the simplest, as its noun system, with only three forms: a singular/plural distinction and a possessive distinction. As other languages are translated into machine learning models, more of this type of bias can be expected.
The problem with “hers” and “theirs” is a good metaphor for the kinds of biases that we encounter much more frequently in other languages. Some differences will reflect societal bias and some will not. In some cases, the machine learning models will cancel out that bias and in other cases, the machine learning models will amplify that bias. The best way to solve the problem is to carefully construct the right datasets that allow the machine learning algorithms to understand all the variations in a language.– Robert Munro
Munro has posted a video demonstrating that this bias exists in many systems – Amazon’s models included.
A key takeaway is that this is not a bias derived from human hand or bias about the world. It is a bias that is fundamental to language. It’s perhaps another example of how AI has a knack for reflecting our world back to us in unexpected ways.