While we are far from understanding what went wrong with the Apple Card credit approval system, we can ask questions to expose the factors that made this issue a big problem for Apple and Goldman Sachs. It will take a while for the investigation to be completed and we’d be surprised if the finding is that the approval process is illegal. But the internet has spoken: it might not be illegal but it’s unfair, not right and off-brand.
We suspect that the Goldman algorithm was trained on data that included an important bias: that the husband is the primary card holder in traditional credit card approval. This biased the data so the algorithm assigned higher creditworthiness to the primary card holder. This meant that the primary card holder status became the proxy for gender.
The whole situation was made worse by a number of applications coming from a demographic or group that exposed this bias, an AI-enabled product which broke the mental model of Apple family sharing and a total lack of a “human-in-the-loop” recovery combined with unexplainable and non-intuitive AI.
What made it all worse? Here’s where we think the biggest failures will be found:
- Not sufficiently detecting or understanding a bias and then de-biasing data.
- Lack of understanding of protected traits and their proxies.
- Insufficient testing of algorithm performance on fairness metrics.
- Insufficient testing on edge cases and extreme scenarios.
- Disruption of the mental model of inclusion and sharing that Apple customers expect. Apple Card is unique because it is exclusively an individual card. The AI likely used historical data where husbands are the primary card holder so receive the “credit” for the credit.
- Lack of planning for failure – lack of understanding of potential failure and preparation with “human-in-the-loop.”
- Lack of recognition about how people feel when AI is control and humans aren’t.
These are the specific questions that matter:
Bias is introduced first in the data. While there are ways to deal with bias, it is difficult to remove completely – there’s really no such thing as neutral.
- If a new dataset was needed for Apple Card, how did the developers collect the data?
- How was the population defined?
- If an existing dataset(s) was also used, what, if any, changes or additions needed to be made for Apple Card user population that weren’t?
- What tool was used for evaluating the dataset for bias/de-biasing? When was this completed? What were the insights? Was there a process for adjustment and correction?
- What was the list of protected traits? What was the distribution of these in the population?
- What proxies and imperfect substitutes were identified and used?
- What historical sources of bias was there potential for?
- How did the developers agree on what data would be used for the training and test sets? Was the training set tested against the population?
Designing algorithms is complex, with many subtle and context-specific decisions. The role of AI governance is to track these decisions and tradeoffs. Many systems intentionally or unintentionally make use of proxies for protected traits, which can increase the risk of proxy discrimination.
- What process was used to assess and document the rationale for use of protected traits?
- What check was done for correlations between all the features and the protected traits to identify important proxies?
- What tests on the model were done to understand its performance on different subgroups?
- How did the team examine and document how to aggregate a dataset or decide to use a single model or multiple models for different input groups?
- How did the team document the differences between the contexts applicable in transferring datasets and models, versus the context in which they were used?
Fairness decisions and discrimination assurance
This is a situation where a group of people share protected characteristics, such as race, gender or age. What we need to be concerned with is when a group is disproportionately disadvantaged and this happens in ways that cannot be reasonably justified.
- Did the design team understand the concepts of disparate treatment and disparate impact?
- Were protected characteristics identified?
- Were different versions of “fairness” modeled?
- What metric for fairness was chosen and was the rationale documented?
Human-centered design and mental models
A mental model is a person’s understanding of how something works. We all bring mental models to using a product. They help set expectations of how it works. In this case, it looks like the most important part of the mental model that people brought to Apple Card was the expectation of inclusion – both by the brand positioning of Apple and the family sharing system that is common. The application process broke this mental model, which was likely a particular problem because use of a traditional credit card as a couple likely generates data where the primary card holder (this now appears to be the husband) gets the credit.
- What mental model might Apple Card customers have at the outset? How is that mental model set and how will the algorithm disrupt this?
- What were the assumptions that the product team made about how customers may think the AI makes its decisions?
- What about the AI is not intuitive to humans? How can a human explain the AI’s decision?
- How did the product team prepare the support team for AI anxiety and anger post-failure?
- What tools, process and authority do the front-line people have for explaining and mitigating a failure of the AI?
- How important was trust as an outcome for customers? How was this explicitly designed for?
- Did the product team inspect the model with an appropriate testing tool and document the insights and decisions?
- Did they perform edge-case and extreme scenarios to check what the model did?
- What were the secondary effects from the reward function (false positives and false negatives) that the team did not plan for?
- How was fairness evaluated? What were the metrics? Was there a document prepared ready for public disclosure?
- How did the team design a system to detect and measure failure during pre-launch testing?
- What were the worst case scenarios if something goes wrong?
- What was the plan for handling a worst case scenario?
- Did the team have a diverse set of trusted users willing to use the in-development AI and give feedback? Employees and ex-employees perhaps?
- What was the agreed success metric that determined launch readiness?
- Was there a check-in with legal counsel to get legal sign-off on the planned usage of the dataset?
- How were front-line people dealing with customers? What problems were arising based on the algorithm or the data?
- What questions did customers ask? What made them anxious and what made them angry?
- What non-intuitive results were arising? How were these being explained?
- What was the process for testing and monitoring for model drift over time?
- What was the process for ensuring design included the ability to improve from feedback, so that new sources of bias could be found and the experience of users captured?
- How does the dataset stay up to date over time?
- Are there recurring meetings for the product team to discuss tuning the model?
AI governance is different. AI learns from data and data about the world is inherently biased. Add to this that an AI’s behavior is somewhat dependent on its post-design experience. Products that are AI-enabled require a different kind of oversight, a process that looks more like managing people than managing IT.