Apple is famous (infamous?) for not talking much about its AI, which makes it difficult to figure out what direction Apple AI capability might head.
In the AI research community, it’s lore that Apple publishes just enough to help attract and retain AI talent. The company was a platinum sponsor of, and presenter at, the recent Interspeech 2019 conference, which is the world’s largest conference on the science and technology of spoken language processing. Which means that it’s a good way to get a glimpse of what research Apple considers attractive for recruiting and valuable for Siri.
Here are three interesting ideas from Apple’s research on Siri, and voice assistants in general. They are all incremental steps on the path to the holy grail of voice – an assistant that understands the user’s intent.
Siri gets emotional
Siri has a very limited understanding of a user’s intent. Take a command such as “find the nearest police station.” The user’s intent could be to find information or get directions and the user could be in an emergency situation or not. These combinations would be relevant to how Siri could best respond.
An important signal of intent that humans understand is expression, so Apple researchers are working on determining whether expression can signal intent and how best to model it. In this work, expression is a combination of emotion modelling and acoustics.
- Arousal (calm versus active) and valence (negative expression to positive expression) are used to model emotion.
- Acoustic features (measures of articulation of speech) are modeled with reference to the human vocal tract and its various physical measures.
The researchers found:
- The AI can perceive emotion from valence and arousal and correlate this with expressiveness.
- Better acoustic features help to generate better acoustic and emotion embeddings.
- Articulatory information can help in detecting emotional variations in speech, especially valence detection.
An expression model that uses both acoustics and emotion can reliably detect expression, which lays the foundation for using expression for intent detection.
Trust through mirroring
Apple wants Siri to be in a long-term relationship with users. Long-term relationships require trust. So Apple researchers want to understand what builds trust.
In human-human relationships, trust is built in many ways, one of which is “mirroring,” which happens over the course of an interaction. It’s where people mimic and reflect some of their partner’s behaviors back to them. In a physical environment, these mirroring strategies can be mimicry, social resonance, coordination, synchrony, attunement and the chameleon effect. In a conversational environment, where the partner is a digital assistant, it turns out the same process is at work. Humans trust an agent more if it converses in the same style, specifically, “chattiness.” Apple researchers want to know if chattiness – the degree to which a query is concise (high information density) versus talkative (low information density) – can influence whether a user trusts Siri more.
They found that user opinion of the likability and trustworthiness of a digital assistant improves when the assistant mirrors the degree of chattiness of the user. Also importantly, the information necessary to accomplish this mirroring can be extracted from user speech.
This was early research but it hints to a future where Siri will predict what kind of conversational style will promote trust and adjust its style accordingly.
Labeling gets more efficient
Labeling data for Siri to learn from is expensive and slow. Data needs to represent novel instances and not simply add to what Siri already knows. It’s also a privacy invasive activity. Not even Apple has an unlimited labeling budget, so it makes sense to be laser focused on improving the efficiency and effectiveness of the process.
Apple is investigating a process called “active learning.” In a nutshell, it’s a process that takes user signals that indicate a false prediction (inside the confusion zone, shown conceptually below) and use these directly to refine what training data goes to human-annotators. Sort of an AI working on the AI. Prediction errors include user actions that imply that their request was not correctly processed – early termination of a piece of music or switching to a direct use of an application. The problem is that these don’t happen very often (relatively speaking). Apple’s goal is to make better use of these rare, but valuable signals.
As voice-powered digital assistants become more advanced and operate in more sophisticated ways, trust, cost and novel ways of determining intent are required. These three papers offer some hints at priorities within Apple and directions for Siri.