The next generation of voice technology
It is predicted that by 2023, the proliferation of voice assistants will have tripled, amounting to 8 billion voice assistants in use worldwide.
While Amazon Alexa remains the market leader in terms of smart speakers, as mentioned in the previous step, we will begin to see voice activation migrate to other home appliances, vehicles and on-the-go devices.
As use cases and the environment in which we use conversational interfaces broaden, so too will the refinement of voice technology. The limitations of speech-to-text in terms of background noise and handling multiple voices has already been advanced greatly by Google AI. We are now seeing even further developments, pointing towards more efficient and robust examples of technology in action.
Take for instance Google’s experimentation with what they have dubbed Translatotron. Their experts have demonstrated further progress in developing how voices can be translated from one language to another, whilst also preserving all of the vocal characteristics of the original speaker. So, in practice, if one speaker was to deliver a phrase in Spanish, the voice could be translated directly into English whilst still sounding true to the original speaker’s voice.
In this example they have utilised a single sequence, speech-to-speech model, thus dramatically speeding up the entire process and doing away with the usual intermediary text representation you have previously learned about. There is also the possibility of improved translation using this method, by reducing the number of steps required to return results. This proof of concept shows the potential for voice technology to vastly improve the speed, flow, and exchange of information.
There are also leaps being made with regards to how much information can be gleaned and understood in the sound of our voices. Chinese technology giant Huawei announced in 2018 that they were developing a voice assistant that could respond to human emotion and Amazon have been continually investing in detecting emotion in voice. This takes conversational interfaces to another level, analysing voice to understand mood, active or passive states, and wellness, and using these factors to respond with greater personalisation and precision.
In the field of healthcare, Pillo works as a voice-activated robot to help monitor a user’s health. It can remind a user to take medication, provide nutritional advice, or alert family members if help is required. Of course, health information is often sensitive and private. As conversational interfaces grow to be more intelligent and combine with other aspects of AI like facial recognition, or accent detection, a more detailed data profile about a person could be compiled, raising concerns about privacy and data misuse.
To combat this, Researchers at Imperial College have worked to use AI to develop a layer of protection between voice assistants and the cloud, creating a way of removing emotion-based information and restoring it to a neutral state to preserve users’ privacy. Their results showed “that identification of sensitive emotional state of the speaker is reduced by ~96 %.”
This push and pull approach, with technology firms advancing in one direction and user advocacy and researchers developing alternatives in another, is inevitable in the wake of more sophisticated artificial intelligence. With big data managed by large technology providers, transparency will be increasingly important to maintain ongoing trust with users.
In steps 2.4 and 2.5 you reflected on the ethical considerations around voice technology, particularly around how voice products are made and how bias and stereotypes can be perpetuated in voice assistants. As we head further into the future and as conversational interfaces become more ubiquitous and refined, with increased personalisation, technology leaders and researchers will need to continue to monitor how users can be protected, especially in terms of data protection.
Have your say
What do you think about the future of advanced, emotionally intelligent voice technology and providers being able to build more personalised, accurate experiences? Can you identify the potential benefits?
How do you feel about the idea of technology firms being able to cater to your mood, emotional state or stress level? Would you trust them with a potential voiceprint that contains more details about you?
Discuss your thoughts with your fellow learners in the comments section.