So how it works, then, is a user says something that’s usually called an utterance. If I said, Alexa, book me a taxi, that’s an utterance. Within that utterance, first of all, there’s a wake word. A wake word is Alexa, hey Google, hey Siri, hey Cortana, if anyone uses Cortana. Then you have a launch request, which is ask so and so, open so and so, launch xyz, and then you have the invocation phrase, which might be my fantastic taxi company. Hey Alexa, open my fantastic taxi company. So all of that together is made up of a wake word, a launch request, and an invocation, but all of it wrapped up together is called an utterance.
Anything someone says to any conversational system is an utterance. What then happens is the audio that you have just spoken is recorded by Alexa and sent to the cloud. What then happens is a process called ASR, which is automatic speech recognition. That goes through the audio sample and turns the audio into text, and then the text is fed through an NLP, which is called natural language processing. The natural language processing engine cleans up that text so that is in a legible format. If I say, can I have a taxi for six. No, I mean seven. Actually, not actually, half past six.
I’ve said a lot of stuff there, but all I really care about is I need a taxi for half past six. Part of the natural language processing is to figure out all of those mistakes that I’ve made and extract from that sentence an intent - it’s the thing that the user is trying to do. That intent is then sent to the application, the third party application if it’s your taxi service, and then you need to respond to that intent. So you take whatever’s in that intent, and intents in Alexa are created and made up of what’s called slots.
So if I want to book a taxi, I’ll have a book a taxi intent, but then there are certain values that I need - the system needs - in order to be able to actually book that taxi. So I’ll need to know what time do you need it for. I’ll need to know what address are you leaving from. Those are what’s called slots, and those are values that give me enough information to be able to fulfill something. I then handle that, the logic, in my code base.
I’ll make the booking, create the response, send the response back as text to the Alexa cloud, and then text to speech - which is another type of technology - will take that text, translate it into speech, and create an audio file that is then read back out through Alexa. So that’s an overview of how the vast majority of these platforms work. There might be some differences and nuances here and there, but broadly speaking, that’s the process and the stack that’s involved. A skill is Amazon’s word for a voice application. Google call them actions, and other people may call them experiences. It’s - just think of them as an app. That’s probably the best way.
As a user, you won’t necessarily know that the skills or apps are created by individual companies or people or organisations. You don’t hear that credited necessarily. You just ask your utterance, and you receive it. It’s not like where you have different branding. Like, you might have - because with images and with apps on an app store, you may see, oh, this is made by this company or this is made by that company. You don’t see any of that. So it’s all kind of hidden. It all appears like it’s all from Alexa or all from the Google system.