How does voice AI talk like a human?

We pull back the curtain on how we build human-like voice AI, in the most non-technician friendly way.

Since the birth of our startup till today, we have done more than a hundred product demonstrations for clients around the world.

Usually, it is quite easy for our clients to understand what we do:

Cool! You guys build voice AI that talks with customers, over the phone.

But what really sparks the huge curiosity (and possibly suspicion) from them, and the question we always get is:

But howww you guys do that?

Indeed it is quite impressive to hear a machine realistically replicate human interaction, especially when most voice assistants today can only respond to simple questions with a somewhat robotic voice.

We thus feel a responsibility to open this black box for our customers, and everyone else who is interested in the tech, as we believe:

  • Everyone should be able to understand tech, if they want to.
  • Everyone has the right to be well-informed before the purchase.

Therefore, in this blog, we will pull back the curtain on how we build life-like voice AI that helps businesses accomplish tasks. Instead of using endless complicated AI terms, we will explain everything in the most non-technician friendly way. 

An ideal technological system would allow voice AI to function exactly like a human speaker. It is a system we are working on constantly.

Just like us humans, voice AI generates a real-time response for every sentence it hears from the customer. The response mainly depends on

  • Who is the customer. Voice AI is backed up with the rich customer data like gender, job, age, etc.
  • The context of the conversation, such as customer’s mood & intention from previous interaction (in the same call).
  • Aim of the call. Voice AI calls customers to meet a goal, instead of doing random chat. 

For example, Let’s say voice AI is going to call a customer named Jack, on behalf of a car selling platform. The browsing history of Jack suggests he is seeking to buy a car recently. So the voice AI is used to call Jack to sell a new car.

To know the first point, who is the customer, voice AI elicits all relevant data about Jack from its database, where millions of customer data is stored (with consent). 

Then a unique user profile of Jack is generated, as in the following picture.


Based on the user profile of Jack, AI has the following conversation with him:

To be more specific, let us look at the steps AI takes to analyze and get back to this particular sentence in the conversation:

These are the steps voice AI takes to do it:


Step 1 Understanding customer’s meaning

In this step, voice AI tries to understand what Jack means by saying the above words. 

  • By using a technology called ASR (Automatic Speech Recognition). AI can transcribe the voice response from Jack into text. More advanced ASR is also able to identify customer’s emotion by their tone.

  • Then the response in text form will be analyzed by a technology called NLU(Natural Language Understanding).

What NLU does is to label customer’s intention based on the text. There are hundreds of thousands labels in voice AI’s database, and AI will pick one which is the closest match. 

For this sentence, AI labels it with “price too high”.

There is more that NLU can do though. For instance, it can also identify customer’s emotion, but based on the text. Besides, it extracts important data points(slots) from the sentences like figures, objects, etc, for future analysis. But to keep this example simple, we will skip these features for now.


Step 2 Collecting information to decide on the next action

Now AI knows Jack’s reaction is “price too high”, but what it really implies? Does it mean he thinks it is totally unacceptable, or if we change the strategy, there is still a chance for him to accept the deal? 

To decide on the next step to take, voice AI needs to gather and analyze all relevant information regarding the conversation. This not only includes the ASR and NLU results of the current words, but the historical dialogue and useful information from the customer’s profile.

A platform called memory platform will be gathering all these information, which can be seen in the following picture.


Step 3 Action Decision 

Aggregating all information from different sources, voice AI has come up with some judgements:

  • Jack is not uninterested with the car, he just thinks it is expensive.
However, the price of the car is only slightly higher than his budget.
  • We know he likes to seek coupons and discounts for all kinds of purchase, and he likes to communicate in a direct manner.

Based on the information, Voice AI decides on the next action: continue selling the car to Jack. However, it will offer him a small discount and a free maintenance package, since that is what he likes.

Besides, it will point out all these benefits directly and push for the ordering, since Jack likes to talk bluntly.

This whole decision process happens in a so-called “dialogue management system”,

where complex algorithms run after every response from the customer, to decide what to say next.


Step 4 Generating response

Based on the action decision in the last step, voice AI generates the following response to Jack:

The technologies used here are NLG (Natural Language Generation) and TTS(Text to Speech).

By now, voice AI has completed one round of conversation with Jack. 

For all the other rounds of the conversation in the call, the same workflow will be gone through again and again. In this case, voice AI can speak to the customer continuously, aiming to sell the car.


Last note

A quick recap:

Voice AI generates real-time response for every sentence from the customer, combining the analysis of

  • Current feedback from the customer (ASR & NLU results).
  • Historical conversation. Namely, what happened before (from the memory platform).
  • Aim of the call and customer profile from the database.

We illustrated how voice AI interacts with customers, just like another human-being, with the simplest example we can think of.

But the whole system can be much, much more sophisticated. 

In this system, there are some parts we have done pretty well, and some parts that still need pain-staking efforts to improve. We will illustrate these in the future in another blog.

Subscribe to our blog

Get AI Rudder contents delivered right to your inbox.

Related Post