Voice Recognition and UX

In Usability, UX Design by relding

For as long as we know it, computers have been an important part of our lives. And all that time we´ve been interacting with them one way or another. Back in the 1950s, this interaction consisted of pressing keys and entering commands into the command line. Before that, commands had to be given to the computer using a punch card or paper tape. It was not until 1978 and the invention of VT100, a video terminal developed by DEC, that the users could see command line information more quickly. When they emerged in the 1980s, GUIs made computers much easier to use. Nowadays, it is almost impossible to imagine working on a computer without an iconic mouse.

Today, when all of us have at least one smartphone in our pocket, it is only a matter of time when voice interaction will become the preferred input method. And according to the demos shown at CES 2017 and all the buzz on tech blogs, it seems that time is now.

In this article, we will try to explain the implications of voice interaction and its potential for UX design. Same as the smartphones evolved from a UX playground to user-friendly designs, we expect the same will happen to voice-enabled devices. With what seems to be a limitless potential, voice interaction has the potential to redefine UX from the ground up through innovative and user-centric design.

2017: The year of voice

This year´s CES has brought us a great deal of announcements about new product launches. From a cat feeder that prevents overeating by recognizing the cat´s microchip, to the Qualcomm´s new Snapdragon 835 chip, the products announced at CES will surely fill the tech news for months to come. Besides dozens of exciting product announcements, CES 2017 has also revealed a new trend that could redefine the way we interact with our electronic devices – voice interaction.

During his speech at CES, Shawn DuBravac, Chief Economist of the Consumer Technology Association stated that this year presents an inflection point in the computer´s ability to translate speech to text. Since 1994 when these experiments first started, the error rate of such a translation was roughly 100 percent. In 2013, the error rate dropped to 23%. This year, that error rate is expected to drop as low as 6%.

Leading the trend in the field of voice interaction are Apple´s Siri and Google´s Now, but from what we saw at CES this year, the leaderboard is about to change. Stealing the show last year, Amazon´s voice-activated assistant Alexa was everywhere you look. And according to latest news coming from Amazon, it will not take long for Alexa to be omnipresent. While Siri and Google Now are already embedded in smartphones, Amazon´s Echo (Alexa) is taking a different approach by offering voice interaction on stationary devices – from Lenovo smart devices to Whirlpool´s appliances.

And cars!

Tesla´s Model S has an in-built voice command feature that combines Google´s voice recognition with its search and maps features. This useful feature is often completely overlooked mostly because most users intuitively end up using the massive touch screen. We could say that people are visual creatures or that old habits die hard. But using a touch screen to setup a navigation during driving is much more complicated (and dangerous) than simply saying “Drive home”. And Tesla is not the only one who is experimenting with voice. Several vehicle manufacturers already offer their vehicles with fully integrated communication and entertainment systems. Ford and Fiat go a step further and, same as Tesla, offer voice recognition for most of the systems in their cars. Voice recognition is used to control those systems through a complex set of commands.

One voice to rule them all

Marketed as completely different and one better than the other, in reality, all three of the aforementioned solutions, with the addition of Microsoft Cortana, work in a similar way. They all operate in stand-by patiently waiting for the user to say the “activation” phrase after which they execute the user´s command such as to reproduce music, set a timer, or to find information about something. For instance, if you ask all of them to do a rather simple task such as “Send an email to John Doe”, the results would be as follows:

Siri and Google Now will perform the task without any issues. They will recognize the name from your contacts, let you dictate your message, and send it out. The minor difference was the fact that Siri lets you specify the subject line. The major difference is the obvious one: Siri works only with Apple Mail, and Google Now with Gmail.

Cortana managed to do the task but not as smooth as Siri and Google Now. It still has some trouble with misheard words such as asking you if you want to text someone instead of emailing them.

Alexa could not even understand the question.

The above example is just one amongst many tests conducted on the subject and most of them clearly indicate that Alexa is not the brightest one. It´s funny if you consider that Echo (Alexa) came to life years after Siri and Google Now. Besides being somewhat “slow”, Alexa´s biggest disadvantage is the lack of a screen that could show visual output such as search results. Echo has a smartphone app but its focus is only on adjusting some settings and functions while the whole interaction is done inside of the actual device. Reading this has probably got you thinking that Alexa is one big step backwards. So why would anyone give their money for a device that works worse than the ones they already have?

The answer lies in error prevention. Amazon´s Alexa thinks it is better to prevent errors from occurring than to just help users recover from them. In line with the improved results in speech recognition that we mentioned above, errors of understanding natural language have dropped significantly, and there is no reason why this positive trend would not continue in the following years. Still, there is one very common type of error with smartphone voice interaction – completely failing to detect the “activation” phrase in noisy environments.

Because of it, Siri often finds it difficult to understand voice commands if there is noise in the background, no matter whether that noise is just noise or background music. Siri also fails if the device is tucked in a pocket or not within reach. Contrary to Siri, Echo´s only priority is voice interaction. With its 7 microphones and a strong emphasis on distinguishing voice from background noises, Alexa will reply to you even if it is on the opposite side of the room.

The potential of voice

When a new technology becomes available, some people want to start from the start and reinvent the wheel. Removing a visual display is just one example as it changes the complete interaction experience and we have to ask ourselves: does the shift from visual to voice mean that all the rules have changed?


There are no images that would help articulate actions more clearly. There are no animations that would explain complex actions more easily. There is nothing to click. Think about it. One of the most fundamental UX element of the Internet, the hyperlink, is no longer there.

And, as the popularity of voice interaction continues to rise, UX professionals are quickly starting to realize that words are becoming more important than ever. With no visual cues that would guide the user, words are the only thing on which the user will evaluate and rate the overall customer experience. As designers will have to rely 100 percent on words and phrases, it will create a strong need for a standardized set of words and phrases that will allow users to seamlessly move and navigate between different voice systems. It is needless to say that memorizing a different set of commands for each voice system is something that most users will not want to do.

One of the key concerns for UX experts that lead the transition from visual to voice will be the need for constant interpretation of actions between the two interfaces. Without having a mouse to perform actions, they will have to anticipate the user´s intent at every step of the process and set a meaningful response. For instance, simply saying “Delete it” can be a valid voice command in different voice systems. But the consequences of such an action can be completely different in different systems. As there are infinite ways to say something, UX designers will have to be sure they are asking the right questions in order to get the right verbal response from the user.

In line with the above, designers can start designing for a finite set of possibilities that are most likely to follow their use case. And that´s the key: to define a reason why the user performs the interaction in the first place. Some of the use cases with greatest potential are people with certain disabilities such as not being able to use a mouse or a keyboard, users working on complex assignments such as driving special vehicles or operating on large equipment.

There is also a question of privacy. Switching to voice interaction will possibly open a completely new attack ground for malicious attackers. Most devices that have voice interaction store and remember user credentials. If our voice is the key of this new kingdom, we have to protect it in order to prevent someone to record it and use it to synthetize commands you never gave. These are just some of the obvious privacy concerns that UX experts need to address but as the popularity of voice continues to rise, so will the need to protect the voice-enabled devices from unwanted access.

But the biggest difficulty of all is the social awkwardness. If you completely ignore the fact how precise we have honed the accuracy of voice recognition, and that several voice AIs are already in use, there is still a great deal of awkwardness in using them in public. Same as with small earphones that made people look like they are talking to themselves, the awkwardness of using voice interaction will probably get lost in translation. But for now, the problem is still here and UX designers must take it into account when designing for voice interaction. One way of dealing with it is to try to come up with the shortest dialogue flow possible. This approach can sometimes have a negative impact on accuracy so a tradeoff between duration and accuracy must be found.


As we said at the beginning of this post, voice is the next big deal for UX design and we expect that its popularity will grow rapidly in the years to come.

Most of the UX paradigms that have been providing a shared basis for developing new ideas can´t even be applied to voice. This means that UX experts need to start embracing and refining this technology. As a part of achieving this goal, they need to carefully shape the vocabulary used in voice interactions, as well as to try to predefine the users´ intent at each and every step of the process.

Once the dust settles, it will be up to UX experts to strengthen user engagement by personalizing and improving the way how AI voice assistants reply to your orders. The first steps will be in making the whole interaction process clearer and more understandable. Once we manage to achieve that, the focus will probably be put on creating complete branded personas. But, as we could learn from currently available voice AIs, as long as there is a clear distinction between a robot and a human, people can really enjoy interacting with their artificial companion.

Get in touch