About
Subscribe

Introduction to telephony speech technologies

Innovative speech solutions, which offer a natural, friendly, conversational and interactive user experience, are enabling new levels of competitive differentiation by taking advantage of advances in telephony and Internet convergence.
By Shaun Cochrane, Executive director, Intelleca Voice & Mobile.
Johannesburg, 30 Jul 2004

ASR has become popular and is replacing DTMF in the customer service departments of large companies, and is also being used by government agencies.

Sophisticated telephone solutions are using standards such as VoiceXML and speech application language tags, coupled with automatic speech recognition, text-to-speech technology and speaker verification technology to offer unprecedented levels of customer service.

In this Industry Insight, the first in a series focusing on developments in speech technology, I will look at the definitions of some of the components of speech solutions, namely VoiceXML (VXML), automatic speech recognition (ASR), text to speech (TTS), speaker verification (SV), speech application language tags (SALT) and VoiceXML.

ASR has become popular and is replacing DTMF in the customer service departments of large companies, and is also being used by government agencies.

Shaun Cochrane, executive director, Intelleca Voice & Mobile.

VoiceXML is an application of the extensible markup language (XML) that describes the interaction between a caller and a server. Just as HTML is the Internet`s industry standard, VoiceXML is the dominant standard for converged telephony services. VoiceXML enjoys the support of over 130 companies, with the VoiceXML 2.0 specification been accepted by the W3C (Word Wide Web Consortium) as part of its voice browser recommendation. VoiceXML media gateways allow telephony services to interpret `pages` of VoiceXML, much as Explorer interprets HTML pages. Companies therefore have the ability to leverage existing services available via the Web and simultaneously ensure the telephony channel and Internet services are in synch.

When combined with complementary voice technologies (ASR, TTS and SV), VoiceXML allows for complex automated services all accessible via a telephone. In essence, your phone and voice become your `mouse`.

VoiceXML supports natural language interactions, which means that the user is not locked into a limited script, but can speak naturally. In what is called a `modeless` or `conversational` mode, the user can interrupt the system with an out-of-context question and thus redirect the session. The goal is to make the exchange as natural as possible. For example, a user can pick up a phone, dial a speech banking application, and request to make a transfer from his current account to his savings account. The voice request activates a database query and then the query result is converted back to a voice message to give the user the information requested.

Vertical market solution

ASR expands a company`s automated customer service capabilities by allowing callers to get information quickly and perform complex transactions by using their voices to direct the system. ASR is being used in a wide range of vertical markets including financial services, healthcare, travel and tourism, entertainment and government. ASR is a technology that allows users of automated telephony systems to choose from a list of predefined options by speaking their entries rather than having to punch numbers on a keypad.

ASR has become popular and is replacing DTMF in the customer service departments of large companies, and is also being used by government agencies. Basic ASR systems recognise single-word entries such as yes-or-no responses and small world lists. This makes it possible for people to work their way through automated menus without having to enter dozens of numbers manually while attempting to navigate the typical maze of DTMF IVR menus.

Take the example of a DTMF banking solution in order for a caller to logon and schedule a transfer: the average number of key entries is 30. A speech banking solution would achieve the same transaction using at most seven spoken phrases. Similarly with DTMF services, callers tend to get lost through the maze of menus and simply hang up and start again.

Sophisticated ASR systems allow the user to enter direct queries or responses, such as a request for driving directions or the telephone number of a hotel in a particular town. This shortens the menu navigation process by reducing the number of decision points. It also reduces the number of instructions that the user must receive and interpret.

For institutions that rely heavily on customer service, such as airlines and financial institutions, ASR makes it possible to increase the usage of automated services, resulting in a reduced call load on call centre agents. As a result of the reduced call load, agents are empowered to provide enhanced customer service and focus on complex interactions, such as enhanced problem solving, sales or customer retention.

Audio output

TTS software takes text and processes it into audio output. The primary use of TTS is to provide information - over the telephone - that is too voluminous, rapidly changing or unpredictable for pre-recorded voice recordings. TTS can enable the reading of computer display information for a visually challenged person, or it may simply be used to augment the reading of a text message. Current TTS applications include voice-enabled e-mail or the reading of driving directions from a database to a caller. With constant improvements in the natural `sound` of TTS engines, we are beginning to see organisations use TTS for all their IVR voice prompts.

SV verifies a person`s identity through their unique voice input. SV is deployed as a verification solution, typically controlling access to sensitive or private information. In the past, telephone verification was achieved by typing in alphanumeric sequences (account number and PIN); in contrast, SV relies on the unique biometric characteristics of a person`s voice, and it is therefore more secure than using PINs and account numbers alone for user authentication.

By comparing a caller`s voice print with the reference voice print associated with the claimed account, SV software can verify if the caller is authorised to access desired information.

SV provides a dual layer of user authentication, requiring knowledge of the correct pass-phrase and a matching voice print, rather than just a PIN number or password, which can be stolen or guessed.

There is a growing interest in SV solutions due to greater security concerns and legislation (such as FICA). Healthcare organisations, for example, have implemented new security and privacy standards such as "entity authentication" in order to protect personal health information. SV technology has also been beneficial to call centres that can now automate caller authentication, PIN reset, and other labour-intensive processes, resulting in a compelling return on investment.

Ask and you will receive

SALT concentrates on multimodal communication, or the ability to ask for information over a cellphone, PDA, or other handheld device, and obtain a text response. Multimodal access enables users to interact with an application in a variety of ways: they will be able to input data using speech, a keyboard, keypad, mouse and/or stylus, and produce data as synthesised speech, audio, plain text, motion video, and/or graphics.

Each of these modes can be used independently or concurrently. The full specification for SALT is being developed by the SALT Forum, an open industry initiative committed to developing a royalty-free, platform-independent standard.

In the next Industry Insight, I will look at the differences between and benefits of open and proprietary speech technology platforms.

Share