Building successful speech applications is a labour of love

By Dr Bill Scholz for Unisys Africa

Johannesburg, 25 Oct 2005

With costs that are difficult control, call centres worldwide are increasingly relying on technologies such as interactive voice response and speech and language recognition to reduce costs while maintaining high levels of customer service. This is the word from Dr Bill Scholz, Architect Director: Business & Strategic Account Development, Global Communications & Media, Unisys Corporation.

The design and deployment of a high-quality speech application presents a unique challenge, requiring not only traditional systems analysis and software development skills but also the specialised skills of the speech scientist, human factors expert and business process analyst.

The process can be initiated much like any traditional software development as defined by the Software Engineering Institute, which has as a minimum the following activities:

* Identification of business needs and constraints;
* Elicitation and collection of requirements;
* Architecture design;
* Detailed design;
* Implementation;
* Testing;
* Deployment; and
* Maintenance.

But the trio of activities - architecture design, detailed design and implementation - which has served as the core application development for a generation, is insufficient for producing a speech application. Experience has shown that successful speech application software development requires additional skills typically offered by non-traditional contributors such as speech scientists, linguists, human factors experts, and business analysts.

Speech scientists and linguists address a speech application`s requirement for audio prompts expressed in the user`s colloquial native language, and manage its requirement for spoken language recognition. Human factors expertise guides the process of dialogue design. Business process analysis clarifies the integration between the voice user interface and the back-end business activity at the application`s core.

Voice user interface - persona, style, and new versus repeat users

Speech application users will derive an impression of an application just as you derive an impression of the person with whom you are speaking during a conversation. Users will impute human-like attitudes and behaviours to it, making selection of the right persona an essential goal for a speech application designer, where the prompt contents are tailored by careful choice of gender and age as well as a general demeanour appropriate to the application`s domain.

The designer must also select the appropriate dialogue style of the application. The style of an application designed for repeat users differs sharply from applications built for new users or infrequent repeat users. Repeat user applications typically have terse, abbreviated prompts often containing domain-specific jargon. Grammars are designed to permit users to speak using the same domain-specific jargon, and built-in help is minimal or absent altogether. Users who want to conduct specific business rapidly and efficiently see all of these attributes as beneficial. By contrast, applications designed for new users have carefully worded, unambiguous prompts, which avoid esoteric or domain-specific terminology wherever possible, and help is included.

Prompt creation

Both art and mechanics are involved in the preparation of prompts. Once designers master the art of prompt composition, they are faced with the mechanical task of decomposing prompts into variable versus constant components and choosing whether to render prompts using text-to-speech (TTS) or audio recordings.

The prompt decomposition mechanics are not independent of the application design since it is the application`s responsibility to reassemble complete prompts from fragments, where some fragments are only defined at runtime and must be retrieved from a database or back-end system. Developer tools to manage the mechanics of prompt construction permit decomposition of the prompt into phrases, fragments or words that are reassembled at runtime. TTS technology has matured to the point where it can be used as the sole mechanism for prompt generation, or used just for dynamic information, and blended seamlessly with pre-recorded static prompts. Application logic controls the assembly of the fragments.

Grammars versus SLMs

Most older speech applications use grammar-constrained recognisers, which means that they are pre-programmed with possible human responses. Thus, users cannot respond outside of the limitations set by the programmer. Statistical language models (SLMs) are new and build a grammar of possible responses based on actual human responses, which means they offer far greater variability in response recognition.

The downside is that they are expensive to generate since thousands of responses are required to build an SLM grammar.

Back-end integration

Nearly all speech applications seek to link the user or customer to some form of back-end system, such as a database, legacy application, or custom collection of business rules. The Internet client-server architecture supports a number of methods for achieving this, offering organisations seeking to implement speech applications a flexible approach to their unique systems architecture.

Debugging environment

Testing and debugging a speech application each have their own unique complexities. Two important methods for debugging speech applications are Usability Analysis and Wizard of Oz (WOZ) testing.

Usability Analysis evaluates the usability of the voice user interface, with its primary focus being to verify that the prompts elicit the desired understanding in the users, and that whatever words or phases they use to reply are understood by the computer. Usability Analysis can be time-consuming and costly if a working prototype of an application must be deployed to perform the tests, particularly if the result is the need to change the call flow, prompts, or grammars.

WOZ testing sees the application designer or developer serving as an application "wizard", listening live to a caller`s utterances through a computer and using mouse clicks to steer the application through a graphic rendering of the call flow. WOZ is less costly than usability testing and can be used to collect usability data prior to committing time and resource to implementing call flow and building the necessary recogniser grammars.

The development and deployment of a successful speech application requires part science, part art and part focus on software development fundamentals. To borrow from Mike Cohen, now that we`ve overcome scepticism about the capabilities of the technology, our primary focus has become the design of the voice user interface with emphasis on conversational context, rigorous emphasis on methodology and close interaction with the growing community of practitioners.

* Michael M Cohen is a research associate in the Program in Experimental Psychology at the University of California - Santa Cruz. His research interests include speech perception and production, speech reading, information integration, learning and computer facial animation.

Unisys

Unisys is a worldwide information technology services and solutions company. Our people combine expertise in consulting, systems integration, outsourcing, infrastructure and server technology with precision thinking and relentless execution to help clients, in more than 100 countries, quickly and efficiently achieve competitive advantage. For more information, visit www.unisys.co.za.

Editorial contacts