About
Subscribe

VoiceXML opens up the Web to telephony

Johannesburg, 09 Jan 2002

VoiceXML, or voice extensible mark-up language, allows users to interact with the Internet through voice recognition technology. Shaun Cochrane, Intelleca GM, provides an overview of the technology and its application.

VoiceXML is a programming language used for developing voice-user interfaces, generally for the telephone. It allows people with any type of telephone to access the Internet to retrieve and send e-mail, check sports scores, make reservations, and more.

It can also support natural language through automatic speech recognition (ASR), so a user is not locked into a limited script, but can communicate through natural speech. In what is called a "conversational" mode, a user can even interrupt the system with an out-of-context question and thus redirect the session. The goal is to make the exchange as natural as possible, as if two people were interacting.

ASR and dual tone multi-frequency (DTMF, the system used by touch-tone telephones), are used for input, while pre-recorded audio files and text-to-speech synthesis (TTS) are used for output.

VoiceXML is based on the Worldwide Web Consortium`s (W3C) extensible mark-up language (XML). VoiceXML 1.0 was created through a collaboration of AT&T, IBM, Lucent Technologies and Motorola, each of which was working on its own approach but joined forces to create an open standard.

Using XML, a Web programmer can enable voice recognition through the addition of a few simple tags. The VoiceXML "parser" runs on a telephony platform. It provides the telephony gateway to the Web, thus opening it up to any person with a telephone.

Platform vendors, application developers and service providers all stand to benefit from the development of this common language, as it ensures portability. Because it uses familiar Web infrastructure, including Web servers and tools, VoiceXML has greatly simplified speech recognition application development.

Two main areas have been identified where VoiceXML will be used extensively:

  • .         1.As a way to voice-enable a Web site.

  • .         2.As a solution to building next-generation interactive voice response (IVR) telephone services.

  • Voice portals are becoming increasingly popular. A voice portal is a telephone service where callers dial a phone number to retrieve information such as weather reports, TV guides and flight information.

Voice portals demonstrate the power and flexibility of speech recognition-based telephone services. However, they are not the only application for VoiceXML. The technology is also used for voice-enabled intranets, contact centres, notification services, unified messaging and v-commerce (voice commerce) solutions.

By being able to separate the application logic, which runs on a Web server, from the voice dialogues, which run on a telephony server, developers can build phone services without having to buy or run equipment.

VoiceXML features The open architecture and high-level common interfaces with the Web`s many computing resources paved the way for the rapid growth of VoiceXML. HTML (hyper text mark-up language) and HTTP (hyper text transfer protocol) hide the complexity of building interactive applications. Just as a Web developer does not need to understand how all the HTML coding is eventually interpreted into a page on a browser, VoiceXML eliminates the need for developers to understand the many complexities of telephony platforms.

It has features that control audio output, presentation logic and control flow, event handling, and basic telephony connections. Beyond the scope of the language are application logic, state management, dialogue generation and sequencing, database operations, and interfaces to legacy systems (such as screen scraping). These are all handled by traditional Web application programming techniques.

Architecture A VoiceXML application consists of the following components:

How it works

  • .         1.Application server - typically a Web server, which runs the application logic, and may contain a database and interfaces to an external database or transaction server.

  • .         2.VoiceXML telephony server - a platform which runs a VoiceXML parser that acts as a client to the application server. The parser understands the VoiceXML dialogues and controls speech and telephony resources (ASR, TTS, audio play and record functions, telephone network interfaces).

  • .         3.Internet-style network - a TCP/IP-based packet network that connects the application server and telephony server via HTTP.

  • .         4.Telephone network - the public switched telephone network (PSTN), a private telephone network such as a PBX (private branch exchange) for example, or VOIP (voice over Internet protocol) packet network.

  • .         5.The caller - any telephone that can connect to the telephone network.

The caller dials the voice portal telephone number, and the call is routed to the VoiceXML telephony server. The appropriate page is retrieved via HTTP from the Web server. The page is passed into the VoiceXML parser. The VoiceXML parser then begins controlling and activating the ASR, TTS and audio files. Once a caller has spoken the correct command the ASR interprets the spoken command and the VoiceXML telephony server initiates a new HTTP request from the Web server.

VoiceXML is a powerful yet relatively simple language for building voice services. The ability to leverage off Web architecture, tools and technology will allow developers to focus on new, exciting and personalised telephone applications. New language and call features currently under development - and enhancements in ASR and TTS - hold the promise of an even richer, voice-enabled Web environment.

Share

Editorial contacts