On CHOW: Should KIDS be welcome in restaurants?

Examine the technology behind the W3C VoiceXML standard and get ahead of the curve

Tags: TELECOMMUNICATIONS, W3C, Peter V. Mikhalenko, VoiceXML document, voice service, VoiceXML

  • Save
  • Print
  • Recommend
  • 1

Takeaway: This article examines the future of the VoiceXML language standard being developed by the W3C working group.

In the recent articles about CCXML and SCXML we talked about voice applications and their logic, and referred often to VoiceXML language. It is still the backbone in voice applications and dialog systems; and in this article I'd like to talk about it and its future. VoiceXML 2.0 is currently under development by the W3C Voice Browser working group. In 2000, the VoiceXML Forum (formed by AT&T, IBM, Lucent, and Motorola) released VoiceXML 1.0 to the public. Shortly thereafter, VoiceXML 1.0 was submitted to the W3C as the basis for the creation of a new international standard. VoiceXML 2.0 is the result of this work based on input from W3C Member companies.

What is it for


Originally it was designed for creating audio dialogs that feature synthesized speech, digitized audio, recognition of spoken and DTMF key input, recording of spoken input, telephony, and mixed initiative conversations. Its major goal is to bring the advantages of Web-based development and content delivery to interactive voice response applications.

The top-level element is <vxml>, which is mainly a container for dialogs. There are two types of dialogs: forms and menus. Forms present information and gather input; menus offer choices of what to do next. Listing A is the simplest application with one main form without successor dialog (hello-world.vxml.txt):

Listing A


<?xml version="1.0" encoding="UTF-8"?>
<vxmlxmlns="http://www.w3.org/2001/vxml"
  xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
  xsi:schemaLocation="http://www.w3.org/2001/vxml
   http://www.w3.org/TR/voicexml20/vxml.xsd"
   version="2.0">
  <form>
    <block>Hello World!</block>
  </form>
</vxml>

VoiceXML's main goal is to bring the full power of Web development and content delivery to voice response applications, and to free the authors of such applications from low-level programming and resource management. It enables integration of voice services with data services using the familiar client-server paradigm.

A voice service is viewed as a sequence of interaction dialogs between a user and an implementation platform. The dialogs are provided by document servers, which may be external to the implementation platform. Document servers maintain overall service logic, perform database and legacy system operations, and produce dialogs.

A VoiceXML document specifies each interaction dialog to be conducted by a VoiceXML interpreter. User input affects dialog interpretation and is collected into requests submitted to a document server. The document server replies with another VoiceXML document to continue the user's session with other dialogs. Summarized, VoiceXML is a markup language that:

  • Shields application authors from low-level, and platform-specific details;
  • Separates user interaction code (in VoiceXML) from service logic (e.g. CGI scripts);
  • Minimizes client/server interactions by specifying multiple interactions per document;
  • Promotes service portability across implementation platforms. VoiceXML is a common language for content providers, tool providers, and platform providers;
  • Is easy to use for simple interactions, and yet provides language features to support complex dialogs.

Requirements for hardware and software


According to the spec, the "http" URI scheme must be supported for document acquisition. In some cases, the document request is generated by the interpretation of a VoiceXML document, while other requests are generated by the interpreter context in response to events outside the scope of the language, for example an incoming phone call. An implementation platform must support audio output using audio files and text-to-speech (TTS). An implementation platform is also required to detect and report character and/or spoken input simultaneously and to control input detection interval duration with a timer whose length is specified by a VoiceXML document.

The VoiceXML application platform must report characters (for example, DTMF) entered by a user, and must support the XML form of DTMF grammars described in the W3C Speech Recognition Grammar Specification (SRGS). It also must be able to record audio received from the user.

So how it works?


We have already seen this basic concept in the SCXML language. A VoiceXML document (or a set of related documents called an application) forms a conversational finite state machine. The user is always in one conversational state, or dialog, at a time. Each dialog determines the next dialog transition. Transitions are specified using URIs, which define the next document and dialog to use. Execution is terminated when a dialog does not specify a successor, or if it has an element that explicitly exits the conversation.

There are two kinds of dialogs: forms and menus. Forms define an interaction that collects values for a set of form item variables. Each field may specify a grammar that defines the allowable inputs for that field. If a form-level grammar is present, it can be used to fill several fields from one utterance. A menu presents the user with a choice of options and then transitions to another dialog based on that choice.

A subdialog is like a function call, in that it provides a mechanism for invoking a new interaction, and returning to the original form. Variable instances, grammars, and state information are saved and are available upon returning to the calling document. Subdialogs can be used, for example, to create a confirmation sequence that may require a database query; to create a set of components that may be shared among documents in a single application; or to create a reusable library of dialogs shared among many applications.

A session begins when the user starts to interact with a VoiceXML interpreter context, continues as documents are loaded and processed, and ends when requested by the user, a document, or the interpreter context.

An application is a set of documents sharing the same application root document. Whenever the user interacts with a document in an application, its application root document is also loaded. The application root document remains loaded while the user is transitioning between other documents in the same application, and it is unloaded when the user transitions to a document that is not in the application. Figure A (fig1.gif) shows the transition of documents (D) in an application that share a common application root document (root).

Figure A

Document transitions

Each dialog has one or more speech and/or DTMF grammars associated with it. In machine directed applications, each dialog's grammars are active only when the user is in that dialog. In mixed initiative applications, where the user and the machine alternate in determining what to do next, some of the dialogs are flagged to make their grammars active (i.e., listened for) even when the user is in another dialog in the same document, or on another loaded document in the same application.

VoiceXML provides a form-filling mechanism for handling "normal" user input. In addition, VoiceXML defines a mechanism for handling events not covered by the form mechanism. Events are thrown by the platform under a variety of circumstances, such as when the user does not respond, doesn't respond intelligibly, requests help, etc. The interpreter also throws events if it finds a semantic error in a VoiceXML document. Events are caught by catch elements or their syntactic shorthand.

Example application


Let’s look at the application with the simplest and most common type of form, in which the form items are executed exactly once in sequential order to implement a computer-directed interaction. This will be a weather information service (Listing B) that uses such a form and provides weather information in specified country and city (weather.vxml.txt).

Listing B


<?xml version="1.0" encoding="UTF-8"?>
<vxml version="2.0" xmlns="http://www.w3.org/2001/vxml"
  xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
  xsi:schemaLocation="http://www.w3.org/2001/vxml
   http://www.w3.org/TR/voicexml20/vxml.xsd">
<form id="weather_info">
 <block>Welcome to the weather information service.</block>
 <field name="country">
  <prompt>What country?</prompt>
  <grammar src="country.grxml"  type="application/srgs+xml"/>
  <catch event="help">
     Please speak the country for which you want the weather.
  </catch>
 </field>
 <field name="city">
  <prompt>What city?</prompt>
  <grammar src="city.grxml" type="application/srgs+xml"/>
  <catch event="help">
     Please speak the city for which you want the weather.
  </catch>
 </field>
 <block>
  <submit next="/servlet/weather" namelist="city country"/>
 </block>
</form>
</vxml>

This dialog proceeds sequentially:

C (computer): Welcome to the weather information service. What country?
H (human): Help
C: Please speak the country for which you want the weather.
H: Georgia
C: What city?
H: Macon
C: I did not understand what you said. What city?
H: Tbilisi
C: The conditions in Tbilisi Georgia are sunny and clear at 11 AM ...

User input


The <grammar> element is used to provide either a speech grammar or a DTMF grammar. A speech grammar specifies a set of utterances that a user may speak to perform an action or supply information, and for a matching utterance, returns a corresponding semantic interpretation. The following (Listing C) is an example of inline grammar defined by the XML Form of the W3C Speech Recognition Grammar Specification (SRGS), (grammar1.xml.txt).

Listing C


<grammar mode="voice" xml:lang="en-US" version="1.0" root="command">
  <!-- Command is an action on an object -->
  <!-- e.g. "open a window" -->
  <rule id="command" scope="public">
    <rulerefuri="#action"/> <rulerefuri="#object"/>
  </rule>

  <rule id="action">
    <one-of>
      <item> open </item>
      <item> close </item>
      <item> delete </item>
      <item> move </item>
    </one-of>
  </rule>

  <rule id="object">
   <item repeat="0-1">
      <one-of> <item> the </item> <item> a </item> </one-of>
    </item>
    <one-of>
      <item> window </item>
      <item> file </item>
      <item> menu </item>
    </one-of>
  </rule>
</grammar>

DTMF grammar specifies a set of key presses that a user may use to perform an action or supply information, and for matching DTMF input, returns a corresponding semantic interpretation. All VoiceXML platforms are required to support the DTMF grammar XML format. The following (Listing D) is an example of a simple inline XML DTMF grammar that accepts as input either "1 2 3" or "#" (grammar2.xml.txt).

Listing D


<grammar mode="dtmf" version="1.0" root="root">
  <rule id="root" scope="public">
    <one-of>
      <item> 1 2 3 </item>
      <item> # </item>
    </one-of>
  </rule>
</grammar>

System output


The <prompt> element controls the output of synthesized speech and prerecorded audio. Conceptually, prompts are instantaneously queued for play, so interpretation proceeds until the user needs to provide an input. At this point, the prompts are played, and the system waits for user input. Once the input is received from the speech recognition subsystem (or the DTMF recognizer), interpretation proceeds.

The content of the <prompt> element is modeled on the W3C Speech Synthesis Markup Language (SSML). A good introduction into SSML is also available.

Beyond average

Certainly this article is just an introduction and cannot cover all details and features of VoiceXML. VoiceXML is a W3C endorsed markup language that allows developers to write advanced telephony applications with simplicity undreamed of until recent years. VoiceXML allows the average Web developer to write telephony applications with the ease and simplicity of writing the average HTML Web page. As VXML is a tag-based markup language, its structure is very similar to HTML in many ways, but instead of being a primarily visual medium, VoiceXML is an auditory medium that allows the end user to navigate through his 'telephony page' by using voice commands, rather than by clicking a button on a Web page. With the implementation of VoiceXML you do not need to invest in expensive hardware and software for a telephony application, or in a dedicated location to store all your telephony equipment. Many voice application hosts and providers such as Skype are ready to provide you with a free voice application, or voice with enhanced functionality for a little extra payment.

  • Save
  • Print
  • Recommend
  • 1

Print/View all Posts Comments on this article

Can you hear me now? Mark W. KaelinTechrepublic Moderator | 06/21/06

What do you think?

CIO Sessions

advertisement
Click Here