The intelligence of machines and the branch of computer science which aims to create it

Artificial Intelligence Journal

Subscribe to Artificial Intelligence Journal: eMailAlertsEmail Alerts newslettersWeekly Newsletters
Get Artificial Intelligence Journal: homepageHomepage mobileMobile rssRSS facebookFacebook twitterTwitter linkedinLinkedIn

Artificial Intelligence Authors: Jason Steinbach, Yeshim Deniz, Liz McMillan, Pat Romanski, Zakia Bouachraoui

Related Topics: Artificial Intelligence Journal, Mutual Funds on Ulitzer

Artificial Intelligence: Article

From Desktop to Server: Speech Recognition Moves Upstream

From Desktop to Server: Speech Recognition Moves Upstream

Speech recognition is the process by which computer-based software converts audible voice into digital text. When you think of computer-based speech recognition, most people picture someone sitting at a desk, wearing a headset microphone, dictating large volumes of text into a desktop system. But speech-recognition technology, over the past decade, has moved from the desktop to the server, from use by an individual to use by the enterprise.

Take the following examples:
"Can you tell me what IBM stock closed at last night?" "I'd like to transfer two thousand dollars from my savings to my checking account please" "Connect me to Kim Kemble in Boca Raton" "Subjective colon 32-year-old white female in for evaluation period Patient describes injury of left ankle period "Objective colon Patient presents in mild distress and pain period Heart colon Regular rate and rhythm period"

What do the above have in common? They're all snippets of dialog that have been extracted from actual applications that utilize speech recognition. What's more, all of these applications implement speech recognition over the telephone. That means callers can conduct business with voice-enabled automated systems over the phone, simply by using their voice.

Before we go any further, let's take a look at some basic speech terms.

  • Dictation: The composing and recording of thoughts into audible voice files
  • Transcription: The process of transcribing dictation into text, either manually (i.e., by typing) or by using speech recognition
  • Command and control: The use of spoken words and phrases to direct an application to perform a task
  • Speaker independence: The extent to which a speech recognition system must understand each individual speaker's voice characteristics to be able to process speech

A Short History
With more than 150 patents in voice recognition, IBM has been researching speech for more than 30 years. It wasn't until the early 1990s that faster hardware and improved software made speech technology practical to implement.

The first speech systems ran on large UNIX servers. Resource requirements were high. First, it required special digital signal processing (DSP) hardware to assist in the speech-recognition process, as most systems in those days simply weren't powerful enough to support very computationally intensive speech recognition. It required several hours of "training," and you also had to talk using a style known as discrete speech, where you inserted short pauses between your words ("youShadStoStalkSlikeSthis"). Even though it was not the most natural way to speak, for people who routinely dealt with large volumes of text (doctors and lawyers, in particular), it was a significant breakthrough.

Then came continuous speech, which meant users didn't have to insert pauses between words. This continuous speech was limited to short commands and phrases, and dictation still required discrete speech. In 1993, a personal voice product, IBM's Personal Dictation System, was released for OS/2. It was one of the first commercially available, high-accuracy voice-recognition products. The following year saw the announcement of IBM VoiceType Dictation for Windows and OS/2 ("You talk, it types"). Four years later, IBM released the industry's first desktop continuous-dictation product. Users no longer had to pause between words, whether they were dictating or using commands, and could speak at a natural pace. Training requirements were decreased from several hours to several minutes, and recognition accuracy continued to improve.

Over the past several years, voice technology has moved from the desktop to the enterprise in the form of voice middleware. Voice middleware encompasses platforms and applications that run on servers, such as IBM's WebSphere Voice Server, serving hundreds or thousands of customers via the telephone or Internet. Generally, these server-based voice applications are written to service a limited vocabulary and a large number of users, such as bank customers. No "training" of a caller's voice is required.

An example of voice middleware is a customer service application that uses Web technology. This new application might give customers a voice interface to the same Web application content that had previously been accessible only through the Internet. For instance, a customer may now call a voice-enabled Web application server at a brokerage firm and complete a trade - without operator assistance. This is done by speaking commands and listening to the same information that might normally be "seen" using a browser on a PC or workstation.

Another example of voice middleware is a voice-enabled flight information system, where a caller can receive flight information directly (such as late arrivals) rather than waiting on hold to speak to an agent. Today a caller can simply call a number, state the flight number and city, and receive the flight information audibly over the phone.

At the same time that speech recognition was finding its way to the enterprise, it also moved to the device. Embedded speech technology now enables mobile devices, which are typically constrained by the amount and type of system resources available (memory, processor speed, and storage space) to deploy voice. Speech recognition can now be used on devices - providing low-resource, small vocabulary command-and-control speech recognition in a variety of languages. The software also supports a variety of real-time operating systems and microprocessors, making the development of robust mobile speech solutions easy and practical for both device and application developers.

The convergence of computers with telephones and handheld devices continues. The human voice becomes a Web browser. Surf the Web in the car while a text-to-speech application reads back the content, then tell the car to turn on the radio when finished. Check the status of an order without having to punch a telephone keypad. Use a PDA to conduct a banking transaction without touching the keyboard. What could be easier than just talking?

Trends and Directions
Voice technology, which for a long time had been confined to research, is now putting a natural interface on the computing environment, from end-user devices to infrastructure, crossing national boundaries. Worldwide spending on voice recognition will reach $41 billion by 2005, according to the Kelsey Group, a market research firm.

There are several forces driving the growth:

  • Companies view voice as a way to improve their call-center service while reducing costs. Voice recognition allows companies to use automation to serve customers over the phone, 24/7, without subjecting them to hold times or requiring them to respond to rigidly structured menus. Then there are the business savings: a typical customer service call costs $5 to $10 to support; automated voice recognition can lower that to 10 to 30 cents. The market-research firm Datamonitor says call center managers are seeing an increase in customer acceptance of automation and self-service, along with cost savings.
  • The rise of telematics, which combines computers and wireless telecommunications with motor vehicles to provide customized services such as driving directions, emergency roadside assistance, personalized news, sports and weather information, and access to e-mail and other productivity tools. The Kelsey Group predicts U.S. and European spending on telematics will exceed $6.4 billion by 2006
  • Companies looking to voice-enable the Internet and their IT establishments, whether it's providing information to consumers through "voice portals" or allowing employees to access corporate databases over the phone through spoken commands
  • The ability to squeeze convenient speech recognition into ever-smaller devices, such as phones, PDAs, and other mobile devices

This is happening not only in the U.S., but also across the globe. For the most part, companies looking to deploy voice face a lot of similar concerns. They want to know what business applications will bring more value to their customers and set them apart from their competition.

Key to the growth of voice is VoiceXML. VoiceXML has been the catalyst for the deployment of speaker-independent, limited-vocabulary automatic speech-recognition systems in recent years. It has allowed for applications such as voice portals and speech-enabled call centers to grow, paving the way for Web access via phone. This allows call centers to automate simple customer requests, and use their live agents for more complicated tasks and inquiries, thus making more efficient use of them. The significance of VoiceXML is manyfold:

  • It's an industry-standard language (sponsored by the VoiceXML Forum -, and it's designed to leverage the skills of the Web development community.
  • It makes speech application programming much more accessible to developers. If you know HTML or XML, VoiceXML will look very familiar to you. Not only does it use the well known element-attribute (or tag) style, it utilizes the very same infrastructure that exists within the Web today.
  • VoiceXML takes speech application programming to a higher level. Previously, you had to know quite a bit about speech recognition and possibly even linguistics to be able to develop an effective speech application. With VoiceXML, the system does a lot of things for your application "under the covers" so your application can focus on the task at hand.
Another key factor in the growth of voice is Natural Language Understanding (NLU) technology. NLU enables applications to implement a conversational interaction with end users. No longer do users have to speak in short commands and phrases; they can now speak almost as if they were talking to a live agent. NLU combines breakthrough research and development in the fields of voice recognition, linguistics, statistics, human factors, and artificial intelligence. When NLU is combined with speech recognition, the user is able to speak in a more unstructured, conversational style, resulting in a more comfortable and productive user experience. For example:
    System: "Welcome to the Mutual Funds Trading System. How can I help you?"
    User: "I'd like to buy more Fidelity, please."
    System: "You currently own 500 shares of Fidelity Magellan. How many shares would you like to buy?"
    User: "What's the going price?"
    System: "The current price of Fidelity Magellan isSWhat else can I do for you?
    User: "HmmmmmSI think I'll buy 300 more"
    System: "Purchasing 300 shares of Fidelity Magellan at market priceSplease say 'yes' to confirm."
    User: "No, make that $2,000."
    System: "Purchasing 2000 dollars of Fidelity Magellan at market priceSplease say yes to confirm."
    User: "Do it."
    System: "2000 dollars, or 125.25 shares, of Fidelity Magellan purchased at market price. What else can I do for you?"

When the user says "I'd like to buy more Fidelity please," the software must extract the meaning from what was said. Clearly, there are extra words in this sentence - words that don't affect the meaning of the sentence - and the software must extract the pertinent words from the sentence to determine its meaning. In this sentence, the key concepts are "buy" (an action), "more" (implying a fund the user already owns), and "Fidelity" (a fund company).

As you can see from its response, the application determined that the user owns Fidelity Magellan, and then asks how many shares the user would like to buy. In this example, the user responds NOT with the number of shares to buy, but with a totally different question, "What's the going price?" This means the caller is in control, rather than being driven by long menus. At this point, the application must "switch gears" and obtain the current price of a fund. From the context, or what was previously said by the user, the application knows that the user wants the price of Fidelity Magellan. You can see from the example how NLU technology can make the interaction between a user and the system more intuitive and effective.

And NLU is not just a vision of the future. One of the most sophisticated uses of NLU technology is being deployed by 401K management company T. Rowe Price. Their system is being rolled out to one million users who will be able to manage their retirement accounts using the enhanced Plan Account Line. The system doesn't require a caller to use a particular script. "We believe that most callers will save at least 30 percent of their time," says Heidi Walsh, vice president and senior marketing manager.

Developments in speech-recognition technology haven't been limited to telephony. Dictation has also leapt from PCs to the server. Most recently, IBM announced the WebSphere Voice Server for Transcription. This offering was introduced in early 2002 and provided large-vocabulary continuous dictation to the enterprise. Aimed at solution developers and service providers with document-workflow solutions, the Voice Server for Transcription can automate what has traditionally been a very manual and resource-intensive process - that of dictation transcription.

For many years, physicians, lawyers, and other professionals whose professions require the production of high volumes of text, have relied on typists and transcriptionists to convert their dictation into documents. With the WebSphere Voice Server for Transcription, the professional's dictation can be transcribed automatically, leaving transcriptionists to only correct and edit (rather than transcribing from scratch), thus improving their overall productivity and turnaround. Since skilled transcriptionists are expensive and hard to find, automated transcription makes the process more efficient.

So what exactly is transcription? Take the example at the beginning of this article. If this audio were sent to the WebSphere Voice Server for Transcription, it would result in something like this:

    Subjective: 32-year-old white female in for evaluation. Patient describes injury of left ankle.
    Objective: Patient presents in mild distress and pain.
    Heart: Regular rate and rhythm.

A transcriptionist would edit and correct the transcribed text, and the workflow application would use the edited text to fill in the appropriate fields (e.g., "Assessment" and "Plan").

The Case for Speech
With the emphasis on improving customer service and customer-relationship management during the past decade, Interactive Voice Response (IVR) systems ("press 1 for savings, press 2 for checkingS") have become ubiquitous. Most IVR systems were originally installed to automate customer service, or contact center, applications and reduce call-handling time. However, callers often bypass IVRs because of complicated menus, resulting in a higher-than-anticipated volume of calls ending up with a Customer Service Representative (CSR). Voice-enabled applications replace complex menu choices by allowing callers to go directly to a selection with a simple spoken request. These systems are usually preferred by customers, encouraging them to participate in self-service voice dialogues rather than opting out to talk to a CSR. Voice recognition reduces the length of calls by 50% or more over menu-driven IVRs, increasing the use of voice automation (versus CSRs) from 20% to 60%. This results in a much quicker return on investment (ROI) of the voice system and increased customer satisfaction.

Let's take a look at the financial industry specifically. With voice recognition, financial institutions can reduce contact center costs, particularly on nonrevenue-generating calls. The typical fully-loaded cost of a CSR is roughly $45,000 per year (assuming a base salary of approximately $36,000), or over $1-$1.50 a call. While IVR systems drive this cost down, voice enablement reduces it further by flattening menus and speeding up navigation. With voice recognition, call costs can be reduced to around 30¢ a call or less, depending upon call volume.

Any increase in automated calls further drives down contact center costs. This kind of return on investment (ROI) means quick system payback. For example, some analysts claim that the payback on even "massive-scale, high-availability" voice-enabled contact center systems has been less than 18 months.

One large financial corporation is currently voice-enabling its automated basic banking services, including balance inquiry and funds transfer. Their cost justification for using speech recognition is based on three predictions. First, they feel they can shorten incoming call length by flattening touch-tone menus and allowing a caller to jump to a desired action. Second, they feel they can increase the percentage of automated calls on a yearly basis by 2%, which represents a large number of calls. Last, they feel they can capture roughly 20% of calls where the caller does nothing and simply defaults to a CSR. With a voice-enabled interface, they feel they can get callers to use the system rather than wait for a transfer.

There are many examples of enterprises adopting speech into applications in their organizations - and not just within the financial industry. While a business case can be built to show a quick return on investment, ROI is just one reason for justifying voice-enabled applications. Others include:

  • Furthering a strategic or corporate goal
  • Providing a service that previously could not be offered
  • Enhancing the performance of an existing system
What's Next
In the '80s, with the introduction of the PC, we increased the population of people who could access information. Then came the Internet, which used PCs as terminals to access the Web. Now we're moving to where we have small devices for an even larger population (including people who don't necessarily use a PC). Advances in voice technology have enabled people to speak directly to devices, rather than use traditional input methods such as the mouse or the keyboard. Soon people will be able to use speech when it's easier to say something than type it or wade through long menus, using a graphical interface when a visual representation serves your needs best, or using touch when that's the easiest way to make a selection. This is known as a multimodal interface. It combines all of the different ways to use technology, employing the most appropriate user interface to the task at hand.

For example, consider a busy mobile worker on the way to the airport who receives a call from a manager, wanting him in Hong Kong for a customer meeting - instead of Tokyo, as originally planned. Using a cell phone, the worker calls the voice-enabled, automated flight reservation number of an airline and requests a list of available flights to Hong Kong. Since the worker is using speech recognition, he gets immediate attention, rather than holding for the next available operator. Shortly after hanging up, a schedule of all available flights is displayed on his wireless PDA. The worker taps a selection, sending it back to the airline reservation server. The flight is booked. The worker used the interface (whether it was speech, graphics, or touch) that was the most convenient to use at the time, all within the same task.

In terms of the base speech technology, we'll continue to see improvements in recognition accuracy, including more languages and dialects supported and tools to make the creation of domain-specific language models easier. We'll see unstructured dictation (where you don't need to dictate punctuation and formatting) and we'll also start seeing speech recognition more fully integrated with other voice technologies, such as speech synthesis, language translation, speaker verification, and natural language understanding. Imagine getting the minutes from a conference call automatically transcribed for you, with individual speakers identified as they speak, and then automatically translating these minutes into other languages - all in real time while the conference call is going on.

Speech recognition technology is being used on the desktop by doctors, lawyers, and even students to input large quantities of text...It's being embedded in PDAs and smart phones to make mobile computing easier and more naturalSIt's being deployed in carsSIt's being used by enterprises to enhance their self-service customer-facing applicationsSIt's being used in real time by professionals to fill in forms, such as insurance claims and trouble reportsSIt's being used at kiosks in airports, shopping malls, movie theatres, and theme parks so customers can get real-time informationSIt's being used in the home to control appliancesSthe possibilities are endless. Speech is quickly becoming a key user interface of choice. And, given its history in the past decade, significant progress will continue to be made; by 2010, speech recognition will truly be pervasive.

Today, faster chip speeds and more sophisticated algorithms mean voice recognition is performing better than ever before. New speech-enabled applications are hitting the market as businesses and consumers realize that voice is the most natural way to access information from the Internet, mobile phones, car dashboards, or handheld organizers. Voice technology may have started with desktop computers, but today, speech is making its way beyond the desktop world to the various touch points of an increasingly mobile e-business world.

More Stories By Kimberlee Kemble

Kimberlee Kemble is program manager for Technical Marketing, IBM Pervasive
Computing, Boca Raton, Florida. Kim has been with IBM since 1982, and
has been working in the field of speech recognition since 1994.
Currently, she coordinates education and training programs for several IBM
voice products and technologies, with special focus on VoiceXML and voice
user interface design. Kim has authored many tutorials, white
papers and articles from the highly technical to the non-technical. She
manages the marketing programs for the IBM WebSphere Voice Server for
Transcription. She is active in the VoiceXML Forum Publication Review Board
and Education
Committee, and has taught VoiceXML classes and presented voice-related
topics at industry conferences.

Comments (0)

Share your thoughts on this story.

Add your comment
You must be signed in to add a comment. Sign-in | Register

In accordance with our Comment Policy, we encourage comments that are on topic, relevant and to-the-point. We will remove comments that include profanity, personal attacks, racial slurs, threats of violence, or other inappropriate material that violates our Terms and Conditions, and will block users who make repeated violations. We ask all readers to expect diversity of opinion and to treat one another with dignity and respect.