Principles Of Asr Technology And Its Application In Call

时间:2022-05-24 03:47:33

Abstract.With recent advances in multimedia technology, computer-aided language learning (CALL) has emerged as a tempting alternative to traditional modes of supplementing or replacing direct student-teacher interaction, such as the language laboratory or audio-tape-based self-study. The integration of sound, voice interaction, text, video, and animation has made it possible to create self-paced interactive learning environments that promise to enhance the classroom model of language learning significantly. This paper makes a case for using automatic speech recognition (ASR) and speech processing technology in CALL. It is proposed not only that speech technology is an essential component of CALL, but that it is, in fact, ready to be deployed successfully in second language education..

Key words: ASR;speech technology;CALL; voice-interaction

1.Introduction

In order to appreciate the potential benefit of using speech technology in CALL, a basic understanding of both the core technology and its limitations--what it can and cannot do--is therefore essential. In the following section, we will present an overview of speech recognition. An overview of current research trends will help identify the kinds of technological advances that lend themselves to being deployed in computer-based language instruction. Next, to illustrate the potential use of speech technology, we will examine a number of innovative language learning applications that offer voice-interactive capabilities.

2.Principles Of Asr Technology

Humans and machines process speech in fundamentally different ways. Complex cognitive processes account for the human ability to associate acoustic signals with meanings and intentions. For a computer, on the other hand, speech is essentially a series of digital values. However, despite these differences, the core problem of speech recognition is the same for both humans and machines: namely, of finding the best match between a given speech sound and its corresponding word string. Automatic speech recognition technology attempts to simulate and optimize this process computationally.

Since the early 1970s, a number of different approaches to ASR have been proposed and implemented, including Dynamic Time Warping, template matching, knowledge-based expert systems, neural nets, and Hidden Markov Modeling (HMM). HMM-based modeling applies sophisticated statistical and probabilistic computations to the problem of pattern matching at the sub-word level. The generalized HMM-based approach to speech recognition has proven an effective, if not the most effective, method for creating high-performance speaker-independent recognition engines that can cope with large vocabularies; the vast majority of today's commercial systems deploy this technique. Therefore, we focus our technical discussion on an explanation of this technique.

An HMM-based speech recognizer consists of five basic components: (a) an acoustic signal analyzer which computes a spectral representation of the incoming speech; (b) a set of phone models (HMMs) trained on large amounts of actual speech data; (c) a lexicon for converting sub-word phone sequences into words; (d) a statistical language model or grammar network that defines the recognition task in terms of legitimate word combinations at the sentence level; (e) a decoder, which is a search algorithm for computing the best match between a spoken utterance and its corresponding word string. Figure 1 shows a schematic representation of the components of a speech recognizer and their functional interaction.

2.1.Signal Analysis

The first step in automatic speech recognition consists of analyzing the incoming speech signal. When a person speaks into an ASR device--usually through a high quality noise-canceling microphone--the computer samples the analog input into a series of 16- or 8-bit values at a particular sampling frequency. These values are grouped together in predetermined overlapping temporal intervals called "frames." These numbers provide a precise description of the speech signal's amplitude. In a second step, a number of acoustically relevant parameters such as energy, spectral features, and pitch information, are extracted from the speech signal. During training, this information is used to model that particular portion of the speech signal. During recognition, this information is matched against the pre-existing model of the signal.

2.2.Phone Models

Training a machine to recognize spoken language amounts to modeling the basic sounds of speech (phones). Automatic speech recognition strings together these models to form words. Recognizing an incoming speech signal involves matching the observed acoustic sequence with a set of HMM models. An HMM can model either phones or other sub-word units or it can model words or even whole sentences. Phones are either modeled as individual sounds--so-called monophones--or as phone combinations that model several phones and the transitions between them (biphones or triphones). After comparing the incoming acoustic signal with the HMMs representing the sounds of language, the system computes a hypothesis based on the sequence of models that most closely resembles the incoming signal. The HMM model for each linguistic unit (phone or word) contains a probabilistic representation of all the possible pronunciations for that unit--just as the model of the handwritten cursive would have many different representations.

Building HMMs--a process called training--requires a large amount of speech data of the type the system is expected to recognize. Large-vocabulary speaker-independent continuous dictation systems are typically trained on tens of thousands of read utterances by a cross-section of the population, including members of different dialect regions and age-groups. As a general rule, an automatic speech recognizer cannot correctly process speech that differs in kind from the speech it has been trained on. This is why most commercial dictation systems, when trained on standard American English, perform poorly when encountering accented speech, whether by non-native speakers or by speakers of different dialects. We will return to this point in our discussion of voice-interactive CALL applications.

2.3.Lexicon

The lexicon, or dictionary, contains the phonetic spelling for all the words that are expected to be observed by the recognizer. It serves as a reference for converting the phone sequence determined by the search algorithm into a word. It must be carefully designed to cover the entire lexical domain in which the system is expected to perform. If the recognizer encounters a word it does not "know" (i.e., a word not defined in the lexicon), it will either choose the closest match or return an out-of-vocabulary recognition error. Whether a recognition error is registered as a misrecognition or an out-of-vocabulary error depends in part on the vocabulary size. If, for example, the vocabulary is too small for an unrestricted dictation task--let's say less than 3K--the out-of-vocabulary errors are likely to be very high. If the vocabulary is too large, the chance of misrecognition errors increases because with more similar-sounding words, the confusability increases. The vocabulary size in most commercial dictation systems tends to vary between 5K and 60K.

2.4.The Language Model

The language model predicts the most likely continuation of an utterance on the basis of statistical information about the frequency in which word sequences occur on average in the language to be recognized. For example, the word sequence A bare attacked him will have a very low probability in any language model based on standard English usage, whereas the sequence A bear attacked him will have a higher probability of occurring. Thus the language model helps constrain the recognition hypothesis produced on the basis of the acoustic decoding just as the context helps decipher an unintelligible word in a handwritten note. Like the HMMs, an efficient language model must be trained on large amounts of data, in this case texts collected from the target domain.

In ASR applications with constrained lexical domain and/or simple task definition, the language model consists of a grammatical network that defines the possible word sequences to be accepted by the system without providing any statistical information. This type of design is suitable for CALL applications in which the possible word combinations and phrases are known in advance and can be easily anticipated (e.g., based on user data collected with a system pre-prototype). Because of the a priori constraining function of a grammar network, applications with clearly defined task grammars tend to perform at much higher accuracy rates than the quality of the acoustic recognition would suggest.

2.5.Decoder

Simply put, the decoder is an algorithm that tries to find the utterance that maximizes the probability that a given sequence of speech sounds corresponds to that utterance. This is a search problem, and especially in large vocabulary systems careful consideration must be given to questions of efficiency and optimization, for example to whether the decoder should pursue only the most likely hypothesis or a number of them in parallel (Young, 1996). An exhaustive search of all possible completions of an utterance might ultimately be more accurate but of questionable value if one has to wait two days to get a result. Trade-offs are therefore necessary to maximize the search results while at the same time minimizing the amount of CPU and recognition time.

3.Current Trends in Voice-Interactive CALL

3.1.Pronunciation Training

A useful and remarkably successful application of speech recognition and processing technology has been demonstrated by a number of research and commercial laboratories in the area of pronunciation training. Voice-interactive pronunciation tutors prompt students to repeat spoken words and phrases or to read aloud sentences in the target language for the purpose of practicing both the sounds and the intonation of the language. The key to teaching pronunciation successfully is corrective feedback, more specifically, a type of feedback that does not rely on the student's own perception. A number of experimental systems have implemented automatic pronunciation scoring as a means to evaluate spoken learner productions in terms of fluency, segmental quality (phonemes) and supra-segmental features (intonation). The automatically generated proficiency score can then be used as a basis for providing other modes of corrective feedback. We discuss segmental and supra-segmental feedback in more detail below..

3.2.Reading Aloud

Reading aloud exercises literacy skills in both second language and literacy education. Intensive practice in reading aloud helps students establish the conventional association between sounds and their written form, a skill that requires years of practice in young children and students of languages with non-phonetic writing, such as Japanese or Chinese. Teaching children and students how to read their own native or a foreign language is thus an area where speech recognition technology can make a significant difference. Imagine a reading tutor that not only listens to children and students reading aloud a story presented on the screen, but intervenes to provide help when needed and corrects mistakes.

Designing a basic recognition network for a voice-interactive reading tutor is relatively straightforward. There is only one correct spoken response to any given written prompt, and the system "knows" in advance what the student will be trying to say. However, the technical challenge is to recognize and respond adequately to the disfluencies of inexperienced readers. Such disfluencies include hesitations, mispronunciations, false starts, and self-corrections.

In the early 1990s, Cowan and Jones (1991), McCandless (1992), and Phillips, Zue, and McCandless (1993) among others demonstrated the technical feasibility of a voice-interactive reading tutor, without, however, providing empirical user data. One of the first fielded prototype systems for teaching reading to young children was developed by the Center for Teaching and Learning (CTL) in 1991 (Kantrov, 1991). The simple but robust multimedia application used an isolated, speaker-dependent recognizer and limited reading vocabulary (18+ words). The system was designed to expand children's reading vocabulary by embedding new words within the context of a goal-oriented game: children are called upon to help a bear overcome obstacles on his way home; reading the word correctly removes the obstacle. Results of three field trials in two Boston-area public schools indicated that the problems with the application were related to the human interface and input mode (microphones), rather than the speech recognition component per se. Ironically, recognition errors, especially misrecognition of correctly read words, contributed positively to the pedagogical effect of the application: the children got additional reading practice, because they had to repeat the words several times until the machine responded appropriately.

One of the most ambitious automated reading coaches currently being developed is the ongoing Project LISTEN at Carnegie Mellon University (CMU). Designed to combat illiteracy, the fully automated prototype uses continuous speech recognition to listen to children read continuous text and automatically trigger pedagogically appropriate interventions. The system features a personalized agent, "Emily," who provides feedback and assistance when necessary. The system incorporates expert knowledge on individual reading assistance that is both pedagogically relevant and technically feasible. Emily intervenes when the child misreads one or more words in the current sentence, gets stuck, or clicks on a word to get help. On the other hand, to reduce frustration in children with reading difficulties, the system deliberately refrains from treating false starts, self-corrections, or hesitations as "mistakes." Instead, errors of this type are modeled and included into the recognition grammar as acceptable.

An experimental trial of the system was conducted among 12 second graders at an urban school in Pittsburgh. Results showed that the children could read at a reading level 0.6 years more advanced when using the automated reading coach, and the average number of reading mistakes fell from 12.3% (without assistance) to 2.6% (with assistance) in texts with similar difficulty.

An improved version of CMU's reading coach running real-time on an affordable PC platform was fielded in 1996 among 8 of the poorest third grade readers at Fort Pitt, PA to measure improvements in reading performance over an 8 month period of using the system. While the earlier study measured reading performance only in terms of student word error rates, the improved system implements algorithms for measuring reading fluency in young children. Relevant performance variables include reading rate, inter-word latency (silence), disfluency (false starts, self-corrections, omissions) and time spent with the assistant. Comparing subjects' reading fluency levels at the beginning of using the system with those at the end, the experiments suggest an overall improvement in reading accuracy of 16% and a 35% decrease in inter-word latency. After using the system for eight months, students' reading levels improved by an average of two years. These results are encouraging in that they show how careful system design and evaluation based on user data can lead to useful and practical applications.

3.3.Teaching Linguistic Structures and Limited Conversation

Apart from supporting systems for teaching basic pronunciation and literacy skills, ASR technology is being deployed in automated language tutors that offer practice in a variety of higher-level linguistic skills ranging from highly constrained grammar and vocabulary drills to limited conversational skills in simulated real-life situations. Prior to implementing any such system, a choice needs to be made between two fundamentally different system design types: closed response vs. open response design. In both designs, students are prompted for speech input by a combination of written, spoken, or graphical stimuli. However, the designs differ significantly with reference to the type of verbal computer-student interaction they support. In closed response systems, students must choose one response from a limited number of possible responses presented on the screen. Students know exactly what they are allowed to say in response to any given prompt. By contrast, in systems with open response design, the network remains hidden and the student is challenged to generate the appropriate response without any cues from the system.

One of the first implementations of a closed response design was the Voice Interactive Language Instruction System (VILIS) developed at SRI. This system elicits spoken student responses by presenting queries about graphical displays of maps and charts. Students infer the right answers to a set of multiple-choice questions and produce spoken responses.

A more recent prototype currently under development in SRI is the Voice Interactive Language Training System (VILTS), a system designed to foster speaking and listening skills for beginning through advanced L2 learners of French. The system incorporates authentic, unscripted conversational materials collected from French speakers into an engaging, flexible, and user-centered lesson architecture. The system deploys speech recognition to guide students through the lessons and automatic pronunciation scoring to provide feedback on the fluency of student responses. As far as we know, only the pronunciation scoring aspect of the system has been validated in experimental trials.

In pedagogically more sophisticated systems, the query-response mode is highly contextualized and presented as part of a simulated conversation with a virtual interlocutor. To stimulate student interest, closed response queries are often presented in the form of games or goal-driven tasks. One commercial system that exploits the full potential of this design is TraciTalk, a voice-driven multimedia CALL system aimed at more advanced ESL learners. In a series of loosely connected scenarios, the system engages students in solving a mystery. Prior to each scenario, students are given a task (e.g., eliciting a certain type of information), and they accomplish this task by verbally interacting with characters on the screen. Each voice interaction offers several possible responses, and each spoken response moves the conversation in a slightly different direction. There are many paths through each scenario, and not every path yields the desired information. This motivates students to return to the beginning of the scene and try out a different interrogation strategy. Moreover, TraciTalk features an agent that students can ask for assistance and accepts spoken commands for navigating the system. Apart from being more fun and interesting, games and task-oriented programs implicitly provide positive feedback by giving students the feeling of having solved a problem solely by communicating in the target language.

The speech recognition technology underlying closed response query implementations is very simple, even in the more sophisticated systems. For any given interaction, the task perplexity is low and the vocabulary size is comparatively small. As a result, these systems tend to be very robust. Recognition accuracy rates in the low to upper 90% range can be expected depending on task definition, vocabulary size, and the degree of non-native disfluency.

The basic principle of an open response design is that students have to come up with a response entirely on their own, without any help from the system. Such systems present a greater challenge to the student and consequently lend themselves to pedagogically more ambitious implementations. Internally, however, systems of this type process students' responses as if they were selected from a multiple-choice list (Waters, 1994). As a minimum, all possible correct responses must be included in the grammar network. If, in addition, the system is supposed to provide detailed feedback to incorrect or questionable input, any potential mistakes must be modeled and anticipated in the grammar network. An open response design can be either very simple or dauntingly complex. While it is easy to implement an open response design for simple question-answer drills (e.g., "What's the color of grass?"), designing a system capable of holding up a prolonged conversation on "How do I get to the train station?" requires a multi-level network grammar based on data collected from students, natural language processing capabilities, and strategies for recovering from misunderstandings. In the following, we provide a sense of the range of possibilities associated with this type of CALL design.

4.Conclusion

In the previous parts, we reviewed the current state of speech technology, and introduced a number of research prototypes that illustrate the range of speech-enabled CALL applications that are currently technically and pedagogically feasible. With the exception of a few exploratory open response dialog systems, most of these systems are designed to teach and evaluate linguistic form (pronunciation, fluency, vocabulary study, or grammatical structure). This is no coincidence. Formal features can be clearly identified and integrated into a focused task design. This means that robust performance can be expected. Furthermore, mastering linguistic form remains an important component of L2 instruction, despite the emphasis on communication (Holland, 1995). Prolonged, focused practice of a large number of items is still considered an effective means of expanding and reinforcing linguistic competence (Waters, 1994). However, such practice is time consuming. CALL can automate these aspects of language training, thereby freeing up valuable class time that would otherwise be spent on drills.

5.References

[1]Egan, K. (1996). Speech recognition application to language learning: Echos. Proceedings of CALICO, July.

[2]Ehsani, F. (1996). Air traffic control task for Japanese (Tech. Rep. No. 7-96), Menlo Park, CA: Entropic, Inc.

[3]Ehsani, F., Bernstein, J., Najmi, A., & Todic, O. (1997). Subarashii: Japanese interactive spoken language education. Proceedings of Eurospeech, Sept., 681-684

[4]Ehsani, F., Bernstein, J., & Najmi, A. (in press). An interactive dialog system for learning Japanese. Speech Communication.

[5]Franco, H., Neumeyer, L., Kim, Y., & Ronen, O. (1997). Automatic pronunciation scoring for language instruction. Proceedings of ICASSP, April, 1471-1474.

[6]Haskin, D. (1997, September 23). Voice recognition reaches new height with dragon naturally speaking. PC Magazine, 16.

[7]Higgins, J. (1988). Language learners and computers: Human intelligence and artificial unintelligence. Singapore: Longman Group.

[8]Hiller, S., Rooney, E., Vaughan, R., Eckert, M., Laver, J., & Jack, M. (1994). An automated system for computer-aided pronunciation learning. Computer Assisted Language Learning, 7(1), 51-63.

[9]Holland, M. (1995). The case for intelligent CALL. In M. Holland, J. D. Kaplan, & M. R. Sams (Eds.), Intelligent language tutors: Theory shaping technology. Mahwah, NJ: Lawrence Erlbaum Associates.

[10]Hubbard, P. (1988). An integrated framework for CALL courseware evaluation. CALICO Journal, Dec., 51-72.

[11]James, E. (1976). The acquisition of prosodic features of speech using a speech visualizer. International Review of Applied Linguistics 14, 227-243.

上一篇:Discussion on the Motivations of Economic G... 下一篇:Study of Optical Fiber Grating Sensor Syste...