Добавил:
Upload Опубликованный материал нарушает ваши авторские права? Сообщите нам.
Вуз: Предмет: Файл:
УМКД СИТиП 08.04.2014.doc
Скачиваний:
13
Добавлен:
21.02.2016
Размер:
1.79 Mб
Скачать

1. Hutchins, j.H. And h. Somers: Machine Translation. Academic Press, 1992.

2. Hovy, E.H. Overview article in MITECS (MIT Encyclopedia of the Cognitive Sciences). 1998.

3. Hovy, E.H. Review in BYTE magazine, January 1993.

4. Knight, K. 1997. Automating Knowledge Acquisition for Machine Translation. AI Magazine 18(4), (81–95).

Lecture 8. Comparative study with human speech translation capability

It is extremely difficult theoretically to evaluate the accuracy of speech translation. If the evaluation of the speech synthesis module is not included, evaluation is made by feeding a number of test sentences into the system, and evaluating the quality of the output. In this sense, the method for evaluating speech translation is essentially the same as that for evaluating automatic text translation. For speech translation, however, the utterances that are evaluated are not strings of text but speech.

Two methods are used to evaluate translation quality: one method where the translations are manually given subjective ratings on a five-point scale, and another that compares the similarity between the output of the system and previously prepared reference translations. A number of rating scales have been proposed for the latter, including BLEU, NIST, and word error rate (WER). Recently, these scales have come to be widely used. Since these results are simple numerical values, it is possible to use them to compare two different systems. What these scores cannot answer, however, is how the system with the higher score will perform in the real world.

A method has been proposed to resolve this issue, by estimating system performance in human terms, estimating the system’s corresponding Test of English for International Communication (TOEIC) score. First, native speakers of Japanese with known TOEIC scores (“TOEIC takers”) listen to test Japanese sentences, and are asked to translate them into spoken English. Next, the translations by the TOEIC takers are compared against the output of the speech-translation system by Japanese-English bilingual evaluators. The human win rate is then calculated as the proportion of tests sentences for which the humans’ translations are better. After the human win rate has been completely calculated for all TOEIC takers, regression analysis is used to calculate the TOEIC score of the speech-translation system. Figure 2 shows system performance converted into TOEIC scores. When using relatively short utterances like those in basic travel conversation (BTEC), the speech-translation system is nearly always accurate. The performance of the speech-translation system on conversational speech (MAD and FED) is, however, equivalent to the score of 600 (TOEIC) by the Japanese speakers. Furthermore, performance drops significantly when dealing with long, rare, or complex utterances. There is thus still room for improvement in performance.

Field experiments using speech translation device. A field experiment was conducted in Downtown Kyoto from 30 July to 24 August 2007, with the objective of evaluating the characteristics of communication mediated by a stand-alone speech translation device about the size of a personal organizer, as well as evaluate the usability of this device. The field experiment was set up as follows, in order to minimize the restrictions on the test subjects:

(1) The people with whom the subjects conversed were not selected ahead of time, in order to collect a diverse range of expressions while using the speech translation device in realistic travel situations, such as transportation, shopping, and dining.

(2) Although the subjects were told the purpose of the dialog ahead of time, no restrictions were placed on the exact destination or proper names of items to purchase.

(3) Subjects were allowed to change the topic freely depending on the flow of the conversation.

(4) Subjects were allowed to move to different locations as appropriate, in accordance with the task. (5) No time limit was placed on single dialogs.

In the case of transportation, the objective was considered to have been met if the subject was able to obtain information about the destination or to actually travel there. For shopping and dining, the objective was met if the subject completed the purchase of the article or the meal and received a receipt. In addition to quantitative evaluations of speech recognition rates, dialog response rates, and translation rates, the experiment also evaluated the level of understanding based on questionnaires. As shown in Figure 3, in the evaluation of the level of understanding of 50 native English speakers, about 80% said that the other person understood nearly everything that they said, and over 80% said they understood at least half of what the other person said. This result suggests that the performance of speech-translation devices could be sufficient for communication.

Lecture 9. Worldwide trends in research and development

International evaluation workshops give a strong boost to the development of speech-translation technologies. An international evaluation workshop is a kind of contest: the organizers provide a common dataset, and the research institutes participating in the workshop compete, creating systems that are quantitatively evaluated. The strengths and weaknesses of the various proposed algorithms are rated from the results of the evaluation, and the top algorithms are then widely used in subsequent research and development. This allows research institutes to perform research both competitively and cooperatively, promoting efficient research. Some representative examples of international evaluation workshops are presented here, describing automatic evaluation technologies that support competitive research styles via evaluation workshops.

(a) The International Workshop on Spoken Language Translation (IWSLT) is organized by C-STAR, an international consortium for research on speech translation including ATR in Japan, CMU in the United States, the Institute for Research in Science and Technology (IRST) in Italy, the Chinese Academy of Sciences (CAS), and the Electronics and Telecommunications Research Institute (ETRI) in Korea. The workshop has been held since 2004. Every year, the number of participating institutes increases, and it has become a core event for speech translation research. The subject of the workshop is speech translation of travel conversation from Japanese, Chinese, Spanish, Italian, and other languages into English. Two distinguishing features of the IWSLT are that it is for peaceful uses (travel conversation) and that the accuracy of the translation is fairly good, because it is a compact task.

(b) Global Autonomous Language Exploitation (GALE) [8] is a project of the US Defense Advanced Research Projects Agency (DARPA). It is closed and non-public. US $50 million are invested into the project per year. The purpose of the project is to translate Arabic and Chinese text and speech into English and extract intelligence from them. A large number of institutions are divided into three teams and compete over performance. The teams are run in units of the fiscal year in which the targets are assigned, and every year the performance is evaluated by outside institutions. In the United States, research on automatic translation is currently strongly dependent

on DARPA budgets, and the inclinations of the US Department of Defense are strongly reflected.

Methods for evaluating translation quality have become a major point of debate at these workshops. There are various perspectives on translation quality, such as fluency and adequacy, and it has been considered a highly knowledge-intensive task. A recently proposed evaluation method called BLEU is able to automatically calculate evaluation scores with a high degree of correlation to subjective evaluations by humans. This makes it possible to develop and evaluate systems repeatedly in short cycles, without costing time or money, which has made translation research and development much more efficient.

References: