Machine transliteration survey -
17 Machine Transliteration Survey SARVNAZ KARIMI, NICTA and The University of Melbourne FALK SCHOLER and ANDREW TURPIN, RMIT University, Melbourne Machine transliteration is the process of automatically transforming the script of a word from a source language to a target language, while preserving pronunciation. The development of algorithms specifically for machine transliteration began over a decade ago based on the phonetics of source and target languages, followed by approaches using statistical and language-specific methods. In this survey, we review the key methodologies introduced in the transliteration literature. The approaches are categorized based on the resources and algorithms used, and the effectiveness is compared. Categories and Subject Descriptors: A.1 [General Literature]: Introductory and Survey General Terms: Algorithms, Experimentation, Languages Additional Key Words and Phrases: Automatic translation, machine learning, machine transliteration, nat- ural language processing, transliteration evaluation ACM Reference Format: Karimi, S., Scholer, F., and Turpin, A. 2011. Machine transliteration survey. ACM Comput. Surv. 43, 3, Article 17 (April 2011), 46 pages. DOI = 10.1145/1922649.1922654 http://doi.acm.org/10.1145/1922649.1922654 1. INTRODUCTION Machine translation (MT) is an essential component of many multilingual applications, and a highly-demanded technology in its own right. In today���s global environment the main applications that require MT include cross-lingual information retrieval and cross-lingual question-answering. Multilingual chat applications, talking translators, and real-time translation of emails and websites are some examples of the modern commercial applications of machine translation. Conventionally, dictionaries have aided human translation, and have also been used for dictionary-based machine translation. While typical dictionaries contain around 50,000 to 150,000 entries, in practice, many more words can be found in texts. For example, a collection of Associated Press newswire text collected over 10 months has NICTA is funded by the Australian government as represented by the Department of Broadband, Commu- nication and Digital Economy, and the Australian Research Council through the ICT Centre of Excellence Programme. This work is also supported by an ARC discovery grant (A. Turpin). This work is based on S. Karimi���s Ph.D thesis completed at RMIT University, Melbourne, Australia. Authors��� addresses: S. Karimi, National ICT Australia, Victoria Research Laboratory, The University of Melbourne, Parkville, Vic 3010, Australia email: email@example.com F. Scholer, School of Com- puter Science and Information Technology, RMIT University, Melbourne, Vic 3001, Australia email: firstname.lastname@example.org A. Turpin, School of Computer Science and Information Technology, RMIT Uni- versity, Melbourne, Vic 3001, Australia email: email@example.com. Corresponding author: Sarvnaz Karimi, firstname.lastname@example.org. Permission to make digital or hard copies of part or all of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies show this notice on the first page or initial screen of a display along with the full citation. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, to republish, to post on servers, to redistribute to lists, or to use any component of this work in other works requires prior specific permission and/or a fee. Permissions may be requested from Publications Dept., ACM, Inc., 2 Penn Plaza, Suite 701, New York, NY 10121-0701 USA, fax +1 (212) 869-0481, or email@example.com. c 2011 ACM 0360-0300/2011/04-ART17 $10.00 DOI 10.1145/1922649.1922654 http://doi.acm.org/10.1145/1922649.1922654 ACM Computing Surveys, Vol. 43, No. 3, Article 17, Publication date: April 2011.
17:2 S. Karimi et al. 44 million words comprising 300,000 distinct English words.1 The ���out-of-dictionary��� terms are typically names, such as companies, people, places, and products [Dale 2007]. In such cases transliteration���where the out-of-dictionary words are spelled out in the target language���is necessary. Machine transliteration emerged around a decade ago as part of machine translation to deal with proper nouns and technical terms that are translated with preserved pro- nunciation. Transliteration is a subfield of computational linguistics, and its language processing requirements make the nature of the task language-specific. Although many studies introduce statistical methods as a general-purpose solution both for translation and transliteration, many of the approaches introduced in the literature benefit from specific knowledge of the languages under consideration. In this survey, we first introduce the key terminology and linguistic background use- ful for understanding the rest of the article. A general discussion of the challenges that automated transliteration systems face, including scripts of different languages, miss- ing sounds, transliteration variants, and language of origin follows the Key Concepts section, Section 2. Then, the specific terminology and formulations used throughout this survey are introduced. Description of the state-of-the-art machine translitera- tion methods follows the formulation section. Literature on transliteration falls into two major groups: generative transliteration and transliteration extraction. Genera- tive transliteration focuses on algorithms that transliterate newly appearing terms that do not exist in any translation lexicon. Transliteration extraction, on the other hand, enriches the translation lexicon using existing transliteration instances from large multilingual corpora such as the Web, to reduce the requirement for on-the-fly transliteration. This second category is also considered as a method of extracting large and up-to-date transliterations from live resources such as the Web. We review both of these categories, with an emphasis on generative methods, as these constitute the core of transliteration technology. We also examine the evaluation procedure under- taken in these studies, and the difficulties that arise with non-standard evaluation methodologies that are often used in the transliteration area. 2. KEY CONCEPTS Some of the common linguistic background concepts and general terminology used throughout this survey, are explained in this section.2 More detailed information on writing systems, alphabets, and phonetics of different languages can be found in IPA (International Phonetic Alphabet3) publications, available for all the existing lan- guages. Phonetics and Phonology. Phonetics is the study of the sounds of human speech, and is concerned with the actual properties of speech sounds, their production, audition, and perception. Phonetics deals with sounds independently, rather than the contexts in which they are used in languages. Phonology, on the other hand, studies sound systems and abstract sound units, such as phonemics, distinctive features, phonotactics, and phonological rules. Phonology, therefore, is language-specific, while phonetics defini- tions apply across languages. The phonetic representation of a sound is shown using [ ], and the phonemes are represented by / /. For example, the phonetic version of both the Persian letter ��� ���, and the English letter ���p��� is [p]. Phoneme. A phoneme is the smallest unit of speech that distinguishes meaning. Phonemes are the important units of each word, and substituting them causes the 1The statistics are on words with no pre-processing such as lemmatization. 2Definitions are based on Crystal [2003, 2006]. 3http://www.omniglot.com/writing/ipa.htm ACM Computing Surveys, Vol. 43, No. 3, Article 17, Publication date: April 2011.
Machine Transliteration Survey 17:3 meaning of a word to change. For example, if we substitute the sound [b] with [p] in the word ���big��� [bIg], the word changes to ���pig���. Therefore /b/ is a phoneme. Note the small- est physical segment of sound is called phone. In other words, phones are the physical realization of phonemes. Also, phonic variety of phonemes are called allophones. Some transliteration algorithms use phonemes to break down words into their constituent parts, prior to transliteration (explained in Section 5.1, Phonetic-based transliteration systems). Grapheme. A grapheme is the fundamental unit in written language. It includes al- phabetic letters, Chinese characters, numerals, punctuation marks, and all the individ- ual symbols of any writing system. In a phonemic orthography, a grapheme corresponds to one phoneme. In spelling systems that are nonphonemic���such as the spellings used most widely for written English���multiple graphemes may represent a single phoneme. These are called digraphs (two graphemes for a single phoneme) and trigraphs (three graphemes). For example, the word ���ship��� contains four graphemes (s, h, i, and p) but only three phonemes, because ���sh��� is a digraph. In Section 5.2, transliteration methods that use grapheme concept are introduced (spelling-based transliteration systems). Syllable. A syllable is a unit of pronunciation. A syllable is generally larger than a single sound and smaller than a word. Typically, a syllable is made up of a syllable peak which is often a vowel, with optional initial and final margins which are mostly consonants. Writing system. A writing system is a symbolic system for representing expressible elements or statements in language. A writing system has four sets of specifications: (1) a set of defined symbols that are individually called characters or graphemes, and collectively called a script (2) a set of rules and conventions which arbitrarily assign meaning to the graphemes, their ordering, and relations, and are understood and shared by a community (3) a language, whose constructions are represented and recalled by the interpretation of these elements and rules and (4) some physical means of distinctly representing the symbols by application to a permanent or semi-permanent medium, so that they may be interpreted. There are four distinct writing systems called logographic, syllabic, featural, and al- phabetic or segmental. Writing system of some languages fall into only one of these categories, however, some other languages use more than one of these systems. (1) Logographic writing systems use logograms, where a single written character is used to represent a complete grammatical word. Most Chinese characters are lo- gograms. (2) Syllabic writing systems define a syllabary as a set of written symbols that repre- sent or approximate syllables that constitute words. Symbols in a syllabary typ- ically represent either a consonant sound followed by a vowel sound, or a single vowel. The Japanese writing system falls into this category. (3) Featural writing systems contain symbols that do not represent whole phonemes, but rather the elements or features that collectively constitute the phonemes. The only prominent featural writing system is Korean Hangul. Hangul has three lev- els of phonological representation: featural symbols, alphabetic letters (combined features), and syllabic blocks (combined letters). (4) Alphabetic or segmental writing systems possess an alphabet that is a small set of letters or symbols that represents a phoneme of a spoken language. The Arabic and Latin writing systems are segmental. ACM Computing Surveys, Vol. 43, No. 3, Article 17, Publication date: April 2011.
17:4 S. Karimi et al. 3. COMMON CHALLENGES IN MACHINE TRANSLITERATION The main challenges that machine transliteration systems encounter can be divided into five categories: script specifications, missing sounds, transliteration variants, lan- guage of origin, and deciding on whether or not to translate or transliterate a name (or part of it). While other specific challenges also arise, these are less generic and gen- erally language-pair dependant. For example, in Chinese the character association in person names is gender-specific [Li et al. 2007] or different impressions are conveyed based on Japanese Kanji ideograms, making the selection of correct strings for a name difficult [Xu et al. 2006]. We cover these other challenges on a study by study basis (Section 5 onwards). 3.1. Script Specifications The possibility of different scripts between the source and target languages is the first hurdle that transliteration systems need to tackle. A script, as explained in Section 2, is a representation of one or more writing systems, and is composed of symbols used to represent text. All of the symbols have a common characteristic which justifies their consideration as a distinct set. One script can be used for several different languages for example, Latin script covers all of Western Europe, and Arabic script is used for Arabic, and some nonSemitic languages written in the Arabic alphabet including Per- sian, Urdu, Pashto, Malay, and Balti. On the other hand, some written languages require multiple scripts, for example, Japanese is written in at least three scripts: the Hiragana and Katakana4 syllabaries and the Kanji ideographs imported from China. Computational processing of such different language scripts requires awareness of the symbols comprising the language for example the ability to handle different character encodings. While some scripts are written using separate characters (such as Latin), others introduce intermediate forms for characters that occur in the middle of a word. For example, in Persian script some letters change their form based on their position in the word. The character ��� ��� [p] is written ��� ��� [p] when it appears at the beginning of a word, ��� ��� [p] in the middle, and ��� ��� [p] at the end however, this rule is sometimes violated when ��� ��� is adjoined to special letters such as ��� ���  in ��� ��� /p6p/. Another important aspect of language script is the direction in which it is written. Some languages are written right-to-left (RTL), and some are written left-to-right (LTR). For example, Persian, Arabic, Hebrew, and Taana scripts are RTL, whereas the script of English and other languages that use the Latin alphabet is LTR. In general, a transliteration system, which manipulates characters of the words, should be designed carefully to process scripts of the source and target languages, taking all of the above, mentioned specifications into account. Figure 1 shows some transliteration examples in different languages with different scripts. Persian and Arabic examples are shown left-to-right at the character correspondence level to match with their English version. In Section 5 in particular, we explain different transliteration methods that investi- gate different language-pairs, and thus they may opt for more phonetic-based methods or orthographic methods. Particular language-pairs, may also lead to the introduction of engineering steps, such as preprocessing the data. 4Katakana is a Japanese syllabary, one component of the Japanese writing system along with Hiragana, Kanji, and in some cases the Latin alphabet. The word Katakana means ���fragmentary kana,��� as the Katakana scripts are derived from components of more complex Kanji. ACM Computing Surveys, Vol. 43, No. 3, Article 17, Publication date: April 2011.
Machine Transliteration Survey 17:5 Fig. 1. Transliteration examples in four language pairs. Letter correspondence shows how the source and target letters aligned, as they are the smallest transliteration units that correspond. 3.2. Missing Sounds Different human languages have their own sound structure, and symbols of the lan- guage script correspond to these sounds. If there is a missing sound in the letters of a language, single sounds are represented using digraphs and trigraphs. For example, an English digraph ���sh��� corresponds to the sound [S]. Cross-lingual sound translation��� the function of transliteration���introduces new sounds to a target language, which the target language does not necessarily accommodate. That is, sounds cannot inevitably be pronounced the same way as in their original language after being imported to the target language. Such sounds are conventionally substituted by a sequence of sound units, which in turn are rendered to a sequence of letters in the target language. For example, the sound of [x] has no equivalent character in English and is reserved for for- eign words. Many other languages support this sound, however. The equivalent Persian and Arabic letter of this sound is ��� ��� [x], which Persian speakers usually transliterate to the digraph ���kh��� in English, whereas in some other languages with Latin script, such as Czech, it is written as ���ch���. The same sound is guttural rhotic���the character ���r������in French (some accents). Transliteration systems need to learn (usually in their training step) both the con- vention of writing the missing sounds in each of the languages involved in isolation, and the convention of exporting the sounds from one language to the other. 3.3. Transliteration Variants The evaluation process of transliteration is not straightforward. Transliteration is a creative process that allows multiple variants of a source term to be valid, based on the opinions of different human transliterators. Different dialects in the same language can also lead to transliteration variants for a given source term. While gathering all possible variants for all of the words in one corpus is not feasible���simply because not all speakers of those languages can be called upon in the evaluation process���there is no particular standard for such a comparison, other than conventions developed ACM Computing Surveys, Vol. 43, No. 3, Article 17, Publication date: April 2011.