Challenge 1- Ambiguity O Multi-meaning may not coincide in bilingual environment. The english word Mouse refers to both animal and electronic device. While in the Chinese side, they are two words. Choosing wrong translation variants is a potential cause for miscomprehension. gnt to buy a mouse Sourle side 老鼠 1eerb而 an animal nima 鼠标 Mouse an electronic device 滑鼠 想買二個A秋! Chinese words et side i Figure 3: Translation ambiguity example (16/84)
Challenge 1 – Ambiguity Multi-meaning may not coincide in bilingual environment. The English word Mouse refers to both animal and electronic device. While in the Chinese side, they are two words. Choosing wrong translation variants is a potential cause for miscomprehension. 1 2 Figure 3: Translation ambiguity example (16/84)
Challenge 2-Language Style News domain o Try to deliver rich information with very economical anguage. o Short and simple-structure sentence make it easy to understand o a lot of abbreviation date, named entitles Chinas Li duihong won the women s 25-meter sport pistol Olympic gold with a total of 687. 9 points early this morning Beijing time. (Guangming Daily, 1996/ 07/02) 我国女子运动员李对红今天在女子运动手枪决赛中,以 6879环战胜所有对手,并创造新的奧运记录。(《光明 日报》1996年7月2日) (17/84)
Challenge 2 – Language Style News Domain Try to deliver rich information with very economical language. Short and simple-structure sentence make it easy to understand. A lot of abbreviation, date, named entitles. China's Li Duihong won the women's 25-meter sport pistol Olympic gold with a total of 687.9 points early this morning Beijing time. (Guangming Daily, 1996/07/02) 我国女子运动员李对红今天在女子运动手枪决赛中,以 687.9环战胜所有对手,并创造新的奥运记录。(《光明 日报》 1996年7月2日) (17/84)
Challenge 2-Language Style Law domain o Very rigorous even with duplicated terms. o Use fewer pronouns, abbreviations etc. to avoid any ambiguity. o High frequency words of shall, may, must, be to o Long sentence with long subordinate clauses When an international treaty that relates to a contract and which the People's Republic of China has concluded on participated into has provisions of the said treaty shall be applied but with the exception of clauses to which the People's republic of china has declared reservation. 中华人民共和国缔结或者参加的与合同有关的国际条约同中华人民共 和国法律有不同规定的,适用该国际条约的规定。但是,中华人民共和 国声明保留的条款除外。 (18/84)
Challenge 2 – Language Style When an international treaty that relates to a contract and which the People’s Republic of China has concluded on participated into has provisions of the said treaty shall be applied, but with the exception of clauses to which the People’s Republic of China has declared reservation. 中华人民共和国缔结或者参加的与合同有关的国际条约同中华人民共 和国法律有不同规定的,适用该国际条约的规定。但是,中华人民共和 国声明保留的条款除外。 Law Domain Very rigorous even with duplicated terms. Use fewer pronouns, abbreviations etc. to avoid any ambiguity. High frequency words of shall, may, must, be to. Long sentence with long subordinate clauses. (18/84)
Challenge 3- Out-Of-Vocabulary o Terminology: words or phrases that mainly occur in specific contexts with specific meanings. o Variants increasing combination etc Entity Distribution ew Words Terminology ● Dissect O Culture iterms ● Common words/ Phrase 8.36% BHT 2.6 甲本酚 91.64% Figure 4: Out-of-Vocabulary Example (19/84)
Challenge 3 – Out-Of-Vocabulary Terminology: words or phrases that mainly occur in specific contexts with specific meanings. Variants, increasing, combination etc. 91.64% 8.36% (19/84) BHT 2,6-二叔丁基 -4-甲基苯酚 Figure 4: Out-of-Vocabulary Example
Domain Adaptation o As Smt is corpus-driven, domain-specificity of training data with respect to the test data is a significant factor that we cannot ignore. o there is a mismatch between the domain of available training data and the target domain o Unfortunately, the training resources in specific domains are Usually relatively scarce. n such scenarios various domain adaptation techniques are employed to improve domain-specitic translation quality by leveraging general-domain data (20/84)
Domain Adaptation As SMT is corpus-driven, domain-specificity of training data with respect to the test data is a significant factor that we cannot ignore. There is a mismatch between the domain of available training data and the target domain. Unfortunately, the training resources in specific domains are usually relatively scarce. In such scenarios, various domain adaptation techniques are employed to improve domain-specific translation quality by leveraging general-domain data. (20/84)