Learning to Scale Multilingual Representations for Vision-Language Tasks

Andrea Burns; Donghyun Kim; Derry Wijaya; Kate Saenko; Bryan A. Plummer

Conference Proceedings

Learning to Scale Multilingual Representations for Vision-Language Tasks

Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics) (2020) 12349 LNCS 197-213

DOI: 10.1007/978-3-030-58548-8_12

17Citations

64Readers

Get full text

Abstract

Current multilingual vision-language models either require a large number of additional parameters for each supported language, or suffer performance degradation as languages are added. In this paper, we-9*6 propose a Scalable Multilingual Aligned Language Representation (SMALR) that supports many languages with few model parameters without sacrificing downstream task performance. SMALR learns a fixed size language-agnostic representation for most words in a multilingual vocabulary, keeping language-specific features for just a few. We use a masked cross-language modeling loss to align features with context from other languages. Additionally, we propose a cross-lingual consistency module that ensures predictions made for a query and its machine translation are comparable. The effectiveness of SMALR is demonstrated with ten diverse languages, over twice the number supported in vision-language tasks to date. We evaluate on multilingual image-sentence retrieval and outperform prior work by 3–4% with less than 1/5th the training parameters compared to other word embedding methods.

Author supplied keywords

Cite

CITATION STYLE

APA

Burns, A., Kim, D., Wijaya, D., Saenko, K., & Plummer, B. A. (2020). Learning to Scale Multilingual Representations for Vision-Language Tasks. In Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics) (Vol. 12349 LNCS, pp. 197–213). Springer Science and Business Media Deutschland GmbH. https://doi.org/10.1007/978-3-030-58548-8_12

Learning to Scale Multilingual Representations for Vision-Language Tasks

Abstract

Author supplied keywords

Cite

Register to see more suggestions