Abstract
Code-switching, the interleaving of two or more languages within a sentence or discourse is pervasive in multilingual societies. Accurate language models for code-switched text are critical for NLP tasks. State-of-the-art data-intensive neural language models are difficult to train well from scarce language-labeled code-switched text. A potential solution is to use deep generative models to synthesize large volumes of realistic code-switched text. Although generative adversarial networks and variational autoencoders can synthesize plausible monolingual text from continuous latent space, they cannot adequately address code-switched text, owing to their informal style and complex interplay between the constituent languages. We introduce VACS, a novel variational autoencoder architecture specifically tailored to code-switching phenomena. VACS encodes to and decodes from a two-level hierarchical representation, which models syntactic contextual signals in the lower level, and language switching signals in the upper layer. Decoding representations sampled from prior produced well-formed, diverse code-switched sentences. Extensive experiments show that using synthetic code-switched text with natural monolingual data results in significant (33.06%) drop in perplexity.
Cite
CITATION STYLE
Samanta, B., Reddy, S., Jagirdar, H., Ganguly, N., & Chakrabarti, S. (2019). A deep generative model for code-switched text. In IJCAI International Joint Conference on Artificial Intelligence (Vol. 2019-August, pp. 5175–5181). International Joint Conferences on Artificial Intelligence. https://doi.org/10.24963/ijcai.2019/719
Register to see more suggestions
Mendeley helps you to discover research relevant for your work.