Iterative rule segmentation under minimum description length for unsupervised transduction grammar induction

Markus Saers; Karteek Addanki; Dekai Wu

Conference Proceedings

Iterative rule segmentation under minimum description length for unsupervised transduction grammar induction

Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics) (2013) 7978 LNAI 224-235

DOI: 10.1007/978-3-642-39593-2_20

10Citations

2Readers

Get full text

Abstract

We argue that for purely incremental unsupervised learning of phrasal inversion transduction grammars, a minimum description length driven, iterative top-down rule segmentation approach that is the polar opposite of Saers, Addanki, and Wu's previous 2012 bottom-up iterative rule chunking model yields significantly better translation accuracy and grammar parsimony. We still aim for unsupervised bilingual grammar induction such that training and testing are optimized upon the same exact underlying model-a basic principle of machine learning and statistical prediction that has become unduly ignored in statistical machine translation models of late, where most decoders are badly mismatched to the training assumptions. Our novel approach learns phrasal translations by recursively subsegmenting the training corpus, as opposed to our previous model-where we start with a token-based transduction grammar and iteratively build larger chunks. Moreover, the rule segmentation decisions in our approach are driven by a minimum description length objective, whereas the rule chunking decisions were driven by a maximum likelihood objective. We demonstrate empirically how this trades off maximum likelihood against model size, aiming for a more parsimonious grammar that escapes the perfect overfitting to the training data that we start out with, and gradually generalizes to previously unseen sentence translations so long as the model shrinks enough to warrant a looser fit to the training data. Experimental results show that our approach produces a significantly smaller and better model than the chunking-based approach. © 2013 Springer-Verlag.

Cite

CITATION STYLE

APA

Saers, M., Addanki, K., & Wu, D. (2013). Iterative rule segmentation under minimum description length for unsupervised transduction grammar induction. In Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics) (Vol. 7978 LNAI, pp. 224–235). https://doi.org/10.1007/978-3-642-39593-2_20

Iterative rule segmentation under minimum description length for unsupervised transduction grammar induction

Abstract

Cite

Register to see more suggestions