We present a study on the text simplification operations undertaken collaboratively by Simple English Wikipedia contributors. The aim is to understand whether a complex-simple parallel corpus involving this version of Wikipedia is appropriate as data source to induce simplification rules, and whether we can automatically categorise the different operations performed by humans. A subset of the corpus was first manually analysed to identify its transformation operations. We then built machine learning models to attempt to automatically classify segments based on such transformations. This classification could be used, e.g., to filter out potentially noisy transformations. Our results show that the most common transformation operations performed by humans are paraphrasing (39.80%) and drop of information (26.76%), which are some of the most difficult operations to generalise from data. They are also the most difficult operations to identify automatically, with the lowest overall classifier accuracy among all operations (73% and 59%, respectively).
CITATION STYLE
Amancio, M. A., & Specia, L. (2014). An analysis of crowdsourced text simplifications. In Proceedings of the 3rd Workshop on Predicting and Improving Text Readability for Target Reader Populations, PITR 2014 at the 14th Conference of the European Chapter of the Association for Computational Linguistics, EACL 2014 (pp. 123–130). Association for Computational Linguistics (ACL). https://doi.org/10.3115/v1/w14-1214
Mendeley helps you to discover research relevant for your work.