This chapter presents an application of machine learning methods that work at the character level. More precisely, several string kernels, one of which is based on Local Rank Distance, are combined to obtain state-of-the-art results for native language identification (NLI). A broad set of NLI experiments are conducted to compare the string kernels approach with other state-of-the-art methods on English, Arabic, and Norwegian corpora. In all the experiments, strings kernels obtain results better than the state of the art, sometimes by a very large margin. For instance, there is a 32.3%32.3% improvement in accuracy over the state-of-the-art system, when the systems based on string kernels are trained on the TOEFL11 corpus and tested on the TOEFL11-Big corpus. The results are even more impressive considering that the proposed approach is language independent and linguistic theory neutral. To gain additional insights about the string kernels approach, the features selected by the classifier as being more discriminant are analyzed in this chapter. The analysis also offers information about localized language transfer effects, since the features used by the proposed model are p -grams of various lengths.
CITATION STYLE
Ionescu, R. T., & Popescu, M. (2016). Native Language Identification with String Kernels. In Advances in Computer Vision and Pattern Recognition (pp. 193–227). Springer Science and Business Media Deutschland GmbH. https://doi.org/10.1007/978-3-319-30367-3_8
Mendeley helps you to discover research relevant for your work.