Language identification is used to categorize the language of a given document. Language identification categorizes the contents and can have a better search results for a multilingual document. In this work, we classify each line of text to a particular language and focused on short phrases of length 2–6 words for 15 Indian languages. It detects that a given document is in multilingual and identifies the appropriate Indian languages. The approach used is the combination of n-gram technique and a list of short distinctive words. The n-gram model applied is language independent whereas short word method uses less computation. The results show the effectiveness of our approach over the synthetic data.
CITATION STYLE
Bhaskaran, S., Paul, G., Gupta, D., & Amudha, J. (2021). Indian language identification for short text. In Advances in Intelligent Systems and Computing (Vol. 1086, pp. 47–58). Springer. https://doi.org/10.1007/978-981-15-1275-9_5
Mendeley helps you to discover research relevant for your work.