Abstract
Most social media messages are written in languages other than English, but commonly used text mining tools were designed only for English. This paper introduces the Unicode Convolutional Neural Network (UnicodeCNN) for analyzing text written in any language. The UnicodeCNN does not require the language to be known in advance, allows the language to change arbitrarily mid-sentence, and is robust to the misspellings and grammatical mistakes commonly found in social media. We demonstrate the UnicodeCNN's effectiveness on the challenging task of content-based tweet geolocation using a dataset with 900 million tweets written in more than 100 languages. Whereas previous work restricted itself to predicting a tweet's country or city of origin (and only worked on tweets written in certain languages from highly populated cities), we predict the exact GPS locations of tweets (and our method works on tweets written in any language sent from anywhere in the world). We predict GPS coordinates using the mixture of von Mises-Fisher (MvMF) distribution. The MvMF exploits the Earth's spherical geometry to improve predictions, a task that previous work considered too computationally difficult. On English tweets, our model's predictions average more than 300km closer to the true location than previous work, and in other languages our model's predictions are up to 1500km more accurate. Remarkably, the UnicodeCNN can learn geographic knowledge in one language and automatically transfer that knowledge to other languages.
Author supplied keywords
Cite
CITATION STYLE
Izbicki, M., Papalexakis, V., & Tsotras, V. (2019). Geolocating tweets in any language at any location. In International Conference on Information and Knowledge Management, Proceedings (pp. 89–98). Association for Computing Machinery. https://doi.org/10.1145/3357384.3357926
Register to see more suggestions
Mendeley helps you to discover research relevant for your work.