Nowadays the need for multilingual information retrieval for searching relevant information is rising steadily. Specialized text-based forums on the Web are a valuable source of such information. However, extraction of informative messages is often hindered by large amount of non-informative posts (the so-called offtopic posts) and informal language commonly used on forums. The paper deals with the task of automatic identification of posts potentially useful for sharing professional experience within text forums irrespective of the forum’s language. For our experiments we have selected subsets from various text forums containing different languages. Manual markup was held by native speaking experts. Textual, thread-based, and social graph features were extracted. In order to select satisfactory language-independent forum features we used gradient boosting models, relative influence metric for model analysis, and NDCG metric for measuring selection method quality. We have formed a satisfactory set of forum features indicating the post’s utility which do not demand sophisticated linguistic analysis and is suitable for practical use.
CITATION STYLE
Grozin, V. A., Gusarova, N. F., & Dobrenko, N. V. (2015). Feature selection for language independent text forum summarization. In Communications in Computer and Information Science (Vol. 518, pp. 63–71). Springer Verlag. https://doi.org/10.1007/978-3-319-24543-0_5
Mendeley helps you to discover research relevant for your work.