Abstract
Data mining has become an essential element of today's information world. Different industries and sources daily produce a huge amount of data. When it comes to textual analysis, internet users produce a large amount of data in the form of Twitter Tweets, updates, posts, and comments from Facebook and blogs, short messages, and emails. Analysis of such data will give more valuable information and insights about the studied subject but the problem with social media text is that it is availbel in very raw form. Social media users usually do not produce text in a particular format required by analytics algorithms. Social Media text contains usually miss-spelt words, links, and hash-tags, mentioning people, word/phrase short forms, word elongations, emotional symbols, and many other raw forms. When available text pre-processing techniques (tokenization, lower case, stemming, lemmatization, stop word removals, and normalization) are applied to this raw and un-cleaned data, the removal of many words/phrases results in information loss or information modification. Hence, the curse of data dimensionality vanished and make it difficult to get as much as possible insights from data. We have proposed some advance and robust pre-processing techniques used to increase information preservation from social media text while preserving the semantics of data remain the same.
Author supplied keywords
Cite
CITATION STYLE
Emaduddin, S. M., Ullah, R., Mazahir, I., & Uddin, M. Z. (2022). Enhancing Information Preservation in Social Media Text Analytics Using Advanced and Robust Pre-processing Techniques. International Journal of Media and Information Literacy, 7(1), 60–70. https://doi.org/10.13187/ijmil.2022.1.60
Register to see more suggestions
Mendeley helps you to discover research relevant for your work.