Hashtag Segmentation: A Comparative Study Involving the Viterbi, Triangular Matrix and Word Breaker Algorithms

2Citations
Citations of this article
15Readers
Mendeley users who have this article in their library.
Get full text

Abstract

Microblogging and social networking sites such as Twitter, Facebook, and Instagram, are becoming increasingly popular, registering more than 500 million posts each day. Twitter uses hashtags that are dynamic, user-generated text, preceded by a pound (#) symbol, to retrieve similar posts or topics, to mark events or to tag channels. Following segmentation, hashtags can be used for many Natural Language Processing (NLP) applications. These include sentiment analysis, text classification, named entity recognition, and sarcasm detection. This study delves into a comparison of three algorithms, namely the Viterbi, Triangular matrix and Word breaker algorithms, to determine the best among the three, for the segmentation of hashtags. These algorithms utilize different resources, to calculate the probability of the segmented parts, in order to rank the possible generated segmentations. For example, while the Viterbi and Triangular Matrix algorithms use two statistical corpora of unigram and bigram, the Word Breaker algorithm uses the n-gram language model. According to conducted experiment, the Viterbi algorithm is better for hashtag segmentation than the Triangular Matrix algorithm. This can be attributed to the manner in which the Viterbi algorithm conducts the backtracking. On the other hand, the Word Breaker algorithm, which can ascertain the meaningful tokens in the form of words, before proceeding with the segmentation of the remaining characters, is considered superior to both the Viterbi and Triangular Matrix algorithms, particularly when it comes to the detection of unknown words. Used together with the Good-Turing smoothing algorithm, the Word Breaker algorithm achieved 86.64% f1-score on a large language model.

Cite

CITATION STYLE

APA

Abd-Hood, S. F., & Omar, N. (2021). Hashtag Segmentation: A Comparative Study Involving the Viterbi, Triangular Matrix and Word Breaker Algorithms. Journal of Advances in Information Technology, 12(4), 311–318. https://doi.org/10.12720/jait.12.4.311-318

Register to see more suggestions

Mendeley helps you to discover research relevant for your work.

Already have an account?

Save time finding and organizing research with Mendeley

Sign up for free