Sign up & Download
Sign in

Entity Based Sentiment Analysis on Twitter

by Siddharth Batra, Deepak Rao
Science ()

Abstract

The aim of ourwork is to use the Twitter corpus to ascertain the opinion about entities that matter and enable consumption of these opinions in a user friendly way. We focus on classifying the opinions as either positive, negative or neutral. Since there arent large enough datasets of labeled tweets, limiting the sentiment categories to the above three enables us to leverage other similar but larger datasets for training custom sentiment language models. We begin by extracting entities from the Twitter dataset using the Stanford NER 8. URLs and username tags (person) are also treated as entities to augment the entities found by the NER. To learn a sentiment language model we use a corpus of 200,000 product reviews that have been labeled as positive or negative. Using this corpus the sentiment language model computes the prob- ability that a given unigram or bigram is being used in a positive context and the probability that its being used in a negative context. Using this sentiment language model we analyze all tweets associated with an entity and classify whether the overall opinion of that entity is positive or negative and by how much.

Cite this document (BETA)

Available from nlp.stanford.edu
Page 1
hidden

Entity Based Sentiment Analysis o...

1www.twitter.com 2www. facebook.com 3www.myspace.com 4www.youtube.com Sentiment Analysis on Twitter Akshi Kumar and Teeja Mary Sebastian Department of Computer Engineering, Delhi Technological University Delhi, India Abstract With the rise of social networking epoch, there has been a surge of user generated content. Microblogging sites have millions of people sharing their thoughts daily because of its characteristic short and simple manner of expression. We propose and investigate a paradigm to mine the sentiment from a popular real-time microblogging service, Twitter, where users post real time reactions to and opinions about ���everything���. In this paper, we expound a hybrid approach using both corpus based and dictionary based methods to determine the semantic orientation of the opinion words in tweets. A case study is presented to illustrate the use and effectiveness of the proposed system. Keywords: Microblogging, Twitter, Sentiment Analysis 1. Introduction Ongoing increase in wide-area network connectivity promise vastly augmented opportunities for collaboration and resource sharing. Now-a-days, various social networking sites like Twitter1, Facebook2, MySpace3, YouTube4 have gained so much popularity and we cannot ignore them. They have become one of the most important applications of Web 2.0 [1]. They allow people to build connection networks with other people in an easy and timely way and allow them to share various kinds of information and to use a set of services like picture sharing, blogs, wikis etc. It is evident that the advent of these real-time information networking sites like Twitter have spawned the creation of an unequaled public collection of opinions about every global entity that is of interest. Although Twitter may provision for an excellent channel for opinion creation and presentation, it poses newer and different challenges and the process is incomplete without adept tools for analyzing those opinions to expedite their consumption. More recently, there have been several research projects that apply sentiment analysis to Twitter corpora in order to extract general public opinion regarding political issues [2]. Due to the increase of hostile and negative communication over social networking sites like Facebook and Twitter, recently the Government of India tried to allay concerns over censorship of these sites where Web users continued to speak out against any proposed restriction on posting of content. As reported in one of the Indian national newspaper [3] ���Union Minister for Communications and Information Minister, Kapil Sibal, proposed content screening & censorship of social networks like Twitter and Facebook���. Instigated by this the research carried out by us was to use sentiment analysis to gauge the public mood and detect any rising antagonistic or negative feeling on social medias. Although, we firmly believe that censorship is not right path to follow, this recent trend for research for sentiment mining in twitter can be utilized and extended for a gamut of practical applications that range from applications in business (marketing intelligence product and service bench marking and improvement), applications as sub- component technology (recommender systems summarization question answering) to applications in politics. This motivated us to propose a model which retrieves tweets on a certain topic through the Twitter API and calculates the sentiment orientation/score of each tweet. The area of Sentiment Analysis intends to comprehend these opinions and distribute them into the categories like positive, negative, neutral. Till now most sentiment analysis work has been done on review sites [4]. Review sites provide with the sentiments of products or movies, thus, restricting the domain of application to solely business. Sentiment analysis on Twitter posts is the next step in the field of sentiment analysis, as tweets give us a richer and more varied resource of opinions and sentiments that can be about anything from the latest phone they bought, movie they watched, political issues, religious views or the individuals state of mind. Thus, the foray into Twitter as the corpus allows us to move into different dimensions and diverse applications. 2. Related Work Applying sentiment analysis on Twitter is the upcoming trend with researchers recognizing the scientific trials and its potential applications. The challenges unique to this problem area are largely attributed to the dominantly IJCSI International Journal of Computer Science Issues, Vol. 9, Issue 4, No 3, July 2012 ISSN (Online): 1694-0814 www.IJCSI.org 372 Copyright (c) 2012 International Journal of Computer Science Issues. All Rights Reserved.
Page 2
hidden
informal tone of the micro blogging. Pak and Paroubek [5] rationale the use microblogging and more particularly Twitter as a corpus for sentiment analysis. They cited: ��� Microblogging platforms are used by different people to express their opinion about different topics, thus it is a valuable source of people���s opinions. ��� Twitter contains an enormous number of text posts and it grows every day. The collected corpus can be arbitrarily large. ��� Twitter���s audience varies from regular users to celebrities, company representatives, politicians, and even country presidents. Therefore, it is possible to collect text posts of users from different social and interests groups. ��� Twitter���s audience is represented by users from many countries. Parikh and Movassate [6] implemented two Naive Bayes unigram models, a Naive Bayes bigram model and a Maximum Entropy model to classify tweets. They found that the Naive Bayes classifiers worked much better than the Maximum Entropy model could. Go et al. [7] proposed a solution by using distant supervision, in which their training data consisted of tweets with emoticons. This approach was initially introduced by Read [8]. The emoticons served as noisy labels. They build models using Naive Bayes, MaxEnt and Support Vector Machines (SVM). Their feature space consisted of unigrams, bigrams and POS. The reported that SVM outperformed other models and that unigram were more effective as features. Pak and Paroubek [5] have done similar work but classify the tweets as objective, positive and negative. In order to collect a corpus of objective posts, they retrieved text messages from Twitter accounts of popular newspapers and magazine, such as ���New York Times���, ���Washington Posts��� etc. Their classifier is based on the multinomial Na��ve Bayes classifier that uses N-gram and POS-tags as features. Barbosa et al. [9] too classified tweets as objective or subjective and then the subjective tweets were classified as positive or negative. The feature space used included features of tweets like retweet, hashtags, link, punctuation and exclamation marks in conjunction with features like prior polarity of words and POS of words. Mining for entity opinions in Twitter, Batra and Rao[10] used a dataset of tweets spanning two months starting from June 2009. The dataset has roughly 60 million tweets. The entity was extracted using the Stanford NER, user tags and URLs were used to augment the entities found. A corpus of 200,000 product reviews that had been labeled as positive or negative was used to train the model. Using this corpus the model computed the probability that a given unigram or bigram was being used in a positive context and the probability that it was being used in a negative context. Bifet and Frank [11] used Twitter streaming data provided by Firehouse, which gave all messages from every user in real-time. They experimented with three fast incremental methods that were well-suited to deal with data streams: multinomial naive Bayes, stochastic gradient descent, and the Hoeffding tree. They concluded that SGD-based model, used with an appropriate learning rate was the best. Agarwal et al. [12] approached the task of mining sentiment from twitter, as a 3-way task of classifying sentiment into positive, negative and neutral classes. They experimented with three types of models: unigram model, a feature based model and a tree kernel based model. For the tree kernel based model they designed a new tree representation for tweets. The feature based model that uses 100 features and the unigram model uses over 10,000 features. They concluded features that combine prior polarity of words with their parts-of-speech tags are most important for the classification task. The tree kernel based model outperformed the other two. The Sentiment Analysis tasks can be done at several levels of granularity, namely, word level, phrase or sentence level, document level and feature level [13]. As Twitter allows its users to share short pieces of information known as ���tweets��� (limited to 140 characters), the word level granularity aptly suits its setting. Survey through the literature substantiates that the methods of automatically annotating sentiment at the word level fall into the following two categories: (1) dictionary-based approaches and (2) corpus-based approaches. Further, to automate sentiment analysis, different approaches have been applied to predict the sentiments of words, expressions or documents. These include Natural Language Processing (NLP) and Machine Learning (ML) algorithms [14]. In our attempt to mine the sentiment from twitter data we introduce a hybrid approach which combines the advantages of both dictionary & corpus based methods along with the combination of NLP & ML based techniques. The following sections illustrate the proposed paradigm. 3. Data Characteristics Twitter is a social networking and microblogging service that lets its users post real time messages, called tweets. Tweets have many unique characteristics, which implicates new challenges and shape up the means of carrying sentiment analysis on it as compared to other domains. Following are some key characteristics of tweets: IJCSI International Journal of Computer Science Issues, Vol. 9, Issue 4, No 3, July 2012 ISSN (Online): 1694-0814 www.IJCSI.org 373 Copyright (c) 2012 International Journal of Computer Science Issues. All Rights Reserved.

Readership Statistics

37 Readers on Mendeley
by Discipline
 
 
 
by Academic Status
 
27% Student (Master)
 
24% Ph.D. Student
 
11% Student (Bachelor)
by Country
 
27% United States
 
5% Canada
 
5% Indonesia

Sign up today - FREE

Mendeley saves you time finding and organizing research. Learn more

  • All your research in one place
  • Add and import papers easily
  • Access it anywhere, anytime

Start using Mendeley in seconds!

Already have an account? Sign in