Sign up & Download
Sign in

Characterizing Microblogs with Topic Models

by Daniel Ramage, Susan Dumais, Dan Liebling
International AAAI Conference on Weblogs and Social Media ()

Abstract

As microblogging grows in popularity, services like Twitter are coming to support information gathering needs above and beyond their traditional roles as social networks. But most users interaction with Twitter is still primarily focused on their social graphs, forcing the often inappropriate conflation of people I follow with stuff I want to read. We characterize some information needs that the current Twitter interface fails to support, and argue for better representations of content for solving these challenges. We present a scalable implementation of a partially supervised learning model (Labeled LDA) that maps the content of the Twitter feed into dimensions. These dimensions correspond roughly to substance, style, status, and social characteristics of posts. We characterize users and tweets using this model, and present results on two information consumption oriented tasks.

Cite this document (BETA)

Available from www.aaai.org
Page 1
hidden

Characterizing Microblogs with To...

Characterizing Microblogs with Topic Models Daniel Ramage Susan Dumais Dan Liebling Stanford University 353 Serra Mall, Stanford, CA dramage@cs.stanford.edu Microsoft Research One Microsoft Way, Redmond, WA sdumais@microsoft.com Microsoft Research One Microsoft Way, Redmond, WA danl@microsoft.com Abstract As microblogging grows in popularity, services like Twitter are coming to support information gathering needs above and beyond their traditional roles as social networks. But most users��� interaction with Twitter is still primarily focused on their social graphs, forcing the often inappropriate conflation of ���people I follow��� with ���stuff I want to read.��� We characterize some information needs that the current Twitter interface fails to support, and argue for better representations of content for solving these challenges. We present a scalable implementation of a partially supervised learning model (Labeled LDA) that maps the content of the Twitter feed into dimensions. These dimensions correspond roughly to substance, style, status, and social characteristics of posts. We characterize users and tweets using this model, and present results on two information consumption oriented tasks. Introduction Millions of people turn to microblogging services like Twitter to gather real-time news or opinion about people, things, or events of interest. Such services are used for social networking, e.g., to stay in touch with friends and colleagues. In addition, microblogging sites are used as publishing platforms to create and consume content from sets of users with overlapping and disparate interests. Consider a hypothetical user @jane who follows user @frank because of the latter���s posts about college football. However, @frank additionally uses Twitter to coordinate social arrangements with friends and occasionally posts political viewpoints. Currently, @jane has few tools to filter non-football content from @frank. In short, Twitter assumes that all posts from the people @jane follows are posts she wants to read. Similarly, @jane has a limited set of options for identifying new people to follow. She can look at lists of users in the social graph (e.g. those followed by @frank), or she can search by keyword and then browse the returned tweets��� posters. However, it remains difficult to find people who are like @frank in general or ��� more challengingly ��� like @frank but with less social chatter or different political views. Copyright �� 2010, Association for the Advancement of Artificial Intelligence (www.aaai.org). All rights reserved. The example above illustrates two of the many content- oriented information needs that are currently unmet on Twitter. Solving these challenges will require going beyond the traditional network-based analysis techniques that are often applied to microblogs and social networks to develop new tools for analyzing and understanding Twitter content. Content analysis on Twitter poses unique challenges: posts are short (140 characters or less) with language unlike the standard written English on which many supervised models in machine learning and NLP are trained and evaluated. Effectively modeling content on Twitter requires techniques that can readily adapt to the data at hand and require little supervision. Our approach borrows the machinery of latent variable topic models like the popular unsupervised model Latent Dirichlet Allocation (LDA) (Blei, Ng, & Jordan, 2003). Latent variable topic models have been applied widely to problems in text modeling, and require no manually constructed training data. These models distill collections of text documents (here, tweets) into distributions of words that tend to co-occur in similar documents ��� these sets of related words are referred to as ���topics.��� While LDA and related models have a long history of application to news articles and academic abstracts, one open question is if they will work on documents as short as Twitter posts and with text that varies greatly from the traditionally studied collections ��� here we find that the answer is yes. In this paper, we use Labeled LDA (Ramage, et al., 2009), which extends LDA by incorporating supervision in the form of implied tweet-level labels where available, enabling explicit models of text content associated with hashtags, replies, emoticons, and the like. What types of patterns can latent variable topic models discover from tweets? The Understanding Following Behavior section argues that the latent topics can be roughly categorized into four types: substance topics about events and ideas, social topics recognizing language used toward a social end, status topics denoting personal updates, and style topics that embody broader trends in language usage. Next, in the Modeling Posts with Labeled LDA section, we outline some applications of the mixture of latent and labeled topics, demonstrating the specificity of learned vocabularies associated with the various label types. Then, in the Characterizing Content on Twitter section, we characterize selected Twitter users along these learned dimensions, showing that topic models can provide
Page 2
hidden
interpretable summaries or characterizations of users��� tweet streams. Finally, In the Ranking Experiments section, we demonstrate the approach���s effectiveness at modeling Twitter content with a set of experiments on users��� quality rankings of their own subscribed feeds. Related work Most of the published research about Twitter has focused on questions related to Twitter���s network and community structure. For example, (Krishnamurthy, Gill, & Arlitt, 2008) summarize general features of the Twitter social network such as topological and geographical properties, patterns of growth, and user behaviors. Others such as (Java, et al., 2007), argue from a network perspective that user activities on Twitter can be thought of as information seeking, information sharing, or as a social activity. Less work has presented a systematic analysis of the textual content of posts on Twitter. Recent work has examined content with respect to specific Twitter conventions: @user mentions in (Honeycutt & Herring, 2009) and re-tweeting, or re-posting someone else���s post in (boyd, Golder, & Lotan, 2010). Notably, (Naaman, Boase, & Lai, 2010) characterizes content on Twitter and other ���Social Awareness Streams��� via a manual coding of tweets into categories of varying specificity, from ���Information Sharing��� to ���Self Promotion.��� Naaman, et al., extrapolate from these categories, inducing two kinds of users: ���informers��� that pass on non-personal information and ���meformers��� that mostly tweet about themselves. Others have proposed forms of content analysis on Twitter with specific focuses, such as modeling conversations (Ritter, Cherry, & Dolan, 2010). Although rich with insight, these works do not present automatic methods for organizing and categorizing all Twitter posts by content, the problem we approach here. Understanding Following Behavior What needs drive following and reading behavior on Twitter, and to what extent does Twitter satisfy them? To help organize our own intuitions, we conducted in-depth structured interviews with four active Twitter users (with number of following and followed users ranging from dozens to thousands), and followed up with a web-based survey of 56 more users. We found that both the content of posts and social factors played important roles when our interviewees decided whether to follow a user. Distilling our conversations down to their essence, we found that all those interviewed made distinctions between people worth following for the subjects they write about (substance, e.g. about a hobby or professional interest), because of some social value (social, e.g. for making plans with friends), because of (dis)interest in personal life updates from the poster (status, e.g. where someone is or what they are doing), or because of the tone or style of the posts (style, e.g. humor or wit). To examine these intuitions in a broader context, we conducted a web-based survey cataloging reasons that underlie users��� following decisions on Twitter, as determined from our interviews and other direct interaction with regular Twitter users. 56 respondents within Microsoft completed the survey during one week in November 2009. 65% were male and 75% were between the ages of 26 and 45. 67% were very active consumers of information, reading posts several times a day. 37% posted more than once per day, and 54% posted with frequency between once a day and once a month. While this sample does not represent the full range of Twitter���s demographics, we believe it provides useful insight into challenges facing Twitter users more generally. Respondents were asked how often they considered 26 reasons when making decisions about whom to follow, with most reasons falling into one of the substance, status, social and style categories identified earlier. Each respondent rated each reason on a five-point scale: ���rarely,��� ���sometimes���, ���about half the time,��� ���often,��� to ���almost always.��� The most common reasons for following represent a mixture of the four categories of reasons: the two most common reasons were ���professional interest��� and ���technology��� (substance). These particular substantive topics reflected the demographics of the respondents. The next most commonly used reasons were ���tone of presentation��� (style), ���keeping up with friends��� (social), ���networking��� (social), and ���interested in personal updates��� (status). Low ranked reasons included ���being polite by following back��� and ���short-term needs (like travel info).��� Respondents were also queried about nine reasons for un-following users, i.e. removing users from their streams. We found that ���too many posts in general��� was the most common reason for a user to be un-followed. Other common reasons were: ���too much status/personal info��� (status), ���too much content outside my interest set��� (substance), and ���didn���t like tone or style��� (style). Respondents rarely un-followed for social reasons like ���too many conversations with other people.��� The least common reason was, unsurprisingly, ���not enough posts��� ��� because such users are rarely seen by their followers simply by lack of activity. 24 users provided additional reasons for un-following: 10 mentioned spam, 8 mentioned insufficiently interesting / boring / duplicative posts, and 6 un-followed because of offensive posts (e.g. religious or political views, general tone, or about other people). In response to an open-ended question about what an ideal interface to Twitter would do differently, survey respondents identified two main challenges related to content on Twitter, underscoring the importance of improved models of Twitter content. First, new users have difficulty discovering feeds worth subscribing to. Later, they have too much content in their feeds, and lose the most interesting/relevant posts in a stream of thousands of posts of lesser utility. Of the 45 respondents who answered this question, 16 wanted improved capabilities for filtering of their feeds by user, topic on context (e.g., ���organize into topics of interest���, ���ignore temporarily people, tags or topics���). In addition, 11 wanted improved interfaces for following, such as organization into topics or

Readership Statistics

232 Readers on Mendeley
by Discipline
 
 
 
by Academic Status
 
34% Ph.D. Student
 
28% Student (Master)
 
6% Researcher (at a non-Academic Institution)
by Country
 
26% United States
 
13% China
 
9% Germany

Sign up today - FREE

Mendeley saves you time finding and organizing research. Learn more

  • All your research in one place
  • Add and import papers easily
  • Access it anywhere, anytime

Start using Mendeley in seconds!

Already have an account? Sign in