Abstract
Work on cross document coreference resolution (CDCR) has primarily focused on news articles, with little to no work for social media. Yet social media may be particularly challenging since short messages provide little context, and informal names are pervasive. We introduce a new Twitter corpus that contains entity annotations for entity clusters that supports CDCR. Our corpus draws from Twitter data surrounding the 2013 Grammy music awards ceremony, providing a large set of annotated tweets focusing on a single event. To establish a baseline we evaluate two CDCR systems and consider the performance impact of each system component. Furthermore, we augment one system to include temporal information, which can be helpful when documents (such as tweets) arrive in a specific order. Finally, we include annotations linking the entities to a knowledge base to support entity linking. Our corpus is available: https://bitbucket.org/mdredze/tgx.
Cite
CITATION STYLE
Dredze, M., Andrews, N., & DeYoung, J. (2016). Twitter at the Grammys: A Social Media Corpus for Entity Linking and Disambiguation. In EMNLP 2016 - Conference on Empirical Methods in Natural Language Processing, Proceedings of the 4th International Workshop on Natural Language Processing for Social Media, SocialNLP 2016 (pp. 20–25). Association for Computational Linguistics (ACL). https://doi.org/10.18653/v1/w16-6204
Register to see more suggestions
Mendeley helps you to discover research relevant for your work.