Building LinkedIn's Real-time Activity Data Pipeline.

  • Goodhope K
  • Koshy J
  • Kreps J
ISSN: <null>
N/ACitations
Citations of this article
195Readers
Mendeley users who have this article in their library.

Abstract

One trend in the implementation of modern web systems is the use of activity data in the form of log or event messages that capture user and server activity. This data is at the heart of many internet systems in the domains of advertising, relevance, search, recommendation systems, and security, as well as contin- uing to fulfill its traditional role in analytics and reporting. Many of these uses place real-time demands on data feeds. Activity data is extremely high volume and real-time pipelines present new design chal- lenges. This paper discusses the design and engineering problems we encountered in moving LinkedIn’s data pipeline from a batch-oriented file aggregation mechanism to a real-time publish-subscribe system called Kafka. This pipeline currently runs in production at LinkedIn and handles more than 10 billion message writes each day with a sustained peak of over 172,000 messages per second. Kafka supports dozens of subscribing systems and delivers more than 55 billion messages to these consumer processing each day. We discuss the origins of this systems, missteps on the path to real-time, and the design and engineering problems we encountered along the way

Cite

CITATION STYLE

APA

Goodhope, K., Koshy, J., & Kreps, J. (2012). Building LinkedIn’s Real-time Activity Data Pipeline. IEEE Data Eng, 1–13.

Register to see more suggestions

Mendeley helps you to discover research relevant for your work.

Already have an account?

Save time finding and organizing research with Mendeley

Sign up for free