Building LinkedIn's Real-time Activity Data Pipeline.

Ken Goodhope; Joel Koshy; Jay Kreps

Journal Article

Building LinkedIn's Real-time Activity Data Pipeline.

Goodhope K
Koshy J
Kreps J

IEEE Data Eng (2012) 1-13

ISSN: <null>

N/ACitations

195Readers

Abstract

One trend in the implementation of modern web systems is the use of activity data in the form of log or event messages that capture user and server activity. This data is at the heart of many internet systems in the domains of advertising, relevance, search, recommendation systems, and security, as well as contin- uing to fulfill its traditional role in analytics and reporting. Many of these uses place real-time demands on data feeds. Activity data is extremely high volume and real-time pipelines present new design chal- lenges. This paper discusses the design and engineering problems we encountered in moving LinkedIn’s data pipeline from a batch-oriented file aggregation mechanism to a real-time publish-subscribe system called Kafka. This pipeline currently runs in production at LinkedIn and handles more than 10 billion message writes each day with a sustained peak of over 172,000 messages per second. Kafka supports dozens of subscribing systems and delivers more than 55 billion messages to these consumer processing each day. We discuss the origins of this systems, missteps on the path to real-time, and the design and engineering problems we encountered along the way

Cite

CITATION STYLE

APA

Goodhope, K., Koshy, J., & Kreps, J. (2012). Building LinkedIn’s Real-time Activity Data Pipeline. IEEE Data Eng, 1–13.

Building LinkedIn's Real-time Activity Data Pipeline.

Abstract

Cite

Register to see more suggestions