Generalizing streaming pipeline design for big data

2Citations
Citations of this article
5Readers
Mendeley users who have this article in their library.
Get full text

Abstract

Streaming data refers to the data that is sent to a cloud or a processing centre in real time. Even though we have limited exposure to such applications that can process data streams on a live basis and generate useful insights, we are still in infancy when it comes to complicated stream processing. Current streaming data analytics tools represents the third-/fourth-generation data processing capability in the big data hierarchy which includes the Hadoop ecosystem, Apache Storm™ and Apache Kafka™ and its likes, Apache Spark™ framework, and now, Apache Flink™ with its non-batch stateful streaming core. Each of these individually cannot handle all the aspects of a data processing pipeline, alone. It is essential to have good synergy between these technologies to cater to the management, streaming, processing and fault-tolerant requirements of various data-driven applications. Companies tailor their pipelines exclusively for their requirements, since making a general framework entails mammoth interfacing and configuration efforts that are not cost-effective for them. In this paper, we envision and implement such a generalized minimal stream processing pipeline and measure its performance, on some data sets, in the form of delays and latencies of data arrival at pivotal checkpoints in the pipeline. We virtualize this using a Docker™ container without much loss in performance.

Cite

CITATION STYLE

APA

Rengarajan, K., & Menon, V. K. (2020). Generalizing streaming pipeline design for big data. In Advances in Intelligent Systems and Computing (Vol. 1085, pp. 149–160). Springer. https://doi.org/10.1007/978-981-15-1366-4_12

Register to see more suggestions

Mendeley helps you to discover research relevant for your work.

Already have an account?

Save time finding and organizing research with Mendeley

Sign up for free