An increasingly important system requirement for distributed stream processing applications is to provide strong correctness guarantees under unexpected failures and out-of-order data so that its results can be authoritative (not needing complementary batch results). Although existing systems have put a lot of effort into addressing some specific issues, such as consistency and completeness, how to enable users to make flexible and transparent trade-off decisions among correctness, performance, and cost still remains a practical challenge. Specifically, similar mechanisms are usually applied to tackle both consistency and completeness, which can result in unnecessary performance penalties. We present Apache Kafka's core design for stream processing, which relies on its persistent log architecture as the storage and inter-processor communication layers to achieve correctness guarantees. Kafka Streams, a scalable stream processing client library in Apache Kafka, defines the processing logic as read-process-write cycles in which all processing state updates and result outputs are captured as log appends. Idempotent and transactional write protocols are utilized to guarantee exactly-once semantics. Furthermore, revision-based speculative processing is employed to emit results as soon as possible while handling out-of-order data. We also demonstrate how Kafka Streams behaves in practice with large-scale deployments and performance insights exhibiting its flexible and low-overhead trade-offs.
CITATION STYLE
Wang, G., Chen, L., Dikshit, A., Gustafson, J., Chen, B., Sax, M. J., … Rao, J. (2021). Consistency and Completeness: Rethinking Distributed Stream Processing in Apache Kafka. In Proceedings of the ACM SIGMOD International Conference on Management of Data (pp. 2602–2613). Association for Computing Machinery. https://doi.org/10.1145/3448016.3457556
Mendeley helps you to discover research relevant for your work.