The Data Lake is emerging as a Big Data storage and management solution which can store any type of data at scale and execute data transformations for analysis. Higher flexibility in storage increases the risk of Data Lakes becoming data swamps. In this paper we show how provenance contributes to data management within a Data Lake infrastructure. We study provenance integration challenges and propose a reference architecture for provenance usage in a Data Lake. Finally we discuss the applicability of our tools in the proposed architecture.
CITATION STYLE
Suriarachchi, I., & Plale, B. (2016). Provenance as essential infrastructure for Data Lakes. In Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics) (Vol. 9672, pp. 178–182). Springer Verlag. https://doi.org/10.1007/978-3-319-40593-3_16
Mendeley helps you to discover research relevant for your work.