Towards a Scalable and Efficient ETL

4Citations
Citations of this article
33Readers
Mendeley users who have this article in their library.

Abstract

Extract, transform, and load (ETL) processes are crucial for building repositories of data from a variety of self-contained sources. Despite their complexity and cost, ETL processes have demonstrated some maturity for traditional, XML, and graph data sources. However the main challenge for ETL processes is double: (1) they do not scale when brought down to managing large and highly varied data sources, involving web-data. (2) the deployment of the target data warehouse in a polystore. The paper reviews various research efforts along this line of research. The paper then proposes a conceptual modeling of these processes using BPMN (Business Process Modeling Notation). These processes are automatically converted to scripts to be implemented within Spark framework. The solution is packaged according a new distributed architecture (Open ETL) that supports both batch and stream processing. To make our new approach more concrete and evaluable, a real case study using the LUBM benchmark, which involves heterogeneous data sources is considered.

Cite

CITATION STYLE

APA

Gueddoudj, E. Y., & Chikh, A. (2023). Towards a Scalable and Efficient ETL. International Journal of Computing and Digital Systems, 14(1), 10223–10231. https://doi.org/10.12785/ijcds/140195

Register to see more suggestions

Mendeley helps you to discover research relevant for your work.

Already have an account?

Save time finding and organizing research with Mendeley

Sign up for free