Towards a Scalable and Efficient ETL

El Yazid Gueddoudj; Azeddine Chikh

Journal ArticleOPEN ACCESS

Towards a Scalable and Efficient ETL

International Journal of Computing and Digital Systems (2023) 14(1) 10223-10231

DOI: 10.12785/ijcds/140195

4Citations

33Readers

Abstract

Extract, transform, and load (ETL) processes are crucial for building repositories of data from a variety of self-contained sources. Despite their complexity and cost, ETL processes have demonstrated some maturity for traditional, XML, and graph data sources. However the main challenge for ETL processes is double: (1) they do not scale when brought down to managing large and highly varied data sources, involving web-data. (2) the deployment of the target data warehouse in a polystore. The paper reviews various research efforts along this line of research. The paper then proposes a conceptual modeling of these processes using BPMN (Business Process Modeling Notation). These processes are automatically converted to scripts to be implemented within Spark framework. The solution is packaged according a new distributed architecture (Open ETL) that supports both batch and stream processing. To make our new approach more concrete and evaluable, a real case study using the LUBM benchmark, which involves heterogeneous data sources is considered.

Author supplied keywords

Cite

CITATION STYLE

APA

Gueddoudj, E. Y., & Chikh, A. (2023). Towards a Scalable and Efficient ETL. International Journal of Computing and Digital Systems, 14(1), 10223–10231. https://doi.org/10.12785/ijcds/140195

Towards a Scalable and Efficient ETL

Abstract

Author supplied keywords

Cite

Register to see more suggestions