Abstract
Extract, transform, and load (ETL) processes are crucial for building repositories of data from a variety of self-contained sources. Despite their complexity and cost, ETL processes have demonstrated some maturity for traditional, XML, and graph data sources. However the main challenge for ETL processes is double: (1) they do not scale when brought down to managing large and highly varied data sources, involving web-data. (2) the deployment of the target data warehouse in a polystore. The paper reviews various research efforts along this line of research. The paper then proposes a conceptual modeling of these processes using BPMN (Business Process Modeling Notation). These processes are automatically converted to scripts to be implemented within Spark framework. The solution is packaged according a new distributed architecture (Open ETL) that supports both batch and stream processing. To make our new approach more concrete and evaluable, a real case study using the LUBM benchmark, which involves heterogeneous data sources is considered.
Author supplied keywords
Cite
CITATION STYLE
Gueddoudj, E. Y., & Chikh, A. (2023). Towards a Scalable and Efficient ETL. International Journal of Computing and Digital Systems, 14(1), 10223–10231. https://doi.org/10.12785/ijcds/140195
Register to see more suggestions
Mendeley helps you to discover research relevant for your work.