Detecting Duplicates in Real-Time Data Warehouse Using Bloom Filter-Based Approach

0Citations
Citations of this article
7Readers
Mendeley users who have this article in their library.
Get full text

Abstract

Data warehousing has been a topic of intense research for past few years. A data warehouse is primarily as a central repository in which data is coming from disparate sources. Generally, fresh data in these warehouses are loaded to the central repository in disconnected mode through batch processing. Hence, there is always a chance of non-real time data available in the central warehouse. This stale data is not useful for most of the commercial real-time applications such as real-time transport monitoring, smart cities, semantic web, online transaction processing and sensor networks. In order to fully realize these applications, fresh data needs to be readily available for critical decision making purpose. In particular, they demand real time and quick accumulation of data from diverse sources in to main data warehouse. This paper focuses on maintaining consistency and providing real-time data updates in data warehouse. In particular, the paper targets the detection of duplicates in streaming environment with a limited amount of memory. For this purpose, it employs a novel concept called Bloom Filter. The bloom filter sets the bits in the array when the information is added in the data warehouse. This technique gives nearly 100% result without any false positive value. The error rate in worst case scenario is 0.01%. For implementation, a data structure called time frame bloom filter (TBF) is used which is essentially a bit map of information. Using this method, one can insert, update, delete and search the messages data in the data warehouse very quickly. To make the bloom filter scalable, one can also add more than one bloom filter to address the inconsistency issues.

Cite

CITATION STYLE

APA

Rizwan, S., Adil, S. H., & Islam, N. (2020). Detecting Duplicates in Real-Time Data Warehouse Using Bloom Filter-Based Approach. In Communications in Computer and Information Science (Vol. 1198, pp. 762–771). Springer. https://doi.org/10.1007/978-981-15-5232-8_65

Register to see more suggestions

Mendeley helps you to discover research relevant for your work.

Already have an account?

Save time finding and organizing research with Mendeley

Sign up for free