Digital libraries, E-commerce brokers and similar vast information-oriented systems rely on consistent data to offer high-quality services. But presence of duplicates, quasi replicas, or near-duplicate entries (Dirty Data) in their repositories asperses their storage resources directly and delivery issues indirectly. Significant investments in this field from interested parties prompted the need for best methods for removing replicas from data repositories. Prior approaches involved using SVM classifiers, approaches to handle these dirty data. New distributed deduplication systems with higher reliability in which the data chunks are distributed across multiple cloud servers. The security requirements of data confidentiality and tag consistency are also achieved by introducing a deterministic secret sharing scheme in distributed storage systems, instead of using convergent encryption as in previous deduplication systems. So propose to use Unsupervised Duplicate Detection (UDD) Mechanism a query-dependent record matching method that requires no pre trained data set. UDD uses two cooperating classifiers that is, a weighted component similarity summing (WCSS) classifier and an SVM classifier that iteratively identifies duplicates in the query results from data sources. Achieves the same efficiency in terms of Deduplication results but significantly at a better performance rate (time) compared to GP systems. A practical implementation of the proposed approach validates the claim.
Jabeen, S. A., Prasanth, Y., & Prasad, G. S. (2019). UDD based procedure for record deduplication over digital storage systems. International Journal of Engineering and Advanced Technology, 8(4), 1850–1856.
Mendeley helps you to discover research relevant for your work.