Techniques for data extraction from heterogeneous sources with data security

Kimmi Kumari; M. Mrunalini

Journal ArticleOPEN ACCESS

Techniques for data extraction from heterogeneous sources with data security

International Journal of Recent Technology and Engineering (2019) 8(2) 2152-2159

DOI: 10.35940/ijrte.B3254.078219

1Citations

19Readers

Get full text

Abstract

Data Extraction is the process of mining or fetching relevant information from unstructured data or the heterogeneous sources of data. This paper aims at mining data from three different sources such as online website, flat files and database and the extracted data are even analyzed in terms of precisions, recall and accuracy. In the environment of heterogeneous sources of data, data extraction is one of the crucial issue and therefore considering the present scenario, we can observe that the heterogeneity is expanding widespread. So this paper focus on the different sources for the data extraction and provides a single framework to perform the required tasks. In this paper, healthcare data are considered in order to show the processing starting from data extraction using three different sources to dividing them in to two clusters based on the thresholds value which has been calculated using cosine similarity and finally calculations of parameters like precisions, recall and accuracy for analyzation purpose. Fetching data online is the task in which we cannot fetch simple string from any website. The backend of each page is html and hence this paper focus on extracting that html of the page while mining data from any web server. The webpage contains a lot of html tags and all of these cannot be removed because they are complex tags which cannot be removed by regular expressions. But still 60% filtered data can be attained as demonstrated in this paper as most of the waste html will be removed. While filtration of the data, we should also note that the content containing Google APIs cannot be removed. So filtered data will contain the contents and tags which does not contain Google APIs. In order to provide data security while extraction, the connection string is being used to avoid tampering of data. This paper also focuses on one of the arguable concepts present in the generation of big data which is Data Lake. In originality, the origin about the idea of Data Lake appears from the field of business. An architectural approach which is specially designed in order to store all the data which are potentially relevant in a repository located centrally is referred to as Data Lake. The data which are stored in the central based repository are fetched from the sources belonging to public as well as enterprises and these data are further used for the purpose of organization, discovery of hidden facts, understanding of new concepts, analyzation of stored information etc. Many challenges and concerns related to privacy are faced during the adoption of Data Lake as it is a new concept which brings revolutionization. This paper also highlights some of the issues imposed by Data Lake.

Author supplied keywords

Cite

CITATION STYLE

APA

Kumari, K., & Mrunalini, M. (2019). Techniques for data extraction from heterogeneous sources with data security. International Journal of Recent Technology and Engineering, 8(2), 2152–2159. https://doi.org/10.35940/ijrte.B3254.078219

Techniques for data extraction from heterogeneous sources with data security

Abstract

Author supplied keywords

Cite

Register to see more suggestions