Optimized Term Extraction Method Based on Computing Merged Partial C-Values

9Citations
Citations of this article
3Readers
Mendeley users who have this article in their library.
Get full text

Abstract

Assessing the completeness of a document collection, regarding terminological coverage of a domain of interest, is a complicated task that requires substantial computational resource and human effort. Automated term extraction (ATE) is an important step within this task in our OntoElect approach. It outputs the bags of terms extracted from incrementally enlarged partial document collections for measuring terminological saturation. Saturation is measured iteratively, using our measure of terminological distance between the two bags of terms. The bags of retained significant terms and extracted at i-th and i + 1-st iterations are compared until it is detected that went below the individual term significance threshold. The flaw of our conventional approach is that the sequence of input datasets is built by adding an increment of several documents to the previous dataset. Hence, the major part of the documents undergoes term extraction repeatedly, which is counter-productive. In this paper, we propose and prove the validity of the optimized pipeline based on the modified C-value method. It processes the disjoint partitions of a collection but not the incrementally enlarged datasets. It computes partial C-values and then merges these in the resulting bags of terms. We prove that the results of extraction are statistically the same for the conventional and optimized pipelines. We support this formal result by evaluation experiments to prove document collection and domain independence. By comparing the run times, we prove the efficiency of the optimized pipeline. We also prove experimentally that the optimized pipeline effectively scales up to process document collections of industrial size.

Cite

CITATION STYLE

APA

Kosa, V., Chaves-Fraga, D., Dobrovolskyi, H., & Ermolayev, V. (2020). Optimized Term Extraction Method Based on Computing Merged Partial C-Values. In Communications in Computer and Information Science (Vol. 1175 CCIS, pp. 24–49). Springer. https://doi.org/10.1007/978-3-030-39459-2_2

Register to see more suggestions

Mendeley helps you to discover research relevant for your work.

Already have an account?

Save time finding and organizing research with Mendeley

Sign up for free