Automatic segmentation of big data of patent texts

Mustafa Sofean

Conference Proceedings

Automatic segmentation of big data of patent texts

Sofean M

Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics) (2017) 10440 LNCS 343-351

DOI: 10.1007/978-3-319-64283-3_25

6Citations

7Readers

Get full text

Abstract

Patent documents are abundant, lengthy and are written in very technical language. Thus, reading and analyzing patent documents can be complex and time consuming. This is where the use of automatic patent segmentation can help. This work attempts to automatically segment the description part of patent texts into semantic sections. Our goal is to develop a robust and scalable segmentation tool for automatic structuring of the patent texts into pre-defined sections that will serve as a pre-processing step to patent text IR(information retrieval) and IE(information extraction) tasks. To do so, an established set of guidelines is exploited for defining the segments in the description part of the patent text. Depending on those guidelines a segmentation tool called PatSeg is developed based on a combination of text mining techniques. A rule-based algorithm is used to identify the headings inside patent text, machine learning technique is used to classify the headings into pre-defined sections, and heuristics are used to identify the sections in patent text that do not contain headings. The performance of our methods achieved up to 94% of accuracy. In addition, we proposed a big data approach based on Hadoop ecosystem modules to apply our methods on the huge amount of patent documents.

Author supplied keywords

Cite

CITATION STYLE

APA

Sofean, M. (2017). Automatic segmentation of big data of patent texts. In Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics) (Vol. 10440 LNCS, pp. 343–351). Springer Verlag. https://doi.org/10.1007/978-3-319-64283-3_25

Automatic segmentation of big data of patent texts

Abstract

Author supplied keywords

Cite

Register to see more suggestions