Patent documents are abundant, lengthy and are written in very technical language. Thus, reading and analyzing patent documents can be complex and time consuming. This is where the use of automatic patent segmentation can help. This work attempts to automatically segment the description part of patent texts into semantic sections. Our goal is to develop a robust and scalable segmentation tool for automatic structuring of the patent texts into pre-defined sections that will serve as a pre-processing step to patent text IR(information retrieval) and IE(information extraction) tasks. To do so, an established set of guidelines is exploited for defining the segments in the description part of the patent text. Depending on those guidelines a segmentation tool called PatSeg is developed based on a combination of text mining techniques. A rule-based algorithm is used to identify the headings inside patent text, machine learning technique is used to classify the headings into pre-defined sections, and heuristics are used to identify the sections in patent text that do not contain headings. The performance of our methods achieved up to 94% of accuracy. In addition, we proposed a big data approach based on Hadoop ecosystem modules to apply our methods on the huge amount of patent documents.
CITATION STYLE
Sofean, M. (2017). Automatic segmentation of big data of patent texts. In Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics) (Vol. 10440 LNCS, pp. 343–351). Springer Verlag. https://doi.org/10.1007/978-3-319-64283-3_25
Mendeley helps you to discover research relevant for your work.