We describe an original approach for exploring corpora of pdf format scientific texts in the area of bio-medical research, created over a wide topic of interest, e.g., cancer, thyroid cancer, biological process etc. Our methodology is based on indexing large lists of appropriate key-terms and additionally performing bi-clustering of term occurrence matrices. In our approach the position of phrase inside text (abstract or text) is not considered, but we include statistics based on occurrences frequency. We treat documents as a bags of words and the results are processed toward unique list of values. Bi-clustering is used to achieve separating character of lists of key-terms, characterizing sub-types of the studied category, e.g., different cancers or different sub-classes of a given cancer. We prove usefulness of the algorithm by searching for lists of genes characteristic for cancer types.
CITATION STYLE
Łancucki, R., Foszner, P., & Polanski, A. (2018). Searching through scientific PDF files supported by bi-clustering of key terms matrices. In Advances in Intelligent Systems and Computing (Vol. 659, pp. 144–153). Springer Verlag. https://doi.org/10.1007/978-3-319-67792-7_15
Mendeley helps you to discover research relevant for your work.