An automatic approach to generate corpus in Spanish

Edwin Puertas; Jorge Andres Alvarado-Valencia; Luis Gabriel Moreno-Sandoval; Alexandra Pomares-Quimbaya

Conference Proceedings

An automatic approach to generate corpus in Spanish

Communications in Computer and Information Science (2018) 885 150-161

DOI: 10.1007/978-3-319-98998-3_12

0Citations

4Readers

Get full text

Abstract

A corpus is an indispensable linguistic resource for any application of natural language processing. Some corpora have been created manually or semi-automatically for a specific domain. In this paper, we present an automatic approach to generate corpus from digital information sources such as Wikipedia and web pages. The information extracted by Wikipedia is done by delimiting the domain, using a propagation algorithm to determine the categories associated with a domain region and a set of seeds to delimit the search. The information extracted from the web pages is carried out efficiently, determining the patterns associated with the structure of each page with the purpose of defining the quality of the extraction.

Author supplied keywords

Cite

CITATION STYLE

APA

Puertas, E., Alvarado-Valencia, J. A., Moreno-Sandoval, L. G., & Pomares-Quimbaya, A. (2018). An automatic approach to generate corpus in Spanish. In Communications in Computer and Information Science (Vol. 885, pp. 150–161). Springer Verlag. https://doi.org/10.1007/978-3-319-98998-3_12

An automatic approach to generate corpus in Spanish

Abstract

Author supplied keywords

Cite

Register to see more suggestions