Sourcerer: Mining and searching internet-scale software repositories

Erik Linstead; Sushil Bajracharya; Trung Ngo; Paul Rigor; Cristina Lopes; Pierre Baldi

Journal Article

Sourcerer: Mining and searching internet-scale software repositories

Data Mining and Knowledge Discovery (2009) 18(2) 300-336

DOI: 10.1007/s10618-008-0118-x

194Citations

133Readers

Get full text

Abstract

Large repositories of source code available over the Internet, or within large organizations, create new challenges and opportunities for data mining and statistical machine learning. Here we first develop Sourcerer, an infrastructure for the automated crawling, parsing, fingerprinting, and database storage of open source software on an Internet-scale. In one experiment, we gather 4,632 Java projects from SourceForge and Apache totaling over 38 million lines of code from 9,250 developers. Simple statistical analyses of the data first reveal robust power-law behavior for package, method call, and lexical containment distributions. We then develop and apply unsupervised, probabilistic, topic and author-topic (AT) models to automatically discover the topics embedded in the code and extract topic-word, document-topic, and AT distributions. In addition to serving as a convenient summary for program function and developer activities, these and other related distributions provide a statistical and information-theoretic basis for quantifying and analyzing source file similarity, developer similarity and competence, topic scattering, and document tangling, with direct applications to software engineering an software development staffing. Finally, by combining software textual content with structural information captured by our CodeRank approach, we are able to significantly improve software retrieval performance, increasing the area under the curve (AUC) retrieval metric to 0.92- roughly 10-30% better than previous approaches based on text alone. A prototype of the system is available at: http://sourcerer.ics.uci.edu . © 2008 Springer Science+Business Media, LLC.

Author supplied keywords

Cite

CITATION STYLE

APA

Linstead, E., Bajracharya, S., Ngo, T., Rigor, P., Lopes, C., & Baldi, P. (2009). Sourcerer: Mining and searching internet-scale software repositories. Data Mining and Knowledge Discovery, 18(2), 300–336. https://doi.org/10.1007/s10618-008-0118-x

Sourcerer: Mining and searching internet-scale software repositories

Abstract

Author supplied keywords

Cite

Register to see more suggestions