We develop and apply statistical topic models to software as a means of extracting concepts from source code. The effectiveness of the technique is demontrated on 1,555 projects from sourceforge and Apache consisting of 113,000 files and 19 milion lines of code. In addition to providing an automated, unsupervised solution to the problem of summarizing program functionality, the approach provides a probabilistic framework with which to analyze and visualize source file similarity. Finally, we introduce an information-theoretic approach for computing tangling and scattering of extracted concepts and present preliminary results.
Mendeley saves you time finding and organizing research
Choose a citation style from the tabs below