Sign up & Download
Sign in

Using Latent Dirichlet Allocation for automatic categorization of software

by Kai Tian, Meghan Revelle, Denys Poshyvanyk
2009 6th IEEE International Working Conference on Mining Software Repositories (2009)

Abstract

In this paper, we propose a technique called LACT for automatically categorizing software systems in open-source repositories. LACT is based on latent Dirichlet Allocation, an information retrieval method which is used to index and analyze source code documents as mixtures of probabilistic topics. For an initial evaluation, we performed two studies. In the first study, LACT was compared against an existing tool, MUDABlue, for classifying 41 software systems written in C into problem domain categories. The results indicate that LACT can automatically produce meaningful category names and yield classification results comparable to MUDABlue. In the second study, we applied LACT to 43 software systems written in different programming languages such as C/C++, Java, C, PHP, and Perl. The results indicate that LACT can be used effectively for the automatic categorization of software systems regardless of the underlying programming language or paradigm. Moreover, both studies indicate that LACT can identify several new categories that are based on libraries, architectures, or programming languages, which is a promising improvement as compared to manual categorization and existing techniques.

Cite this document (BETA)

Available from ieeexplore.ieee.org
Page 1
hidden

Using Latent Dirichlet Allocation for automatic categorization of software



Using Latent Dirichlet Allocation for Automatic Categorization of Software

Kai Tian, Meghan Revelle, and Denys Poshyvanyk
Computer Science Department
The College of William and Mary
Williamsburg, VA 23185
{ktian, meghan, denys}@cs.wm.edu


Abstract
In this paper, we propose a technique called LACT
for automatically categorizing software systems in
open-source repositories. LACT is based on Latent
Dirichlet Allocation, an information retrieval method
which is used to index and analyze source code
documents as mixtures of probabilistic topics. For an
initial evaluation, we performed two studies. In the first
study, LACT was compared against an existing tool,
MUDABlue, for classifying 41 software systems written
in C into problem domain categories. The results
indicate that LACT can automatically produce
meaningful category names and yield classification
results comparable to MUDABlue. In the second study,
we applied LACT to 43 software systems written in
different programming languages such as C/C++,
Java, C#, PHP, and Perl. The results indicate that
LACT can be used effectively for the automatic
categorization of software systems regardless of the
underlying programming language or paradigm.
Moreover, both studies indicate that LACT can identify
several new categories that are based on libraries,
architectures, or programming languages, which is a
promising improvement as compared to manual
categorization and existing techniques.
1. Introduction
Open-source software repositories such as
SourceForge.net maintain massive amounts of source
code and software artifacts. To facilitate easier
browsing and searching of such repositories, software
systems are placed into categories (e.g., text editors,
anti-virus, databases, etc). These categories group
systems by their functionality, and classification is
performed manually by users or administrators. This
labor-intensive categorization is time-consuming and
requires an understanding of the underlying
functionalities of the software systems in the repository.
Automatic categorization is a desirable alternative to
the current practice since it eliminates manual effort.
An existing research prototype, MUDABlue [6], has
successfully used Latent Semantic Indexing (LSI) [3],
an Information Retrieval (IR) technique, to
automatically categorize software systems in open-
source software repositories. Latent Dirichlet
Allocation (LDA) [2] is an alternative IR approach in
which documents can be viewed as a mixtures of
topics, which may make it more amenable to software
categorization than LSI. If we consider a software
system in an open-source repository to be a document,
the distribution of topics in that document can be used
to automatically place the software system into
categories. In this paper, we propose a novel technique
called LACT for automatically classifying software
systems in open-source repositories. LACT works by
using LDA’s topic-document distributions that are
gleaned from comments and identifiers in source code.
We conducted two initial studies, one aimed at
comparing LACT with MUDABlue on a previously
published dataset, and the other studying LACT when
categorizing software systems written in different
programming languages. The next sections present the
details of LACT and the results of our studies.
2. Using Latent Dirichlet Allocation for
Software Categorization
LDA is a probabilistic topic model originally used in
natural language processing, but it has also been
applied to software artifacts [1, 8-10, 12]. In LDA,
documents are represented as mixtures over latent
topics, and each topic is characterized by a distribution
over words [2]. Given a corpus of documents, LDA
identifies a set of topics, associates a set of words with
each topic, and defines a finite mixture of these topics
for each document. Our proposed technique of LACT
utilizes LDA as described in the following steps:
1. Parse software systems. We consider a software
system as a collection of words (i.e., identifiers and
comments). Each system is parsed and represented
as a document in a corpus (see Table 1).
2. Index corpus with LDA. We use GibbsLDA++1
to index the resulting corpus. Topic-document or

1
http://gibbslda.sourceforge.net/ (accessed and verified on 03/01/09)

Sign up today - FREE

Mendeley saves you time finding and organizing research. Learn more

  • All your research in one place
  • Add and import papers easily
  • Access it anywhere, anytime

Start using Mendeley in seconds!

Already have an account? Sign in

Readership Statistics

10 Readers on Mendeley
by Discipline
 
by Academic Status
 
50% Ph.D. Student
 
20% Researcher (at an Academic Institution)
 
10% Lecturer
by Country
 
20% United States
 
10% Netherlands
 
10% Colombia