What Can OSS Mailing Lists Tell Us? A Preliminary Psychometric Text Analysis of the Apache Developer Mailing List
- ISBN: 076952950X
- DOI: 10.1109/MSR.2007.35
Abstract
Developer mailing lists are a rich source of information about Open Source Software (OSS) development. The unstructured nature of email makes extracting information difficult. We use a psychometrically-based linguistic analysis tool, the LIWC, to examine the Apache httpd server developer mailing list. We conduct three preliminary experiments to assess the appropriateness of this tool for information extraction from mailing lists. First, using LIWC dimensions that are correlated with the big five personality traits, we assess the personality of four top developers against a baseline for the entire mailing list. The two developers that were responsible for the major Apache releases had similar personalities. Their personalities were different from the baseline and the other developers. Second, the first and last 50 emails for two top developers who have left the project are examined. The analysis shows promise in understanding why developers join and leave a project. Third, we examine word usage on the mailing list for two major Apache releases. The differences may reflect the relative success of each release.
What Can OSS Mailing Lists Tell Us? A Preliminary Psychometric Text Analysis of the Apache Developer Mailing List
A preliminary psychometric text analysis of the Apache developer mailing list
Peter C. Rigby ∗
Software Engineering Group
University of Victoria, B.C., Canada
pcr@uvic.ca
Ahmed E. Hassan
Dept. of Electrical and Computer Engineering
University of Victoria, B.C., Canada
ahmed@ece.uvic.ca
Abstract
Developer mailing lists are a rich source of information
about Open Source Software (OSS) development. The un-
structured nature of email makes extracting information dif-
ficult. We use a psychometrically-based linguistic analysis
tool, the LIWC, to examine the Apache httpd server devel-
oper mailing list. We conduct three preliminary experiments
to assess the appropriateness of this tool for information ex-
traction from mailing lists. First, using LIWC dimensions
that are correlated with the big five personality traits, we
assess the personality of four top developers against a base-
line for the entire mailing list. The two developers that were
responsible for the major Apache releases had similar per-
sonalities. Their personalities were different from the base-
line and the other developers. Second, the first and last 50
emails for two top developers who have left the project are
examined. The analysis shows promise in understanding
why developers join and leave a project. Third, we exam-
ine word usage on the mailing list for two major Apache
releases. The differences may reflect the relative success of
each release.
1 Introduction
Compared to most development artifacts, such as source
code or bug reports, mailing lists are less structured allow-
ing discussion of a wider range of topics. These lists embed
information about the Open Source Software (OSS) devel-
opment process, design decisions, and developer character-
istics. While flexibility is important during development, it
complicates the mining of useful information from mailing
lists.
Text analysis tools have been used on a variety of arti-
facts to understand and predict aspects of the development
processes. For example, Mockus and Votta [10] used text
∗Rigby acknowledges the support of a NSERC CGSD scholarship
analysis on CVS logs to characterize changes to the system,
such as corrective and perfective changes. More recently Li
et al. [6] combined manual classification, text analysis, and
machine learning to classify bug databases. Although these
techniques could be applied to the analysis of messages on
mailing lists, the range of topics and ambiguity of the dis-
cussion on a mailing list is larger than in a bug database or
CVS commit log. This ambiguity makes the classification
step more difficult. Instead of creating our own dictionary
or extracting one from the large corpus of emails, we use a
context independent, text analysis tool from psychology to
examine a mailing list. This limits our research to the psy-
chological and social aspects of the mailing list, but affords
us greater external validity.
Our primary goal is to assess the usefulness of the Lin-
guistic Inquiry and Word Count (LIWC) tool as a predictor
and classifier as well as a tool to understand the intricacies
of OSS development. We conduct three distinct, prelim-
inary experiments on the Apache httpd server’s developer
mailing list. In the next subsection, we discuss the motiva-
tion and rationale for each experiment.
1.1 Overview of our Experiments
What is the personality type of OSS developers?
Raymond [16] describes how modesty and fully acknowl-
edging contributions from others are essential traits of the
founders of Perl and Linux. We are unaware of any research
that empirically examines if there is a particular personality
type that is successful as an OSS developer. We build on
the efforts of others who correlated word counts with the
big five personality traits, a standard measure of personal-
ity [12], to assess the personality of core OSS developers.
Conscientiousness is one of the big five personality traits.
Are core Apache developers more diligent than the general
mailing list population?
Does the language and attitude of a developer change as
he or she moves from being a new, to a current, to a depart-
ing developer?
Fourth International Workshop on Mining Software Repositories (MSR'07)
0-7695-2950-X/07 $20.00 © 2007
Sign up today - FREE
Mendeley saves you time finding and organizing research. Learn more
- All your research in one place
- Add and import papers easily
- Access it anywhere, anytime



