Using Program Mutation for the Empirical Assessment of Fault Detection Techniques: A Comparison of Concurrency Testing and Model Checking
Abstract
As a result of advances in hardware technology such as multi-core processors there is an increased need for concurrent software development. Unfortunately, developing correct concurrent code is more difficult than developing correct sequential code. This difficulty is due in part to the many different, possibly unexpected, executions of the program, and leads to the need for special quality assurance techniques for concurrent programs such as randomized testing and formal state space exploration. This thesis focuses on the complementary relationship between such different state-of-the art quality assurance approaches in an effort to better understand the best bug detection methods for concurrent software. An approach is presented that allows the assessment and comparison of different software quality assurance tools using metrics to measure the effectiveness and efficiency of each technique at finding concurrency bugs. Using program mutation, the assessment method creates a range of faulty versions of a program and then evaluates the ability of various testing and formal analysis tools to detect these faults. The approach is implemented and automated in an experimental mutation analysis framework (ExMAn) which allows results to be easily reproducible. A comparison of the IBM concurrency testing tool ConTest and the NASA model checker Java PathFinder is given to demonstrate the approach.
Using Program Mutation for the Empirical Assessment of Fault Detection Techniques: A Comparison of Concurrency Testing and Model Checking
Fault Detection Techniques: A Comparison of
Concurrency Testing and Model Checking
by
Jeremy S. Bradbury
A thesis submitted to the
School of Computing
in conformity with the requirements for
the degree of Doctor of Philosophy
Queen’s University
Kingston, Ontario, Canada
June 2007
Copyright c© Jeremy S. Bradbury, 2007
As a result of advances in hardware technology such as multi-core processors there is an
increased need for concurrent software development. Unfortunately, developing correct con-
current code is more difficult than developing correct sequential code. This difficulty is due
in part to the many different, possibly unexpected, executions of the program, and leads to
the need for special quality assurance techniques for concurrent programs such as random-
ized testing and formal state space exploration. This thesis focuses on the complementary
relationship between such different state-of-the art quality assurance approaches in an effort
to better understand the best bug detection methods for concurrent software. An approach
is presented that allows the assessment and comparison of different software quality as-
surance tools using metrics to measure the effectiveness and efficiency of each technique
at finding concurrency bugs. Using program mutation, the assessment method creates a
range of faulty versions of a program and then evaluates the ability of various testing and
formal analysis tools to detect these faults. The approach is implemented and automated
in an experimental mutation analysis framework (ExMAn) which allows results to be easily
reproducible. A comparison of the IBM concurrency testing tool ConTest and the NASA
model checker Java PathFinder is given to demonstrate the approach.
i
Chapters 4 and 5 were published previously in papers co-authored with my supervisors
James R. Cordy and Juergen Dingel. Both Chapter 4 and Chapter 5 were published in
the proceedings of the 2nd Workshop on Mutation Analysis (Mutation 2006) [BCD06a,
BCD06b]. In Chapter 6, the description of the controlled experiment comparing ConTest
and Java PathFinder with depth-first search was accepted for publication at the 3rd Work-
shop on Mutation Analysis (Mutation 2007) in another paper co-authored with my super-
visors [BCD07]. For all three papers I was the primary author and conducted the research
under the supervision and in collaboration with James R. Cordy and Juergen Dingel. The
concepts and ideas in Chapter 3 were published in an earlier form in the proceedings of the
6th International ACM SIGPLAN-SIGSOFT Workshop on Program Analysis for Software
Tools and Engineering (PASTE 2005) [BCD05].
ii
I would like to thank the Natural Sciences and Engineering Research Council of Canada
(NSERC) and the Ontario Graduate Scholarship (OGS) program for their generous financial
support.
I would like to thank my supervisors, Juergen Dingel and Jim Cordy, for all of their
guidance in helping me grow as a researcher and complete my Ph.D. dissertation. They have
both been excellent mentors who have provided me with two great examples of the type of
researcher one should strive to become. I would like to thank the members of my thesis
examination committee for their helpful comments and insights: John Hatcliff, Mohammad
Zulkernine, Tom Dean and Bob Crawford. I would also like to thank all of my friends and
members of the School of Computing who have helped me in one way or another along the
way. In particular I would like to thank: Richard Zanibbi, Dean Jin, Michelle Crane, Chris
McAloney, Derek Shimozawa and Adrian Thurston.
I would like to thank my parents, Goldie and Gerald, my sister, Pamela, and my grand-
mothers, Jessie and Gladys, for their love and support. It has taken a long time to achieve
this goal and my family has always supported the choices I have made.
Finally, I would like to thank Michelle Cortes for her love and support. She has always
been there to listen.
iii
I, Jeremy Bradbury, certify that the research work presented in this thesis is my own and
was conducted under the supervision of James R. Cordy and Juergen Dingel. All references
to the work of other people are properly cited.
iv
Abstract i
Co-Authorship ii
Acknowledgments iii
Statement of Originality iv
Contents v
List of Tables viii
List of Figures ix
1 Introduction 1
1.1 Motivation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1
1.2 Problem . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2
1.3 Thesis Statement and Scope of Research . . . . . . . . . . . . . . . . . . . . 4
1.4 Contributions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5
1.5 Organization of Thesis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6
2 Background 8
2.1 Systems of Interest . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8
2.1.1 Java Concurrency Mechanisms . . . . . . . . . . . . . . . . . . . . . 8
2.1.2 Concurrent Programs . . . . . . . . . . . . . . . . . . . . . . . . . . 11
2.1.3 Consequences of Bugs in Concurrent Programs . . . . . . . . . . . . 13
2.2 Bug Detection Techniques . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15
2.2.1 Testing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15
2.2.2 Model Checking . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18
2.2.3 Other Techniques . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21
2.3 Program Mutation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21
2.3.1 Assessment of Testing with Mutation . . . . . . . . . . . . . . . . . 23
2.3.2 Assessment of Formal Analysis with Mutation . . . . . . . . . . . . . 23
2.3.3 Assessment of Hybrid Techniques with Mutation . . . . . . . . . . . 23
v
2.4 Empirical Software Engineering . . . . . . . . . . . . . . . . . . . . . . . . . 24
2.4.1 Field Studies . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24
2.4.2 Controlled Experiments . . . . . . . . . . . . . . . . . . . . . . . . . 26
2.4.3 Benchmarks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27
2.4.4 Case Studies . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28
3 An Empirical Methodology for Comparing the Effectiveness and Effi-
ciency of Fault Detection Techniques 29
3.1 Motivation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 30
3.2 Methodology Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31
3.2.1 Metrics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 34
3.3 Experimental Setup . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 35
3.4 Experimental Procedure . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 40
3.5 Threats to Validity . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 42
3.6 Possible Outcomes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 43
3.7 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 45
4 ExMAn: A Generic and Customizable Framework for Experimental Mu-
tation Analysis 47
4.1 Challenges with Implementing a Mutation Analysis Framework . . . . . . . 48
4.2 Related Work: Existing Mutation Tools . . . . . . . . . . . . . . . . . . . . 49
4.3 Overview of ExMAn . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 50
4.3.1 Architecture . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 50
4.3.2 Process Description . . . . . . . . . . . . . . . . . . . . . . . . . . . 52
4.4 ExMAn in Practice . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 57
4.5 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 59
5 ConMAn: Mutation Operators for Concurrent Java (J2SE 5.0) 60
5.1 Motivation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 61
5.2 Related Work: Existing Mutation Operators for Java . . . . . . . . . . . . . 62
5.2.1 Method Mutation Operators . . . . . . . . . . . . . . . . . . . . . . 62
5.2.2 Class Mutation Operators . . . . . . . . . . . . . . . . . . . . . . . . 64
5.3 Bug Patterns for Java Concurrency . . . . . . . . . . . . . . . . . . . . . . . 64
5.4 Concurrent Mutation Operators . . . . . . . . . . . . . . . . . . . . . . . . . 67
5.4.1 Modify Parameters of Concurrent Method . . . . . . . . . . . . . . . 70
5.4.2 Modify the Occurrence of Concurrency Method Calls: Remove, Re-
place, and Exchange . . . . . . . . . . . . . . . . . . . . . . . . . . . 76
5.4.3 Modify Keywords: Add and Remove . . . . . . . . . . . . . . . . . . 80
5.4.4 Switch Concurrent Objects . . . . . . . . . . . . . . . . . . . . . . . 83
5.4.5 Modify Critical Region: Shift, Expand, Shrink and Split . . . . . . . 85
5.4.6 Bug Pattern Classification of ConMAn Operators . . . . . . . . . . . 88
5.5 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 89
vi
6.1 Experimental Definition . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 92
6.2 Experimental Setup . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 93
6.2.1 Selection of Approaches for Comparison . . . . . . . . . . . . . . . . 94
6.2.2 Selection of Example Programs . . . . . . . . . . . . . . . . . . . . . 95
6.2.3 Selection of Mutation Operators . . . . . . . . . . . . . . . . . . . . 98
6.2.4 Selection of Quality Artifacts . . . . . . . . . . . . . . . . . . . . . . 98
6.2.5 Selection of Experimental Environment . . . . . . . . . . . . . . . . 99
6.3 Experimental Procedure . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 100
6.3.1 Mutant Generation . . . . . . . . . . . . . . . . . . . . . . . . . . . . 100
6.3.2 Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 102
6.4 Experimental Outcome . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 106
6.4.1 Effectiveness . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 106
6.4.2 Efficiency . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 112
6.5 Threats to Validity . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 114
6.6 Related Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 116
6.7 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 117
7 Summary and Conclusions 119
7.1 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 119
7.2 Contributions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 120
7.3 Limitations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 121
7.4 Future Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 123
7.4.1 Further Empirical Studies . . . . . . . . . . . . . . . . . . . . . . . . 123
7.4.2 Validating Program Mutation for Concurrent Software . . . . . . . . 124
7.4.3 Optimization of Testing and Model Checking Based on Empirical As-
sessment . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 125
7.5 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 127
Bibliography 129
vii
5.1 Method mutation operators [KO91, OLR+96] . . . . . . . . . . . . . . . . . 63
5.2 Class mutation operators used in MuJava [MKO02] . . . . . . . . . . . . . . 63
5.3 The relationship between the new ConMAn mutation operators and the con-
currency features provided by J2SE 5.0 . . . . . . . . . . . . . . . . . . . . 68
5.4 The ConMAn mutation operators for Java . . . . . . . . . . . . . . . . . . . 69
5.5 Concurrency bug patterns vs. ConMAn mutation operators . . . . . . . . . 89
6.1 Approaches under comparison: ConTest, JPF Depth-First Search, JPF Ran-
dom Simulation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 94
6.2 General size metrics for the example programs . . . . . . . . . . . . . . . . 97
6.3 Concurrency metrics for the example programs . . . . . . . . . . . . . . . . 97
6.4 Example programs categorized by thread count . . . . . . . . . . . . . . . . 99
6.5 Example program properties . . . . . . . . . . . . . . . . . . . . . . . . . . . 99
6.6 The number of mutants generated for each example program . . . . . . . . 101
6.7 The distribution of mutants for each example program . . . . . . . . . . . . 101
6.8 The mutant scores of ConTest, JPF Depth-First Search and JPF Random
Simulation for each example program . . . . . . . . . . . . . . . . . . . . . . 107
6.9 The mutant scores of ConTest + JPF Depth-First Search, ConTest + JPF
Random Simulation and JPF Depth-First Search + JPF Random Simulation
for each example program . . . . . . . . . . . . . . . . . . . . . . . . . . . . 107
6.10 The mutant scores of ConTest in combination with JPF Depth-First Search
and JPF Random Simulation for each example program . . . . . . . . . . . 107
viii
2.1 TicketsOrderSim: A concurrent simulation of multiple agents selling tickets
for a flight . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12
2.2 Example fault in run method of TicketsOrderSim program . . . . . . . . . . 15
2.3 A comparison of conventional testing and testing with ConTest [EFN+02] . 16
2.4 Java PathFinder [JPF] . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19
2.5 The Bandera/Bogor tool set [Ban] . . . . . . . . . . . . . . . . . . . . . . . 20
2.6 Results from the Wojcicki and Strooper questionnaire . . . . . . . . . . . . 25
3.1 Experimental mutation analysis . . . . . . . . . . . . . . . . . . . . . . . . . 32
3.2 High-level experimental methodology activity diagram . . . . . . . . . . . . 33
3.3 Detailed experimental setup activity diagram . . . . . . . . . . . . . . . . . 36
3.4 The relationship between mutation operators, the approaches under compar-
ison and the fault model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 38
3.5 Detailed experimental procedure activity diagram . . . . . . . . . . . . . . . 41
4.1 ExMAn architecture . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 50
4.2 ExMAn process . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 53
4.3 ExMAn Tool Profile Creator dialog . . . . . . . . . . . . . . . . . . . . . . . 54
4.4 ExMAn Create/Edit Project dialog . . . . . . . . . . . . . . . . . . . . . . . 55
6.1 Experimental mutation analysis using the ExMAn framework and the Con-
MAn operators . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 92
6.2 Activity diagram of testing procedure . . . . . . . . . . . . . . . . . . . . . 103
6.3 Activity diagram of model checking procedure with depth-first search . . . 104
6.4 Activity diagram of model checking procedure with random simulation . . . 105
6.5 Detailed mutant results for ConTest, Java PathFinder Depth-First Search
and Java PathFinder Random Simulation . . . . . . . . . . . . . . . . . . . 108
6.6 ConTest, JPF Depth-First Search and JPF Random Simulation: mutants
detected by all three tools, two tools, one tool or neither tool . . . . . . . . 109
6.7 ConTest and JPF Depth-First Search: mutants detected by both, one or
neither tool . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 109
6.8 ConTest and JPF Random Simulation: mutants detected by both, one or
neither tool . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 109
ix
1.3 Thesis Statement and Scope of Research
Thesis Statement: For concurrent applications, a mutation analysis approach will allow for a
meaningful assessment of the fault detection capabilities of testing and formal analysis.
In the above statement the term mutation analysis is “...the process of modifying syntactic
software artifacts” [OAL06]. The term fault refers to “An incorrect step, process, or data
definition in a computer program” [IEE02]. The terms testing and formal analysis are also
defined in the standard way. Testing is “The process of operating a system or component
under specified conditions, observing or recording the results, and making an evaluation of
some aspect of the system or component” [IEE02]. Formal analysis is “...a mathematically
well-founded automated technique for reasoning about the semantics of software with respect
to a precise specification of intended behavior for which the sources of unsoundness are
defined” [DHP+07]. An example of a formal analysis technique is model checking that
assesses the correctness of a temporal logic specification or assertion with respect to a
model of a system or component.
The term assessment refers to the statistical evaluation of testing and formal analysis
using mutation measurements to assess the effectiveness and efficiency of each technique.
The goal of such an assessment is to answer the following kinds of questions:
• “Is testing tool A better or worse at finding faults than model checker B?”
• “Is testing tool A more efficient at finding faults than model checker B?”
The scope of the research will be restricted to concurrent Java applications. The kind
of concurrent Java applications under consideration are multi-threaded programs that have
shared variables. We selected this kind of application for two reasons:
1. Fault Detection Tool Maturity: existing testing and model checking tools have been
used in this domain for a number of years.
• Chapter 7: a summary of the thesis, a list of contributions, a discussion of limitations
and future work and a statement of conclusions.
Background
This chapter provides an overview of the background required for the work presented in
later chapters of this thesis. We first present an overview of the systems of interest for our
research (Section 2.1). That is, we review concurrency mechanisms in Java (J2SE 5.0), an
example of a concurrent Java application, and the types of bugs that are typically exhib-
ited in concurrent Java applications. Second, we review existing bug detection techniques,
including testing and model checking, that can be used to detect faults in concurrent Java
applications (Section 2.2). Third, we summarize related work that uses program mutation
(Section 2.3). Recall, that program mutation is used in our approach to compare different
fault detection techniques. Fourth, we describe empirical software engineering techniques
including field studies, controlled experiments, benchmarks and case studies that are used
to better understand fault detection techniques (Section 2.4).
2.1 Systems of Interest
2.1.1 Java Concurrency Mechanisms
While the majority of software systems currently developed in industry are single-threaded
sequential programs, there is mounting evidence that “applications will increasingly need
8
to be concurrent if they want to fully exploit CPU throughput gains that have now started
becoming available and will continue to materialize over the next several years” [Sut05]. For
example, the Intel Core Duo processor is a dual core processor used in Apple’s MacBook Pro
and Lenovo Thinkpads. In order to fully exploit this processor, as well as other multicore
processors, source code needs to be concurrent. In the past, advances in single processors
have led to free speed-up of sequential programs which will no longer occur with multicore
technologies.
Many imperative programming languages like Java, which are often used in the devel-
opment of sequential programs, can also be used for the development of concurrent applica-
tions. For example, Java provides a number of synchronization events (e.g., wait, notifyAll)
for the development of concurrent programs that can affect the scheduling of threads and
access to variables in the shared state [CS98]. The variations in the scheduling of threads
means that the execution of concurrent Java programs is non-deterministic. The interleav-
ing space of a concurrent Java program consists of all possible thread schedules [EFBA03].
Threads. Java concurrency is built around the notion of multi-threaded programs.
The Java documentation defines a thread as “...a thread of execution in a program.”2 A
typical thread is created and then started using the start() method and terminates once it
has finished running. While a thread is alive it can often alternate between being runnable
and not runnable. A number of methods exist that can affect the status of a thread:
• sleep(): will cause the current thread to become not runnable for a certain amount of
time.
• yield(): will cause the current thread that is running to pause.
• join(): will cause the caller thread to wait for a target thread to terminate.
• wait(): will cause the caller thread to wait until a condition is satisfied. Another thread
notifies the caller that a condition is satisfied using the notify() or notifyAll() method.
2java.lang.Thread documentation
im
po
rt
jav
a.i
o.I
OE
xc
ep
tio
n;
pu
bli
c c
las
s M
ain
{
/*
* Fi
rst
pa
ram
ete
r is
th
e n
um
be
r o
f th
rea
ds
* S
ec
on
d p
ara
me
ter
is
th
e c
us
hio
n
*/
pr
iva
te
sta
tic
in
t n
um
be
rTh
rea
ds
=
10
;
pr
iva
te
sta
tic
in
t c
us
hio
n =
3;
pu
bli
c s
tat
ic
vo
id
ma
in(
Str
ing
[] a
rgs
) {
if (
arg
s.l
en
gth
<
2){
Sy
ste
m.
ou
t.p
rin
tln
("E
RR
OR
: E
xp
ec
ted
2
pa
ram
ete
rs"
);
}
els
e {
nu
mb
erT
hre
ad
s =
In
teg
er.
pa
rse
Int
(ar
gs
[0]
);
cu
sh
ion
=
Int
eg
er.
pa
rse
Int
(ar
gs
[1]
);
ne
w
Sim
ula
tor
(nu
mb
erT
hre
ad
s,
cu
sh
ion
);
}
}
} pu
bli
c c
las
s S
im
ula
tor
im
ple
me
nts
R
un
na
ble
{
sta
tic
in
t N
um
_O
f_S
ea
ts_
So
ld
=0
;
int
M
ax
im
um
_C
ap
ac
ity
, N
um
_o
f_t
ick
ets
_is
su
ed
;
bo
ole
an
St
op
Sa
les
=
fal
se
;
Th
rea
d t
hre
ad
Ar
r[]
;
pu
bli
c S
im
ula
tor
(in
t s
ize
, in
t c
us
hio
n){
Nu
m_
of_
tic
ke
ts_
iss
ue
d =
si
ze
;
Ma
xim
um
_C
ap
ac
ity
=
Nu
m_
of_
tic
ke
ts_
iss
ue
d -
cu
sh
ion
;
thr
ea
dA
rr =
ne
w
Th
rea
d[N
um
_o
f_t
ick
ets
_is
su
ed
];
/* s
tar
t th
e s
ell
ing
of
th
e t
ick
ets
: c
rea
te
ag
en
t th
rea
ds
*/
for
( in
t i=
0;
i <
N
um
_o
f_t
ick
ets
_is
su
ed
; i+
+)
{
Sy
ste
m.
ou
t.p
rin
tln
("C
rea
tin
g t
hre
ad
#
" +
i);
thr
ea
dA
rr[i
] =
ne
w
Th
rea
d (
thi
s)
;
thr
ea
dA
rr[i
].s
tar
t();
}
}
/* t
he
se
llin
g a
ge
nt:
m
ak
es
th
e s
ale
an
d c
he
ck
s i
f li
mi
t w
as
re
ac
he
d*/
pu
bli
c v
oid
ru
n()
{
sy
nc
hr
on
ize
d (
thi
s)
{
if (
Sto
pS
ale
s =
= f
als
e)
{
Nu
m_
Of
_S
ea
ts_
So
ld+
+;
}
if (
Nu
m_
Of
_S
ea
ts_
So
ld
==
M
ax
im
um
_C
ap
ac
ity
) {
Sy
ste
m.
ou
t.p
rin
tln
("O
ve
r c
ap
ac
ity
");
S
top
Sa
les
=
tru
e;
}
}
if (
Nu
m_
Of
_S
ea
ts_
So
ld
> M
ax
im
um
_C
ap
ac
ity
)
thr
ow
ne
w
Ru
nti
me
Ex
ce
pti
on
("b
ug
fo
un
d -
ov
ers
old
se
ats
!!")
;
}
}
Figure 2.1: TicketsOrderSim: A concurrent simulation of multiple agents selling tickets for a flight
Figure 2.1). This example demonstrates the basics of thread creation in Java as well as
protecting a critical region with synchronization.
The program has two classes: Main and Simulator. The Main class instantiates the Simula-
tor. The Simulator class has two methods: a constructor and a run method. The constructor
has two parameters which define the number of ticket agent threads (size) and the number of
excess tickets available (cushion). In other words, there are size agents each selling one ticket
and there are actually only size-cushion tickets available. The only task that the constructor
is responsible for is creating and starting all of the agent threads.
The run method defines the behavior of each agent thread. The method contains a
critical region protected by a synchronized block which ensures that only one agent thread
can sell a ticket, update the total number of tickets sold, and set a StopSales flag when
the number of available tickets has been sold. An agent thread has to obtain an implicit
• Livelock: “...similar to deadlock in that the program does not make progress. How-
ever, in a deadlocked computation there is no possible execution sequence which suc-
ceeds, whereas in a livelocked computation there are successful computations, but there
are also one or more execution sequences in which no thread makes progress” [LSW07].
• Starvation: “...An example of starvation is when a thread tries to access a synchro-
nised block and the JVM always gives the lock to some other waiting thread” [LSW07].
Starvation may occur because of thread priorities.
• Dormancy: “...occurs when a non-runnable thread fails to become runnable” [LSW07].
• Incoincidence: occurs “...through a call completing at the wrong time (excluding
consequences already listed above)” [LSW07].
As we have shown above, the development of concurrent software offers a new set of
challenges not present in the development of sequential code. In addition to the new conse-
quences of concurrency bugs, concurrent software also has an additional difficulty regarding
the detection of bugs. A bug that leads to a deadlock or race condition may only occur in
a very small number of execution interleavings meaning it is extremely difficult to detect
some bugs prior to software deployment.
To explore this further let us consider the TicketsOrderSim example from Section 2.1.2.
An example of a fault in this program would be if the synchronized block in the run method
only contained the if statement with the condition StopSales == false and not the if statement
with the condition Num Of Seats Sold == Maximum Capacity (see Figure 2.2). If this fault was
present in the TicketsOrderSim program, some (but not all) of the interleavings would
exhibit incorrect behavior. For instance, it would be possible for one agent thread to sell
the last available ticket and then a second agent thread to sell an additional ticket ensuring
that the StopSales flag does not get set. Other possible interleavings would not exhibit
incorrect behavior as long as each agent thread does not get interrupted in between the if
Run Test
Fix BugFinish
Check
Results
Correct Problem
(a) Conventional testing
Run Test
Fix Bug
Finish
Check
Results
Correct Problem
Check
Coverage
Target
Not
Reached
Rerun Test with heuristically
generated interleaving
Record interleaving
Update Coverage
Rerun test
using replay
Reached
(b) Testing with ConTest
Figure 2.3: A comparison of conventional testing and testing with ConTest [EFN+02]
of the software to ensure that the bug no longer occurs (see Figure 2.3(a)).
Due to the non-determinism of the execution of concurrent source code and the high
number of possible interleavings, concurrency testing can not rely on coverage metrics alone
to guarantee that code is correct. In addition to ensuring that all code is covered we must
also provide some probabilistic confidence that bugs that manifest themselves in only a few
of the interleavings are found. For example, since a race condition or deadlock may only
occur in a small subset of the possible interleaving space, the more interleavings we test
the higher our confidence that the bug that caused the race condition or deadlock will be
found. Executing a test case once using one possible interleaving provides us the lowest
confidence while increasing the number of interleavings executed increases our confidence
that bugs will be detected. The current state-of-the-art in concurrency testing is to use the
time duration of testing as a measure of confidence [SL05].
schedulers in concurrency testing allow for the exploration of different interleavings, they
do not provide the ability to reproduce or replay a race condition or deadlock. In other
words, a test case that produced a bug on one execution is not guaranteed to produce
the bug the next time the code is executed. In general, determining if a bug is fixed by
rerunning a test case is a non-trivial activity. In order to aid replay in concurrency testing,
tools have been developed such as the IBM tool DejaVu [CS98].
2.2.2 Model Checking
Software model checking is a formal methods approach that typically involves developing a
finite state model of a software system and specifying a set of assertions or temporal logic
properties that the software system should satisfy. The model checker determines if the
model of the software system satisfies the specified properties by conducting an exhaustive
state space search. The exhaustive search means that all possible interleavings of the model
of a concurrent system are examined and thus provides a high level of confidence regarding
the quality of the software. Although model checking can provide more confidence then
testing it usually requires a long time to search the state space.
Typically model checkers are used to prove correctness, however model checkers also
provide benefits as debuggers. Additionally, many software model checkers allow for sim-
ulation of the model which can be used like testing to identify possible race conditions or
deadlocks. A shift in the focus of techniques like model checking from proofs of correct-
ness to debugging and testing has been advocated by a number of researchers including
Rushby [Rus00]. Recent research in formal analysis suggests that this shift is indeed taking
place and increases the practicality and acceptance of formal techniques. An example of
this shift is demonstrated by the approximately a quarter of a million assertions in the
Microsoft Office code [Hoa03]. The primary application of these assertions is “...not to the
proof of program correctness, but to the diagnosis of their errors” [Hoa03].
Today’s state-of-the-art software model checkers are automatic, scalable, and only leave
library
abstraction
choice
generator
vm
listener
Virtual Machine
Search Strategy
property
checker
search
listener
verification target
(Java bytecode
program)
MJI
data/scheduling
heuristics
VM
observation
verification report
search
observation
system/
apps
Core JPF
*.class
*.jar
error-path
property
violation
end
seen
*******************************error path
..
Step #11 Thread #0
oldclassic.java:65 event1.wait_for_event();
oldclassic.java:37 wait();
..
Step #14 Thread #1
oldclassic.java:95 event2.wait_for_event();
oldclassic.java:37 wait();
*******************************thread stacks
Thread: Thread-0
at java.lang.Object.wait(java/lang/Object.java:429)
at Event.wait_for_event(oldclassic.java:37)
..
Thread: Thread-1
at java.lang.Ojbect.wait(java/lang/Object.java:429)
at Event.wait_for_event(oldclassic.java:37)
..
======================
1 Error Found: Deadlock
Figure 2.4: Java PathFinder [JPF]
a small semantic gap (if any) between the source artifacts used by developers and the model
artifacts required for analysis. The ability of software model checkers to directly analyze
source code and the increase in size of systems that can be analyzed has helped them
become a viable option for software debugging. For example, in most model checkers a
counter-example is produced if the verification of a property fails. When a counter-example
is produced it can be used to locate the error in source code. Intuitively, the detection
of a property or assertion violation, such as a violation of a method pre-condition, a loop
invariant, a class representation invariant, an interface usage rule, or a temporal property
should be more insightful than simply knowing that there was a failure of a possibly global
test case.
Several software model checkers support the analysis of concurrent Java including Java
PathFinder (JPF) [HP00, VHB+03, JPF], developed at NASA, and the Bandera/Bogor
tool set [CDH+00, HDPR02, RDH03], developed at Kansas State University. We chose to
use JPF for our experiments in this thesis. However, in the future we also plan to conduct
further experiments with Bandera/Bogor.
to defect detection techniques and tools.
BugBench [LLQ+05] is a general defect detection benchmark of 17 programs with real
bugs. Four of the programs in the BugBench benchmark have a concurrency bug. An
HTTP server (HTTPD) contains a data race and three different versions of the database
MSQL contain a data race and two atomicity bugs.
Another example benchmark is the IBM Concurrency Benchmark [EU04]. The IBM
benchmark contains 40 programs ranging in size from 57 to 17000 lines of code. The task
samples include student created programs, tool developer programs, open source programs,
and a commercial product. One of these programs, the TicketsOrderSim was presented in
Section 2.1.2. In Chapter 6 we will use TicketsOrderSim and other example programs from
the IBM Concurrency Benchmark to demonstrate our research methods.
2.4.4 Case Studies
Case studies “...are used primarily for exploratory investigations, both prospectively and
retrospectively, that attempt to understand and explain phenomenon or construct a theory.
They are generally observational or descriptive in nature...” [PSE06]. An example of a case
study involving testing and model checking was conducted by Chockler, Farchi, Glazberg
et al., who compared ConTest and the ExpliSAT model checker within real projects at
IBM [CFG+06]. The results of the case study focused on the usage and the comprehensive-
ness of the results of each tool. Overall, ConTest was found to be easier to use but was not
as comprehensive in identifying potential problems in the software. The comprehensiveness
considered by Chockler et al. is a similar measurement to our effectiveness.
An Empirical Methodology for
Comparing the Effectiveness and
Efficiency of Fault Detection
Techniques
In this chapter, we provide an overview of our empirical methodology. A methodology is
defined as “a body of methods, rules, and postulates employed by a discipline : a particu-
lar procedure or set of procedures”1. Our methodology is developed with the intention of
being able to compare different fault detection techniques based on their effectiveness as
well as their efficiency at finding faults. We first motivate the need for a new methodology
and approach in Section 3.1. In Section 3.2 we give a high-level overview of the approach
included definitions of the metrics collected. We outline the experimental setup methods
in Section 3.3 and the experimental procedure in Section 3.4. In our descriptions of both
1Merriam-Webster Dictionary
29
3.2.1 Metrics
We collect 4 kinds of metrics in our methodology: mutant score, ability to kill a type of
mutant, ease to kill a mutant and cost to kill a mutant. We have chosen these 4 metrics be-
cause they provide quantitative measurements pertaining to the effectiveness and efficiency
of detecting mutants (proxies for real faults).
To evaluate the effectiveness of a quality assurance technique or tool at killing (detecting)
faults we use the mutant score. Recall that we previously described the mutant score metric
in Section 2.3.1. The mutant score provides a good comparative measurement to quantify
the ability of different fault detection techniques at finding mutant faults.
mutant score of technique t = the percentage of mutants detected (killed) by a tech-
nique t (e.g., testing, model checking)
To evaluate the effectiveness of a quality assurance technique or tool at killing a par-
ticular kind of fault we measure the ability to kill. We use this metric to help identify
any relationships regarding the kinds of faults that are found by a given quality assurance
technique. If we consider each mutant operator as generating a different kind of mutant we
can measure the ability to detect mutants generated by a given operator.
ability to kill mutants of type s by technique t = the percentage of mutants of
type s that are detected (killed) by a technique t .
To evaluate the effectiveness of a quality assurance technique or tool at killing a par-
ticular mutant we measure the ease to kill. The ease to kill a mutant is a metric used by
Andrews et al. [ABL05].
ease to kill a mutant m by technique t = the percentage of quality artifacts used by
technique t that detect (kill) mutant m.
To evaluate the efficiency of a quality assurance technique or tool at killing mutants we
measure the cost to kill a mutant. We record both the real time and the CPU time and
then can compare techniques based on either the real or CPU time required to detect faults.
cost to kill a mutant m by technique t = the total time (e.g., real or CPU time) re-
quired by a technique t to detect (kill) a mutant m.
Although we only use the mutant score, ability to kill, ease to kill and cost to kill metrics,
other metrics might be useful to use in our methodology. It might also be appropriate to
measure the effectiveness and efficiency in terms of tool specific metrics. For example, in
concurrency testing measuring the efficiency as the number of interleavings explored before
a mutant is detected.
3.3 Experimental Setup
In this section we define the setup of an experiment based on our methodology. The setup
includes rules and guidelines for selecting the approaches under comparison, the example
programs used in the experiment, the mutation operators used to generate faults, the quality
artifacts used by the approaches under comparison and the experimental environment (see
Figure 3.3).
Selection of Approaches for Comparison. When selecting approaches or tools to
evaluate it is important to ask two questions:
1. Are the approaches or tools intended for the same type of applications? If the tools
are specific to different domains or specific to different types of application it may
be unfair to compare them if the context of the experiment is outside the scope for
which either tool was originally intended. For example, comparing model checking
with a conventional testing technique that does not have any mechanism for exploring
multiple interleavings for concurrent software would make little sense. Nor would it
make sense to compare an analysis technique for embedded systems with one for
Selection of Approaches for Comparison
Selection of Example Programs
Selection of Mutation Operators
Selection of Quality Artifacts
(1) Are the approaches intended for the same type of applications?
(2) Do the approaches have similar goals?
(1) Are the example programs representative of the type
of programs each approach is intended for?
(2) Are the example programs developed by an independent source?
(1) Are the mutant operators systematically created from
an existing fault classification?
(2) Are the mutants generated by the mutation operators
the same type of faults detectable using the approaches?
(1) Are the artifacts of any approach more mature or advanced?
(2) Do the artifacts of one approach provide an advantage over
the artifacts of another approach?
[No to either question]
Selection of Experimental Environment
(1) Are there any factors in the experimental environment
that can give one approach an advantage?
(2) Are there any other factors that could affect the results
of the experiment in general?
[Yes to both questions]
[Yes to both questions]
[Yes to either question]
[Yes to either question]
[No to both questions]
[No to both questions]
[No to either question]
[Yes to both questions]
[No to either question]
Figure 3.3: Detailed experimental setup activity diagram
The above activity diagram presents the ideal approach to setting up an experiment in our methodology.
When applying the experimental setup to real experiments we acknowledge that it may not be possible to
answer all of the questions for a given activity before proceeding to the next activity. In cases where
question(s) can not be addressed it is up to the discretion of the researcher to assess the effects of not
answering the question(s) on the validity of the experiment.
large-scape distributed systems.
2. Do the approaches or tools have similar goals? Are they intended to be used for the
same purpose and do they find the same types of faults? For example, comparing the
fault detection capabilities of two tools where one tool is only capable of detecting
deadlocks and the other tool is only capable of detecting race conditions has very little
value.
Selection of Example Programs. When selecting example programs for an experi-
ment it is important to ask the following:
1. Are the example programs representative of the type of programs each approach is
intended for? Given a particular domain are the example programs representative of
all of the programs in the domain with respect to purpose, programming constructs,
size, and usage. For example, if the approaches are designed for the analysis of reactive
systems, then the use of only sequential programs would be inappropriate. An ideal
source of representative programs are domain-specific benchmarks.
2. Are the example programs developed by an independent source? It is not always possi-
ble to get example programs from a third party however every effort should be made
to do so and reduce potential bias. A source of independently developed programs is
the open source community.
Depending on the approaches under comparison we may be limited in our selection of
example programs. One possible limitation may be the size of the example programs due
to the scalability of one or more of the approaches. In this case we may have no choice but
to choose a set of example programs that are not representative of the type of programs we
are interested in. If this occurs we could narrow the scope of our experiment and discuss
the selection of example programs as a possible threat to validity in our experiment. We
discuss threats to validity in Section 3.5.
Faults
Detectable by
Approaches
Mutant
Operators
Existing Fault
Classification
m1, m2, …, mn f1, f2, …, fq a1, a2, …, ap
Figure 3.4: The relationship between mutation operators, the approaches under comparison and
the fault model
Selection of Mutation Operators. When creating or selecting mutation operators
to generate faulty program versions we need to ask two questions:
1. Are the mutation operators systematically created from an existing fault classification?
A fault classification identifies the kinds of faults that occur in a particular domain.
It is important that mutation operators are based on an existing classification to
ensure that operators are representative of real programmer faults. Creating the
mutation operators in a systematic approach ensures the operators are comprehensive.
In Chapter 5 we will present our concurrency mutation operators which are based on
an existing bug pattern taxonomy for Java concurrency [FNU03]. We will also discuss
related operators including the class-level mutation operators [MKO02] which are
based on a fault model of subtype inheritance and polymorphism [OAW+01].
2. Are the mutants generated by the mutation operators the same type of faults detectable
using the approaches? In Figure 3.4 we present an abstract representation of the
relationship between the mutation operators, the fault classification and the fault
detection tools. In order to ensure that the fault detection techniques are at least
capable of detecting the mutants generated by the mutation operators we need to
ensure that the mutation operators and the fault detection tools map to the same
kinds of faults in the classification. For example, if f1 faults were not detectable by
any of the fault detection tools we might want to remove the mutation operator m2
if it only generates f1 faults.
The above questions are primarily concerned with ensuring that the mutants generated
from the operators appear in the underlying fault model and that the mutants are capable
of being detected by the approaches.
Selection of Quality Artifacts. When selecting quality artifacts (e.g., test cases,
assertions, temporal properties) we should ensure the following questions are addressed:
1. Are the artifacts of any approach more mature or advanced? For example, a com-
parison of testing using a mature test suite with model checking using only a small
number of superficial assertions or properties would bias the experiment towards a
favorable outcome for testing when in fact the difference in the two approaches might
only be attributable to the more mature quality artifacts.
2. Do the artifacts of one approach provide an advantage over the artifacts of another
approach? This question refers to any other factors that may unfairly influence the
comparison. For example, the artifacts used by the approaches might be equally
mature but the artifacts for one approach might cover parts of the code not covered
by the artifacts of the other.
One way to ensure no bias in the quality artifacts used by each approach is to use the
same artifacts for the different approaches. However, this is not always possible.
Selection of the Experimental Environment. When selecting the experimental
environment there are several questions that should be considered:
1. Are there any factors in the experimental environment that can give one approach
an advantage? For example, running an experiment on a multi-processor system
when some of the analysis approaches are not multi-threaded might cause a biased
comparison in terms of efficiency and invalidate the measurements.
2. Are there any other factors that could affect the results of the experiment in general?
This question considers other factors that do not necessarily bias one approach but
may invalidate the measurements. For example, if the experiment is being run on a
shared system then measuring the efficiency of each approach should be done with
respect to CPU time not real time.
3.4 Experimental Procedure
In Section 3.3 we outlined how to setup an experiment using our methodology. We now
outline how to compare the fault detection capabilities of different techniques and tools in
our experimental procedure (see Figure 3.5). There are three main steps of the experimental
procedure which are repeated for each example program:
1. Mutant generation: A set of mutation operators are applied to an example program to
generate mutant programs. Each mutant is the example program with one syntactic
change.
2. Analysis: Our approach first analyzes the example program to determine the expected
observable output. The expected output could include any output generated by the
program or the analysis technique. For example, if the analysis approach was testing
we could include the standard command-line output, the standard error produced
by any exceptions, and the timing information. After obtaining the expected output
we analyze each mutant program and compare the mutant output with the expected
output. An example of comparing the output of the mutant with the original program
would be to use the diff program under Linux. It is possible that before comparing
the output it may have to be normalized. For example, the output may have to
be sorted with a concurrent example program to account for different interleavings.
The Analysis process for each technique or tool can be conducted sequentially or
concurrently.
Mutate Example Program
Apply Analysis Approach #1 to Original Program
Apply Analysis Approach #1 to Mutant Program
Merge and Display Results
Apply Analysis Approach #n to Original Program
Apply Analysis Approach #n to Mutant Program
...
[More mutants]
[No more mutants]
[More example programs]
[No more example programs]
Compare Results of Original Program and Mutant Compare Results of Original Program and Mutant
[No more mutants]
[More mutants]
Figure 3.5: Detailed experimental procedure activity diagram
between theory and practice” [WRH+00] and “...refer to the extent to which the experimental
setting actually reflects the construct under study” [WRH+00]. In our methodology there
is a potential for threats to construct validity if the fault detection techniques and tools are
not used in the way in which they are intended to be used. We discussed this issue briefly
in Section 3.3 when we outlined the importance of selecting tools with similar goals that
are applied to the same type of applications. If we ensure that the tools have similar goals
we can limit the need to modify how the tools are used.
Conclusion validity. Threats to conclusion validity are “...concerned with issues that
affect the ability to draw the correct conclusion about relations between the treatment and
the outcome of an experiment” [WRH+00]. Several concerns in our methodology related to
conclusion validity are the confidence that our measurements are correct and the statistical
tests used. First, we ensure that our measurements are recorded correctly by automating
the collection of measurements. In Chapter 4 we will present a framework to automate
our experimental methodology. Second, in order to ensure that the statistical tests used to
evaluate the measurements allow for correct conclusions we must ensure that none of the
statistical test assumptions are violated and that we use a test with high enough power. The
statistical test used in different experiments based on our methodology can vary. Therefore,
at the methodology level we can not account for threats to conclusion validity due to the
statistical tests. These threats need to be considered at the experiment level when an
experiment is designed based on the methods outlined in Sections 3.3 and 3.4.
3.6 Possible Outcomes
Recall that the purpose of experiments based on our methodology is to compare the effec-
tiveness and efficiency of fault detection techniques and tools. We now describe the most
probable outcomes of comparing different fault detection techniques and provide specific
examples of possible outcomes when comparing testing and model checking. In Chapter 6
we will conduct an actual experiment using the testing tool ConTest and the model checker
Java PathFinder.
In terms of effectiveness, there are two outcomes that are most likely:
1. The fault detection techniques are complementary. For example, it may be the case
that model checking can find bugs that testing can not find and vice versa. In a
concurrent setting testing may not be able to find certain bugs at all, while model
checking can. In this situation we may be able to identify ways to use the techniques
in combination such that the overall effectiveness of detecting faults is increased.
2. The fault detection techniques are alternatives. For example, it may be the case that
model checking and testing are equally likely to find most of the faults in a concurrent
program. In this situation the use of both techniques in combination would provide
very little, if any, benefit over either approach in isolation.
Although we outline two distinct outcomes with respect to effectiveness, in reality there
is a spectrum of different outcomes. On one end of the spectrum we have completely
complementary approaches which do not detect any of the same faults. On the other
end of the spectrum we have two completely alternative approaches that detect all of the
same faults. In between we have mixed results in which some faults are detected by both
approaches and some are not.
In terms of efficiency, there are also a number of possible outcomes:
1. Overall, one fault detection technique is more efficient. For example, if the outcome
is that testing is always more efficient then model checking with a given tool we
would still like to know by what factor is it more efficient because the analysis of
the properties might still provide increased insight to developers that can be factored
against its increased cost. Therefore, we would also like to know the kinds of properties
ExMAn: A Generic and
Customizable Framework for
Experimental Mutation Analysis
Current mutation analysis tools are primarily used to compare different test suites and
are tied to a particular programming language. In this chapter we present the ExMAn
experimental mutation analysis framework – ExMAn is automated, general and flexible and
allows for the comparison of different quality assurance techniques such as testing, model
checking, and static analysis. ExMAn is an implementation of the empirical methodology
presented in Chapter 3. The goal of ExMAn is to provide a tool that facilitates the mutation-
based comparison of different fault detection approaches by:
• providing a maximal degree of automation, which reduces effort and the possibilities
for errors
• providing a maximal degree of customization, which allows the tool to be used for a
large variety of experiments
47
• supporting reproducibility of experiments by other researchers
In this chapter we will first provide an overview of the challenges with implementing
a mutation analysis framework in Section 4.1. In Section 4.2 we will provide an overview
of existing mutation analysis tools that have influenced the design and implementation of
ExMAn. In Section 4.3 we will provide a description of ExMAn’s architecture as well as
the functionality of the ExMAn framework. In Section 4.4 we outline 7 usage scenarios for
ExMAn. We will present a summary of the ExMAn framework in Section 4.5.
4.1 Challenges with Implementing a Mutation Analysis Frame-
work
In Section 2.3 we introduced program mutation – a technique that has been used in the
testing community for 30 years. In Chapter 3 we proposed a generalized methodology
called experimental mutation analysis (see Figure 3.1) for comparing the effectiveness and
efficiency of different fault detection techniques. Implementing a generalized experimental
mutation analysis approach to empirically assess different quality assurance techniques is
a challenging problem. A mutation approach that supports the comparison of different
quality techniques would have to provide a high degree of automation and customizability.
The high degree of automation is required to execute the mutation analysis process and is
essential to allow for experimental results to be reproduced. Automation can be achieved
through automatically generated scripts to handle the generation of mutants, the mutant
analysis, and the generation of results such as mutant score. Customizability is necessary
because the approach has to be language and quality artifact independent. On the one hand,
language independence means that pluggable mutation generators and compilers are ideal.
On the other hand quality artifact independence means the approach should support the
comparison of different pluggable quality assurance tools that use artifacts including test
cases, assertions, temporal logic properties, and more. In the absence of such a framework,
running a wide variety of experiments would mean a considerable amount of effort.
We have developed the ExMAn (EXperimental Mutation ANalysis) framework as a
realization of our generalized methodology from Chapter 3. That is, ExMAn is a reusable
implementation for building different customized mutation analysis tools for comparing
different quality assurance techniques.
4.2 Related Work: Existing Mutation Tools
There are several mutation tools including Mothra [DGK+88, DO91], Proteum [DM96], and
MuJava [OMK04, MOK05] that our work builds upon. The Mothra tool is a mutation tool
for Fortran programs that allows for the application of method level mutation operators
(e.g., relational operator replacement) (see Table 5.1). The Proteum tool is a mutation
analysis tool for C programs. MuJava is the most recent mutation tool and was designed
for use with Java and includes a subset of the method-level operators available in Mothra
as well as a set of class mutation operators (see Table 5.2) to handle object oriented issues
such as polymorphism and inheritance (e.g., mutate the public keyword into protected).
The difference between ExMAn and these tools is that although each is highly automated
they were designed to apply mutation analysis to testing. Thus, each is program language
dependent and assumes only test cases as quality artifacts. Despite this limitation, all of
these tools are excellent for mutation testing and we have learned from their design in
building ExMAn as a flexible alternative.
Quality Artifact
Selectors
Tool 1
Tool n
LEGEND
BUILT-IN COMPONENT
EXTERNAL TOOL COMPONENT
OR PLUGIN COMPONENT
QA Tool
1
Compiler
(Optional)
Mutant
Viewer
ExMAn
Mutant
Generator
QA Tool
n
Results
Generator &
Viewer
Hybrid Artifact
Set Generator
Source
Viewer
Artifact
Generator 1
(Optional)
Artifact
Generator n
(Optional)
Plugin Interface Script Generator & Executor
Script Generator &
Executor
Compiler
Viewer
(Optional)
QA Tool
Viewers
Tool 1
Tool n
Figure 4.1: ExMAn architecture
The architecture consists of built-in components (appear inside dark grey box) and external tool components
and plugin components (appear outside of grey box at top of diagram). The built-in components in the light
grey box provide the ExMAn user interface and allow for control of the external tool components via the
Script Generator & Executor. The plugin components are accessed using a plugin interface. Arrows in the
diagram represent the typical control flow path between components.
4.3 Overview of ExMAn
4.3.1 Architecture
The ExMAn architecture is composed of three kinds of components: built-in components,
plugin components, and external tool components. The built-in components are general
components that are used in all types of experiments (see Figure 4.1). We will discuss
most of the general components in our description of the ExMAn process in Section 4.3.2.
However, we will discuss one important built-in component, the Script Generator & Execu-
tor, now. This built-in component provides the interface to the external tool components
such as a mutant generator. This component builds and executes scripts when requested
Figure 4.4: ExMAn Create/Edit Project dialog
one of the quality assurance tools (e.g., dynamic analysis, testing, model checking)
requires compiled source code as input. The progress of the compilation is reported
in the Compile Viewer. Some mutation generation tools produce mutants that are
not syntactically correct and thus will not compile. Only mutants that compile will
be used in the following steps.
4. Select Quality Artifacts: for each quality assurance analysis tool being analyzed using
mutation a set of quality artifacts is selected. For example, with model checking a set
of temporal logic properties can be selected from a property pool. The property pool
can be generated by an optional Artifact Generator plugin or we can use an existing
property pool. The selection of quality artifacts can be conducted randomly or by
hand using a Quality Artifact Selector & Viewer. For example we could randomly
select 20 properties from a property pool or select them by hand. Each quality artifact
can also be viewed in a dialog interface.
5. Run Analysis with Original Source Code & Mutants: Quality Analysis Tool Viewers
call automatically generated scripts which allow all of the quality assurance tools to
be run automatically. For each tool’s set of quality artifacts, we first evaluate each
artifact using the original source to determine the expected outputs. Next we evaluate
the artifacts for all of the mutant versions of the original program. During this step
all of the tool analysis results and analysis execution times of each artifact with each
program version are recorded and the progress is reported. Quality Analysis Tool
Viewers also provide an interface to customize the running of the analysis by placing
limits on the size of output and the amount of CPU time. For example, a mutant
might cause the original program to go into an infinite loop and never terminate
which would be a problem if we are evaluating a test suite. Fortunately, the user can
account for this by placing relative or absolute limits on the resources used by the
mutant programs. If relative limits are used then the resources used by the original
program are recorded and the resources used by each mutant are monitored and the
mutant is terminated once it exceeds a relative threshold (e.g. 60 seconds of CPU
time more then the original program).
6. Collection and Display of Results: results using all of the quality assurance tools are
displayed in tabular form in the Results Generator & Viewer. The data presented
includes the quality artifact vs. mutant raw data, the mutant score and analysis
time for each quality artifact and the ease to kill each mutant (i.e. the number of
quality artifacts that kill each mutant). We also can generate hybrid sets of quality
artifacts from all quality assurance tools that have undergone mutation analysis using
the Hybrid Artifact Set Generator. For instance, if different artifacts are used with
different tools we report the combined set of quality artifacts that will achieve the
highest mutant score. Additionally, we can generate the hybrid set of artifacts that
ConMAn: Mutation Operators for
Concurrent Java (J2SE 5.0)
The current version of Java (J2SE 5.0) provides a high level of support for concurrency
in comparison to previous versions. For example, programmers using J2SE 5.0 can now
achieve synchronization between concurrent threads using explicit locks, semaphores, bar-
riers, latches, or exchangers. Furthermore, built-in concurrent data structures such as hash
maps and queues, built-in thread pools, and atomic variables are all at the programmer’s
disposal (see Section 2.1.1 for more details).
We are interested in using mutation analysis to evaluate, compare and improve quality
assurance techniques for concurrent Java programs. Furthermore, we believe that the cur-
rent set of method mutation operators and class operators proposed in the literature are
insufficient to mutate concurrent Java source code because the majority of operators do
not directly mutate the portions of code responsible for synchronization. In this chapter
we will provide an overview of a new set of concurrent mutation operators – the CONcur-
rency Mutation ANalysis (ConMAn) operators. We will justify the ConMAn operators by
categorizing them with an existing bug pattern taxonomy for concurrency. Most of the bug
60
Java has been proposed in previous work – for instance the MuJava tool [MOK05] discussed
in Section 4.2. Recall that MuJava included two general types of mutation operators for
Java: method level operators [KO91, MOK05] and class level operators [MKO02]. In gen-
eral, the method and class level mutation operators do not directly mutate the synchroniza-
tion portions of the source code in Java (J2SE 5.0) that handle concurrency. Furthermore,
we conjecture that additional operators are needed in order to provide a more comprehen-
sive set of operators that can truly reflect the types of bugs that often occur in concurrent
programs. In this chapter we present a set of concurrent operators for Java (J2SE 5.0). We
believe our new set of concurrency mutation operators used in conjunction with existing
method and class level operators provide a more comprehensive set of mutation metrics for
the comparison and improvement of quality assurance testing and analysis for concurrency.
5.2 Related Work: Existing Mutation Operators for Java
Currently, there are two main groups of operators for mutating Java source code: method
and class mutation operators. As previously stated we believe the existing operators are
complementary to our concurrency mutation operators.
5.2.1 Method Mutation Operators
Method level operators [KO91, MOK05] have been used in previous mutation tools for
other programming languages besides Java (e.g., the Mothra tool set for mutating Fortran
programs [DGK+88]). These operators are applied to statements, operands and operators
(see Table 5.1). Operators applied to statements perform actions such as modification,
replacement, and deletion. Operators applied to operands primarily are replacements. Op-
erators applied to operators include insertion, deletion, and replacement. The sufficient
set of method level mutation operators in Table 5.1 have been implemented in the MuJava
Operator
Category
Concurrency Mutation Operators
for Java (J2SE 5.0)
MXT – Modify Method-X Time
(wait(), sleep(), join(), and await() method calls)
MSP - Modify Synchronized Block Parameter
ESP - Exchange Synchronized Block Parameters
MSF - Modify Semaphore Fairness
MXC - Modify Permit Count in Semaphore and Modify Thread
Count in Latches and Barriers
M
od
ify
P
ar
am
et
er
s
of
C
on
cu
rr
en
t M
et
ho
ds
MBR - Modify Barrier Runnable Parameter
RTXC – Remove Thread Method-X Call
(wait(), join(), sleep(), yield(), notify(), notifyAll() Methods)
RCXC – Remove Concurrency Mechanism Method-X Call
(methods in Locks, Semaphores, Latches, Barriers, etc.)
RNA - Replace NotifyAll() with Notify()
RJS - Replace Join() with Sleep()
ELPA - Exchange Lock/Permit Acquisition
M
od
ify
t
he
O
cc
ur
re
nc
e
of
C
on
cu
rr
en
cy
M
et
ho
d
C
al
ls
EAN - Exchange Atomic Call with Non-Atomic
ASTK – Add Static Keyword to Method
RSTK – Remove Static Keyword from Method
ASK - Add Synchronized Keyword to Method
RSK - Remove Synchronized Keyword from Method
RSB - Remove Synchronized Block
RVK - Remove Volatile Keyword M
od
ify
K
ey
w
or
d
RFU - Remove Finally Around Unlock
RXO - Replace One Concurrency Mechanism-X with Another
(Locks, Semaphores, etc.)
Sw
itc
h
C
on
cu
r-
re
nt
O
bj
ec
ts
EELO - Exchange Explicit Lock Objects
SHCR - Shift Critical Region
SKCR - Shrink Critical Region
EXCR – Expand Critical Region M
od
ify
C
rit
ic
al
R
eg
io
n
SPCR - Split Critical Region
Table 5.4: The ConMAn mutation operators for Java
mutation for concurrency was also suggested by Ghosh who proposed two mutation opera-
tors (RSYNCHM and RSYNCHB) for removing the synchronized keyword from methods and
removing synchronized blocks [Gho02]. The operators proposed by Ghosh are equivalent to
the Remove Synchronized Keyword from Method (RSK) and Remove Synchronized Block
(RSB) operators presented later in this chapter.
5.4.1 Modify Parameters of Concurrent Method
These operators involve modifying the parameters of methods for thread and concurrency
classes. Some of the method level mutation operators that modify operands are similar to
the operators proposed here.
MXT - Modify Method-X Timeout
The MXT operator can be applied to the wait(), sleep(), and join() method calls (introduced
in Section 2.1.1) that include an optional timeout parameter. For example, in Java a call
to wait() with the optional timeout parameter will cause a thread to no longer be runnable
until a condition is satisfied or a timeout has occurred. The MXT replaces the timeout
parameter, t, of the wait() method by some appropriately chosen fraction or multiple of t
(e.g., t/2 and t ∗ 2). We could replace the timeout parameter by a variable of an equivalent
type. However, since we know that the parameter represents a time value it is just as
meaningful to mutate the method to both increase and decrease the time by a factor of 2.
Original Code: MXT Mutant for wait():
long t ime = 10000;
t r y {
wa i t ( t ime ) ;
} catch . . .
long t ime = 10000;
t r y {
wa i t ( t ime ∗2 ) ;
// or r e p l a c e w i th t ime /2
} catch . . .
The MXT operator with the wait() method is most likely to result in an interference bug
or a data race when the time is decreased. The MXT operator with the sleep() and join()
methods is most likely to result in the sleep() bug pattern. For example, in a situation where
a sleep() or join() is used by a caller thread to wait for another thread, reducing the time
may cause the caller thread to not wait long enough for the other thread to complete.
The MXT operator can also be applied to the optional timeout parameter in await()
method calls. Both barriers and latches have an await() method. In barriers the await()
method is used to cause a thread to wait until all threads have reached the barrier. In
latches the await() method is used by threads to wait until the latch has finished counting
down, that is until all operations in a set are complete. For example:
Original Code: MXT Mutant for await():
CountDownLatch l a t c h 1
= new CountDownLatch ( 1 ) ;
. . .
long t ime = 50 ;
l a t c h 1 . awa i t ( t ime ,
TimeUnit .MILLISECONDS ) ;
. . .
CountDownLatch l a t c h 1
= new CountDownLatch ( 1 ) ;
. . .
long t ime = 50 ;
l a t c h 1 . awa i t ( t ime /2 ,
TimeUnit .MILLISECONDS ) ;
// or r e p l a c e t ime wi th t ime ∗2
. . .
The MXT operator when applied to an await() method call will most likely result in an
interference bug.
MSP - Modify Synchronized Block Parameter
Common parameters for a synchronized block include the this keyword, indicating that
synchronization occurs with respect to the instance object of the class, and implicit monitor
objects. If the keyword this or an object is used as a parameter for a synchronized block we
can replace the parameter by another object or the keyword this. For example:
Original Code:
p r i v a t e Object l o c k1 = new Object ( ) ;
p r i v a t e Object l o c k2 = new Object ( ) ;
. . . .
pub l i c vo id methodA (){
synchron ized ( l o c k1 ){ . . . }
}
. . .
MSP Mutant: Another MSP Mutant:
p r i v a t e Object l o c k1
= new Object ( ) ;
p r i v a t e Object l o c k2
= new Object ( ) ;
. . .
pub l i c vo id methodA (){
synchron ized ( l o c k2 ){ . . . }
}
. . .
p r i v a t e Object l o c k1
= new Object ( ) ;
p r i v a t e Object l o c k2
= new Object ( ) ;
. . .
pub l i c vo id methodA (){
synchron ized ( t h i s ){ . . . }
}
. . .
The MSP operator will result in the wrong lock bug pattern.
ESP - Exchange Synchronized Block Parameters
If a critical region is guarded by multiple synchronized blocks with implicit monitor locks
the ESP operator exchanges two adjacent lock objects. For example:
acquire() method to obtain a permit may receive one prior to an already waiting thread -
this is known as barging3.
Original Code: MSF Mutant:
i n t pe rm i t s = 10 ;
p r i v a t e f i n a l Semaphore sem
= new Semaphore ( pe rmi t s , t rue ) ;
. . .
i n t pe rm i t s = 10 ;
p r i v a t e f i n a l Semaphore sem
= new Semaphore ( pe rmi t s , f a l s e ) ;
. . .
MXC - Modify Concurrency Mechanism-X Count
The MXC operator is applied to parameters in three of Java’s concurrency mechanisms:
semaphores, latches, and barriers. A latch allows a set of threads to countdown a set of
operations and a barrier allows a set of threads to wait at a point until a number of threads
reach that point. The count being modified in semaphores is the set of permits, and in
latches and barriers it is the number of threads. We will next provide an example of the
MXC operator for semaphores, latches, and barriers.
The constructor of the Semaphore class has a parameter that refers to the maximum
number of available permits that are used to limit the number of the threads accessing
the shared resource. Access is acquired using the acquire() method and released using the
release() method. Both the acquire() and release() method calls have optional count parameters
referring to the number of permits being acquired or released. The MXC operator modifies
the number of permits, p, in calls to these methods by decrementing (p--) and incrementing
(p++) it by 1. For example:
Original Code: MXC Mutant for a Semaphore:
i n t pe rm i t s = 10 ;
p r i v a t e f i n a l Semaphore sem
= new Semaphore ( pe rmi t s , t rue ) ;
. . .
i n t pe rm i t s = 10 ;
p r i v a t e f i n a l Semaphore sem
= new Semaphore ( pe rmi t s −−,t rue ) ;
. . .
A potential bug that can occur from modifying permit counts in Semaphores. In the above
3java.util.concurrent documentation
Original Code:
i n t pe rm i t s = 10 ;
p r i v a t e f i n a l Semaphore sem = new Semaphore ( pe rmi t s , t rue ) ;
. . .
sem . a c q u i r e ( ) ;
. . .
sem . r e l e a s e ( ) ;
. . .
RCXC Mutant for a Semaphore: Another RCXC Mutant for a Semaphore:
i n t pe rm i t s = 10 ;
p r i v a t e f i n a l Semaphore sem
= new Semaphore ( pe rmi t s , t rue ) ;
. . .
// removed sem . a c q u i r e ( ) ;
. . .
sem . r e l e a s e ( ) ;
. . .
i n t pe rm i t s = 10 ;
p r i v a t e f i n a l Semaphore sem
= new Semaphore ( pe rmi t s , t rue ) ;
. . .
sem . a c q u i r e ( ) ;
. . .
// removed sem . r e l e a s e ( ) ;
. . .
Due to the similar nature of applying the RCXC operator for other concurrency mechanisms
we will not provide any additional examples.
RNA - Replace NotifyAll() with Notfiy()
The RNA operator replaces a notifyAll() with a notify() and is an example of the notify instead
of notify all bug pattern.
Original Code: RNA Mutant:
. . . n o t i f y A l l ( ) ; . . . . . . n o t i f y ( ) ; . . .
RJS - Replace Join() with Sleep()
The RJS operator replaces a join() with a sleep() and is an example of the sleep() bug pattern.
Original Code: RJS Mutant:
. . . j o i n ( ) ; . . . . . . s l e e p ( 1 0 0 0 0 ) ; . . .
ELPA - Exchange Lock/Permit Acquistion
In a semaphore the acquire(), acquireUninterruptibly() and tryAcquire() methods can be used to
obtain one or more permits to access a shared resource. The ELPA operator exchanges one
method for another which can lead to potential timing changes as well as starvation. For
example, an acquire() method will try and obtain one or more permits and will block and
wait until the permit or permits become available. If the thread that invoked the acquire()
method is interrupted it will no longer continue to block and wait. If the acquire() method
invocation is changed to acquireUninterruptibly() it will behave exactly the same except it can
no longer be interupted. Thus in situations where the semaphore is unfair or if for other
reasons the number of requested permits never becomes available the thread that invoked
the acquireUninterruptibly() will stay dormant and wait. If an acquire() method invocation is
changed to a tryAcquire() then a permit will be acquired if one is available otherwise the
thread will not block and wait. tryAcquire() will acquire a permit or permits unfairly even
if the fairness setting is set to fair. Use of tryAcquire() may cause starvation for threads
waiting for permits.
Original Code:
i n t pe rm i t s = 10 ;
p r i v a t e f i n a l Semaphore sem = new Semaphore ( pe rmi t s , t rue ) ;
. . .
sem . a c q u i r e ( ) ;
. . .
ELPA Mutant:
i n t pe rm i t s = 10 ;
p r i v a t e f i n a l Semaphore sem = new Semaphore ( pe rmi t s , t rue ) ;
. . .
sem . a c q u i r e U n i n t e r r u p t i b l y ( ) ;
. . .
Original Code: RSK Mutant:
pub l i c synchron ized vo id aMethod ( )
{ . . . }
pub l i c vo id aMethod ( )
{ . . . }
RSB - Remove Synchronized Block
Similar to the RSK operator, the RSB operator removes the synchronized keyword from
around a statement block which can cause a no lock bug. For example:
Original Code: RSB Mutant:
synchron ized ( t h i s ){
<s ta tement c1>
}
// s yn ch r on i z e d ( t h i s ) i s removed
<s ta tement c1>
. . .
RVK - Remove Volatile Keyword
The volatile keyword is used with a shared variable and prevents operations on the variable
from being reordered in memory with other operations. In the below example we remove
the volatile keyword from a shared long variable. If a long variable, which is 64-bit, is not
declared volatile then reads and writes will be treated as two 32-bit operations instead of one
operation. Therefore, the RVK operator can cause a situation where a nonatomic operation
is assumed to be atomic. For example:
Original Code: RVK Mutant:
v o l a t i l e long x ; long x ;
RFU - Remove Finally Around Unlock
The finally keyword is important in releasing explicit locks. In the below example, finally
ensures that the unlock() method call will occur after a try block regardless of whether or
not an exception is thrown. If finally is removed the unlock() will not occur in the presence
of an exception and cause a blocking critical section bug.
Original Code: RFU Mutant:
p r i v a t e Lock l o ck1
= new Reent rantLock ( ) ;
. . .
l o c k1 . l o c k ( ) ;
t r y {
. . .
} f i n a l l y {
l o c k1 . un lock ( ) ;
}
. . .
p r i v a t e Lock l o ck1
= new Reent rantLock ( ) ;
. . .
l o c k1 . l o c k ( ) ;
t r y {
. . .
}
l o c k1 . un lock ( ) ;
. . .
5.4.4 Switch Concurrent Objects
When multiple instances of the same concurrent class type exist we can replace one con-
current object with the other.
RXO - Replace One Concurrency Mechanism-X with Another
When two instances of the same concurrency mechanism exist we replace a call to one with
a call to the other. For example, consider the replacement of Lock method calls:
Original Code: RXO Mutant for Locks:
p r i v a t e Lock l o ck1
= new Reent rantLock ( ) ;
p r i v a t e Lock l o ck2
= new Reent rantLock ( ) ;
. . .
l o c k1 . l o c k ( ) ;
. . .
p r i v a t e Lock l o ck1
= new Reent rantLock ( ) ;
p r i v a t e Lock l o ck2
= new Reent rantLock ( ) ;
. . .
// shou l d be c a l l to l o c k1 . l o c k ( )
l o c k2 . l o c k ( ) ;
. . .
We can also apply the RXO operator when 2 or more objects exist of type Semaphore,
CountDownLatch, CyclicBarrier, Exchanger, and more. For example consider the application of
the RXO operator with two Semaphores and two Barriers:
Original Code: EXCR Mutant:
<s ta tement n1>
<s ta tement n2>
synchron ized ( t h i s ){
// c r i t i c a l r e g i o n
<s ta tement c1>
<s ta tement c2>
}
<s ta tement n3>
<s ta tement n4>
. . .
<s ta tement n1>
synchron ized ( t h i s ){
<s ta tement n2>
// c r i t i c a l r e g i o n
<s ta tement c1>
<s ta tement c2>
<s ta tement n3>
}
<s ta tement n4>
. . .
The EXCR operator can also cause correctness issues and consequences such as deadlock
when an expanded critical region overlaps with or subsumes another critical region.
SKCR - Shrink Critical Region
Shrinking a critical region will have similar consequences (interference) to shifting a region
since both the SHCR and SKCR operators move statements that require synchronization
outside the critical section. Below we provide an example of the SKCR operator using a
Lock.
Original Code: SKCR Mutant:
p r i v a t e Lock l o ck1
= new Reent rantLock ( ) ;
. . .
pub l i c vo id m1 (){
<s ta tement n1>
l o c k1 . l o c k ( ) ;
t r y {
// c r i t i c a l r e g i o n
<s ta tement c1>
<s ta tement c2>
<s ta tement c3>
} f i n a l l y {
l o c k1 . un lock ( ) ;
}
<s ta tement n2>
. . .
p r i v a t e Lock l o ck1
= new Reent rantLock ( ) ;
. . .
pub l i c vo id m1 (){
<s ta tement n1>
// c r i t i c a l r e g i o n
<s ta tement c1>
l o c k1 . l o c k ( ) ;
t r y {
<s ta tement c2>
} f i n a l l y {
l o c k1 . un lock ( ) ;
}
<s ta tement c3>
<s ta tement n2>
. . .
Concurrency Bug Pattern Mutation Operators
Nonatomic operations assumed to be
atomic bug pattern
RVK, EAN
Two-stage access bug pattern SPCR
Wrong lock or no lock bug pattern MSP, ESP, EELO, SHCR, SKCR, EXCR,
RSB, RSK, ASTK, RSTK, RCXC, RXO
Double-checked locking bug pattern –
The sleep() bug pattern MXT, RJS, RTXC
Losing a notify bug pattern RTXC, RCXC
Notify instead of notify all bug pattern RNA
Other missing or nonexistent signals bug
pattern
MXC, MBR, RCXC
A “blocking” critical section bug pattern RFU, RCXC
The orphaned thread bug pattern –
The interference bug pattern MXT, RTXC, RCXC
The deadlock (deadly embrace) bug pattern ESP, EXCR, EELO, RXO, ASK
Starvation bug pattern MSF, ELPA
Resource exhaustion bug pattern MXC
Incorrect count initialization bug pattern MXC
Table 5.5: Concurrency bug patterns vs. ConMAn mutation operators
5.5 Summary
We have presented our set of ConMAn mutation operators to be used in the comparison
of different test suites and testing strategies for concurrent Java as well as different fault
detection techniques for concurrency. Although we are primarily interested in our ConMAn
operators as comparative metrics we believe that these operators can also serve a role
similar to method and class level mutation operators as both comparative metrics and
coverage criteria. Our new concurrency operators should be viewed as a complement not a
replacement for the existing operators used in tools like MuJava. For example, using the
ConMAn operators can cause direct concurrency bugs while using the method and class
level operators can cause indirect concurrency bugs. We define direct concurrency bugs as
bugs that result from mistakes in the source code that manage concurrency mechanisms
like synchronization and access to shared variables. We define indirect concurrency bugs
as bugs that occur elsewhere in the source code that may have an effect on concurrency
mechanisms elsewhere in a program.
Sign up today - FREE
Mendeley saves you time finding and organizing research. Learn more
- All your research in one place
- Add and import papers easily
- Access it anywhere, anytime


