Sign up & Download
Sign in

S2E : A Platform for In-Vivo Multi-Path Analysis of Software Systems

by Vitaly Chipounov, Volodymyr Kuznetsov, George Candea
Security (2011)

Abstract

This paper presents S2E, a platform for analyzing the properties and behavior of software systems. We demonstrate S2E's use in developing practical tools for comprehensive performance profiling, reverse engineering of proprietary software, and bug finding for both kernel-mode and user-mode binaries. Building these tools on top of S2E took less than 770 LOC and 40 person-hours each. S2E's novelty consists of its ability to scale to large real systems, such as a full Windows stack. S2E is based on two new ideas: selective symbolic execution, a way to automatically minimize the amount of code that has to be executed symbolically given a target analysis, and relaxed execution consistency models, a way to make principled performance/accuracy trade-offs in complex analyses. These techniques give S2E three key abilities: to simultaneously analyze entire families of execution paths, instead of just one execution at a time; to perform the analyses in-vivo within a real software stack-user programs, libraries, kernel, drivers, etc.-instead of using abstract models of these layers; and to operate directly on binaries, thus being able to analyze even proprietary software. Conceptually, S2E is an automated path explorer with modular path analyzers: the explorer drives the target system down all execution paths of interest, while analyzers check properties of each such path (e.g., to look for bugs) or simply collect information (e.g., count page faults). Desired paths can be specified in multiple ways, and S2E users can either combine existing analyzers to build a custom analysis tool, or write new analyzers using the S2E API.

Cite this document (BETA)

Available from portal.acm.org
Page 1
hidden

S2E : A Platform for In-Vivo Multi-Path Analysis of Software Systems

S2E: A Platform for
In-Vivo Multi-Path Analysis of Software Systems
Vitaly Chipounov, Volodymyr Kuznetsov, George Candea
School of Computer and Communication Sciences
´Ecole Polytechnique Fe´de´rale de Lausanne (EPFL), Switzerland
{vitaly.chipounov,vova.kuznetsov,george.candea}@epfl.ch
Abstract
This paper presents S2E, a platform for analyzing the properties and
behavior of software systems. We demonstrate S2E’s use in devel-
oping practical tools for comprehensive performance profiling, re-
verse engineering of proprietary software, and bug finding for both
kernel-mode and user-mode binaries. Building these tools on top of
S2E took less than 770 LOC and 40 person-hours each.
S2E’s novelty consists of its ability to scale to large real sys-
tems, such as a full Windows stack. S2E is based on two new ideas:
selective symbolic execution, a way to automatically minimize the
amount of code that has to be executed symbolically given a target
analysis, and relaxed execution consistency models, a way to make
principled performance/accuracy trade-offs in complex analyses.
These techniques give S2E three key abilities: to simultaneously
analyze entire families of execution paths, instead of just one exe-
cution at a time; to perform the analyses in-vivo within a real soft-
ware stack—user programs, libraries, kernel, drivers, etc.—instead
of using abstract models of these layers; and to operate directly on
binaries, thus being able to analyze even proprietary software.
Conceptually, S2E is an automated path explorer with modular
path analyzers: the explorer drives the target system down all ex-
ecution paths of interest, while analyzers check properties of each
such path (e.g., to look for bugs) or simply collect information (e.g.,
count page faults). Desired paths can be specified in multiple ways,
and S2E users can either combine existing analyzers to build a cus-
tom analysis tool, or write new analyzers using the S2E API.
Categories and Subject Descriptors D.2.4 [Software/Program
Verification]
General Terms Reliability, Verification, Performance, Security
1. Introduction
System developers routinely need to analyze the behavior of what
they build. One basic analysis is to understand observed behavior,
such as why a given web server is slow on a SPECweb benchmark.
More sophisticated analyses aim to characterize future behavior in
previously unseen circumstances, such as what will a web server’s
maximum latency and minimum throughput be, once deployed at
Permission to make digital or hard copies of all or part of this work for personal or
classroom use is granted without fee provided that copies are not made or distributed
for profit or commercial advantage and that copies bear this notice and the full citation
on the first page. To copy otherwise, to republish, to post on servers or to redistribute
to lists, requires prior specific permission and/or a fee.
ASPLOS’11, March 5–11, 2011, Newport Beach, California, USA.
Copyright © 2011 ACM 978-1-4503-0266-1/11/03. . . $10.00
a customer site. Ideally, system designers would also like to be
able to do quick what-if analyses, such as determining whether
aligning a certain data structure on a page boundary will avoid all
cache misses and thus increase performance. For small programs,
experienced developers can often reason through some of these
questions based on code alone. The goal of our work is to make it
feasible to answer such questions for large, complex, real systems.
We introduce in this paper a platform that enables easy con-
struction of analysis tools (such as oprofile, valgrind, bug finders,
or reverse engineering tools) that simultaneously offer the follow-
ing three properties: (1) they efficiently analyze entire families of
execution paths; (2) they maximize realism by running the analy-
ses within a real software stack; and (3) they are able to directly
analyze binaries. We explain these properties below.
First, predictive analyses often must reason about entire fami-
lies of paths through the target system, not just one path. For exam-
ple, security analyses must check that there exist no corner cases
that could violate a desired security policy; recent work has em-
ployed model checking [29] and symbolic execution [11] to find
bugs in real systems—these are all multi-path analyses. One of
our case studies demonstrates multi-path analysis of performance
properties: instead of profiling solely one execution path, we derive
performance envelopes that characterize the performance of entire
families of paths. Such analyses can check real-time requirements
(e.g., that an interrupt handler will never exceed a given bound on
execution time), or can help with capacity planning (e.g., deter-
mine how many web servers to provision for a web farm). In the
end, properties shown to hold for all paths constitute proofs, which
are in essence the ultimate prediction of a system’s behavior.
Second, an accurate estimate of program behavior often requires
taking into account the whole environment surrounding the ana-
lyzed program: libraries, kernel, drivers, etc.—in other words, it
requires in-vivo1 analysis. Even small programs interact with their
environment (e.g., to read/write files or send/receive network pack-
ets), so understanding program behavior requires understanding the
nature of these interactions. Some tools execute the real environ-
ment, but allow calls from different execution paths to interfere
inconsistently with each other [12, 18]. Most approaches abstract
away the environment behind a model [2, 11], but writing abstract
models is labor-intensive (taking in some cases multiple person-
years [2]), models are rarely 100% accurate, and they tend to lose
1 In vivo is Latin for “within the living” and refers to experimenting using
a whole live system; in vitro uses a synthetic or partial system. In life sci-
ences, in vivo testing—animal testing or clinical trials—is often preferred,
because, when organisms or tissues are disrupted (as in the case of in vitro
settings), results can be substantially less representative. Analogously, in-
vivo program analysis captures all interactions of the analyzed code with its
surrounding system, not just with a simplified abstraction of that system.
1
Page 2
hidden
accuracy as the modeled system evolves. It is therefore preferable
that target programs interact directly with their real environment
during analysis in a way that keeps multi-path analysis consistent.
Third, real systems are made up of many components from
various vendors; access to all corresponding source code is rarely
feasible and, even when source code is available, building the code
exactly as in the shipped software product is difficult [5]. Thus, in
order to be practical, analyses ought to operate directly on binaries.
Scalability is the key challenge of performing analyses that are
in-vivo, multi-path, and operate on binaries. Going from single-
path analysis to multi-path analysis turns a linear problem into an
exponential one, because the number of paths through a program
increases exponentially in the number of branches—the “path ex-
plosion” problem [7]. It is therefore not feasible today to execute
fully symbolically an entire software stack (programs, libraries, OS
kernel, drivers, etc.) as would be necessary if we wanted consistent
in-vivo multi-path analysis.
We describe in this paper S2E, a general platform for developing
multi-path in-vivo analysis tools that are practical even for large,
complex systems, such as an entire Windows software stack. First,
S2E simultaneously exercises entire families of execution paths in a
scalable manner by using selective symbolic execution and relaxed
execution consistency models. Second, S2E employs virtualization
to perform the desired analyses in vivo; this removes the need
for the stubs or abstract models required by most state-of-the-art
symbolic execution engines and model checkers [3, 11, 18, 29, 36].
Third, S2E uses dynamic binary translation to directly interpret x86
machine code, so it can analyze a wide range of software, including
proprietary systems, even if self-modifying or JITed, as well as
obfuscated and packed/encrypted binaries.
The S2E platform offers an automated path exploration mech-
anism and modular path analyzers. The explorer drives in parallel
the target system down all execution paths of interest, while ana-
lyzers check properties of each such path (e.g., to look for bugs)
or simply collect information (e.g., count page faults). An analysis
tool built on top of S2E glues together path selectors with path ana-
lyzers. Selectors guide S2E’s path explorer by specifying the paths
of interest: all paths that touch a specific memory object, paths in-
fluenced by a specific parameter, paths inside a target code module,
etc. Analyzers can be pieced together from S2E-provided analyzers,
or can be written from scratch using the S2E API.
S2E comes with ready-made selectors and analyzers that pro-
vide a wide range of analyses out of the box. The typical S2E user
only needs to define in a configuration file the desired selector(s)
and analyzer(s) along with the corresponding parameters, start up
the desired software stack inside the S2E virtual machine, and run
the S2E launcher in the guest OS, which starts the desired applica-
tion and communicates with the S2E VM underneath. For example,
one may want to verify the code that handles license keys in a
proprietary program, such as Adobe Photoshop. The user installs
the program in the S2E Windows VM and launches the program
using s2e.exe C:\Program Files\Adobe\Photoshop. From inside
the guest OS, the s2e.exe launcher communicates with S2E via
custom opcodes (described in §4). In the S2E configuration file,
the tester may choose a memory-checker analyzer along with a
path selector that returns a symbolic string whenever Photoshop
reads HKEY LOCAL MACHINE\Software\Photoshop\LicenseKey
from the Windows registry. S2E then automatically explores the
code paths in Photoshop that are influenced by the value of the
license key and looks for memory safety errors along those paths.
Developing a new analysis tool with S2E takes on the order of
20-40 person-hours and a few hundred LOC. To illustrate S2E’s gen-
erality, we present here three very different tools built using S2E: a
multi-path in-vivo performance profiler, a reverse engineering tool,
and a tool for automatically testing proprietary software.
This paper makes the following four contributions:
• Selective symbolic execution, a new technique for automatic
bidirectional symbolic–concrete state conversion that enables
execution to seamlessly and correctly weave back and forth
between symbolic and concrete mode;
• Execution consistency models, a systematic way to reason
about the trade-offs involved in over/under-approximation of
paths in software system analyses;
• A general platform for performing diverse in-vivo multi-path
analyses in a way that scales to large real systems;
• The first use of symbolic execution in performance analysis.
In the rest of the paper, we describe selective symbolic execu-
tion (§2), execution consistency models (§3), S2E’s APIs for devel-
oping analysis tools (§4), the S2E prototype (§5), evaluation (§6),
related work (§7), and conclusions (§8).
2. Selective Symbolic Execution
In devising a way to efficiently exercise entire families of paths, we
were inspired by the successful use of symbolic execution [22] in
automated software testing [11, 18]. The idea is to treat a program
as a superposition of possible execution paths. For example, a
program that is all linear code except for one conditional statement
if (x>0) then ... else ... can be viewed as a superposition of two
possible paths: one for x>0 and another one for x≤0. To exercise
all paths, it is not necessary to try all possible values of x, but rather
just one value greater than 0 and one value less than 0.
We unfurl this superposition of paths into a symbolic execution
tree, in which each possible execution corresponds to a path from
the root of the tree to a leaf corresponding to a terminal state. The
mechanics of doing so consist of marking variables as symbolic at
the beginning of the program, i.e., instead of allowing a variable x
to take on a concrete value (say, x=5), it is viewed as a superposi-
tion λ of all possible values x could take. Then, any time a branch
instruction is conditioned on a predicate p that depends (directly
or indirectly) on x, execution is split into two executions Ei and
Ek, two copies of the program’s state are created, and Ei’s path
remembers that the variables involved in p must be constrained to
make p true, while Ej’s path remembers that p must be false.
The process repeats recursively: Ei may further split into Eii
and Eik , and so on. Every execution of a branch statement creates
a new set of children, and thus what would normally be a linear
execution (if concrete values were used) now turns into a tree
of executions (since symbolic values are used). A node s in the
tree represents a program state (a set of variables with formulae
constraining the variables’ values), and an edge si → sj indicates
that sj is si’s successor on any path satisfying the constraints in
sj . Paths in the tree can be pursued simultaneously, as the tree
unfurls; since program state is copied, the paths can be explored
independently. Copy-on-write is used to make this process efficient.
S2E is based on the key observation that often only some fami-
lies of paths are of interest. For example, one may want to exhaus-
tively explore all paths through a small program, but not care about
all paths through the libraries it uses or the OS kernel. This means
that, when entering that program, S2E should split executions to ex-
plore the various paths, but whenever it calls into some other part
of the system, such as a library, multi-path execution can cease and
execution can revert to single-path. Then, when execution returns
to the program, multi-path execution must be resumed.
Multi-path execution corresponds to expanding a family of
paths by exploring the various side branches as they appear, while
switching to single-path mode corresponds to corseting the family
of paths. In multi-path mode, the tree grows in width and depth; in
2

Sign up today - FREE

Mendeley saves you time finding and organizing research. Learn more

  • All your research in one place
  • Add and import papers easily
  • Access it anywhere, anytime

Start using Mendeley in seconds!

Already have an account? Sign in

Readership Statistics

30 Readers on Mendeley
by Discipline
 
 
 
by Academic Status
 
30% Ph.D. Student
 
13% Assistant Professor
 
10% Student (Master)
by Country
 
47% United States
 
13% China
 
7% Portugal