Sign up & Download
Sign in

Distilling free-form natural laws from experimental data.

by Michael Schmidt, Hod Lipson
Science (2009)

Abstract

For centuries, scientists have attempted to identify and document analytical laws that underlie physical phenomena in nature. Despite the prevalence of computing power, the process of finding natural laws and their corresponding equations has resisted automation. A key challenge to finding analytic relations automatically is defining algorithmically what makes a correlation in observed data important and insightful. We propose a principle for the identification of nontriviality. We demonstrated this approach by automatically searching motion-tracking data captured from various physical systems, ranging from simple harmonic oscillators to chaotic double-pendula. Without any prior knowledge about physics, kinematics, or geometry, the algorithm discovered Hamiltonians, Lagrangians, and other laws of geometric and momentum conservation. The discovery rate accelerated as laws found for simpler systems were used to bootstrap explanations for more complex systems, gradually uncovering the "alphabet" used to describe those systems.

Cite this document (BETA)

Available from www.ncbi.nlm.nih.gov
Page 1
hidden

Distilling free-form natural laws from experimental data.

Distilling Free-Form Natural Laws
from Experimental Data
Michael Schmidt
1
and Hod Lipson
2,3
*
For centuries, scientists have attempted to identify and document analytical laws that underlie
physical phenomena in nature. Despite the prevalence of computing power, the process of finding
natural laws and their corresponding equations has resisted automation. A key challenge to finding
analytic relations automatically is defining algorithmically what makes a correlation in observed
data important and insightful. We propose a principle for the identification of nontriviality. We
demonstrated this approach by automatically searching motion-tracking data captured from various
physical systems, ranging from simple harmonic oscillators to chaotic double-pendula. Without any
prior knowledge about physics, kinematics, or geometry, the algorithm discovered Hamiltonians,
Lagrangians, and other laws of geometric and momentum conservation. The discovery rate
accelerated as laws found for simpler systems were used to bootstrap explanations for more
complex systems, gradually uncovering the “alphabet” used to describe those systems.
M
athematical symmetries and invariants
underlie nearly all physical laws in na-
ture (1), suggesting that the search for
many natural laws is inseparably a search for con-
served quantities and invariant equations (2, 3).
Automated techniques for generating, collecting,
and storing data from scientific measurements
have become increasingly precise and powerful,
but automated processes for distilling this data into
knowledge in the form of analytical natural laws
have not kept pace. Thus, there is a pressing prac-
tical need (4, 5) for improved forms of scientific
data mining (6, 7).
The most prohibitive obstacle to overcome in
order to search for conservation laws computa-
tionally is finding meaningful and nontrivial
invariants. There exist an infinite number of
identities that are numerically invariant but have
no connection to the natural physics or dynamics
of the system. We introduce a principle for iden-
tifying only the useful analytical relations that are
related to the system dynamics. We then dem-
onstrate how a search algorithm based on this
principle identifies meaningful analytical links
in data captured from various physical systems
(Fig. 1).
Our goal is to find natural relations where
they exist, with minimal restrictions on their
analytical form (i.e., free-form). Many methods
exist for modeling scientific data: Some use
fixed-form parametric models derived from ex-
pert knowledge, and others use numerical models
(such as neural networks) aimed at prediction.
Still others have explored restricted model spaces
using greedy monomial search (8, 9). Alterna-
tively, we seek the principal unconstrained
analytical expression that explains symbolically
precise conserved relations, thus helping distill
data into scientific knowledge.
Symbolic regression (10) is an established
method based on evolutionary computation (11)
for searching the space of mathematical expres-
sions while minimizing various error metrics [see
section S4 in the supporting online material
(SOM)]. Unlike traditional linear and nonlinear
regression methods that fit parameters to an
equation of a given form, symbolic regression
searches both the parameters and the form of
equations simultaneously (see SOM section S6).
Initial expressions are formed by randomly com-
bining mathematical building blocks such as
algebraic operators {+, –, ÷, ×}, analytical
functions (for example, sine and cosine), con-
stants, and state variables. New equations are
formed by recombining previous equations and
probabilistically varying their subexpressions.
The algorithm retains equations that model the
experimental data better than others and aban-
dons unpromising solutions. After equations reach
a desired level of accuracy, the algorithm termi-
nates, returning a set of equations that are most
likely to correspond to the intrinsic mechanisms
underlying the observed system.
Although symbolic regression is typically
used to find explicit (12–14) and differential
equations (15), this method cannot readily find
conservation laws or invariant equations. Rather
than trying to model a specific signal, we are
trying to detect any underlying physical law that
the system obeys, which may or may not be
constant (e.g., a Lagrangian).
A particular challenge is requiring the law to
be a function of the system’s state while avoiding
trivial or meaningless relations. For any system
over the state space x, there are infinitely many
trivial equations over x that satisfy a conserved
quantity, such as sin
2
(x
1
)+cos
2
(x
1
)orx
1
+4.56–
x
2
x
1
/x
2
. Additionally, there are infinitely many
arbitrarily close trivial conservations, such as
4.56 + 1/(100 + x
1
2
). To distinguish good con-
servation law candidates from poor ones, we
need a more robust principle than simply invar-
iance alone.
The identification of nontrivial relations is a
major challenge, even for human scientists: Many
published invariant quantities have turned out to
be coincidental (16). The mere appearance of a
conserved value is insufficient for a conservation
1
Computational Biology, Cornell University, Ithaca, NY 14853,
USA.
2
School of Mechanical and Aerospace Engineering,
Cornell University, Ithaca, NY 14853, USA.
3
Computing and
Information Science, Cornell University, Ithaca, NY 14853,
USA.
*To whom correspondence should be addressed. E-mail:
hod.lipson@cornell.edu
Fig. 1. Mining physical systems. We captured the angles and angular velocities
of a chaotic double-pendulum (A) over time using motion tracking (B), then we
automatically searched for equations that describe a single natural law relating
these variables. Without any prior knowledge about physics or geometry, the
algorithm found the conservation law (C), which turns out to be the double
pendulum’s Hamiltonian. Actual pendulum, data, and results are shown.
www.sciencemag.org SCIENCE VOL 324 3 APRIL 2009 81
REPORTS



Page 2
hidden
law. The key insight into identifying nontrivial
conservation laws computationally is that the
candidate equations should predict connections
between dynamics of subcomponents of the sys-
tem. More precisely, the conservation equation
should be able to predict connections among de-
rivatives of groups of variables over time, rela-
tions that we can also readily calculate from new
experimental data.
One instance of such a metric is the partial
derivatives between pairs of variables (see SOM
section S1). For example, in a two-dimensional
system we could measure variables x(t)andy(t)
over time. The system’s partial derivatives esti-
mated from time-series data would then be x′/y′ ≈
∆x/∆y and y′/x′ ≈∆y/∆x (where x′ and y′ represent
the time derivatives of x and y). Similarly, given a
candidate conservation law equation f (x,y), we
can derive the same values through differentia-
tion: (df /dy)/(df /dx) ≈ dx/dy and (df /dx)/(df /dy) ≈
dy/dx. We can now compare ∆x/∆y values from
the experimental data with dx/dy values from a
candidate conservation expression f (x,y) to mea-
sure how well it predicts intrinsic relations in the
system. In higher-dimensional systems, multiple
variable pairings and higher-order derivatives
yield a plethora of criteria to use. See SOM
sections S2 and S3 for generalization to higher-
dimensional systems. Using the partial-derivative
pairs, we define a new type of search criteria for
measuring how well a candidate analytical ex-
pression represents a nontrivial invariance over
the experimental data.
An important consequence of the partial-
derivative–pair measure is that it can also identify
relations that represent other nontrivial identities
of the system beyond invariants and conservation
laws. For example, if the system is confined to a
manifold, the manifold equation can also derive
accurate partial-derivative pairs. Similarly, the
partial-derivative pair can identify equations such
as Lagrangian equations, the energy equivalent to
the equation of motion in classical mechanics,
which summarize the systems dynamics but are
not invariant.
One can control, to an extent, the type of law
that the system might find by choosing what
variables to provide to the algorithm. For
example, if we only provide position coor-
dinates, the algorithm is forced to converge on
a manifold equation of the system’s state space.
If we provide velocities, the algorithm is biased
to find energy laws. If we additionally supply
accelerations, the algorithm is biased to find force
identities and equations of motion. However,
given these or other types of variables, other or
previously unknown analytical laws may exist.
We used an algorithm (Fig. 2) to search for
analytical laws in data captured from several
synthetic and physical systems using various
sets of system variables. We present results for
a number of physical experimental systems (see
SOM section S7 for a study of synthetic systems,
geometric symmetries, and manifolds). A video
is available online (see SOM section S14).
We collected data from standard experimental
systems typically used in undergraduate physics
education: an air-track oscillator and a double
pendulum (Fig. 3). We used motion-tracking
software to record the devices’ positions over
time. We then numerically calculated velocities
and accelerations (see SOM section S11). All
data sets are available in SOM section S15.
Without any additional information, system
models, or theoretical knowledge, the search with
the partial-derivative–pairs criterion produced
several analytical law expressions directly from
these data. For each system, the algorithm outputs
a short list of ~10 equations that have maximal
accuracy found for different sizes (complexities)
of equations (see SOM section S8). We then
inspect this list manually to select the final equa-
tion. Often the list consists of varying approx-
imations or elaborations on a particular law
equation, but it can contain qualitatively different
equations, as discussed below.
We experimented on two configurations of the
air track: (i) two-spring single-mass and (ii) three-
spring double-mass. Similarly, we collected time-
series data from a pendulum and a double
pendulum (Fig. 3) with the use of motion-tracking
(SOM section S12).
The single-car air track is a harmonic os-
cillator with slight damping from the air and its
two springs. With only minimal noise and damp-
Fig. 2. Computational approach for detecting conservation laws from experimentally collected data. (A)
First, calculate partial derivatives between variables from the data, then search for equations that may
describe a physical invariance. To measure how well an equation describes an invariance, derive the same
partial derivatives symbolically to compare with the data. (B) The representation of a symbolic equation in
computer memory is a list of successive mathematical operations (see SOM section S6). (C)Thislist
representation corresponds to a graph, where nodes represent mathematical building blocks and leaves
represent parameters and system variables. Both (B) and (C) correspond to the same equation. The
algorithm varies these structures to search the space of equations.
3 APRIL 2009 VOL 324 SCIENCE www.sciencemag.org82
REPORTS


Sign up today - FREE

Mendeley saves you time finding and organizing research. Learn more

  • All your research in one place
  • Add and import papers easily
  • Access it anywhere, anytime

Start using Mendeley in seconds!

Already have an account? Sign in

Readership Statistics

126 Readers on Mendeley
by Discipline
 
 
 
by Academic Status
 
35% Ph.D. Student
 
14% Post Doc
 
12% Researcher (at a non-Academic Institution)
by Country
 
40% United States
 
12% United Kingdom
 
12% Germany

Groups

Naive Physics