Sign up & Download
Sign in

Extraction of physical laws from joint experimental data

by I Grabec
European Physical Journal B (2007)

Abstract

The extraction of a physical law y=yo(x) from joint experimental data about x and y is treated. The joint, the marginal and the conditional probability density functions (PDF) are expressed by given data over an estimator whose kernel is the instrument scattering function. As an optimal estimator of yo(x) the conditional average is proposed. The analysis of its properties is based upon a new definition of prediction quality. The joint experimental information and the redundancy of joint measurements are expressed by the relative entropy. With the number of experiments the redundancy on average increases, while the experimental information converges to a certain limit value. The difference between this limit value and the experimental information at a finite number of data represents the discrepancy between the experimentally determined and the true properties of the phenomenon. The sum of the discrepancy measure and the redundancy is utilized as a cost function. By its minimum a reasonable number of data for the extraction of the law yo(x) is specified. The mutual information is defined by the marginal and the conditional PDFs of the variables. The ratio between mutual information and marginal information is used to indicate which variable is the independent one. The properties of the introduced statistics are demonstrated on deterministically and randomly related variables.

Cite this document (BETA)

Available from arxiv.org
Page 1
hidden

Extraction of physical laws from joint experimental data

ar
X
iv
:0
70
4.
01
51
v1
[
ph
ys
ics
.da
ta-
an
]
2 A
pr
20
07
EPJ manuscript No.
(will be inserted by the editor)
Extraction of physical laws from joint experimental data
Igor Grabec
Faculty of Mechanical Engineering, University of Ljubljana,
Asˇkercˇeva 6, PP 394, 1001 Ljubljana, Slovenia,
Tel: +386 01 4771 605, Fax: +386 01 4253 135,
E-mail: igor.grabec@fs.uni-lj.si
Received: date / Revised version: date
Abstract. The extraction of a physical law y = yo(x) from joint experimental data about x and y is
treated. The joint, the marginal and the conditional probability density functions (PDF) are expressed by
given data over an estimator whose kernel is the instrument scattering function. As an optimal estimator
of yo(x) the conditional average is proposed. The analysis of its properties is based upon a new definition
of prediction quality. The joint experimental information and the redundancy of joint measurements are
expressed by the relative entropy. With the number of experiments the redundancy on average increases,
while the experimental information converges to a certain limit value. The difference between this limit
value and the experimental information at a finite number of data represents the discrepancy between
the experimentally determined and the true properties of the phenomenon. The sum of the discrepancy
measure and the redundancy is utilized as a cost function. By its minimum a reasonable number of data
for the extraction of the law yo(x) is specified. The mutual information is defined by the marginal and
the conditional PDFs of the variables. The ratio between mutual information and marginal information
is used to indicate which variable is the independent one. The properties of the introduced statistics are
demonstrated on deterministically and randomly related variables.
PACS. 06.20.DK Measurement and error theory – 02.50.+s Probability theory, stochastic processes, and
statistics – 89.70.+c Information science
Page 2
hidden
2 Igor Grabec: Extraction of physical laws from joint experimental data
1 Introduction
The progress of natural sciences depends on advancement
in the fields of experimental techniques and modeling of
relations between experimental data in terms of physical
laws.[1,2] By utilizing computers a revolution appeared
in the acquisition of experimental data while modeling
still awaits a corresponding progress. For this purpose the
modeling process should be generally described in terms
of operations that could be autonomously performed by a
computer. A step in this direction was taken recently by a
nonparametric statistical modeling of the probability dis-
tribution of measured data.[3] The nonparametric model-
ing requires no a priori assumptions about the probability
density function (PDF) of measured data and therefore
provides for a fairly general and autonomous experimen-
tal modeling of physical laws by a computer.[1,4] More-
over, the inaccuracy of measurement caused by stochastic
influences can be properly accounted for in the nonpara-
metric modeling that further leads to the expression of ex-
perimental information, redundancy of repeated measure-
ments and model cost function in terms of entropy of infor-
mation. These variables have already been applied when
formulating an optimal nonparametric modeling of PDF,
in the most simple case of a one–dimensional variable.[3]
However, more frequently than modeling of a PDF the
problem is to extract a physical law from joint data about
various variables and to analyze its properties. Therefore,
the aim of this article is to propose a general statistical
approach also to the solution of this problem.
As an optimal statistical estimator of an experimen-
tal physical law we propose the conditional average (CA)
that is determined by the conditional PDF.[1] This esti-
mator represents a nonparametric regression whose struc-
ture is case independent; hence it can be generally pro-
grammed and autonomously determined by a computer.
Due to these convenient properties, we consider CA as a
basis for the autonomous extraction of experimental phys-
ical laws in data acquisition systems.
The fundamental steps of the proposed approach to
extraction of experimental physical laws from given data
are explained in the second section. We first define the
estimators of the joint, the marginal and the conditional
PDFs and derive from them the conditional average as
an optimal estimator of a physical law that is hidden in
joint data. In order to estimate the number of data ap-
propriate for the extraction of a physical law, we further
introduce the statistics that characterize the information
provided by joint measurements. In the third section of
the article the properties of the CA estimator and the
other introduced statistics are demonstrated on cases of
deterministically and randomly related data.
2 Statistics of joint measurements
2.1 Uncertainty of experimental observation
Without loss of generality we consider a phenomenon that
can be quantitatively characterized by two scalar valued
variables x and y comprising a vector z = (x, y). We fur-
ther assume that the phenomenon can be experimentally
Page 3
hidden
Igor Grabec: Extraction of physical laws from joint experimental data 3
explored by repetition of joint measurements on a two–
channel instrument having equal spans Sx = (−L,L),
Sy = (−L,L). Their Cartesian product Sxy = Sx ⊗ Sy
determines the joint span. We treat a measurement of a
joint datum as a process in which the measured object
generates the instrument output z = (x, y). The basic
properties of the instrument and measurement procedure
can be characterized by a calibration based on a set of
objects {wkl = (uk, vl); k = 1, . . . l = 1, . . .} that repre-
sent joint physical units. Using these units, a scale net can
be determined in the joint span Sxy of the instrument. In
order to simplify the notation, we further omit the indices
of units.
A common property of measurements is that the out-
put of the instrument fluctuates even when calibration
is repeated.[1,2] We describe this property by the joint
PDF ψ(z|w), which characterizes the scattering of the in-
strument output at a given joint unit w. For the sake
of simplicity, we consider an instrument whose channels
can be calibrated mutually independently. In this case the
instrument scattering function is expressed by the prod-
uct of scattering functions corresponding to both channels
ψ(z|w) = ψ(x|u)ψ(y|v). Their mean values u, v, and stan-
dard deviations σx, σy represent an element of the instru-
ment scale and the scattering of instrument output at the
joint calibration. These values can be estimated statisti-
cally by the sample mean and variance of both components
measured during repeated calibration by a joint unit w.
The standard deviation σ characterizes the uncertainty
of the measurement procedure performed on a unit.[1,2]
We further consider the most frequent case in which the
output scattering does not depend on the channel index
and the position w = (u, v) on the joint scale. In this
case it can be expressed as a function of the difference
z − w = (x − u, y − v) and a common standard devia-
tion σ = σx = σy as ψ(z|w) = ψ(z − w, σ). We consider
scattering of instrument output during calibration as a
consequence of random disturbances in the measurement
system. When these disturbances are caused by contribu-
tions from mutually independent sources, the central limit
theorem of the probability theory leads us to the Gaussian
scattering function ψ(z−w, σ) = g(x−u, σ)g(y−v, σ), in
which the scattering of a single component is determined
by:
ψ(x|u) = g(x− u, σ) = 1√
2pi σ
exp
[
− (x− u)
2

]
. (1)
2.2 Estimation of probability density functions
Let us consider a single measurement which yields a joint
datum z1 = (x1, y1). We assume that this joint datum
appears at the outputs of instrument channels, since it is
the most probable at a given state z of the observed phe-
nomenon and the instrument during measurement. There-
fore, we utilize the measured datum z1 as the center of the
probability distribution ψ(z− z1, σ) = ψ(x− x1, σ)ψ(y −
y1, σ) that represents the corresponding state.
Consider next a series of N repeated measurements
which yield the basic data set {zi; i = 1, . . . , N}. In ac-
cordance with the above–given interpretation of measured
data we adapt to them the distributions {ψ(z−zi, σ); i =
1, . . . , N}. If the data z1, . . . , zN are spaced more than σ
Page 4
hidden
4 Igor Grabec: Extraction of physical laws from joint experimental data
apart, we assume that their scattering is caused by varia-
tion of the state z in repeated measurements and generally
consider z as a random vector variable. Its joint PDF is
determined by the statistical average over distributions
{ψ(z− zi, σ); i = 1, . . . , N} as:
fN (z) =
1
N
N

i=1
ψ(z− zi, σ). (2)
This function represents an experimental model of PDF
and resembles Parzen’s kernel estimator, which is often
used in statistical modeling of PDFs.[5,4] However, in Parzen’s
modeling the kernel width σ plays the role of a smooth-
ing parameter whose value decreases with the number of
data N , which is not consistent with the general proper-
ties of measurements. In opposition to this, we consider σ
as an instrumental parameter that is determined by the
inaccuracy of measurement.[3,4] In the majority of experi-
mental observations σ is a constant during measurements,
and hence need not be further indicated in the scattering
function ψ.
From the joint PDF f(z) = f(x, y) the marginal PDF
f(x) of a component x is obtained by integration over the
other component, for example:
f(x) =

Sy
f(x, y)dy (3)
The conditional PDF of the variable y at a given condition
x is then defined by the ratio of the joint PDF and the
marginal PDF of the condition:
f(y|x) = f(x, y)
f(x)
(4)
Using the experimental model of joint PDF (2) we obtain
for the marginal and conditional PDFs the following kernel
estimators:
fN(x) =
1
N
N

i=1
ψ(x− xi, σ) (5)
fN (y|x) =
∑N
i=1 ψ(x− xi, σ)ψ(y − yi, σ)
∑N
i=1 ψ(x− xi, σ)
(6)
2.3 Estimation of a physical law
It is often observed that the joint PDF resembles a crest
along some line y = yˆ(x). We consider yˆ(x) as an estimator
of a hidden physical law y = yo(x) that provides for a
prediction of a value y from the given value x. If we repeat
joint measurements, and consider only those that yield
the value x, we can generally observe that corresponding
values of the variable y are scattered, at least due to the
stochastic character of the measurements. As an optimal
predictor of the variable y at the given value x, we consider
the value yˆ that yields the minimum of the mean square
prediction error D at a given x:
D = E[(yˆ − y)2|x] = min(yˆ) (7)
The minimum takes place when dD/dyˆ = 0. The solu-
tion of this equation yields as the optimal predictor yˆ the
conditional average
yˆ(x) = E[y|x] =

Sy
y f(y|x)dy (8)
By using Eq. 6 for the conditional probability, we obtain
for CA the superposition
yˆN (x) =
∑N
i=1 yiψ(x− xi, σ)
∑N
i=1 ψ(x− xi, σ)
=
N

i=1
yiCi(x) (9)
The coefficients
Ci(x) =
ψ(x− xi, σ)
∑N
i=1 ψ(x − xi, σ)
(10)
Page 5
hidden
Igor Grabec: Extraction of physical laws from joint experimental data 5
represent a normalized measure of similarity between the
given value x and sample values xi and satisfy the condi-
tions:
N

i=1
Ci(x) = 1 , (11)
0 ≤ Ci(x) ≤ 1. (12)
The more similar given value x is to a datum xi, the larger
the coefficient Ci(x) is and the contribution of the corre-
sponding term yiCi(x) to the sum in Eq.(9). The pre-
diction of the value yˆN (x), which best corresponds to the
given value x, thus resembles the associative recall of mem-
orized items in the brains of intelligent beings, and there-
fore could be treated as a basis for the development of
computerized autonomous modelers of physical laws and
related machine intelligence.[1]
The predictor Eq. (9) is completely determined by the
set of measured data {z − zi; i = 1, . . . , N} and the in-
strument scattering function ψ. The predictor is not based
on any a priori assumption about the functional relation
between the variables x and y, as is done for example
when a physical law is described by some regression func-
tion in which parameters are adapted to given data. The
conditional average Eq. (9) can thus be treated as a non-
parametric regression, although the scattering functions
ψ(z−zi, σ) still depend on the parameters zi, σ. However,
these parameters, as well as the form of the function ψ,
are totally specified by measurements. They represent a
property of the observed phenomenon and not an assumed
auxiliary of the modeling. Since the form of the CA pre-
dictor does not depend on a specific phenomenon under
consideration, it could be considered as a generally ap-
plicable basis for statistical modeling of physical laws in
terms of experimental data in an autonomous computer.
It is convenient that Eq. (9) can be simply generalized to a
multi–dimensional case by substituting the condition and
the estimated variable by the corresponding vectors.[1]
Moreover, it is convenient that the ordering into depen-
dent and independent variables is done automatically by
a specification of the condition.
2.3.1 Description of predictor quality
We can interpret a phenomenon which is characterized by
the vector z = (x, y) as a process that maps the vari-
able x to the variable y. When the variables x and y are
stochastic, we most generally describe this mapping by the
joint PDF f(x, y). Similarly, we can interpret the predic-
tion of the variable yˆ(x) from the given value x as a pro-
cess that runs in parallel with the observed phenomenon.
This process is also generally characterized by the PDF
f(x, yˆ), while the relation between the variables y and yˆ
is characterized by the PDF f(y, yˆ). The better the pre-
dictor is, the more the distribution f(y, yˆ) is concentrated
along the line y = yˆ(x). For a good predictor we generally
expect that the prediction error Er = y − yˆ is close to
0. Since both variables are considered as stochastic ones,
we expect that the first and second moments of the pre-
diction error E[y − yˆ], E[(y − yˆ)2] are small, while for
an exact prediction E[y − yˆ] = 0, and E[(y − yˆ)2] = 0.
The second moment of the error is equal to E[(y − yˆ)2] =
Var(y)+Var(yˆ)−2Cov(y, yˆ)+(my−myˆ)2, wheremy = E[y]
andmyˆ = E[yˆ] denote mean values. If the variables y and yˆ
Page 6
hidden
6 Igor Grabec: Extraction of physical laws from joint experimental data
are statistically independent and have equal mean values,
the covariance vanishes: Cov(y, yˆ) = 0, and my −myˆ = 0,
so that E[(y − yˆ)2] = Var(y) + Var(yˆ). Based upon this
property we introduce a relative statistic called the pre-
dictor quality with the formula
Q = 1− E[(y − yˆ)
2]
Var(y) + Var(yˆ)
=
2Cov(y, yˆ)
Var(y) + Var(yˆ)
− (my −myˆ)
2
Var(y) + Var(yˆ)
(13)
Its value equals 1 for an exact prediction: yˆ = y, while it
equals 0, if the variables y, yˆ are statistically independent
and have equal mean values. If the mean values differ:
my −myˆ 6= 0, the quality Q can also be negative.
When the predictor is determined by the conditional
average (8), we obtain for its mean value
myˆ = E[yˆ] =

yˆf(x)dx =
∫ ∫
yf(y|x)f(x)dxdy
=
∫ ∫
yf(y, x)dxdy = E[y] = my. (14)
Since in this case my −myˆ = 0, we further get
Q =
2Cov(y, yˆ)
Var(y) + Var(yˆ)
(15)
Similarly we get for the covariance
Cov(y, yˆ) =
∫ ∫
(y −my)(yˆ(x) −myˆ(x)])f(y, x)dxdy
=

(yˆ(x)−myˆ(x))(y −my)f(y|x)dyf(x)dx
=

(yˆ(x)−myˆ(x))2f(x)dx = Var(yˆ), (16)
so that the expected quality of the CA predictor is
Q =
2Var(yˆ)
Var(y) + Var(yˆ)
. (17)
In the case when the relation between both components of
the vector z is determined by some physical law yo(x), and
only the measurement procedure introduces an additive
noise ν with zero mean E[ν] = 0, and variance E[ν2] = σ2,
we can express the variable y as y = yo(x) + ν. In this
case the following equations: E[(y − yˆ)2] = σ2, Var(y) =
Var(yˆ) + σ2 hold, and we get for the expected predictor
quality the expression:
Q =
2Var(yˆ)
2Var(yˆ) + σ2
. (18)
For Var(yˆ) ≫ σ2/2 we have Q ≈ 1, while for Var(yˆ) ≪
σ2/2 we have Q ≈ 0. In the last case yˆ ≈ constant, while
y fluctuates around this constant, and consequently the
prediction quality is low.
Since generally Var(y) ≥ Var(yˆ) and Var(yˆ) ≥ 0, we
obtain from Eq. (17) the inequality 0 ≤ Q ≤ 1. It describes
a mean property, which need not be fulfilled exactly if the
conditional average is statistically estimated from a finite
number of samples N ; but we can expect that it holds
ever more with an increasing N . However, we can gen-
erally expect that with an increasing N , the statistically
estimated CA ever better represents the underlying physi-
cal law y = yo(x). However, with an increasing N , the cost
of experiments increases, and consequently there generally
appears the question: ”How to specify a number of sam-
ples N that is reasonable for the experimental estimation
of a hidden law yo(x)?”
2.4 Experimental information
In order to answer the last question, we proceed with the
description of the indeterminacy of the vector variable z
in terms of the entropy of information. Following the def-
initions given for a scalar random variable in the previous
Page 7
hidden
Igor Grabec: Extraction of physical laws from joint experimental data 7
article,[3] we first describe the indeterminacy of the com-
ponent x. For this purpose we introduce a uniform refer-
ence PDF ρ(x) = 1/(2L) that hypothetically corresponds
to the most indeterminate noninformative observation of
variable x; or to equivalently prepared initial states of the
instrument before executing the experiments in a series
of observations. By using this reference and the marginal
PDF f(x), we first define the indeterminacy of a continu-
ous random variable by the negative value of the relative
entropy[6,7]
Hx = −

Sx
f(x) log
(f(x)
ρ(x)
)
dx. (19)
Using the expressions for the reference, instrumental scat-
tering function, and experimentally estimated PDF, we
obtain the expressions for the uncertainty Hu of calibra-
tion performed on a unit u, the uncertainty Hx of the
component x, experimental information Ix provided by
N measurements of x, and the redundancy Rx of these
measurements as follows [3]:
Hu = −

Sx
ψ(x, u) log(ψ(x, u)) dx − log(2L),
Hx = −

Sx
fN (x) log(fN (x)) dx − log(2L),
Ix(N) = Hx −Hu,
Rx(N) = log(N)− Ix(N), (20)
Similar equations are obtained for the component y by
substituting x→ y.
In order to describe the uncertainty of the random vec-
tor z, we utilize the reference PDF that is uniform inside
the joint span Sxy: ρ(z) = ρ(x)ρ(y) = 1/(2L)2, and van-
ishes elsewhere. By analogy with the scalar variable we
define the indeterminacy of the random vector z by the
negative value of the relative entropy:[6]
Hxy = −
∫ ∫
Sxy
f(z) log
(f(z)
ρ(z)
)
dxdy. (21)
In the case of a uniform reference PDF we obtain
Hxy = −
∫ ∫
Sxy
f(z) log(f(z)) dxdy − 2 log(2L). (22)
With this formula we then express the uncertainty of the
joint instrument calibration as
Hw = −
∫ ∫
Sxy
ψ(z,w) log(ψ(z,w)) dxdy − 2 log(2L).
(23)
For σ ≪ L we obtain from the Gaussian scattering func-
tion ψ(z, zi) = g(x− xi, σ)g(y − yi, σ) the approximation
Hw ≈ log
(σ2
L2
)
+ log
pi
2
+ 1, (24)
The uncertainty of calibration depends on the ratio be-
tween the scattering width 2σ and the instrument span 2L
in both directions. The number 2 log(σ/L) determines the
lowest possible uncertainty of measurement on the given
two–channel instrument, as achieved at its joint calibra-
tion.
The indeterminacy of the random vector z, which char-
acterizes the scattering of experimental data, is defined by
the estimated joint PDF as
Hxy = −
∫ ∫
Sxy
fN (z) log(fN (z)) dxdy − 2 log(2L) (25)
and is generally greater than the uncertainty of calibra-
tion described by Hw. Since Hw denotes the lowest possi-
ble indeterminacy of observation carried out over a given
instrument, we define the joint experimental information
Page 8
hidden
8 Igor Grabec: Extraction of physical laws from joint experimental data
Ixy about vector z = (x, z) by the difference
Ixy(N) = Hxy −Hw
= −
∫ ∫
fN (z) log(fN (z)) dxdy
+
∫ ∫
ψ(z,w) log(ψ(z,w)) dxdy. (26)
Most properties of the uncertainty and information apper-
taining to a random vector are similar to those in the case
of a scalar variable. For example, the reference density ρ(z)
can be arbitrarily selected since it is excluded from the
specification of the experimental information.[3] Further-
more, the joint experimental information Ixy(1) provided
by a single measurement is zero. For a measurement which
yields multiple samples z1, . . . , zN that are mutually sep-
arated by several σ in both directions, the distributions
ψ(z, z1) = g(x− xi, σ)g(y− yi, σ) are nonoverlapping and
the first integral on the right of Eq. 26 can be approxi-
mated as
− 1
N
N

i=1
∫ ∫
ψ(z, zi) log
[ 1
N
N

i=1
ψ(z, zi)
]
dxdy
≈ log(N)−
∫ ∫
ψ(z, z1) logψ(z, z1) dxdy (27)
so that we get Ixy(N) ≈ log(N). If the distributions ψ(z, zi)
are overlapping but not concentrated at a single point, the
inequality 0 ≤ Ixy(N) ≤ log(N) holds generally. Similarly
as the entropy of information for a discrete random vari-
able, the experimental information describes how much
information is provided by N experiments performed by
an instrument that is not infinitely accurate.[6] In accor-
dance with these properties the experimental information
describes the complexity of experimental data in units of
information entropy, which are here nats.
When the distributions ψ(z, zi) are nonoverlapping,N
repeated experiments yield the maximal possible informa-
tion log(N). However, with an increasing number N , ever
more overlapping of distributions ψ(z, zi) takes place, and
therefore the experimental information Ixy(N) increases
more slowly than log(N). Consequently, the repetition of
joint measurements becomes on average ever more redun-
dant with an increasing number N . The difference
Rxy(N) = log(N)− Ixy(N) . (28)
thus represents the redundancy of repeated joint measure-
ments in N experiments. Since the overlapping of distri-
butions ψ(z, zi) increases with an increasing number of ex-
periments, the experimental information on average tends
to a constant value Ixy(∞), and along with this, the re-
dundancy increases with N .
The number
Kxy(N) = eIxy(N) (29)
describes how many nonoverlapping distributions are needed
to represent the experimental observation. With an in-
creasing N , the number Kxy(N) tends to a fixed value
Kxy(∞) that can be well estimated already from a finite
number of experiments. We could conjecture thatKxy(∞)
approximately determines a reasonable number of experi-
ments that provide sufficient data for an acceptable mod-
eling of the joint PDF. However, it is still better to de-
termine such a number from a properly introduced cost
function of the experimental observation. With this aim
we consider the difference Dxy(N) = Ixy(∞)− Ixy(N) as
the measure of the discrepancy between the experimen-
Page 9
hidden
Igor Grabec: Extraction of physical laws from joint experimental data 9
tally observed and the true properties of the phenomenon.
An information cost function is then comprised of the re-
dundancy and the discrepancy measure:
Cxy(N) = Rxy(N) +Dxy(N). (30)
Since the redundancy on average increases, while the dis-
crepancy measure decreases with the number of measure-
ments N , we expect that the cost function Cxy(N) ex-
hibits a minimum at a certain number No, which could be
considered as an optimal one for the experimental model-
ing of a phenomenon. From the definition of redundancy
and the discrepancy measure we further obtain Cxy(N) =
Rxy(N)+Dxy(N) = log(N)−2Ixy(N)+Ixy(∞). Since the
last term is a constant for a given phenomenon, it is not
essential for the determination of No, and can be omitted
from the definition of the cost function. This yields a more
simple version
Cxy(N) = log(N)− 2Ixy(N), (31)
which is more convenient for application since it does not
include the limit value Ixy(∞). In a previous article [3]
we have proposed a cost function that is comprised from
the redundancy and the information measure of the dis-
crepancy between the hypothetical and experimentally ob-
served PDFs. However, such a definition is less convenient
than the present one, although the values of No deter-
mined from both cost functions do not differ essentially.
Numerical investigations also show that the optimal num-
ber No approximately corresponds to Kxy(∞) = eIxy(∞)
if the distribution of the data points is approximately uni-
form.
Although the experimental information of a vector vari-
able and its scalar components exhibits similar properties,
their values generally do not coincide since the overlapping
of distributions ψ(z, zi) generally differs from that of dis-
tributions ψ(x, xi) or ψ(y, yi). Therefore, the experimen-
tal information provided by joint measurements generally
differs from that provided by measurements of single com-
ponents.
2.5 Mutual information and determination of one
variable by the other
In order to describe the information corresponding to the
relation between variables x, y we introduce conditional
entropy. At a given value x we express the entropy per-
taining to the variable y by the conditional PDF as
Hy|x = −

Sy
f(y|x) log
(f(y|x)
ρ(y)
)
dy (32)
If we express in Eq. (21) the joint PDF by the conditional
one f(z) = f(y|x)f(x) we obtain the following equation:
Hxy = Hy|x +Hx (33)
in which Hy|x denotes the average conditional entropy of
information
Hy|x = −

Sx
Hy|xf(x) dx. (34)
When we exchange the meaning of the variables we get
Hxy = Hx|y +Hy. (35)
Based on these equations and Eq. (26) we obtain the fol-
lowing relation between the joint and the conditional in-
Page 10
hidden
10 Igor Grabec: Extraction of physical laws from joint experimental data
formation
Ixy = Hx|y +Hy −Hu −Hv
= Iy|x + Ix = Ix|y + Iy (36)
where the conditional information is defined by
Ix|y = Hx|y −Hu or Iy|x = Hy|x −Hv. (37)
When the components of the vector z are statistically
independent, the joint PDF is equal to the product of
marginal probabilities and the joint information is given
by the sum Ixy = Ix + Iy, which represents the maxi-
mal possible information that could be provided by joint
measurements. However, when x and y are not statisti-
cally independent, the joint information is less than the
maximal possible one: Ixy < Ix + Iy. The difference
Im = Ix + Iy − Ixy = Ix − Ix|y = Iy − Iy|x. (38)
can be interpreted as the experimental information that
a measurement of one variable provides about another one
and is consequently called the mutual information.[6,8,9,10]
In accordance with the previous interpretation of the re-
dundancy, it follows from the last two terms in Eq. (38)
that the mutual information also describes how redun-
dant on average is a measurement of the variable y at a
given x or vice versa. In accordance with the definition of
the redundancy of a certain number N of measurements
Rx(N) = log(N) − Ix, we further define also the mutual
redundancy of N joint measurements
Rm(N) = log(N)− Im(N) . (39)
If we then take into account all the definitions of the re-
dundancies and types of information, we obtain the for-
mula:
Rxy(N) = Rx(N) +Ry(N)−Rm(N) (40)
It should be pointed out that redundanciesRxy(N), Rx(N),
Ry(N), and Rm(N) generally increase with N , while the
corresponding experimental information tends to fixed val-
ues that correspond to the amount of data needed for pre-
senting related variables.
In order to describe quantitatively how well determined
the value of the variable y by the value of x is on aver-
age, we propose a relative measure of determination by
the ratio
Dy|x =
Im
Iy
= 1− Iy|x
Iy
. (41)
If Dy|x > Dx|y, the value of the variable x better deter-
mines the value of y than vice versa. In this case the vari-
able x could be considered as more fundamental for the
description of the phenomenon, and consequently as an
independent one. In the case of functional dependence de-
scribed by a physical law y = yo(x), the relative measure
of determination is Dy|x = 1, while for the statistically
independent variables x and y it is Dy|x = 0.
The entropy of information is generally decreased if
the distribution of scattered experimental data at a given
x is compressed to the estimated physical law yˆ(x). The
corresponding information gain is in drastic contrast to
the information loss that is caused by the noise in a mea-
surement system.[11]
Page 11
hidden
Igor Grabec: Extraction of physical laws from joint experimental data 11
3 Illustration of statistics
3.1 Data with a hidden law
The purpose of this section is to demonstrate graphically
the basic properties of the statistics introduced above. For
this purpose it is most convenient to generate data nu-
merically since in this case the relation between the vari-
ables x and y, as well as the properties of the scatter-
ing function ψ(z), can be simply set. For our demonstra-
tion we arbitrarily selected a third order polynomial law
yo(x) = [x(x − 5)(x + 10)]/100 and the Gaussian scatter-
ing function with standard deviation σ = 0.2. To simulate
the basic data set {xi, yi; i = 1, . . . , N}, we first calcu-
lated 50 sample values xi by summing two random terms
obtained from a generator with a uniform distribution in
the interval [−8,+8] and from a Gaussian generator hav-
ing the mean value 0 and standard deviation σ = 0.2.
The corresponding sample values yi were then calculated
as a sum of terms obtained from the selected law yo(xi)
and the same random Gaussian generator with a different
seed. The generated data {xi, yi; i = 1, . . . , 50} were used
as centers of scattering function when estimating the joint
PDF based on Eq. (2). An example of such PDF is shown
in Fig. 1, while the corresponding joint data of the basic
set are shown by points in the top curve of Fig. 2 together
with the underlying law yo(x).
The conditional average predictor, which corresponds
to the presented example, was modeled by inserting data
from the basic data set into Eq. (9). To demonstrate its
performance, we additionally generated a test data set by
−10
−5
0
5
10
−10
−8
−6
−4
−2
0
2
4
6
8
10
0
0.2
0.4
XY
PD
F
N=50, σ=0.2
Fig. 1. The joint PDF f(x, y) utilized to demonstrate the
properties of the conditional average predictor.
−10 −8 −6 −4 −2 0 2 4 6 8 10
−2
0
2
4
6
8
10
X
Y
TESTING OF CA PREDICTOR
σ = 0.2 N=50 Q = 0.977
Yo
Y
Yt
Yp
Er
Fig. 2. Testing of CA predictor. Curves representing the un-
derlying law and given data yo, y – (top), test and predicted
data yt, yp – (middle), and prediction error Er = yp − yt –
(bottom) are displaced in vertical direction for a better visu-
alization.
the same procedure as in the case of the basic data set, but
with different seeds of all the random generators. Using
the values xi,t of the test set, we then predicted the cor-
responding values yˆi by the modeled CA predictor. With
this procedure we simulated a situation that is normally
Page 13
hidden
Igor Grabec: Extraction of physical laws from joint experimental data 13
0 10 20 30 40 50 60 70 80 90 100
−3
−2
−1
0
1
2
3
4
5
N
[na
t]
log(N)
Ixy
Im
Rxy
Cxy
σ=0.2
Fig. 4. Dependence of log(N), experimental information Ixy,
mutual information Im, redundancy Rxy, and cost function
Cxy on the number of samples N determined by various sta-
tistical data sets.
corresponds to the ideal case with no scattering, is also
presented by the curve log(N), since it represents the ba-
sis for defining the redundancy. Similarly as in the one–
dimensional case [3], the experimental information Ixy in
the two–dimensional case also converges with increasing
N to a fixed value. In the presented case the limit value
is Ixy(∞) ≈ 3.2, which yields the number K∞ ≈ 25. This
number is approximately equal to the ratio of standard
deviation of variable x and the scattering width σ and
describes how many uniformly distributed samples are
needed to represent the PDF of the data.[3] Due to the
convergence of experimental information to a fixed value,
the curve Ixy(N) starts to deviate from log(N) with the in-
creasingN . Consequently the redundancyRxy = log(N)−
Ixy(N) starts to increase, which further leads to the min-
imum of the cost function Cxy(N) = log(N) − 2Ixy(N).
0 10 20 30 40 50 60 70 80 90 100
0
0.5
1
1.5
2
2.5
3
3.5
4
4.5
5
N
[na
t]
log(N)
Ixy
Ix
Iy
Im
σ=0.2
Fig. 5. Dependence of log(N), experimental information Ixy,
marginal informations Ix, Iy, and mutual information Im on
the number of samples N .
The minimum is not well pronounced due to statistical
variations, but it takes place at approximately No ≈ 30.
Not surprisingly, the optimal number No approximately
corresponds to K∞ and also to NCA.
Similarly as the joint experimental information Ixy, the
marginal experimental information Ix, Iy also converges
to fixed values with increasing N .[3] These statistics are
presented in Fig. 5 for the same data generator as applied
in the case of Fig. 4. The sample values of variable x take
place in a larger interval than those of variable y. Hence
there is less overlapping of scattering functions comprising
the marginal PDF of x and consequently Ix is larger than
Iy. It is also characteristic that Ixy is larger than Ix since
the data points in the joint span Sxy are more separated
than in the marginal span Sx. Since the mutual informa-
tion Im is defined as Im = Ix + Iy − Ixy, its properties
depend on both the marginal and the joint information,
Page 14
hidden
14 Igor Grabec: Extraction of physical laws from joint experimental data
0 10 20 30 40 50 60 70 80 90 100
−3
−2
−1
0
1
2
3
4
5
N
[na
t]
log(N)
Ixy
Rxy
Cxy
σ=0.1
σ=0.4
σ=0.4
σ=0.4
σ=0.1
σ=0.1
Fig. 6. Dependence of log(N), experimental information Ixy,
redundancy Rxy, and cost function Cxy on the number of sam-
plesN determined from various data sets and scattering widths
σ.
and consequently Im converges more quickly to the limit
value than the experimental information Ixy.
To demonstrate the influence of scattering width on
the presented statistics the calculations were repeated with
σ = 0.1 and 0.4. The results are presented in Fig. 6. For
the sake of clear presentation, the curves representing the
mutual information Im are omitted. As could be expected,
the limit value of Ixy increases with decreasing σ. This
property is consistent with the well–known fact that more
information can be obtained by experimental observation
when using an instrument of higher accuracy that corre-
sponds to a lesser scattering width. In opposition to this,
the redundancy of measurement decreases, and along with
it, the optimal number No increases with the decreasing
scattering width.
0 10 20 30 40 50 60 70 80 90 100
0
0.2
0.4
0.6
0.8
1
N
D
Dy|x
Dx|yσ=0.2
Fig. 7. Dependence of relative measure of determination Dy|x
– (top lines) and Dx|y – (bottom lines) on the number of sam-
ples N determined from various statistical data sets.
From the calculated mutual and marginal information,
the relative measures of determinationDy|x andDx|y were
further determined using various statistical data sets. The
results are presented in Fig. 7 for the case of scattering
width σ = 0.2. When the number of data N surpasses the
interval around the optimal number No, statistical varia-
tions of Dy|x and Dx|y become less pronounced and their
values settle close to limit ones. The limit value Dx|y is
essentially lower than Dy|x. This is the consequence of the
fact that in our case the variable y is uniquely determined
by the underlying law yo(x) based upon the variable x, but
not vice versa. In our case, there are three values of the
variable x corresponding to a value of y in a certain inter-
val. Consequently, y is better determined by a given x than
vice versa, which further yields Dy|x > Dx|y. Hence the
relative measure of determination indicates that variable x
Page 15
hidden
Igor Grabec: Extraction of physical laws from joint experimental data 15
−10
−5
0
5
10
−10
−5
0
5
10
0
0.02
0.04
X
Y
PD
F
N=500, σ=0.2
random data
Fig. 8. The joint PDF f(x, y) of N = 500 statistically inde-
pendent random data with σ = 0.2.
could be considered more fundamental for the description
of the relation between the variables x and y.
3.2 Data without a hidden law
To support the last conclusion let us examine an exam-
ple in which the sample values of the variables x and
y were calculated by two statistically independent ran-
dom generators. The corresponding joint PDF is shown
in Fig. 8, while the properties of the other statistics are
demonstrated by Figs. 9, 10 and 11.
The properties of the presented statistics could be un-
derstood, if the overlapping of scattering functions com-
prising the estimator of the joint PDF is examined. In
the previous case with the underlying law yo(x), the joint
data are distributed along the corresponding line where
−8 ≤ x ≤ +8, while in the last case, they take place in
the square region −8 ≤ x ≤ +8,−8 ≤ y ≤ +8. Conse-
quently, the number of samples with nonoverlapping scat-
tering functions in the last case is approximately L/σ = 16
0 50 100 150 200 250 300 350 400 450 500
−6
−4
−2
0
2
4
6
8
N
[na
t]
log(N)
Ixy
Im
Rxy
Cxy
σ=0.2
random data
Fig. 9. Dependence of log(N), experimental information Ixy,
redundancy Rxy, and cost function Cxy on the number of sam-
ples N determined by various statistical data sets and scatter-
ing widths σ.
0 50 100 150 200 250 300 350 400 450 500
0
1
2
3
4
5
6
7
N
[na
t]
log(N)
Ixy
Ix
Iy
Im
σ=0.2
random data
Fig. 10. Dependence of log(N), experimental information Ixy,
marginal informations Ix, Iy, and mutual information Im on
the number of samples N in the case of statistically indepen-
dent random variables x, y.
Page 16
hidden
16 Igor Grabec: Extraction of physical laws from joint experimental data
0 50 100 150 200 250 300 350 400 450 500
0
0.2
0.4
0.6
0.8
1
N
D
Dy|x
Dx|y
σ=0.2
random data
Fig. 11. Dependence of relative measure of determination Dy|x
– (top lines) and Dx|y – (bottom lines) on the number of ran-
dom samples N in the case of statistically independent random
data with σ = 0.2.
times larger than in the previous case. In the last case
we can therefore expect the optimal number of samples
in the interval around Nro ≈ 16No = 480. Since in the
last case a larger region is covered by the joint PDF, the
overlapping of scattering functions is less probable than
previously, and therefore, the joint experimental informa-
tion Ixy deviates less quickly from the line log(N) with
the increasingN . Therefore, the redundancy increases less
quickly and the minimum of the cost function takes place
at a much higher number of Nro = 480, which corre-
sponds well to our estimation. Since in the last case the
experimental information Ixy converges less quickly to the
limit value than the marginal information Ix, Iy, the mu-
tual information Im first increases and later decreases to
its limit value. Related to this is the approach of rela-
tive measures of determination Dy|x, Dx|y to much lower
limit values as in the previous case. Since the marginal
information Ix, Iy is approximately equal, the curves rep-
resenting Dy|x, Dx|y join with increasing N , and there is
no argument to consider any variable as a more funda-
mental one for the description of the phenomenon under
examination. This conclusion is consistent with the fact
that the centers of the scattering functions are determined
by two statistically independent random generators. How-
ever, the limit values of the statistics Dy|x, Dx|y are not
equal to zero since the region −8 ≤ x ≤ +8,−8 ≤ y ≤ +8
where the data appear is limited, while the characteristic
region −σ ≤ x ≤ +σ,−σ ≤ y ≤ +σ covered by the joint
scattering function does not vanish.
4 Conclusions
Following the procedures proposed in the previous article
[3], we have shown how the joint PDF of a vector variable
z = (x, y) can be estimated nonparametrically based upon
measured data. For this purpose the inaccuracy of joint
measurements was considered by including the scattering
function in the estimator. It is essential that the properties
of the scattering function need not be a priori specified,
but could be determined experimentally based upon cali-
bration procedure. The joint PDF was then transformed
into the conditional PDF that provides for an extraction
of the law yo(x) that relates the measured variables x, y.
For this purpose the estimation by the conditional average
yo(x) ≈ E[y|x] is proposed. The quality of the prediction
by the conditional average is described in terms of the es-
timation error and the variance of the measured data. It
is outstanding that the quality exhibits a convergence to
Page 17
hidden
Igor Grabec: Extraction of physical laws from joint experimental data 17
some limit value that represents the measure of applicabil-
ity of the proposed approach. Examination of the quality
convergence makes it feasible to estimate an appropriate
number of joint data needed for the modeling of the law.
It is important that the conditional average makes feasi-
ble a nonparametric autonomous extraction of underlying
law from the measured data.
Using the joint PDF estimator we have also defined
the experimental information, the redundancy of measure-
ment and the cost function of experimental exploration. It
is characteristic that experimental information converges
with an increasing number of joint samples to a certain
limit value which characterizes the number of nonoverlap-
ping scattering distributions in the estimator of the joint
PDF. The most essential terms of the cost function are
the experimental information and the redundancy. Dur-
ing cost minimization the experimental information pro-
vides for a proper adaptation of the joint PDF model to
the experimental data, while the redundancy prevents an
excessive growth of the number of experiments. By the
position of the cost function minimum we introduced the
optimal number of the data that is needed to represent the
phenomenon under exploration. This number roughly cor-
responds to the ratio between the magnitude of the charac-
teristic region where joint data appear and the magnitude
of the characteristic region covered by the joint scattering
function. It also corresponds to the appropriate number
estimated from the quality of prediction by the conditional
average. Based upon the experimental information corre-
sponding to the joint and marginal PDFs, the mutual in-
formation has been introduced and further utilized in the
definition of the relative measure of determination of one
variable by another. This statistic provides an argument
for considering one variable as a fundamental one for the
description of the phenomenon.
In this article we graphically present the properties of
the proposed statistics by two characteristic examples that
represent data related by a certain law and statistically
independent random data. The exhibited properties agree
well with the expectations given by experimental science.
The problems related to the extraction of laws represent-
ing relations such as y2 + x2 = 1 and the expression of
physical laws by differential equations or analytical mod-
eling were not considered. For this purpose the statistical
methods are developed in the fields of pattern recognition,
system identification and artificial intelligence.
Acknowledgment
The research was supported by the Ministry of Science
and Technology of Slovenia and EU COST.
References
1. I. Grabec and W. Sachse, Synergetics of Measurement, Pre-
diction and Control (Springer-Verlag, Berlin, 1997).
2. J. C. G. Lesurf, Information and Measurement (Institute of
Physics Publishing, Bristol, 2002)
3. I. Grabec, Experimental modeling of physical laws, Eur.
Phys. J., B, 22 129-135 (2001)
4. R. O. Duda and P. E. Hart, Pattern Classification and Scene
Analysis (J. Wiley and Sons, New York, 1973), Ch. 4.
5. E. Parzen, Ann. Math. Stat., 35 1065-1076 (1962).
Page 18
hidden
18 Igor Grabec: Extraction of physical laws from joint experimental data
6. T. M. Cover and J. A. Thomas Elements of Information
Theory (John Wiley & Sons, New York, 1991).
7. A. N. Kolmogorov, IEEE Trans. Inf. Theory, IT-2 102-108
(1956).
8. B. S. Clarke, A. R. Barron, IEEE Trans. Inf. Theory, 36 (6)
453-471 (1990)
9. D. Haussler, M. Opper, Annals of Statistics, 25 (6) 2451-
2492 (1997)
10. D. Haussler, IEEE Trans. Inform. Theory, 43 (4) 1276-
1280 (1997)
11. C. E. Shannon, Bell. Syst. Tech. J., 27 379-423 (1948).

Sign up today - FREE

Mendeley saves you time finding and organizing research. Learn more

  • All your research in one place
  • Add and import papers easily
  • Access it anywhere, anytime

Start using Mendeley in seconds!

Already have an account? Sign in

Readership Statistics

1 Reader on Mendeley
by Discipline
 
by Academic Status
 
100% Researcher (at a non-Academic Institution)
by Country
 
100% Canada