Sign up & Download
Sign in

A Bayesian Foundation for Physical Theories

by Roberto C Alamino
Complexity (2010)

Abstract

Bayesian probability theory is used as a framework to develop a formalism for the scientific method based on principles of inductive reasoning. The formalism allows for precise definitions of the key concepts in theories of physics and also leads to a well-defined procedure to select one or more theories among a family of (well-defined) candidates by ranking them according to their posterior probability distributions, which result from Bayes's theorem by incorporating to an initial prior the information extracted from a dataset, ultimately defined by experimental evidence. Examples with different levels of complexity are given and three main applications to basic cosmological questions are analysed: (i) typicality of human observers, (ii) the multiverse hypothesis and, extremely briefly, some few observations about (iii) the anthropic principle. Finally, it is demonstrated that this formulation can address problems that were out of the scope of scientific research until now by presenting the isolated worlds problem and its resolution via the presented framework.

Cite this document (BETA)

Available from Roberto Alamino's profile on Mendeley.
Page 1
hidden

A Bayesian Foundation for Physical Theories

ar
X
iv
:1
00
8.
16
35
v1
[
ph
ys
ics
.da
ta-
an
]
10
A
ug
20
10
manuscript No.
(will be inserted by the editor)
Roberto C. Alamino
A Bayesian Foundation for Physical
Theories
Received: date / Accepted: date
Abstract Bayesian probability theory is used as a framework to develop a
formalism for the scientific method based on principles of inductive reasoning.
The formalism allows for precise definitions of the key concepts in theories of
physics and also leads to a well-defined procedure to select one or more theo-
ries among a family of (well-defined) candidates by ranking them according to
their posterior probability distributions, which result from Bayes’s theorem
by incorporating to an initial prior the information extracted from a dataset,
ultimately defined by experimental evidence. Examples with different levels
of complexity are given and three main applications to basic cosmological
questions are analysed: (i) typicality of human observers, (ii) the multiverse
hypothesis and, extremely briefly, some few observations about (iii) the an-
thropic principle. Finally, it is demonstrated that this formulation can ad-
dress problems that were out of the scope of scientific research until now by
presenting the isolated worlds problem and its resolution via the presented
framework.
Keywords Bayesian inference · Scientific method · Cosmology
PACS 01.70.+w · 02.50.Tt
Mathematics Subject Classification (2000) 03A10 · 03B48 · 62C12
1 Introduction
Science is fundamentally based on the processing of information accumulated
as we humans continually observe the endless spectrum of natural phenom-
ena occurring in our universe. The philosophical foundations of the task of
This project was partially supported by EPSRC grant EP/E049516/1
Roberto C. Alamino
Neural Computing Research Group, Aston University
E-mail: alaminrc@aston.ac.uk
Page 2
hidden
2trying to make sense of all this collected information in a rational way, which
received the name of scientific method, were constructed based on the prin-
ciple that scientific knowledge should rely entirely on the empirical evidence
collected from these observed phenomena and that the theories elaborated
to explain them should not make any unnecessary assumption beyond what
is required to account for the facts. Constructing a scientific theory would
then be equivalent to inferring patterns, or an underlying connecting struc-
ture, from this acquired information. The very hypothesis of this possibility
relies on the also empirical observation of the regularity and predictability of
nature.
The above paragraph is full of ill-defined concepts, but they contain the
essence of what we would require from an ideal scientific theory. One of
our aims in this work will be to give more precise definitions not only to
the above presented ideas, but also to many others that appear in connec-
tion with physics in particular and with science in general. It is important
to clarify at this point what the theory to be developed here is not. The
present work is not a sociological theory of how scientists develop the sci-
entific knowledge. It is not concerned with historical details of how theories
were discovered or developed and how were the mental processes that re-
sulted in their final formulations. A theory like that would have to take into
consideration not only sociological and psychological aspects, but also their
complex relations with the economical and political environment in which
the theories appeared. The theory, although a better word would be frame-
work, to be developed here will be concerned with what could be considered
the ideal mathematical methods that should be used by the scientific inquiry
and, if not, will lead to wrong or suboptimal results. The author thinks that
he does not have the proper training to analyse the historical contingencies
of the development of theories by humans and that, in fact, social scientists
and historians are better suited to this task.
That said, let us point out that physics can be considered the best example
of the application of the scientific method, providing us with descriptions of
nature with astonishing quantitative accuracy as in the well-known case of
quantum electrodynamics. The success of physics is indisputable as can be
easily attested by the improvement of quality of life brought by our present
technology. This high efficiency of physics is a direct result of taking seriously
the basic tenets of the scientific method to discern between what should be
considered science and what should not.
One of the most fundamental of these criteria is the concept of falsi-
fiability, which is a central point in the Popperian description of science
[1]. Falsifiability can be a difficult concept to identify in theoretical physics,
specially when the theoretical advancements run ahead of the experimental
technology, as is the present situation with respect for instance to quantum
gravity, where the difficulty of falsifying tentative theories is well recognised
[2]. Partly because of that, physicists rely also on other characteristics of
physical theories, like mathematical simplicity and elegance, as guides in the
development of theories even if they are not rigorously defined concepts. Ob-
viously it must be always stressed that these criteria are only guidelines that
must be discarded if experiments turn out to disprove them. A rather illus-
Page 3
hidden
3trative example, although apparently nonsensical for today scientists, is the
idea that every substance is composed of a mixture of only four elements:
water, fire, air and earth. This can be considered a rather elegant theory,
as it requires only four basic elements in comparison with the more than
one hundred elements known today, but nonetheless it was proved wrong be-
yond doubt by the overwhelming experimental evidence accumulated since
its formulation.
Given that what tells science apart frommysticism and other non-scientific
human activities is the strict reliance on rational thought, many tentatives
were made to formalise the scientific method using inductive logic, with de-
ductive logic as a particular case (see text and references in [3]). Although
since the very early stages it was recognised that inductive, probabilistic
reasoning should underpin these efforts, it is fair to say that the correct
formulation was only achieved with the modern development of Bayesian in-
ference [4]. Its application to this problem however has not yet made use of
the full power of the theory, although there exists at least one very developed
tentative framework, called Minimum Description Length (MDL) [5].
MDL fuses concepts of information theory, computation and probability
theory in a tentative to develop a method to decide between theories given
some data. The main idea of MDL is to choose the best hypothesis by calcu-
lating the length of the smallest description in a certain code simultaneously
of the hypothesis and the data when compressed using the hypothesis itself.
We will not give any detailed account of MDL here, but a comparison of its
main points with the work we develop here will be made in section 9.
It can be shown that Bayesian inference is the only way to deal with in-
formation that comply with a sensible set of requirements for inductive and
logical reasoning [4,6,7]. Although many misconceptions about the Bayesian
interpretation of probability are still used against it on fundamental grounds,
with strong attacks against its use and the use of probabilistic reasoning in
general within the scientific method [8], the success of Bayesian methods in
machine learning applications, which are ultimately a tentative to under-
stand and formalise an ideal process of thinking, is an experimental fact [9,
10]. These fundamental questions will not be discussed here and the inter-
ested reader can refer to Jaynes’ book [4] for a thorough study of these and
other fundamental points. A huge supporting bibliography containing all the
relevant works can also be found there and, in order to save space, will not
be presented here.
In this work, we will first use the Bayesian inference framework to intro-
duce our reasoning that will lead to a formalisation of a concept of scientific
method concept, which is done in section 2. The basic processes that compose
the scientific methodology according to this framework, which will be called
information acquisition, modeling and testing/selection, are there introduced
and briefly commented. The precise meaning of information acquisition is
explained in section 3. Section 4 then analyses the modeling process of the-
ories in a more detailed way. Testing, which we will use as a short word
for testing/selection, is studied in section 5, after which follows a discussion
about the important related concept of falsifiability in section 6. In partic-
ular, this latter section will contain the two most important definitions of
Page 4
hidden
4this paper, those of a scientific theory and of the scientific method. These
sections contain the core of our framework and the next sections will be con-
cerned with applications. In section 7 the formalism developed here is used
to analyse three current cosmological issues in theoretical physics, namely
the question of typicality of human observers, the preference for a multiverse
description and a brief analysis of the meaning of the (weak) anthropic prin-
ciple. Section 8 will introduce the isolated worlds problem and show how the
present framework yields a solution that extends the applicability of scien-
tific method beyond today’s range. A discussion and the final conclusions are
given in section 9.
2 Bayesian Reasoning and Physical Theories
Usually, most discussions about the scientific method begin with a philo-
sophical consideration of truth. The concept of truth can be stated in more
pragmatic terms by asserting that there is a well defined underlying process
generating every acquired piece of information from nature. This is not done
here as the formalism to be developed does not require this concept to be
considered at all, even if a completely positivist point of view is taken.
Every formalisation of a subject starts by trying to stabilish precise def-
initions of the key concepts the theory wants to capture. In the case of the
scientific method, there are many intuitive and philosophical requirements
that scientific theories should fulfill. A non-comprehensive list includes Ock-
ham’s razor, falsifiability, elegance and explanatory power. Some of these
concepts are easier to formalise than others. We will try to address those
which seem most important in our opinion.
Before that, however, let us first consider a simplified model to begin our
discussions. The most fertile ground for this is machine learning. The basic
task of this discipline is to study mathematical models, usually to be imple-
mented in computers, that are able to do some kind of “learning”. Learning
is mainly associated to identifying patterns in a given dataset. Consider as
an example the perceptron, a very simple machine learning model thoroughly
studied by methods of statistical physics [11,10]. A perceptron is a mathe-
matical model where a student perceptron characterised by a vector S ∈ RM
tries to infer a rule given a set of examples. In a supervised learning situation,
the rule is encoded by the vector B ∈ RM , which is usually called the teacher
perceptron. The student then has access to a dataset Dt formed by t pairs
(ξi, σi), i = 1, ..., t of questions ξi and answers σi = f(B, ξi) and has to infer
from it the value of B. The function f is called the activation function of the
teacher and in the simplest scenario is known to the student, although we can
consider a situation where the student will have to infer its functional form
as well [12]. This process of “discovering the rule” is called generalisation and
has as a measure the so-called generalisation error which is the probability
of the student to answer wrongly a new question. Note that the activation
function can have an stochastic component, which is usually associated with
some kind of noise that distorts the dataset.
The perceptron was inspired by the landmarking paper of McCulloch and
Pitts [13] which introduced a simplified mathematical model for real human
Page 5
hidden
5neurons and initiated the field of artificial neural networks [14]. In fact, neural
networks are formed by collections of interconnected perceptrons. Similarly,
but in a more complicated way, when humans do science the main objective
is to try to infer natural rules, which are supposed to exist, based on some
dataset encoding information derived from observations about a system. Al-
though nature may be independent of humans, science is realised by human
brains, which are devices that evolved to process information, making this
processing an important part of science. The process of human thinking by
deductive logic has been studied since ancient times with a theory of deduc-
tive logic being already proposed by Aristotle. However, pure deductive logic
is not enough to capture all characteristics of thinking. It can be shown [4,
7] that given some set of reasonable requirements that plausible reasoning
must follow in the form of Cox axioms [15], probability theory is the correct
mathematical description of it. In addition, Bayesian inference becomes the
correct method to update probabilities based on new evidence. As it is natu-
ral to require that the scientific method follows the formal rules of plausible
reasoning, Bayesian inference is the framework we choose to formalise it.
The concepts to be formalised in the following have been used in many
recent discussions about fundamental aspects of physical theories, specially
cosmological models [16,17]. A precise definition of these concepts may help
not only to better pose the relevant questions, but also to make sure that
all conclusions are correctly derived from the analysis of known data and
possible hypothesis.
The aim of the scientific method is to generate scientific theories. Intu-
itively, a physical theory T , or any other scientific theory actually, is con-
structed to “explain” some observed facts, although the meaning of “ex-
plaining” is a philosophical question which will not be explored in depth in
this work. We will assume that what we call observed facts can either be
described as the result of physical measurements obtained from experiments
chosen from some set, or the information obtained by some kind of reason-
ing, as the result of a mathematical theorem that is suppose to answer some
mathematical question. The experiments or questions in general will be as-
sumed to be chosen from a question set Ξ, to be better defined in the next
section. In addition, we also assume that the results obtained as answers to
the question, the result of a measurement for instance, can always be taken
as real vectors, as any information can in some way be encoded by them,
although sometimes a non-numerical description of them may be more illus-
trative, as in the case where the measurement corresponds to some quality
as “colour” for instance. This process by which information is acquired is a
basic part of the scientific method and we will call it information acquisition.
Acquiring information is both required to construct and to test a theory.
Using the same notation we used for the perceptron, Dt will denote a collec-
tion of t data points, with each data point composed by a label denoting the
question and the corresponding acquired answer.
The process of constructing a theory, or modeling will be discussed in
details in section 4. With the knowledge of some dataset Dt it is possible
to elaborate many different theories about the acquired information. When
faced with different theories then, we would like to be able to compare their
Page 6
hidden
6predictions to some dataset in order to assess their relative agreement with
the information we were able to extract. Consider, for instance, a set of M
theories T1, ..., TM created to explain a dataset with questions drawn from
some set Ξ. The most principled method to choose the “best” theory is to
rank them by their posterior probabilities given the data using Bayes’ theorem
P (Ti|Dt) =
P (Dt|Ti)P (Ti)

i P (Dt|Ti)P (Ti)
. (1)
The probability P (Dt|Ti) of the observed dataset given the theory, also
known as the likelihood of the theory, represents the idea of “predictions” of
the theories. This probability is normalised for fixed values of t as

Dt
P (Dt|Ti) = 1. (2)
Of course it is possible to include a probability distribution over the value
of t as well. Suppose, for instance, that the probability of taking a measure-
ment of some sort is constant in time. Then, the probability of taking t
measurements in a fixed interval of time is given by a Poissonian distribution
P (t) = t¯
te−t¯
t! , (3)
where t¯ is the average value of the number of measurements. However, this
can be easily accommodated in the formalism by including t as a random
variable in all the probability formulas and we will avoid this sophistication
as it does not affect any of the main considerations.
The probability distribution P (Ti) in equation (1) is called the prior
distribution of the theories. It gives a ranking of preference of the theories
according to those intrinsic characteristics of each theory which are indepen-
dent of the dataset. Note that by fixing the number of theories to be compared
(M according to the given notation) we avoid any ambiguities with respect
to the fact that we never really know how many theories are possible. We will
always be concerned with comparing theories that we have, not ones that we
still do not.
Equation (1) is the fundamental formula of Bayesian inference and will
also be our central principle. Based on the analysis of its consequences, we
will be able to study the process of testing physical theories, which will be
dealt with in details in section 5.
It may be argued that we should be able to evaluate the plausibility of a
theory even if we have only one. But then, by using equation (1) we would
always find probability one for this theory as there is not any other. That
is not a weakness of the procedure, for it is designed to choose one between
many possible theories. To evaluate the probability of one theory being right
according to the data, the correct procedure is to consider a secondary hy-
pothesis: is the theory correct or wrong? Then, these two hypothesis can be
compared using Bayesian inference giving the desired result. To a more de-
tailed discussion, the reader is referred to [4]. This latter case however does
Page 7
hidden
7not weight other characteristics of a theory and would not discriminate be-
tween two different theories that however explain the data with the same
precision.
Theories cannot be proved right, but can be proved wrong, even if only
in principle. We are only keen to consider a theory scientific if we are able to
falsify it, which is accomplished by making predictions that can be checked
by comparing with the results of the experiments. This procedure is part of
the testing process. Still, falsifiability is such an important concept that we
will dedicate the entire section 6 to its study.
As it can be appreciated from the exposition above, information acqui-
sition, modeling and testing are interrelated, dependent processes. In fact,
we claim that these three basic elements are all that is needed to formalise
the scientific method. Science should then be carried out by iterating the
above processes in whatever order it is necessary to produce a probability
distribution for scientific theories at some instant t after processing all the
information available in the dataset Dt.
Note that this does not produce a unique theory unless the probability of
all other theories is reduced to zero at some point. Selection of one theory, if
desired, can be carried out by choosing the most probable theory after some
dataset has been measured. In the next sections we will start to develop the
formalism and give more precise definitions that will enable us to study in
details each one of the processes described above.
3 Information Acquisition
In order to be able to formalise the scientific method, we will first attempt
to give definitions for its basic concepts. It became clear by the arguments
of the last section that a fundamental concept of the theory should be that
of a question and we therefore will give tentative formal definitions to the
basic elements that are required for it as well as try to relate them to their
natural (physical, mathematical) interpretations.
Definition 1 (Questions and Measured Answers) A question Q ∈ Q
is a mathematical describable object encoding all the information about a
specific physical experiment or mathematical reasoning. A measured an-
swer A ∈ A to a question Q is another describable object that encodes the
information acquired by carrying out the experiment/reasoning encoded by
Q, which is represented symbolically by
A = Q. (4)
The sets Q and A can be of any kind of describable objects, not necessarily
the same.
The concept of a question encompass both the entire concept of a physical
experiments and that of a mathematical/logical reasonings. For a physical
experiment, this must encode the experimental setup, the method of mea-
surement and any other relevant information that is necessary to reproduce
it. In the case of a reasoning, the method of reasoning and the hypothesis
Page 8
hidden
8used to obtain the measured answer should be part of it. The encoded infor-
mation should also includes initial conditions related to the systems under
study or characteristics of these systems like, for instance, the mass or the
charge of a particle in a high energy experiment.
The freedom allowed with respect to the nature of questions and measured
answers is necessary to guarantee that any kind of inquiry is possible. For
instance, in the case of the perceptron, the questions are multidimensional
euclidean vectors with coordinates usually being boolean or real and the
activation function. On the other hand, questions can also be formulated in
more familiar languages, like for example “What is the mass of the electron?”
or “Is there any integers x, y, z satisfying xn+yn = zn for integer n > 2?”. In
quantum mechanics, as another example, questions can be associated with
the calculation of eigenvalues of Hermitian operators (and any mathematical
method used to do it) with measured answers corresponding to their actual
numerical values.
Definition 2 (Question Set) Consider an index set I. The set
Ξ = {Qi}i∈I , (5)
where each element Qi indexed by values of the index set I corresponds to a
possible question about a system, is called a question set.
The index set I can be either discrete or continuous. It also can be finite
or infinite. We will not restrict it in any way a priori. Note that the index of
the questions can be related to numerical parameters encoded by it. This has
the profound consequence that the index set I can then, in some situations,
acquire a physical interpretation beyond that of a simple mathematical ob-
ject. Indeed, we know that concepts as mass and charge, for instance, label
in quantum field theory the possible representations of symmetry groups and
correspond to the characteristics that define the elementary particles.
Definition 3 (Dataset) Given a question set Ξ as defined above, the set
Dt = {(ξµ, σµ)}µ=1,...,t, (6)
of ordered pairs of questions ξµ ∈ Ξ and corresponding measured answers
σµ = ξµ, called data points, where t is an integer, is called a dataset.
The notation is borrowed from machine learning (see again the perceptron
example given in the previous section). Although we are using the notation t
to index the dataset, this must not be interpreted as a time index in general,
but only in particular cases if it is designed as such. We could try to generalise
t by allowing its values to be in any set, but this is unnecessary in most
practical situations and therefore we will not attempt to do that.
There are many cases where the experiments ξµ can be chosen indepen-
dently of each other, but it may be conceivable that this does not happen in
many situations. For instance, suppose the acquired information corresponds
to values of some Markovian random variable at each time step. Then it is
possible to choose the initial state of the system, but the following states will
be defined by its stochastic dynamics and are not independently chosen. The
Page 9
hidden
9specific way into which the correlations are encoded is defined by the theory
T , which will be defined in the next section, and is part of its structure.
It is important to notice at this point that in the case of physical ex-
periments, the resulting value can be corrupted by many different kinds of
noise. Usually, the central limit theorem guarantees that when many sources
of noise are acting together, the final result is a Gaussian noise, but this may
not always be the case. In the same way as the dataset correlations, we will
see that the concept of noise and its modeling also depends on the theory
T being considered and, as such, will be described in more details when we
discuss the process of modeling, in the next section. As a trivial example,
electric experiments are made with devices with a noise model that depends
on electromagnetism itself.
At many points during the scientific investigation of a physical process,
experiments are made in a tentative to obtain more information to continue
the process of modeling. In these cases, there are many methods that can be
used to decide about what kind of questions should be asked. An interesting
approach is named learning by queries. Queries are questions that are chosen
in a certain way as to maximise the amount of information acquired with
their answers. The theory of learning by queries has been analysed and used
successfully in many applications in neural networks with the possibility of
being generalised to other situations. We will not delve into details here,
more information being available in [18] and references therein.
Let us then finish this section by given the definition of the first of the
three processes that will compose the scientific method.
Definition 4 (Information Acquisition) The process of constructing a
dataset Dt by obtaining the measured answers to some subset of a the ques-
tion set Ξ is called information acquisition.
4 Modeling
The process of modeling is the most involved aspect of the scientific method.
Here is where the inspiration of the researcher comes into play. There is
no algorithmic methodology up to date to create a model, only guidelines. A
formalisation of this process would require traits like creativity to be modeled
and, until we have a better understanding of the process of thinking, this will
remain uncertain ground. Modeling the process of creating a theory is not
our objective. Here it is assumed that there is a way to model the theory and
we will try to assess its characteristics. We start by giving a more precise
definition of a theory. Note that we still are not defining a scientific theory,
for which we still need some more considerations.
Definition 5 (Theory) A theory T = (α, pi) is composed by an algorithm
α depending on a set of p free parameters pi = {pi1, ..., pip}, called the the-
ory’s fundamental constants, that generates in a finite time a probability
distribution P (Dt|T ) for any possible dataset Dt with questions belonging
to a specific question set Ξ. In this sense, we say that the theory T answers
the question set Ξ and the probability distributions P (Dt|T ) are called the
theoretical answers of the theory.
Page 10
hidden
10
In physical theories, the fundamental constants are called physical con-
stants, which must be dimensionless [19], and this is the motivation for the
adopted terminology. The values of the theory’s constants are not part of
the theory and need to be obtained from experiments, or better, from the
measured answers. The correct way to obtain them, once more, is to use
Bayesian inference to extract its values from the available datasets. We will
talk more about information processing using the the datasets in section 5
when we discuss how to test the theories.
In principle, it can be argued that the algorithm α can be either stochastic
or deterministic. This is, however, just a matter of convention. A stochas-
ticity of an algorithm α can always be transferred to its theoretical answers
by modifying the resulting probability distributions of the datasets accord-
ingly. However, there is a sense in which we can talk about a deterministic
or stochastic theory that is reminiscent of physical theories. This can be
formalised in the following way.
Definition 6 (Stochastic and Deterministic Theories) A determin-
istic theory is one where all theoretical answers are delta functions, i.e.,
probability distributions with zero variance. Otherwise, the theory is called
stochastic.
It is worthwhile to clarify at this point that we are not trying to delve into
the philosophical debates about the meaning of the mathematical structure
of the theory. This can be illustrated by the question of the interpretation
of quantum mechanics. The mathematical structure of quantum mechanics
is well defined and it is this structure that we are calling the theory. The
pictorial representations in terms of human language or perception, although
being extremely important in our opinion, are questions that are out of the
scope of our framework and will not be dealt with here beyond what we
consider to be necessary to develop our ideas. However, we agree that this is
an interesting future research direction.
The given definition of a theory is minimal in the sense that the internal
structure of the algorithm is not completely specified. We can however try to
identify this structure. Consider a dataset Dt. As already pointed out in the
previous section, the theory must be able to describe correlations between its
data points, the noise affecting the experiments (in the relevant situations)
and the probability distributions corresponding to ideal theoretical answers
in the absence of noise.
The interdependence of a certain set of variables can always be repre-
sented by a Bayesian network [20], which is a directed acyclic graph where
each node represents one of the variables and each directed edge represents
conditional dependence. A Bayesian network is a particular case of more
general probabilistic structures known as graphical models which have been
used in applications of statistical mechanics for some time now. Most well
known statistical mechanical models, like the Ising model, can be written in
this language. By using the information contained in the network and the
chain rule of probability, P (Dt|T ) can be broken down into factors repre-
senting the correlation structure in the dataset according to the theory. An
interesting feature is that, in addition to the nodes representing the data
Page 11
hidden
11
Fig. 1 A Bayesian network representing four variables q1, q2, q3, q4 and its depen-
dences.
points, a Bayesian network may contain hidden nodes, which are quantities
that cannot be directly observed but their state can influence the other vari-
ables, being able to encode non-explicit correlations. We will see how these
hidden nodes appear in a physical context when we analyse an example of a
quantum mechanical system later on.
Figure 1 shows an example of a Bayesian network representing a Markov
model of order two involving four variables q1, q2, q3 and q4. Following the
dependence encoded in the graph we can then write
P (q1, q2, q3, q4) = P (q4|q3, q2)P (q3|q2, q1)P (q2|q1)P (q1). (7)
Note that when the dataset comes from a physical process developing
in time, the case where the conditional probabilities of the system states
at every instant given the former instants are delta functions, represents a
process where there is no random component in the time development, which
conforms to our definition of a deterministic theory. The classic comparison is
between Newtonian mechanics, which in a fundamental level does not include
any statistical element, versus quantum mechanics, which is fundamentally
based on stochastic processes. Even with the time development of the wave
function being still deterministic, the results of what we will see posteriorly
to be completely noiseless measurements, the ideal theoretical answers of
the theory, remain stochastic and are given by probability distributions of
possible values which are not delta functions.
Let us now focus on the characterisation in the theory of how noise affects
the answers. The modeling of the noise pressuposes that the experimental
conditions encoded by the questions can affect the answers in such a way
that what would be some unique value in the absence of noise becomes a
probability distribution over values. Therefore, this “corruption” can be de-
scribed by some probability distribution. This is easier seen in the case of
only one data point (ξ, σ). The generalisation is then straightforward. Let us
then assume that σ = ξ ∈ A is the corrupted value obtained as a measured
answer. Let us call the noiseless value of this answer by a, which will not be
experimentally accessible due to the noise. According to Bayesian inference
theory, we should then marginalise over all unmeasurable possibilities and
write
P (ξ, σ|T ) =

a
P (ξ, σ, a|T )
=

a
P (σ|a, ξ, T )P (a|ξ, T )P (ξ|T ).
(8)
If the way the questions (or experiments, for instance) are chosen do
not depend on the theory, the last term in this formula cancels out when the
Page 12
hidden
12
posterior distribution for the theory is normalised. The term P (a|ξ, T ) is then
a distribution of possible answers of the question ξ (or experimental results)
without the interference of the noise that we then call the noiseless theoretical
answer. The noise modeling is contained then in the first term P (σ|a, ξ, T ).
Of course, if there is more than one data point, a Bayesian network specifying
the correlation of all variables, including the possible measured answers (·)
should be given by the theory.
The algorithm α should then describe all the above stated relationships.
It is known that algorithms can be described in many ways by using differ-
ent languages. The algorithm α also can be seen as a mathematical structure
defined by a set of a axioms {α1, ..., αa} dependent on the fundamental con-
stants pi. A method to encode mathematical structures provenient frommodel
theory is explained for instance in [21]. The physical experiments encoded in
the question set Ξ and the possible values of measurements, the measured
answers, are then associated with theorems derived from these axioms.
The construction of the algorithm α is usually subjected to many con-
straints. The most fundamental one being (mathematical) consistence. For
example, if a physical theory can calculate, let us say, the entropy of a black
hole by several different methods, they must give the same answer. Other
constraints nevertheless can be imposed like, for instance, Tegmark’s Com-
putable Universe Hypothesis [21], which could be implemented by requiring
that the probability distributions which the theory is supposed to generate
must be computable, i.e., the theory must allow for their calculation in a
finite time, which in particular we have already included in our definition
of theory. Any constraint to be enforced should ultimately be part of α and
be incorporated in its axioms in some form. It is then fair to think about
both the amount of axioms needed to construct the theory and the number
of free parameters in it as leading in some way the concept of complexity of
the theory.
The concept of complexity is obviously not a simple one. There are many
measures of complexity in the literature and new ones are proposed from
time to time, with the exact idea of what is the meaning of “complex” being
different in each of them. A popular measure that seems to be able to cap-
ture many desired characteristics is the one introduced by Solomonoff [22]
in 1960 and independently five years later by Kolmogorov [23]. It is known
as algorithmic complexity or alternatively as Kolmogorov complexity (KC).
The KC of a given object is defined as the length of the shortest description
of an algorithm that can reproduce the object in some universal language
[5,24]. Although this length changes with the particular language used to
express the algorithm, Kolmogorov was able to show that the descriptions
differ only by an additive constant. KC is however an accurate concept only
for dealing with algorithms of classical information theory. If we broaden our
spectre of theories by allowing α to be not only a classical algorithm, but
also a quantum one [25] then the usual KC is not an appropriate measure
of complexity in general. There are for this case proposals for considering a
quantum version of KC [26,27], however there is no clear general complexity
measure that is able to encompass both cases. This problem is related to the
Page 13
hidden
13
fact that the frontier between the classical and the quantum is not completely
understood yet.
Even if we restrict our analysis to classical algorithms, Kolmogorov com-
plexity has still one drawback, namely, the fact that it is uncomputable. One
solution proposed in the MDL approach is to use instead the prefix Kol-
mogorov complexity (PKC) which is the restriction of Kolmogorov complex-
ity to self-delimiting codes, for which there always exists a Turing Machine
that can identify if a codeword is or is not part of the code in finite time [5].
Be it classical or quantum, the basic principle behind Kolmogorov com-
plexity is that the complexity of any object is actually related to its regularity,
or more specifically, to its compressibility. Consider a classical string of zeros
and ones. The more regular the string is, the smaller is the program needed
to reproduce that string. In other words, we can compress the string in a
number of characters less than its size. In fact, compressibility is related to
the number of symmetries of an object O and is fundamentally linked to the
size of its automorphism group Aut(O), or the group of its symmetries. The
more symmetric an object is, the shorter the description needed to reproduce
it. The relationship between Aut(T ) and complexity of description for a a
theory T is discussed also in [21] and is behind the idea that a GUT should
have a larger symmetry than the effective low-energy theories unified by it.
The exact measure of complexity to be used will not be important for our
discussion, only the fact that it is a quantity that is possible to measure and
can approximate the relative importance of this characteristic in each theory.
The reason being that we would like to compare the relative complexity in
order to use it as an additional criterium for selecting theories which will
be included in the prior distribution P (T ). In fact, there may be situations
where KC or PKC may not be the more convenient choices and simpler
approximations adapted to the theories being analysed would capture the
relative importance of either theory better than these options.
There is a subtle point here. Note that in the above considerations about
complexity we only talked about algorithms. The way we defined it, a theory
is obviously not only its algorithmic part α and complexity is a concept that
should beextended to include somehow the dependence of the theory on the
set pi of its fundamental constants as well. However, contrary to the approach
taken in MDL, we choose to consider the complexity of these two components
of the theory separately for reasons that will be clearer as we proceed in
our study. We still associate the word “complexity” with some measure of
the length of the algorithm α, but we deal with pi in a different way. It is
important to note that what is considered to be a fundamental constant
in one theory can be a quantity that may depend on other fundamental
constants in a different theory and, therefore, may be calculable in it. This
latter theory would then have less fundamental constants then the former
and intuitively it would make sense to say that it is “more fundamental”.
This suggests a way to define a concept which we will call the fundamental
status of a theory related to the set of its fundamental constants pi.
Definition 7 (Fundamental Status) Given two theories T1 and T2 an-
swering the same question set Ξ with sets of fundamental constants given
respectively by pi(1) and pi(2), theory T1 is said to be more fundamental, or
Page 14
hidden
14
to possess a higher fundamental status, than theory T2 if the cardinality
of pi(1) is less than that of pi(2), or stated in another terms, if theory T1 has
less fundamental constants than theory T2.
If two theories answer the same question set, even if it is a very restricted
one, it is reasonable to expect that their fundamental status and complexity
should be related. In fact, as the more fundamental theory will have less
constants, the decrease in the number of constants must result in an in-
crease in the number of relations between the remaining ones, which implies
in an increase in the complexity of α. In other words, a theory gets rid of
constants by including constraints that relate its values. From this simple
observation we then are led to the startling idea that the more fundamen-
tal the theory, the more complex it is. At first sight this must seem highly
counter-intuitive. This obviously is a result of the highly difficult task of
defining complexity in a manner that agree with all the intuitive ideas that
we usually associate with this word. However, upon closer analysis, all con-
cepts involved are intuitively reasonable when we think that theories that we
would usually classify as highly fundamental, consider the example of string
theory for instance, rely on less physical constants but are mathematically
much more challenging, to the point of taking the efforts of many physicists
during a considerable amount of time to provide ways to use it to calculate
quantities that would be simpler in theories consider less fundamental. This
is not an isolated example. We can easily realise that theories we call more
fundamental in general, require much more subtle concepts and much more
mathematical sophistication than the others. Actually this could be under-
stood as the relation between generalisation and fitting in machine learning
[28]. Less fundamental theories favour fitting as they have more adjustable
parameters, while more fundamental ones favour generalisation. We will have
more to talk about that in the next section.
Again, the above expressed ideas seem to be at odds with our assertion
that GUTs should be more symmetric theories as symmetries allows for com-
pression of data. In fact, although symmetries allow for the reproducibility
of an object by giving less information about it, the description of the sym-
metry should now be incorporated to the description of the program that
reproduces the object, which means that although the input that generates
the object is now smaller, the program that interprets it should be larger be-
cause it must “know” somehow what the symmetry is and how to implement
it.
A very important observation must be made at this point. At the most
fundamental level, it is possible to encode the algorithm α plus the set of
parameters pi with some assigned values into a number N . This number can
be fed to a universal Turing machine Υ that then can provide the required
probability distributions of the data points. In this sense, there is only one
theory Υ with only one numerical parameter N that is then chosen based
on the dataset by using Bayesian inference. This description blurs the dif-
ference between parameters and algorithm that we defined. However, this
is a question of hierarchy. We indeed want to differentiate between a set of
axioms and numerical parameters, even if they can be considered the same
thing at a higher, purely mathematical, level. We then can break down N
Page 15
hidden
15
in independent free parameters and view axioms as fixing some of these free
parameters which then become what we called α. These fixed parameters are
not allowed to change even if more information is acquired. Changing them
is equivalent to consider a different theory. The remaining free parameters
are allowed to change as more information is collected without considering
the result a different theory. Although we can see by this that the precise
concepts of a theory and its constants is a question of attributing semantic
“baggage”, as considered in [21], this separation is important for relating the
input of this theoretical Turing machine Υ to the physical world or whatever
system is being studied.
In the following paragraphs we will analyse some basic examples that will
serve to illustrate many of the concepts discussed above. The examples given
are only intended to clarify the definitions introduced, applications of the
theory being left to the appropriate sections.
Example 1 (Fundamental Status and Complexity) Let us illustrate our basic
idea that the more fundamental is a theory, the more complex it tends to
become with an example of a very simple system system, a set of analytical
real valued functions with real arguments. In this set, every function can
be expanded in a possibly infinite polynomial around zero. Let us consider
a sequence of theories where the question set is composed by the possible
values of the their argument and the answers are real numbers corresponding
to applying the function we are trying to discover to these values. A set of
five possible theories is given by
T1 = The function is a polynomial,
T2 = The function is a polynomial of finite degree n,
T3 = The function is a polynomial of degree 3,
T4 = The function is a polynomial of degree 2,
T5 = The function is sinx.
Note that in T1 there is an infinite number of coefficients to adjust from
the data, but the theory is very simple to state. In fact, we could say that it
has only one axiom. T2 have much less free parameters, as now the number
is finite. By adding one extra axiom saying that the polynomial has a finite
degree, we reduced drastically the number of free parameters and now we
have a more fundamental theory.
It could be argued that T1 should be more fundamental than T2 as the
latter is a special case of the former. However, this is a misleading argument,
as being very general, T1 in fact leaves much more structure to be adjusted
by data than T2. Note that the most general theory is the one that actually
assumes nothing, as any other theory can be obtained from it by “adjusting”
its infinite number of free parameters.
Considering again T2, we see that in fact we could break down the finite-
ness axiom of its degree into a countably infinite number of axioms by saying
that we are fixing an infinite number of parameters, although the exact num-
ber of non-zero parameters is now a new parameter. T3 and T4 now have one
more axiom that fixes the degree n. The number of free parameters is again
Page 16
hidden
16
reduced, increasing the fundamental status of the corresponding theory in
the process. Intuitively, it would be hard to decide which one of T3 and T4
is more complex, but accordingly to our conjecture that a more fundamen-
tal theory should be more complex, then T4 should somehow be attributed
higher complexity, although it is not clear for us at the moment how to do
that in a rigorous way.
Finally, the theory T5 is obviously more complex than all the others as it
requires the specification of all the polynomial coefficients with the recursion
formula that defines the sine function. In contrast, it is the more fundamental
theory of all once it simply has no adjustable parameter.
Example 2 (Perceptron) Let us see how the concepts developed here are re-
lated to the perceptron, already described briefly in section 2. For simplicity,
let us suppose that the possible question set is Ξ = {0, 1}n, for some integer
n. In fact, it becomes convenient to enumerate the questions using as an
index the question itself. For instance, if we had n = 2 we would write the
question set as
Ξ = {Q00, Q01, Q10, Q11} = {00, 01, 10, 11}. (9)
Suppose the activation function of the teacher perceptron f(B, ξi) to be
a Boolean function, by which we mean that its range is in the set {0, 1}.
In physical terms, the teacher perceptron is a “toy universe” representing
nature and its activation function is what we would call the “physical laws”
of the model. The vector B plays the role of a physical constant, in fact, the
only physical (theory’s) constant existent in this “universe”. The questions
can be seen as binary strings encoding every possible experimental setups in
this universe that are designed to measure one binary digit as a result.
Let us assume that we start our modeling with a certain dataset of t
measurements
Dt = {(ξ1, σ1), ..., (ξt, σt)}, (10)
with ξi ∈ Ξ and σi ∈ {0, 1}, from which we suppose (in this case correctly,
but this is not always true) that the ranges of the possible answers are all
equal to Ai = {0, 1}. Again, by a stroke of genius or a lucky guess, we
suppose that physical laws depend only on one physical constant, namely
S, which in perceptron terminology is the student’s synaptic vector, through
some boolean function g(S, ξ), the activation function of the student. Then
our physical theory, call it T p, will have a set of physical constants which is
the singleton
pi(p) =
{
pi(p)1
}
= {S}. (11)
First, let us decide about the dependence of the data points. Our theory
will consider data points are independent and, being so, we can write their
probability distribution as
P (Dt|T p) =
t

i=1
P (ξi, σi|T p)
=
t

i=1
P (σi|ξi, T p)P (ξi|T p).
(12)
Page 17
hidden
17
In the majority of applications, the probability distribution of the ques-
tions given by the last term P (ξi|T p) does not depend on the theory and
consequently this term is canceled by the corresponding term in the nor-
malization of equation (1). The remaining factor describes how the actual
obtained answers are related to the the possible questions according to the
theory. This represents in general the noise model, but we will assume that
noise is negligible in our experimental setup. The probability for our given
dataset is then modeled by
T

i=1
P (σi|ξi, T p) =
t

i=1
δ(σi,ξi), (13)
where δ is a Kroenecker delta and ξi = g(S, ξi). The algorithm α in this
case is composed by the description of the activation function g plus the
hypothesis that the information acquisition process, is noiseless.
If we would like to compare, let us say, different theories Ti each corre-
sponding to a different student activation function gi, then substitution of
the above distribution into equation (1) gives
P (Ti|Dt) ∝
t

i=1
δ(σi, gi(S, ξi)), (14)
where we considered the prior distribution being uniform over all the possible
theories. The interesting aspect of this equation is that, if any theory misses
the correct answer for any one of the data points, the deltas guarantee that
its probability is reduced to zero. This is obviously a very rare case and the
reason this happen is that, by considering the theory noiseless, we attributed
a strong decisive power to the dataset. In this sense, these theories can in
principle be proved wrong. This means that these theories have the main
characteristic of being falsifiable theories, a concept that about which we will
have more to discuss in the next section.
Example 3 (Quantum Mechanics) Let us deal with a more familiar example
in physics. Consider the theory to be analysed as quantum mechanics, sym-
bolised by TQM , and the a question set given by Ξ = {x,p}, with x the
position operator and p the momentum operator of some system which will
be analysed with some experimental setup. In fact, the way we are describ-
ing the question set is sloppy, as we should actually append a description of
the experimental setup. We shall however ignore it for simplicity as this will
be no relevant for the point we are trying to clarify, as will be seen in the
following.
In order to simplify the present analysis, we will suppose that the theory
describes the system by assuming that there is no time evolution between
two measurements of these observables. The rules of quantummechanics state
that if the system is in the state |ψ〉 at some instant, then a measurement of
an observable O will result in one of the eigenvalues oi of O with probability
| 〈oi|ψ〉 |2 and, after the measurement, the new state becomes an eigenstate
of that operator. In principle the measurement noise can be reduced to zero
as long as these rules apply.
Page 18
hidden
18
ξ
1
ξ
2
ξ
3
ψ
1
ψ
2
ψ
3
σ
2
σ
3
σ
1
Fig. 2 The Bayesian network representing the variables in the quantum mechanics
example. The dark circles associated to the state of the system represent hidden
nodes that must be summed over
Let us analyse what happens if the measured dataset is
D3 = {(x, x1), (p, p2), (x, x3)}, (15)
where the first element in the pair indicates the measured observable and the
second the value obtained. Here we will consider that the data are collected in
chronological order from left to right. As the state of the system depends on
the result of the former measurement, the data is not independent. In fact, the
dependence structure can be represented by the Bayesian network of figure
2, where the state of the system is represented by hidden nodes, as it is not
measurable and must be summed over. It is important not to confuse this kind
of description with a hidden variable interpretation for quantum mechanics,
but just as a graphical way of visualising interdependency of each element
of its structure. Then the probability of the dataset is then decomposed,
substituting the corresponding values according to the graphical model, as
P
(
D3|TQM
)
=

ψ1,ψ2,ψ3
P
(
x3|x, ψ3, TQM
)
P
(
ψ3|p2, TQM
)
× P
(
p2|p, ψ2, TQM
)
P
(
ψ2|x1, TQM
)
× P
(
x1|x, ψ1, TQM
)
P
(
ξ1|TQM
)
P
(
ξ2|TQM
)
P
(
ξ3|TQM
)
× P
(
ψ1|TQM
)
,
(16)
where the sum over the ψi’s runs over all possible values of the state of the
system according to the theory, which is all the possible equivalent classes
defining a different state in the Hilbert space of the system, and we did not
substitute the question in the last factors describing the probability of the
ξ’s for the sake of clarity. In any case, as we are free to choose what to
measure independent of the order, these will drop out when probabilities are
normalised.
Page 19
hidden
19
As we are considering noise to be negligible, we have that
P
(
ψt|σt−1, TQM
)
= | 〈σt−1|ψt〉 |2, (17)
P
(
σt|ξt, ψt, TQM
)
= | 〈σt|ψt〉 |2, (18)
where the index t indicates the time step to which each variable is related.
If we would consider a continuous time index with a time evolution between
measurements given by a time evolution operator U , this would affect the
first equation as
P
(
ψt+∆t|ξt, σt, TQM
)
= |

σt|U †(∆t)|ψt+∆t

|2. (19)
Noise would affect directly equation (16), where we would need to have a
sum over the possible answers at each measurement. The probability chain
rule would again be used as this would be a Markov chain. The first term in
the chain would then be written as
P
(
x3|A(x), TQM
)
P
(
A(x)|x, ψ3, TQM
)
P
(
ψ3|A(p),p, TQM
)
, (20)
where A(·) corresponds to the answer predicted by the theory at the cor-
responding time step, which are not explicitly written. Let us remind once
more that we are neglecting a term with the probability of the experiments
as their order would not depend on the theory for we are allowed to choose
freely if we are going to measure either x or p, and they would cancel out
with the normalisation. The noise model would enter only in the first factor.
For instance, white Gaussian noise with unit variance in the measurement
process would give
P
(
x3|A(x), TQM
)
= 1√
2pi
e− 12 (x3−A(x))
2
. (21)
In this case, the set of physical constants pi and the algorithm α are much
more difficult to describe.
For completeness, let us finish again by giving a more formal definition
of the process analysed in this section.
Definition 8 (Modeling) The process of creating a theory T that answers
the question set Ξ is called modeling.
5 Testing
Contrarily to what we have done in the last two, we will start this section by
defining from the begining the process we want to analyse.
Definition 9 (Testing) Given a set of M theories {T1, T2, ..., TM} and a
dataset Dt, the process of calculating the posterior distribution
P (Ti|Dt) =
P (Dt|Ti)P (Ti)

i P (Dt|Ti)P (Ti)
, (22)
is called testing.
Page 20
hidden
20
Testing in a Bayesian framework then involves the recalculation of the
posterior distribution of the candidate theories according to Bayes’ rule. This
can either be done using only the original dataset Dt in order to rank a
new theory (or theories) among the other existent ones or by acquiring new
information, thus expanding the dataset to Dt′ , and again recalculating the
rank of the existing theories. Of course, both things can also be done at the
same time.
Ideally, the recalculation of the posterior distributions should be made
with the entire dataset, which in machine learning applications receives the
name off-line learning. There may be however situations where old data
points become systematically less important or less reliable. In these cases
the posterior can be recalculated for each new data point at time t + 1 us-
ing the posterior at time t as a prior, a practice known as on-line learning.
Here we will deal exclusively with off-line learning situations. References to
Bayesian on-line learning can be found for instance in reference [29].
As we had already shown in equation (1), discussed briefly in section
2, in order to calculate the posterior distribution of the candidate theories
given the dataset, a prior distribution over theories should be defined. It is
obviously possible not to favour any theory in the absence of information.
However, our experience tells us that physical theories have some desirable
properties. For instance, as we climb up the ladder that leads to the standard
model, the larger the symmetry in the theory. Most of the efforts related to
grand unified theories (GUTs) rely on finding a more encompassing symmetry
group that breaks down to the known gauge groups of electromagnetic, weak
and strong interactions at smaller energies. In this sense, more symmetric
theories are preferred to less symmetric ones.
Deciding about theories that “explain” the data equally well is the job
of the prior. One of the most used criteria in this scenario is the well known
principle of Ockham’s razor, which simply states that if more than one theory
explains the dataset equally well, the simplest should be preferred.
Ockham’s razor is justified on philosophical grounds based on our belief
that nature ought to be simple, the blaming being on us for not being able
to describe it correctly. “Simple” is always understood as mathematically
simple. There is however no guarantee that nature’s laws are indeed simple
in these terms. A more rational and practical justification for Ockham’s ra-
zor would be that if the theories explain the dataset equally well, we then
choose the simplest one to carry out necessary calculations in order to spend
less resources, which now clearly favours theories with lower computational
complexity. On the other hand, machine learning teaches us that in every
inference task, there is a trade off between generalisation ability and fitting
(see [10,28]). By using enough adjustable parameters it is eventually possi-
ble to perfectly fit any dataset but generalisation, the capacity of inferring
correctly a new data point, becomes compromised.
Whatever the rationale for Ockham’s razor is, choosing the simplest the-
ory requires a precise measure of how complex a theory is, something that we
have already discussed in the previous section. We will then assume that the
prior distribution over a theory T = (α, pi) must depend on some measure of
the complexity of the algorithm α, which contains the mathematical descrip-
Page 21
hidden
21
tion of the theory. However, we claim that the prior over any physical theory
should also depend on the concept of fundamental status. As already said,
more fundamental theories have less adjustable parameters and more gener-
alisation power. In order to consider this, note that the prior distribution of
any theory can always be written as
P (T ) = P (α, pi) = P (pi|α)P (α). (23)
The distribution P (α) is the part of the prior that select theories with
less complexity. The factor P (pi|α) can then be used to select the more fun-
damental theory by using some probability distribution that decreases with
the cardinality of pi. In the previous section we argued that more fundamen-
tal theories should be more complex. As the exact relation is not clear at the
moment, the above way of writing the prior is general enough to include this
dependence in the term P (pi|α). In fact, we propose that these are the only
characteristics that are needed in the prior.
As in every Bayesian inference task, the problem of choosing the correct
prior is a difficult one, which does not mean though that it has no objective
solution. For instance, the maximum entropy principle is a method of choos-
ing priors by using Shannon’s entropy as a measure of lack of information
and maximising it subjected to constraints that encode all the information
available [4]. In this case, if two different observers have the same amount of
information, they agree about their priors. This is no more subjective than
the situation in relativity where two observers may only agree about the size
of time intervals or the order of spacelike separated events if they share the
same velocity, although given enough information they can carry on their
own calculations and discover how the other observer is measuring these
observables. Of course the choosing of the prior depends on many criteria
according to the aims of the task to be done, however the objectivity means
that once these criteria are agreed, there is one way to construct priors such
that again, if two observers have the same information, they end up with the
same prior.
A question may arise with respect to the factor of the prior related to the
fundamental status of the theory. Why not use the complexity of the set of
fundamental constants of the theory as a prior instead of just its cardinality?
Although this can be done, and it is in some sense the path taken in MDL,
in principle fundamental constants should not be judged by their simplicity
and should be learned using the dataset. In fact, as we have already discussed
when we argued that the whole theory can be reduced to a number to be
fed to a universal Turing machine, the theory’s constants work as a set of
indices defining a possibly continuum set of theories, each one corresponding
to different values of these constants, and choosing them is equivalent to using
equation (1) to rank the candidate theories. There are nevertheless situations
when it is desirable to compare different theories, but the exact numerical
values of the fundamental constants are not important. The solution in this
case is also the Bayesian one, but now we should sum over the unknown
Page 22
hidden
22
values of the parameters, what gives the expression
P (α|Dt) =

pi
P (α, pi|Dt)
=

pi
P (T |Dt),
(24)
to which the formula for P (T |Dt) can then be applied as before.
A very fundamental idea that must be clearly understood is that we only
compare theories constructed with respect to the same question set Ξ. When
the question sets of two theories are different, there is no basis to compare
them in general, for they have different scopes. For instance, compare hydro-
dynamics and quantum mechanics. In principle, quantum mechanics is able
to answer all questions of hydrodynamics (although this assertion has many
subtleties), but it makes no sense to decide between both by comparing their
different question sets. Depending on the questions to be answered, hydrody-
namics can even be the preferred theory as we do not need the full power of
quantum theory to calculate flows of water in confined geometries. However,
this very example allows us to define a new concept that relates theories with
different questions sets, the notion of power of a theory.
Definition 10 (Power) Consider a theory T1 that answers the question set
Ξ(1). A theory T2 that answers the question set Ξ(2) such that Ξ(2) ⊂ Ξ(1)
is said to be less powerful than theory T2 and, conversely, T1 is said to be
more powerful than T2.
According to this definition, quantum mechanics is more powerful than
hydrodynamics, an assertion that agrees with our intuition for the power of
a theory. For the rest of this work, the word “powerful” will be used in the
strict sense of this definition unless otherwise stated. In principle, hydrody-
namics can be obtained from quantum mechanics by restricting its question
set to a smaller subset relevant to its scope. Another classic example beyond
hydrodynamics versus quantum mechanics is Newtonian gravity versus gen-
eral relativity. It is clear that for most practical applications the full power
of general relativity is not only unnecessary, but usually undesirable due to
the more involved calculations needed to get an answer as good as the former
for all practical purposes.
Once more, it is not clear what is the correct quantitative measure of
power and how a prior over it should be constructed. For instance, a reason-
able measure would be the size (or cardinality) of the corresponding question
set, but nothing prevents us from choosing a monotonic function of these val-
ues. This is another point where more research is required.
The importance of considering to which question set each theory cor-
responds when comparing them was already noticed by Brody [8]. In fact,
what we called the question set here corresponds precisely to his concept of
scope of the theory. Brody arrives at pretty much the same conclusions we
arrived here that it is immaterial to compare theories with different scopes,
although he failed to accept the probabilistic interpretation for still hold-
ing to a pure frequentist interpretation. Again, many of his criticisms find
appropriate answers in Jaynes [4].
Page 23
hidden
23
The ultimate goal of theoretical physics is to find the most powerful theory
of all, one that can answer any possible question set about nature. This is
expressed by the idea of unification. Unification actually express the sensible
human belief that nature is not broken in small non-overlapping domains of
unrelated theories, but that it is a limitation of our body of knowledge to
describe it in these terms. The classic example of trying to merge general
relativity and quantum field theory in a theory of quantum gravity involves
this concept in a subtle way.
The concept of unification however is subtler than it seems. For instance,
if we consider two different theories answering two disjoint question sets, it
may be possible to consider a larger, more powerful theory than both, that is
trivially formed by the union of these two original theories. By union we mean
that all resulting sets should be the union of the specific sets of each theory.
In this sense, disjoint question sets should be understood as sets such that
the sets formed by their answers are independent. Although in this trivial
case we would not be too compelled to call the resulting theory a true unified
theory, we can use it as a guide to define more formally this concept.
Definition 11 (Unified Theory) Consider any two question sets Ξ(1) and
Ξ(2). A unified theory is a theory that can answer the union of these sets
ΞU = Ξ(1)

Ξ(2).
Note that a unified theory is always, by definition, more powerful than
any theory answering only Ξ(1) or Ξ(2), but there is nothing that guarantees
that it is more fundamental. In fact, the trivial unification based on the union
of two theories is actually less fundamental as it will always have at least the
same number of constants as the individual theories that it is composed of.
Returning to the example of quantum gravity, let us consider the ap-
proaches of Loop Quantum Gravity (LQG) and String Theory (ST). The
fundamental problem is that there is a question set ΞG which is answered
by general relativity in a highly successful way. But as gravity is supposed
to be universal and quantum systems also gravitate, their question sets over-
lap at some point and these theories should be unified somehow. The first
tentatives however have resulted only in theories that were not able to be
made consistent. Therefore, both LQG and ST try to be unified theories in
the sense that they both try to answer the union of two different question
sets. However ST aims to be a more fundamental theory as it actually is
argued to rely on less adjustable parameters, assuming the existence of less
fundamental physical constants. This can be seen by comparing both to the
standard model (SM). LQG is not more fundamental than the SM as it has
at least the same amount of adjustable parameters corresponding to masses,
couplings and mixing angles for, being a theory just about gravity, it does
not change the basic set up of the SM. On the contrary, ST has less parame-
ters than the SM in the sense that many of the physical constants that must
be obtained from experiments in the latter are in principle calculable in the
former without the need to be learned from any dataset.
Quantum gravity is also a good example to understand how difficult it is
to assess the complexity of a theory in general. The catch is that it is not a
fully developed theory. Even if either ST or LQG, or maybe both, are correct,
Page 24
hidden
24
many calculations cannot be done yet in these framework for they are too
complicated and not completely understood. This means that there is not
yet, for instance, an algorithm αST that allows ST to give answers to the full
question set it is supposed to answer and, therefore, no way to estimate its
full complexity in order to compare with other possible theories, although
partial comparisons using restricted question sets can be done. However,
many theoreticians believe that fundamental status is an even more desirable
characteristic than simplicity and use the fact that ST is arguably more
fundamental in its favor.
6 Falsifiability
Falsifiability is such a delicate and fundamental concept related to physical
theories that it deserves a separate discussion. String theory, for instance,
is usually attacked on the grounds that it may not be falsifiable. The main
difference between science and religion also is supposed to rely on falsifiability.
Falsifiability is a concept related to the testing process of a theory. It is
well known that not being falsifiable does not mean that a theory does not
agree with the experiments. For instance, the trivial case of a theory that has
no axioms, only adjustable parameters corresponding to the probabilities of
all possible answers to all questions cannot be ever disproved.
Consider, once more, the case of a polynomial of degree n. Any dataset
with n+1 or less experimental data points obtained from some function can
be fitted by this polynomial. However, one more data point can be enough
to test if the polynomial is or is not the desired function. In the more general
case of inferring probabilities for answers (what we called the theoretical
answers) in order to be falsifiable, a theory must be able to answer some
question in such a way that some new data would be able to disprove it,
where by “disproving” we mean that the probability of that theory given
the data would be decreased to zero after this new dataset is measured. In
terms of the formalism developed up to this point, we will explore a possible
definition of falsifiability.
According to equation (1), the only way to reduce the posterior of a
theory to zero is if the probability of the dataset is zero given the theory,
equivalently, if the theoretical answer evaluated for that dataset is identically
zero. If there is one dataset for which this happens, then the theory can be
considered to make a prediction which is falsifiable if experiments can be
tailored to check the theory’s predictions about this dataset. Of course this
requirement is null if there is only one dataset to consider. The concept of
falsifiability depends on the existence of new data to be measured, or new
information to be acquired. This suggests the following definition.
Definition 12 (Falsifiability) A theory T , answering a question set Ξ, is
called falsifiable if there exists more data points to be obtained beyond
those used to construct the theory and at least one dataset Dt with zero
probability
P (Dt|T ) = 0. (25)
Page 25
hidden
25
The first observation is again the strong dependence of the definition
on the considered question set. A falsifiable theory can give origin to a
non-falsifiable one if the question set is restricted to questions such that
P (Dt|T ) 6= 0 for any possible dataset. This implies that for a any theory,
a concept of absolute falsifiability could therefore only be defined if the set
of all addressable questions is known. If this set is not fully known, it is not
possible to claim that a theory is not falsifiable in principle, although it may
not be falsifiable in practice in the sense that the theory restricted to the
part of the question set that is experimentally accessible may not be.
This leads to the somewhat obvious conclusion that a theory cannot be
said to be absolutely non-falsifiable until all the question set it can answer
is known. By this definition, for instance, there would be still no grounds at
the moment to say that ST is fundamentally non-falsifiable. In fact, many
theories of quantum gravity are in this position as they deal with high energy
phenomena that are unaccessible to our present accelerators. Another exam-
ple would be Hawking radiation. It is a falsifiable prediction once it depends
in principle only on more advanced technology to be measured, although it
is not possible to do it at the present.
The second point to be noted is that, if there is no more data points to
be taken, there is no way to falsify the theory. This looks like an obvious
assertion and, indeed, it is. However, it is probably false that a situation
like this will ever happen in physics. Although a Theory of Everything is
supposed to account for every possible phenomenon, still there is no way
to claim that all possible measurements will ever be done. A possible, hand
waving argument, would be to use the holographic principle to argue that, as
observations seem to imply that the universe will never stop its expansion, if
we assume that the information contained in the universe is given by the size
of its horizon, this information will be ever increasing. This would mean that
even if at some time t all the information about the universe was collected
in the dataset Dt, at t+ dt more information would be created and so on ad
infinitum.
Although falsifiability is a difficult criteria to evaluate in a theory, in the
previous section we saw that the more fundamental constants (adjustable
parameters) a theory has, the easier it is to agree with the data. Therefore it
is reasonable to expect that the less fundamental a theory is, the smaller the
chances that it is falsifiable. According to our claim that more fundamental
theories should be more complex, we would than be led to the conclusion that
a less complex theory is not necessarily more falsifiable. The “everything
goes” theory for instance has only this axiom and an infinite number of
adjustable parameters to be measured by experiments, being fundamentally
non-falsifiable, which agrees with our claim.
Another kind of prediction that a theory is usually expected to do is to
derive and answer new questions previously not contained in the original
question set Ξ used as a basis for the construction of the theory. There are
many arguments in favour of requiring that a theory should be able to do
so. The practical one, for instance, is that this kind of prediction is the one
that leads to technological advancements, which leads to an enhancement of
society’s quality of life. It also brings economic wealth. This concept will be
Page 26
hidden
26
called the predictive power of the theory. The trivial “everything goes” theory
stated above is not only non-falsifiable, but also has no predictive power at
all.
Note that predictive power and falsifiability are related, but are different
concepts. A theory can make predictions beyond the original dataset that
may not be falsifiable and a theory may be falsifiable but do not address
questions beyond the original question set.
However, again, intuition says that the larger the predictive power of a
theory, the more probable it is to be falsifiable. This can be understood if
we note that the higher the predictive power of a theory, the larger becomes
the question set and, therefore, the larger is the number of possible datasets,
increasing the probability that at least one is impossible to be observed.
This culminates in the two most important definitions of this paper: sci-
entific theory and scientific method. Although these definitions were actually
stated before, unlike the situation up to now, each one of the concepts in-
volved is now precisely defined through the sequence of all previous definitions
in this work.
Definition 13 (Scientific Theory) A theory T is called a scientific the-
ory if it is falsifiable.
Definition 14 (Scientific Method) The iteration of the processes of ac-
quiring information, modeling and testing in any order necessary to produce
and rank scientific theories that answers a specific question set Ξ is called
the scientific method.
We require that all physical theories should be scientific theories unless
all the information available in the universe entire chronological history is
contained in a hypothetical obtained dataset D∞. As discussed above, the
existence of this dataset may even be impossible in principle, although we are
far from a rigorous proof. In any case, if all experiments possible experiments
were really done and its results cataloged, science would be truly over and the
above questions would be simply meaningless, the only work really remaining
be a better way of compressing the information in the most mnemonical way.
7 Cosmology
As our first cosmological example to be analysed here, we consider an inter-
esting question which has been revived recently and is related to the concept
of typicality in cosmology. Bayesian selection of physical theories has been
conjured in a number of papers [16,17,30] to justify or argue against this
concept. Let us examine these ideas using some of the framework developed
here.
The idea that theories where our observations are typical are favoured is
expressed more specifically in [30]. The argument boils down to the fact that
in the case of equal priors, the probability of the theory depends only on the
dataset
P (T |Dt) ∝ P (Dt|T ), (26)
Page 27
hidden
27
and then the theory that predicts that the data has higher probability, i.e.,
is “more typical” is preferred.
Of course the line of reasoning could not be more clear. Indeed the best
theory given equal priors gives the higher probability to the dataset. Actually,
the best one is the one which gives the dataset probability 1 in expense of
other possible datasets. If the dataset Dt contains all data that will ever be
possible to measure, then that is the only possible choice. But let us analyse
better the dataset.
First, let us define the question set. Just one experiment is enough, namely
Q = ’What kind of observer?’. Let us allow for N + 1 possible answers, i.e.,
A = {0, 1, ..., N} with 0 corresponding to no observer, 1 to a human observer
and the other integers to different kind of observers like aliens, Boltzmann
brains, etc. The 0 value is necessary because the experiment is designed to
find an observer somewhere. If no observer is found, then the zero value must
be attributed. For example, we can divide the visible universe into a grid and
attribute to each region of the grid one of these numbers corresponding to
the result of the experiment Q.
As we have already described, we could sum over the regions to answer
the question of what kind of observer anywhere, but the experiment should
still be well defined and the probability of finding some observer somewhere
must be provided by the theory even if we marginalise over it in the end. By
defining then all the necessary elements, we can discuss again the typicality
issue.
Suppose now that the only place we looked for is the Earth. Most of our
data will have the value 1. The theory that says that humans are the only
type of observers would be the best. As long as we have other places to look
at, this changes. Actually, as far as we know, “no observer” is the typical
result for the universe. Clearly this is a simplification, but it is not far from
the ones used in the above papers. The most important lesson is how decisive
is to define well the question set.
The second issue we are concerned with is related to multiverse hypoth-
esis. Consider the idea contained in [16], the suggestion that multiverse the-
ories are always favoured by Bayesian reasoning for they predict probability
1 of existence for any value of a measurement while single universe theories
do not, which implies that the likelihoods of the former are always larger
than of the later for the existence of these values. The multiverse theory
gives probability one for the “existence” of solutions for some question set
Ξ while the universe gives less than one. Therefore, assigning equal priors
for both, the likelihoods will select the multiverse theory. In order to anal-
yse this assertion, there is again the need for a precise definition of what
are the questions the theories must answer. In this case the question set
is not the original Ξ, but a derived one Ξ ′ that contains the unique ques-
tion Q′ = ’Does the set of answers for Ξ exists?’ The possible range of the
answers for this question is the set A′ = {Yes,No}. By definition, the prob-
ability distribution of the answer A(Q′) in the multiverse theory M is given
by
P (A(Q′)|Q′,M) = δ(Yes, A(Q′)), (27)
Page 28
hidden
28
which means that the probability of a ’Yes’ is one and of a ’No’ is zero. The
same distribution for the universe theory U is
P (A(Q′)|Q′, U) = Pδ(Yes, A(Q′)) + (1− P )δ(No, A(Q′)), (28)
which gives probability P to ’Yes’ and 1− P to ’No’.
Once more, comparing the theories given some data is the desirable sce-
nario. If the answer is ’No’, then obviously the multiverse is ruled out, so let
us suppose that our data says that the answer is ’Yes’. There are three cases:
(1) P = 1. Then both theories give the same likelihood for the data. If
they have equal priors, both theories have the same probability and cannot
be decided on the basis of the dataset.
(2) 0 < P < 1. Then the likelihood of the data given by the multiverse
is really higher and the multiverse solution must be preferred. Although this
may sound strange, the point is that we are now certain that the data exist
and the universe theory predicts that the data may not, so it is, based only
on the data, obviously less desirable if these are the only two alternatives.
(3) P = 0. The universe is obviously ruled out as it gives the wrong
answer.
Note that in this example, the nature of the question admits only a precise
definition. The distributions of the answers are Kroenecker deltas, which
means that they have no spread. Like in the previous case, there is not
a definite answer that is correct always. The posterior probabilities again
depend on the data and can be defined only by it.
The specific way in which the likelihoods are constructed shows that the
advantage of multiverse theories over universe ones is not straightforward.
The above defined “theories” are perfectly fine from the formal point of
view. However, different requirements must be taken into consideration when
constructing the theories. As already discussed, there are many of them. If
each element is not well defined there is not a proved superiority of one over
another in the present search for a theory that describes our world. This
superiority as in any other physical theory must be decided on the basis of
collected data which, at the present, is not enough for a decision to be made.
Finally, as the last cosmological issue to be analysed, let us consider the
weak version of the anthropic principle, as the strong one is non-falsifiable.
The analysis will be very simple as we only want to make the point that this
“principle” is only a label for a noiseless observation in a broader dataset.
The noiseless observation is the one that humans exist. The broader dataset
is the data accumulated about the physical and chemical requirements that
are necessary for us to live. Given these requirements, and the fact that we
indeed exist, any theory related to any question set to which this is relevant,
should have answers with distributions that give reasonable values for these
datasets.
Page 29
hidden
29
The fact that the anthropic principle can be used to calculate the range
of many physical constants can simply be reformulated according to the
framework developed in this paper as the assertion that the theories that
do not take into consideration this noiseless information from the beginning
have automatically zero probability. The fact that the constants calculated
in this way may be not specific values but be in some range just reflect the
fact that the other observations about what is needed for human life is not
known in a noiseless way. From this point of view, the importance status
given by the word “principle” becomes highly questionable.
8 The Isolated Worlds Problem
In principle, there is nothing that forbids the existence of some parts of our
universe that never interacted and will never do so. Although high specula-
tive, this possibility is not forbidden by any law of physics presently known.
Then the isolated worlds problem can be described in the following way. Sup-
pose that there are two regions of the universe which never interacted and
will never do so by definition. Appart from that, both regions have physical
laws that allowed the development of intelligent life, these regions named the
two isolated worlds. According to the usual scientific considerations, noth-
ing from one region is measurable from the other and, therefore, each region
should be considered by the other as non-existing. However, as by our hy-
pothesis, both region exist. Is there a fundamental problem here? Is science
unable to address this question?
We will argue now that this question is actually addressable, in principle,
through our framework if some special condition is met. Suppose that there
are many theories available to describe our universe and, when compared to
the available datasets corresponding to all the physical knowledge acquired
up to that moment, one of them is ranked with a probabity much higher
than any other. Now suppose that the mathematical structure of this theory
not only predict the existence of two isolated worlds according to the above
definitions, but that it requires it.
Now, even if there is no way of measuring directly one world from the
other, the whole framework developed here force us to admit that the prob-
ability of both worlds existing should be considered higher than they not ex-
isting and, the higher the probability of the theory requiring them becomes
as more datasets are added, the more their existence should be considered
real. A more surprising conclusion is that, if the theory requiring them is
falsifiable, we should assume that the existence of these two isolated worlds
is also falsifiable.
There could be argued that the above discussion is based in highly ques-
tionable speculations. We can counter-argue that though speculative, none
of the arguments above has any real impossibility of being correct, meaning
that there is no no-go theorem forbidding them. However, more strong sup-
port can be found in the fact that there is indeed a a theory where a similar
situation exists: eternal inflation [31]. In the eternal inflation scenario, new
universes are being formed all the time, and these universes do not interact
at all.
Page 30
hidden
30
This interesting analysis show that the Bayesian framework outlined here
is capable of extending the scope of scientific enquire to questions beyond
what is usually accepted as adressable.
9 Conclusions
The main objective of this work was to develop a formalisation of the scientific
method using the framework of Bayesian inference. We used this framework
to describe the basic processes that compose the scientific methodology. Re-
lying on it and using ideas related to characteristics of theories of physics,
we were able to give formal definitions to many important, albeit previously
only intuitive, concepts. This formalisation allowed a clear separation be-
tween these concepts and their study provided a better understanding of
their relevance in the structure of physical theories. The sequence of defi-
nitions presented culminated in the two most important ones in the paper,
those of a scientific theory and of the scientific method, both agreeing with
all our intuitive requirements.
Many important conclusions can then be drawn by the use of the for-
malism. The first important insight is that any theory is only defined with
respect to some set of questions and their precise definition is absolutely
crucial for comparing two or more theories. This implies that comparison be-
tween theories that answer different question sets is not a sensible procedure
in general. The question sets define the scope of the theories to be compared,
which do not need to be theories about everything but can be restricted to
the questions the researcher is interested in inside some scientific area or sim-
ply about some specific system. This allows the application of the methods
presented here to select the most appropriate theory to describe a limited set
of questions which, as already argued in the main text, does not need to be
the most powerful or fundamental theory in general.
Another key insight is the role played by the theory’s fundamental con-
stants, which in physical theories are actually the fundamental physical con-
stants of it. As a consequence of this identification, it was possible to make a
clear distinction between these constants pi and theory’s algorithm α, which
then allowed the formalisation of the a concept of fundamental status of a
theory that is in full agreement with the idea as used in theoretical physics
today. The more difficult question of the complexity of a theory was then
restricted to the complexity of the algorithm α. We discussed how difficult
it is to define a measure of complexity, with many attempts to do so in the
literature. Still, based on our studies, we argued that there is a sense in which
we can conjecture that the more fundamental a theory is, the more complex
it tends to become although we still cannot prove this assertion rigorously, as
the complexity concept is not so rigorously defined. However, this conjecture
is related to the interplay between generalisation ability and fitting, which is
a well known issue in machine learning.
Another key issue is that it becomes clear that although Bayes’ rule is
a fundamental principle of theory selection, and therefore of the scientific
method, it is not sufficient to capture all the concepts of science. Other
elements, like falsifiability and fundamental status, which are not directly
Page 31
hidden
31
related to inductive reasoning are also important. One possibility is that
these principles are still derivable from some more fundamental principle.
We see no alternative for this principle with exception of maximum entropy.
One indication that this may be the correct path to take is the fact that even
Bayes’s rule seems to be derivable from it.
By applying the developed framework to some cosmological problems, we
arrived at the follwoing conclusions. With respect ot the tipicality of the hu-
man observer, we saw that the exact formulation of the question is of utmost
importance and no proper answer can be given without a proper one. We
concluded that there is no basis for favouring multiverse theories on theo-
retical basis only and any preference can only be attributed to experimental
data. Finally, we argued that attributing the status of a physical principle to
anthropic reasoning may not be appropriate as it can be viewed as a simple
case of inferring some characteristics of a physical theory from the very triv-
ial observation that we exist plus the additional knowledge that was already
acquired through means of other experimental evidence of what is needed for
our survival.
The last result of this work is an important one. We showed, through
means of a problem we called the isolated worlds problem, that the framework
we developed can show that through pure inference, science can address
situations where a direct positivist approach would render it useless and,
above that, would not be considered a valid scientific question. This shows
that using Bayesian inference, the extent to which science can be used is
enlarged to situations which were beyond it in other formulations.
Although we did not discussed the concept of truth in the whole paper, as
we actually promised not to do, we need to include a little observation about
it. The representation of the probabilistic dependences of a theory which we
argued can be expressed in the form of a Bayesian network allows for the
introduction in the theory of hidden nodes. When some variable appears in
a theory only as a hidden node, it is fair to discuss on philosophical grounds
if any reality can be attributed to this variable. The positivist viewpoint
answers this question as a “no”, while the mathematical universe hypothesis,
on the contrary, would answer it as a “yes”.
Concerning the differences between our framework and Minimum De-
scription Length [5], MDL suggests to choose theories by minimising the
complexity of the description of the hypothesis plus the dataset. Although
MDL may be desirable for many applications and a good approximation to
Bayesian inference, our framework allows us to address questions that do
not appear in MDL. These questions, like falsifiability and unification, play a
very important role in the development and selection of physical theories and
we hope that the use of our formalism can lead to a better understanding of
them.
As a final comment, let us highlight that by using ideas coming from
probability theory and machine learning we were able to give a mathematical
framework to questions that could be considered to lie only on the sphere
of philosophy. This shows how important a part pure philosophy is of the
scientific endeavour, something that seems to be forgotten nowadays.
Page 32
hidden
32
Acknowledgements I would like to thank Dr. Juan P. Neirotti and Prof. David
Saad for estimulating discussions. The comments and suggestions made by Prof.
Ariel Caticha about the ideas in the manuscript were deeply inspiring and thought-
ful and I would like to thank him for taking the time to read this work. I also would
like to thank Prof. Nestor Caticha from the University of Sao Paulo, where the main
part of this work was done, for introducing me to Bayesian theory and helping me
to see its relevance to physics.
References
1. K.R. Popper, The Logic of Scientific Discovery (Routledge Classics) (Rout-
ledge, 2002)
2. G. Amelino-Camelia, arXiv:0806.0339v1 [gr-qc] (2008)
3. B. Gower, Scientific Method: A Historical and Philosophical Introduction
(Routledge, 1996)
4. E.T. Jaynes, Probability Theory : The Logic of Science (Cambridge University
Press, 2003)
5. M. Li, P. Vitanyi, An Introduction to Kolmogorov Complexity and Its Applica-
tions (Texts in Computer Science) (Springer, 1997)
6. A. Caticha, in BAYESIAN INFERENCE AND MAXIMUM ENTROPY
METHODS IN SCIENCE AND ENGINEERING: 27th International Work-
shop on Bayesian Inference and Maximum Entropy Methods in Science and
Engineering, vol. 954, ed. by K.H. Knuth, A. Caticha, J.L. Center, A. Giffin,
C.C. Rodr´ıguez (AIP, Saratoga Springs (NY), 2007), vol. 954, pp. 11–22
7. A. Caticha, Lectures on Probability, Entropy, and Statistical Physics (2008)
8. T. Brody, The Philosophy Behind Physics (Springer-Verlag, 1994)
9. D. Sivia, J. Skilling, Data Analysis: A Bayesian Tutorial, 2nd edn. (Oxford
University Press, USA, 2006)
10. A. Engel, C. van den Broeck, Statistical Mechanics of Learning (Cambridge
University Press, 2001)
11. F. Rosenblatt, Principles of Neurodynamics: Perceptrons and the Theory of
Brain Mechanisms (Spartan Books, 1962)
12. J.P. Neirotti, N. Caticha, Phys. Rev. E 67(4), 041912 (2003). DOI 10.1103/
PhysRevE.67.041912
13. W. Mcculloch, W. Pitts, Bulletin of Mathematical Biology 5(4), 115 (1943).
DOI 10.1007/BF02478259
14. J.A. Hertz, A. Krogh, R.G. Palmer, Introduction to the Theory of Neural Com-
putation (Westview Press, 1991)
15. R.T. Cox, American Journal of Physics 14(1), 1 (1946). DOI 10.1119/1.1990764
16. D.N. Page, arXiv:0707.4169v1 pp. 1–7 (2007)
17. J.B. Hartle, M. Srednicki, Physical Review D (Particles, Fields, Gravitation,
and Cosmology) 75(12), 123523 (2007)
18. P. Sollich, Phys. Rev. E 49(5), 4637 (1994)
19. M.J. Duff, arXiv:hep-th/0208093 (2002)
20. J. Pearl, Probabilistic Reasoning in Intelligent Systems: Networks of Plausible
Inference (Morgan Kaufmann Publishers Inc., San Francisco, CA, USA, 1988)
21. M. Tegmark, Foundations of Physics 38, 101 (2008)
22. R. Solomonoff, Report V-131, Zator Co. (1960)
23. A.N. Kolmogorov, Probl. Peredachi Inf. 1(1), 3 (1965)
24. T.M. Cover, J. Thomas, Elements of Information Theory (John Wiley & Sons,
New York, NY, 1991)
25. M.A. Nielsen, I.L. Chuang, Quantum Computation and Quantum Information
(Cambridge University Press, 2000)
26. A. Berthiaume, W. van Dam, S. Laplante, Journal of Computer and System
Sciences 63(2), 201 (2001)
27. P. Vitanyi, in COCO ’00: Proceedings of the 15th Annual IEEE Conference on
Computational Complexity (IEEE Computer Society, Washington, DC, USA,
2000), p. 263
Page 33
hidden
33
28. C.M. Bishop, Pattern Recognition and Machine Learning (Information Science
and Statistics) (Springer, 2006)
29. M. Opper, A Bayesian approach to on-line learning (Cambridge University
Press, New York, NY, USA, 1998), pp. 363–378
30. D.N. Page, Physical Review D (Particles, Fields, Gravitation, and Cosmology)
78(2), 023514 (2008). DOI 10.1103/PhysRevD.78.023514
31. A.H. Guth, arXiv:astro-ph/0101507v1 pp. 1–15 (2001)

Sign up today - FREE

Mendeley saves you time finding and organizing research. Learn more

  • All your research in one place
  • Add and import papers easily
  • Access it anywhere, anytime

Start using Mendeley in seconds!

Already have an account? Sign in

Readership Statistics

11 Readers on Mendeley
by Discipline
 
 
 
by Academic Status
 
45% Ph.D. Student
 
27% Researcher (at an Academic Institution)
 
18% Student (Master)
by Country
 
18% Germany
 
9% Japan
 
9% India