Solomonoff's theory of inductive inference
Solomonoff's theory of inductive inference is a mathematical theory of induction introduced by Ray Solomonoff, based on probability theory and theoretical computer science. In essence, Solomonoff's induction derives the posterior probability of any computable theory, given a sequence of observed data. This posterior probability is derived from Bayes rule and some universal prior, that is, a prior that yields a positive probability to any computable theory.
Interestingly, Solomonoff's induction naturally formalizes Occam's razor, which assigns larger prior credences to theories that require a shorter algorithmic description.
Origin
Philosophical
The theory is based in philosophical foundations, and was founded by Ray Solomonoff around 1960. It is a mathematically formalized combination of Occam's razor and the Principle of Multiple Explanations.All computable theories which perfectly describe previous observations are used to calculate the probability of the next observation, with more weight put on the shorter computable theories. Marcus Hutter's universal artificial intelligence builds upon this to calculate the expected value of an action.
Principle
Solomonoff's induction has been argued to be the computational formalization of pure Bayesianism. To understand, recall that Bayesianism derives the posterior probability of a theory given data by applying Bayes rule, which yields, where theories are alternatives to theory. For this equation to make sense, the quantities and must be well-defined for all theories and. In other words, any theory must define a probability distribution over observable data. Solomonoff's induction essentially boils down to demanding in addition that all such probability distributions be computable.Interestingly, the set of computable probability distributions is a subset of the set of all programs, which is countable. Similarly, the sets of observable data considered by Solomonoff were finite. Without loss of generality, we can thus consider that any observable data is a finite bit string. As a result, Solomonoff's induction can be defined by only invoking discrete probability distributions.
Solomonoff's induction then allows to make probabilistic predictions of future data, by simply obeying the laws of probability. Namely, we have. This quantity can be interpreted as the average predictions of all theories given past data, weighted by their posterior credences.
Mathematical
The proof of the "razor" is based on the known mathematical properties of a probability distribution over a countable set. These properties are relevant because the infinite set of all programs is a denumerable set. The sum S of the probabilities of all programs must be exactly equal to one thus the probabilities must roughly decrease as we enumerate the infinite set of all programs, otherwise S will be strictly greater than one. To be more precise, for every > 0, there is some length l such that the probability of all programs longer than l is at most. This does not, however, preclude very long programs from having very high probability.Fundamental ingredients of the theory are the concepts of algorithmic probability and Kolmogorov complexity. The universal prior probability of any prefix p of a computable sequence x is the sum of the probabilities of all programs that compute something starting with p. Given some p and any computable but unknown probability distribution from which x is sampled, the universal prior and Bayes' theorem can be used to predict the yet unseen parts of x in optimal fashion.
Mathematical guarantees
Solomonoff's completeness
The remarkable property of Solomonoff's induction is its completeness. In essence, the completeness theorem guarantees that the expected cumulative errors made by the predictions based on Solomonoff's induction are upper-bounded by the Kolmogorov complexity of the data generating process. The errors can be measured using the Kullback–Leibler divergence or the square of the difference between the induction's prediction and the probability assigned by the data generating process.Solomonoff's incompleteness
Unfortunately, Solomonoff also proved that Solomonoff's induction is uncomputable. In fact, he showed that computability and completeness are mutually exclusive: any complete theory must be uncomputable. The proof of this relies is derived from a game between the induction and the environment. Essentially, any computable induction can be tricked by a computable environment, by choosing the computable environment that negates the computable induction's prediction. This fact can be regarded as an instance of the no free lunch theorem.Modern applications
Artificial intelligence
Though Solomonoff's inductive inference is not computable, several AIXI-derived algorithms approximate it in order to make it run on a modern computer. The more computing power they are given, the closer their predictions are to the predictions of inductive inference.Another direction of inductive inference is based on E. Mark Gold's model of learning in the limit from 1967 and has developed since then more and more models of learning. The general scenario is the following: Given a class S of computable functions, is there a learner which for any input of the form,f,...,f) outputs a hypothesis. A learner M learns a function f if almost all its hypotheses are the same index e, which generates the function f; M learns S if M learns every f in S. Basic results are that all recursively enumerable classes of functions are learnable while the class REC of all computable functions is not learnable.
Many related models have been considered and also the learning of classes of recursively enumerable sets from positive data is a topic studied from Gold's pioneering paper in 1967 onwards. A far reaching extension of the Gold’s approach is developed by Schmidhuber's theory of generalized Kolmogorov complexities, which are kinds of super-recursive algorithms.
Turing machines
The third mathematically based direction of inductive inference makes use of the theory of automata and computation. In this context, the process of inductive inference is performed by an abstract automaton called an inductive Turing machine.Inductive Turing machines represent the next step in the development of computer science providing better models for contemporary computers and computer networks and forming an important class of super-recursive algorithms as they satisfy all conditions in the definition of algorithm. Namely, each inductive Turing machines is a type of effective method in which a definite list of well-defined instructions for completing a task, when given an initial state, will proceed through a well-defined series of successive states, eventually terminating in an end-state. The difference between an inductive Turing machine and a Turing machine is that to produce the result a Turing machine has to stop, while in some cases an inductive Turing machine can do this without stopping. Stephen Kleene called procedures that could run forever without stopping by the name calculation procedure or algorithm. Kleene also demanded that such an algorithm must eventually exhibit "some object". This condition is satisfied by inductive Turing machines, as their results are exhibited after a finite number of steps, but inductive Turing machines do not always tell at which step the result has been obtained.
Simple inductive Turing machines are equivalent to other models of computation. More advanced inductive Turing machines are much more powerful. It is proved that limiting partial recursive functions, trial and error predicates, general Turing machines, and simple inductive Turing machines are equivalent models of computation. However, simple inductive Turing machines and general Turing machines give direct constructions of computing automata, which are thoroughly grounded in physical machines. In contrast, trial and error predicates, limiting recursive functions and limiting partial recursive functions present syntactic systems of symbols with formal rules for their manipulation. Simple inductive Turing machines and general Turing machines are related to limiting partial recursive functions and trial and error predicates as Turing machines are related to partial recursive functions and lambda-calculus.
Note that only simple inductive Turing machines have the same structure as Turing machines. Other types of inductive Turing machines have an essentially more advanced structure due to the structured memory and more powerful instructions. Their utilization for inference and learning allows achieving higher efficiency and better reflects learning of people.
Some researchers confuse computations of inductive Turing machines with non-stopping computations or with infinite time computations. First, some of computations of inductive Turing machines halt. As in the case of conventional Turing machines, some halting computations give the result, while others do not give. Second, some non-stopping computations of inductive Turing machines give results, while others do not give. Rules of inductive Turing machines determine when a computation gives a result. Namely, an inductive Turing machine produces output from time to time and once this output stops changing, it is considered the result of the computation. It is necessary to know that descriptions of this rule in some papers are incorrect. For instance, Davis formulates the rule when result is obtained without stopping as "once the correct output has been produced any subsequent output will simply repeat this correct result." Third, in contrast to the widespread misconception, inductive Turing machines give results always after a finite number of steps in contrast to infinite and infinite-time computations.
There are two main distinctions between conventional Turing machines and simple inductive Turing machines. The first distinction is that even simple inductive Turing machines can do much more than conventional Turing machines. The second distinction is that a conventional Turing machine always informs when the result is obtained, while a simple inductive Turing machine in some cases does inform about reaching the result, while in other cases, it does not inform. People have an illusion that a computer always itself informs when the result is obtained. In contrast to this, users themselves have to decide in many cases whether the computed result is what they need or it is necessary to continue computations. Indeed, everyday desktop computer applications like word processors and spreadsheets spend most of their time waiting in event loops, and do not terminate until directed to do so by users.