Recently I have been looking a bit into algorithmic information theory, and there is this really beautiful idea of Solomonoff induction

We first make a lexigraphical tree of all strings with $n$ bits. This tree has $2^n$ leaf nodes, each corresponding to a binary string with n bits. Each internal nodes correspond to a set of strings with the same prefix.

We assign a prior belief on how likely each string is.

Given any sequence, we can actually traverse the tree according to that sequence. We start from the root node and begin reading the sequence. If a bit is 0, we move to its left children,; if a bit is 1, we move to its right children. This means that after seeing the first $k$ bits, we have moved to the node corresponding to the set of all hypotheses consistent with our observation so far.

Our prediction process is actually extremely simple: we look at the two children of the current node, and select the one with larger weight as our prediction.

Suppose initially we start with 1 unit of probability mass. That is initially we know nothing about the input, and our assumption is the entire space (which has unit probability). As we observe the input, we start to eliminate entire branches of the tree. We lose some probability mass. The important observation is that each time we make a mistake, we lose more than half of the probability mass.

However if our prior assigns to the true string $2^{-H(x)}$ probability, then any parent node of string must have at least that much probability mass. This means that as we traverse the tree we can never reach a node with probability mass $<2^{-H(x)}$. But remember that each time we make a mistake, we lose half of our probability mass; if we make more than $k$ mistakes, we must have reached a node with $<2^{-k}$ probability mass remaining. But this is in contradition to what we just concluded! Therefore we can never make more than $H(x)$ mistakes.

The question is then to find some prior probability $\mu(x)$ such that for any string that we can conjure up, $\log \mu(x)$ is a finite number. Such a prior probability is called ‘‘universal’’. It is not clear at all that such a probability can exist, and it is a remarkable observation that such a probability exists, and exists in a very reasonable way, which we will explain next.