Optimal binary search tree


In computer science, an optimal binary search tree , sometimes called a weight-balanced binary tree, is a binary search tree which provides the smallest possible search time for a given sequence of accesses. Optimal BSTs are generally divided into two types: static and dynamic.
In the static optimality problem, the tree cannot be modified after it has been constructed. In this case, there exists some particular layout of the nodes of the tree which provides the smallest expected search time for the given access probabilities. Various algorithms exist to construct or approximate the statically optimal tree given the information on the access probabilities of the elements.
In the dynamic optimality problem, the tree can be modified at any time, typically by permitting tree rotations. The tree is considered to have a cursor starting at the root which it can move or use to perform modifications. In this case, there exists some minimal-cost sequence of these operations which causes the cursor to visit every node in the target access sequence in order. The splay tree is conjectured to have a constant competitive ratio compared to the dynamically optimal tree in all cases, though this has not yet been proven.

Static optimality

Definition

In the static optimality problem as defined by Knuth, we are given a set of ordered elements and a set of probabilities. We will denote the elements through and the probabilities through and through. is the probability of a search being done for element. For, is the probability of a search being done for an element between and, is the probability of a search being done for an element strictly less than, and is the probability of a search being done for an element strictly greater than. These probabilities cover all possible searches, and therefore add up to one.
The static optimality problem is the optimization problem of finding the binary search tree that minimizes the expected search time, given the probabilities. As the number of possible trees on a set of elements is, which is exponential in, brute-force search is not usually a feasible solution.

Knuth's dynamic programming algorithm

In 1971, Knuth published a relatively straightforward dynamic programming algorithm capable of constructing the statically optimal tree in only O time. Knuth's primary insight was that the static optimality problem exhibits optimal substructure; that is, if a certain tree is statically optimal for a given probability distribution, then its left and right subtrees must also be statically optimal for their appropriate subsets of the distribution.
To see this, consider what Knuth calls the "weighted path length" of a tree. The weighted path length of a tree on n elements is the sum of the lengths of all possible search paths, weighted by their respective probabilities. The tree with the minimal weighted path length is, by definition, statically optimal.
But weighted path lengths have an interesting property. Let E be the weighted path length of a binary tree, be the weighted path length of its left subtree, and be the weighted path length of its right subtree. Also let W be the sum of all the probabilities in the tree. Observe that when either subtree is attached to the root, the depth of each of its elements is increased by one. Also observe that the root itself has a depth of one. This means that the difference in weighted path length between a tree and its two subtrees is exactly the sum of every single probability in the tree, leading to the following recurrence:
This recurrence leads to a natural dynamic programming solution. Let be the weighted path length of the statically optimal search tree for all values between and, let be the total weight of that tree, and let be the index of its root. The algorithm can be built using the following formulas:

Mehlhorn's approximation algorithm

While the O time taken by Knuth's algorithm is substantially better than the exponential time required for a brute-force search, it is still too slow to be practical when the number of elements in the tree is very large.
In 1975, Kurt Mehlhorn published a paper proving that a much simpler algorithm could be used to closely approximate the statically optimal tree in only time. In this algorithm, the root of the tree is chosen so as to most closely balance the total weight of the left and right subtrees. This strategy is then applied recursively on each subtree.
That this strategy produces a good approximation can be seen intuitively by noting that the weights of the subtrees along any path form something very close to a geometrically decreasing sequence. In fact, this strategy generates a tree whose weighted path length is at most
where H is the entropy of the probability distribution. Since no optimal binary search tree can ever do better than a weighted path length of
this approximation is very close.

Hu–Tucker and Garsia–Wachs algorithms

In the special case that all of the values are zero, the optimal tree can be found in time. This was first proved by T. C. Hu and Alan Tucker in a paper that they published in 1971. A later simplification by Garsia and Wachs, the Garsia–Wachs algorithm, performs the same comparisons in the same order. The algorithm works by using a greedy algorithm to build a tree that has the optimal height for each leaf, but is out of order, and then constructing another binary search tree with the same heights.

Dynamic optimality

Definition

There are several different definitions of dynamic optimality, all of which are effectively equivalent to within a constant factor in terms of running-time. The problem was first introduced implicitly by Sleator and Tarjan in their paper on splay trees, but Demaine et al. give a very good formal statement of it.
In the dynamic optimality problem, we are given a sequence of accesses x1,..., xm on the keys 1,..., n. For each access, we are given a pointer to the root of our BST and can use the pointer to perform any of the following operations:
  1. Move the pointer to the left child of the current node.
  2. Move the pointer to the right child of the current node.
  3. Move the pointer to the parent of the current node.
  4. Perform a single rotation on the current node and its parent.
Our BST algorithm can perform any sequence of the above operations as long as the pointer eventually ends up on the node containing the target value xi. The time it takes a given dynamic BST algorithm to perform a sequence of accesses is equivalent to the total number of such operations performed during that sequence. Given any sequence of accesses on any set of elements, there is some BST algorithm which performs all accesses using the fewest total operations.
This model defines the fastest possible tree for a given sequence of accesses, but calculating the optimal tree in this sense therefore requires foreknowledge of exactly what the access sequence will be. If we let OPT be the number of operations performed by the strictly optimal tree for an access sequence X, we can say that a tree is dynamically optimal as long as, for any X, it performs X in time O.
There are several data structures conjectured to have this property, but none proven. It is an open problem whether there exists a dynamically optimal data structure in this model.

Splay trees

The splay tree is a form of binary search tree invented in 1985 by Daniel Sleator and Robert Tarjan on which the standard searchtree operations run in amortized time. It is conjectured to be dynamically optimal in the required sense. That is, a splay tree is believed to perform any sufficiently long access sequence X in time O.

Tango trees

The tango tree is a data structure proposed in 2004 by Erik Demaine and others which has been proven to perform any sufficiently-long access sequence X in time. While this is not dynamically optimal, the competitive ratio of is still very small for reasonable values of n.

Other results

In 2013, John Iacono published a paper which uses the geometry of binary search trees to provide an algorithm which is dynamically optimal if any binary search tree algorithm is dynamically optimal. Nodes are interpreted as points in two dimensions, and the optimal access sequence is the smallest arborally satisfied superset of those points.
The interleave lower bound is an asymptotic lower bound on dynamic optimality.