Arrays with gaps

In the comp.programming newsgroup I presented a challenge to devise a way to handle arrays with gaps. There were a couple of responses and then I presented two further methods.

The challenge

Here is a problem that comes up now and then in real life. Suppose we have an array that has gaps, i.e., there are entries for some indices and not for others. Let A be the array. The array can be empty, sparsely filled, half filled, almost completely full, or completely full.

What we need to be able to do is, given an index i, find the smallest index j >= i, such A[j] has an entry, if there be such. Obviously we can do a linear search starting with i, but this could be an O(n) search if the array is sparse.

What is needed is some kind of data structure and algorithm that is space and time efficient so that we can find entries efficiently. Extra points if the method is dynamic, i.e., if it can handle inserts and deletes into the array.

I’ve come up with one way to do where searches, inserts and deletes are O(log g) where g is the average gap size; I’ll post it later.

This is a problem you might have to handle in your code; how would you do it?

Responses

Ben Pfaff suggested using a radix tree. He pointed to two URLs on his web site,
http://cvs.savannah.gnu.org/viewvc/pspp/src/libpspp/sparse-array.h?revision=1.4&root;=pspp&view;=markup
http://cvs.savannah.gnu.org/viewvc/pspp/src/libpspp/sparse-array.c?revision=1.4&root;=pspp&view;=markup

He pointed out that his implementation included handling sparse arrays, and provided the following quotes from the source code documentation:

```/* Sparse array data structure.
Implements a dictionary that associates a "unsigned long int"
key with fixed-size values (elements).
The implementation allocates elements in groups of moderate
size, so it achieves maximum space efficiency when elements
are clustered into groups of consecutive keys.  For the same
reason, elements should be kept relatively small, perhaps a
few pointer elements in size.
The implementation is slightly more efficient both in time and
space when indexes are kept small.  Thus, for example, if the
indexes in use start from some fixed base value, consider
using the offset from that base as the index value. */
...
/* Sparse array data structure.
The sparse array is implemented in terms of a "radix tree", a
multiway tree in which a set of bits drawn from the key
determine the child chosen at each level during a search.  The
most-significant bits determine a child of the root, the next
bits determine a child of that child, and so on, until the
least-significant bits determine a leaf node.
In this implementation, the branching factor at each level is
held constant at 2**BITS_PER_LEVEL.  The tree is only made as
tall as need be for the currently largest key, and nodes that
would be entirely empty are not allocated at all.  The
elements are stored in the leaf nodes. */
```
Gene Ressler suggested using a tree of index ranges. He wrote:
Anyway, you could keep a balanced tree of index ranges not in use (or in use). Then adds to unused elements require log N to update the tree (one node delete and zero to two adds) where N is the number of not-in-use (or in-use) ranges. Deletes are the reverse – merging ranges. The search algorithm is pretty obvious and is also log N.
I added two further choices as follows:

Method A: Tree of bitmaps

This is a solution that they might have come up with fifty years ago. Despite that, it may well be the best solution, space and time wise. The data structure is a tree of bit maps. The first tier is an array of 32 bit words, each bit set 1 or 0 depending on whether the corresponding bin has content or is empty. Tier 2 is an array of 32 bit words, each bit corresponding to a word of tier 1, the bit is 1 if the corresponding word is non-zero and 0 if it is zero. Usw.

The procedure in outline is as follows: We first look at the desired bin; if it has content we are done. If not we look at the corresponding word in tier 1 and mask off the low order bits through the bit corresponding to our desired bin. If the result is non-zero we find the lowest 1 bit in the masked word; it corresponds to the bin we want.

If it is zero we go up to tier 2, find the lowest 1 bit in the corresponding word, select it, go down to tier 1, find the lowest 1 bit in that word, and select it to get the desired bin. Usw.

Spacewise this is a good solution because the additional storage is only N/31. Timewise it also is a good solution. If the array is very sparse the cost is O(N) with a small constant factor, the number of iterations being (log2 N)/5. If the array is not sparse the cost is O(1) because most searches stay within tier 1.

Method B: An ordered heap with marks.

An ordered heap is a binary heap that (with a min root) has the additional property that for each node everything in the left subtree is less than everything in the right subtree. Our first step is to arrange the indices 0..N-1 in a balanced ordered heap. For example, if N=15, the ordered heap looks like this:

```                            0
_______|_______
|               |
1               8
___|___         ___|___
|       |       |       |
2       5       9      12
_|_     _|_     _|_     _|_
|   |   |   |   |   |   |   |
3   4   6   7  10  11  13  14
```
The second step is to mark the bins; we mark a bin if any of its children have content. Since the children of a node are larger than the node, we know that if a bin is marked, there is a best fit bin for that size either that bin or among its children.

Example: Suppose all of the bins are empty except for bin 13; then bins 0, 8, 12, and 13 will be marked.

There is a corresponding unmark algorithm – if a bin empties we check its left and right children (if any). If there are no unmarked children we unmark the bin and go up to its parent. If there are we are done. Usw.

Now for searching. If the index we start with is marked then we search down the subtrees, always checking first the left branch and then the right, always going down the first branch that has a mark and terminating the search when we find one with content. Call this the descent search.

If the bin index we start with is unmarked we are in an ascent search. If we are at a left node we go to the right sibling and look for a mark. If we at a right node we go up the tree through our ancestors until we find one that is a left node; then we go to its right sibling. (For performance we store the node to go to next as an additional field in the node.)

Example: Again, suppose all bins are empty except 13 and that we are looking for 6 or better. We check 6 – unmarked so we go to 7. It’s unmarked so we look an ancestor that is a left node. 5 is right; 5’s parent is 1, a left node, so we go to 1’s sibling, 8. We then successively check 9, 12, and 13.

Like method A this algorithm is O(log n) if the array is sparse, and O(1) if it is not. Unlike method A it is not space efficient; space requirements are O(N).

Well after the fact I realized that there is a major improvement to method B that I will call method C; instead of the marks being bits, they should be the minimum index in the subtrees in a node’s subtrees. This eliminates the need for a descent search.

Concluding remarks

It isn’t particularly profitable to discuss the O() properties of these methods – they are all O(log(X)) for some parameter roughly related to the array size and the number of gaps. It seems likely that the “best” algorithm depends upon factors such as ease of coding and cache friendliness.

The key feature of both method A and method B is that they have an ascending search and a descending whereas methods such as radix trees and gap trees are descent only. The advantage of starting at the target and ascending first is that the best fit bin will be found quickly if it is near the target.

Method C is interesting because it is an ascent only algorithm.

Methods A and B presuppose that the array size is fixed and that the number of entries with content is variable. For this reason they aren’t really suitable for sparse arrays where the array size can vary widely.