Hyde County, South Dakota is the Pin Tail Duck Capital of the world. Visit scenic Highmore, SD in 2007!
Richard Harter's World
Site map
Mathcomp
April 2007
email

A (possibly) fast, novel search technique

0. Introduction
1. Statement of the task
2. The discrimination set tree structure
3. Building an discrimination set tree
4. Basic operations
5. A forest of trees
6. Computational costs

0.Introduction

Some time ago Richard Heathfield enquired in the comp.programming newsgroup about effective implementations for a "uniq" operation, i.e., given a suite of data items of varying sizes, some very large, how to identify the duplicates. Various parties replied; for the most part all suggested some variant of hashing the data items and then using either a hash table or some kind of tree to distinguish between the duplicates and the non-duplicates. At the time I commented that one could do that but that there was a much more efficient technique. I said in part:
"The general idea is that we record for each item (of the same length) a byte position where that item differs from some other item. When we insert a new item we check each recorded byte position to see the current item matches at that position. When it is established that there is no conflict between the current item and some existing item then and only then do we do a full comparison."
I didn't go into details at the time and no one followed up on the thought. It turns out that there is an interesting data structure involved, one that I am not familiar with. Let me generalize a wee bit.

1. Statement of the task

Suppose we have a sequence E of data elements e1,e2, ... en and that we wish to determine a uniqueness sequence U for E as follows: For each k, k=1,...,n, u[k] = j where j is the index of the first occurence of ek in E. For example, suppose that E = {a,b,a,c}. Then U = {1,2,1,4}.

The uniqueness sequence (or some variant of it) is useful for a variety of elementary operations, e.g., uniq and diff, as well as building symbol tables.

Now suppose we have a simple function f(p,e) on a parameter space P (character position in the original context) and E such that

(1)  f(p,ei) .ne. f(p,ej) => ei .ne. ej.
and such that f(p,e) is O(1), i.e., it is independent of the length of elements.

For example, if the data elements are character arrays then f(p,ei) could be the p'th character in element ei. Thus, if e1 = "abcd" then f(3,e1) = "c". The important things about f are:

  • It is O(1).
  • It may distinguish between unequal elements but is not guaranteed to do so.
  • It makes different distinctions for different values of p.

Suppose we also have a function g(ei,ej) such that g returns a value p such that if ei .ne. ej then f(p,ei) .ne. f(p,ej). Again, as an example, g could be the first character position at which ei and ej differ. Unlike function f which is O(1), the cost of function g is data dependent. When the two elements are the same, the cost of g is the common length; otherwise it depends on the data. Commonly, however, it will also be O(1).

2. The discrimination set tree structure

With the aid of the functions f and g we can define an unusual data structure that I shall call an discrimination set tree. (This is the third name I have come up with for the structure, the previous ones being cross-linked sets and H trees.) If this type of tree already eists in the literature I would appreciate references.

A node Si has two principal components, a parameter value pi, and a set of records. (In an implementation the set might be implemented as an array, a linked list, a binary tree, or even a trie as might be convenient. Formally, however it is a set.) A record has three components:

  • A reference (pointer) to a data element ej.
  • The value of f(pi,ej)
  • A reference (pointer) to a sub node, if any.

A sub node has the properties that:

  • Its parameter value is distinct from that of any of its ancestral nodes.
  • It's set of records includes the data element from its parent record.
  • It ordinarily has at least two records in its set of records. However deletions can created single element sets.

The root node of an H-tree looks like sub nodes except that it may contain less than two records as a special case; if there are less than two the parameter value is not set.

There are at least two things that make an H-tree unusual. The first is that in effect we have an alternation of generations, i.e., nodes pointing to sets pointing to nodes, etc. The second is that references to elements can appear more than once in a tree; in effect every parent is its own child.

3. Building an discrimination set tree

Now let's look at the basic build algorithm. What we are building is sort of a tree that is sort of a trie. We walk through the sequence of data elements. We start the tree with e1 and e2. We apply g(e1,e2) to get our first parameter p1. (In context we compare e1 and e2 byte by byte to find the first index where they differ.)

For people who think in terms of lists we create a list that looks something like this:

    (p1, (ei, f(p1,e1)), (e2, f(p1,e2)))
Here is a more compact and perhaps more useful notation:
    p1:  1 -- 2
Now let's examine the third element. We determine f(p1,e3). There are two cases. Either it is distinct from the first two values or it is the same as one of them. If it is distinct we extend the list, i.e.,
    p1:  1 -- 2 -- 3
If it matches one of previous values, say that for e2, then we apply g(e,e) again to find a p2 that differentiates e2 and e3. Now our data structure looks like:
    p1:  1 -- 2
              |
    p2:       2 -- 3
After a few more entries we might have a "tree" that looks like this:
    p1:  1 -- 2 -- 5
         |    |
    p2:  |    2 -- 3
         |         |
    p3:  1 -- 4    |
                   |
    p4:            3 -- 6

4. Basic operations

There are three basic operations, searching for elements, deleting elements, and adding new elements. Searches are the simplest and most efficient operations. The plan is to follow down the tree until either no match is found (the element is not there) or a terminal node is reached. If a terminal node is reached the element is possibly there; in that case a full comparison has to be made to determine whether the element referenced in the tree is the same as the one being searched for. This cost is unavoidable; all (completely reliable) algorithms have to do a full verification, either explictly, or implicitly during the search process.

In summary: A failed search is very fast; a successful search is slightly slower because of the need for verification.

Adding an element uses the same logic as a search; however the terminal actions are different. If no match is found, i.e., the value of f(node parameter, element) does not match any value in a nodes set of records a new entry is created in the set is created. If a leaf node is reached and a potential match is found we enter the verification step. If there is a match the add routine reports that a duplicate has been found. On the other hand is a mismatch is found a new sub-node is created going down from the candidate element. The new sub node contains the candidate and the element being added.

Summary: Finding a duplicate has the same cost as a failed search; a successful search has the same cost as a successful search plus the cost of creating a new node.

Deleting elements is the most complicated operation because an element can appear more than once in the tree. The general plan is to replace each instance of the deleted element by a sibling and promote the sibling up in place of the deleted element. The details of this process will be described later. Most element deletions (at least half) are simple deletions, i.e., the only occurrence of the element is in the set of a leaf node. In the worst case the cost is proportional to the height of the tree.

The complexity of the deletion process depends upon whether an element can easily be identified as an element in the tree. In many applications in which elements are deleted this will be the case. For example, a data element may well have the structure:

        struct data_element {
                int length;
                char * data;
                struct usage_record *rec;
        };
In this case the usage record pointer can be used to identify the first occurence of the element to be deleted. If there is no such identification a standard search must be made and a back track process must be used to find the first occurrence.

In summary: Deletion can be complicated; however most deletions will cost about the same as a failed search.

5. A forest of trees

In most applications it is convenient and profitable to use a forest of trees, i.e, to distinguish between data element lengths and to have a separate tree for each distinct list. The advantage lies in the fact that lengths are integers; for which there are many special techniques such as table lookups that are quite fast.

6. Computational costs

Are discrimination set trees actually more computationally efficient than the customary hash tables? The answer is yes, but there are issues. Suppose we have N data elements with an average length of M, and let D be the number of duplicates. If the data elements are to be (mostly) distinct M must be greater than log2(N). Typically M is much larger log2(N). As an example, consider a file of source code. An average line might have 30 bytes, i.e., 240 bits. Few programs have 2**240 lines of code. Let K be the factor by which M is larger, i.e., M = K*log2(N).

If we hash the data elements we need M reads per element, a cost r per read and a hashing cost h per read. The cost per element for the hashing is (r+h)*M. The amortized cost of checking duplicates is r*(D/N)*M which is small compared to the cost of hashing. The hash table cost should also be small compared to the cost of hashing.

Determining the costs of a discrimination set tree based algorithm is not so simple because it is data dependent. In the worst case every entry requires a call to g(e,e) with a cost 2*r*M. This is better than hashing but not by much. However the average case is much better, the number of reads per element being log2(N)/log2(L) where L is the average length of the lists. Therefore the ratio of access costs (not counting hashing) is

(2) K*log2(L)
This is not the entire story; we also have to take into the cost of searching lists. Here we have a space/time tradeoff. If we put the lists in binary search trees the list search time will be O(log2(L)); if we are willing to invest some space for a table lookup the list search cost will be O(1). The choice is an implementation consideration.

A major factor in the cost of implementations is the cost of storage management. There are three potential problem points; these are getting new nodes, getting initial set space for each new node, and increasing the set space as sets increase in size. If a naive storage allocation strategy used allocation/deallocation costs will dominate the algorithm costs; however there are standard techniques for minimizing such costs so that most allocations cost no more than removing an item from a free list.

I have implemented an initial implementation; the preliminary results are quite promising; H tree searches are faster than hash table searches for the test data sets. However there is much work to do be done before the empirical results are satisfactory.


This page was last updated April 24, 2007.

Richard Harter's World
Site map
Mathcomp
April 2007
email
Hyde County, South Dakota is the Pin Tail Duck Capital of the world. Visit scenic Highmore, SD in 2007!