A (possibly) fast, novel search technique1. Statement of the task 2. The discrimination set tree structure 3. Building an discrimination set tree 4. Basic operations 5. A forest of trees 6. Computational costs
“The general idea is that we record for each item (of the same length) a byte position where that item differs from some other item. When we insert a new item we check each recorded byte position to see the current item matches at that position. When it is established that there is no conflict between the current item and some existing item then and only then do we do a full comparison.”I didn’t go into details at the time and no one followed up on the thought. It turns out that there is an interesting data structure involved, one that I am not familiar with. Let me generalize a wee bit. 1. Statement of the taskSuppose we have a sequence E of data elements e1,e2, … en and that we wish to determine a uniqueness sequence U for E as follows: For each k, k=1,…,n, u[k] = j where j is the index of the first occurence of ek in E. For example, suppose that E = {a,b,a,c}. Then U = {1,2,1,4}.The uniqueness sequence (or some variant of it) is useful for a variety of elementary operations, e.g., uniq and diff, as well as building symbol tables. Now suppose we have a simple function f(p,e) on a parameter space P (character position in the original context) and E such that (1) f(p,ei) .ne. f(p,ej) => ei .ne. ej.and such that f(p,e) is O(1), i.e., it is independent of the length of elements. For example, if the data elements are character arrays then f(p,ei) could be the p’th character in element ei. Thus, if e1 = “abcd” then f(3,e1) = “c”. The important things about f are:
Suppose we also have a function g(ei,ej) such that g returns a value p such that if ei .ne. ej then f(p,ei) .ne. f(p,ej). Again, as an example, g could be the first character position at which ei and ej differ. Unlike function f which is O(1), the cost of function g is data dependent. When the two elements are the same, the cost of g is the common length; otherwise it depends on the data. Commonly, however, it will also be O(1).
A node Si has two principal components, a parameter value pi, and a set of records. (In an implementation the set might be implemented as an array, a linked list, a binary tree, or even a trie as might be convenient. Formally, however it is a set.) A record has three components:
A sub node has the properties that:
The root node of an H-tree looks like sub nodes except that it may contain less than two records as a special case; if there are less than two the parameter value is not set.
There are at least two things that make an H-tree unusual. The first is that in
effect we have an alternation of generations, i.e., nodes pointing to sets pointing
to nodes, etc. The second is that references to elements can appear more than once
in a tree; in effect every parent is its own child.
For people who think in terms of lists we create a list that looks something like this: (p1, (ei, f(p1,e1)), (e2, f(p1,e2)))Here is a more compact and perhaps more useful notation: p1: 1 -- 2Now let’s examine the third element. We determine f(p1,e3). There are two cases. Either it is distinct from the first two values or it is the same as one of them. If it is distinct we extend the list, i.e., p1: 1 -- 2 -- 3If it matches one of previous values, say that for e2, then we apply g(e,e) again to find a p2 that differentiates e2 and e3. Now our data structure looks like: p1: 1 -- 2 | p2: 2 -- 3After a few more entries we might have a “tree” that looks like this: p1: 1 -- 2 -- 5 | | p2: | 2 -- 3 | | p3: 1 -- 4 | | p4: 3 -- 6 4. Basic operationsThere are three basic operations, searching for elements, deleting elements, and adding new elements. Searches are the simplest and most efficient operations. The plan is to follow down the tree until either no match is found (the element is not there) or a terminal node is reached. If a terminal node is reached the element is possibly there; in that case a full comparison has to be made to determine whether the element referenced in the tree is the same as the one being searched for. This cost is unavoidable; all (completely reliable) algorithms have to do a full verification, either explictly, or implicitly during the search process.In summary: A failed search is very fast; a successful search is slightly slower because of the need for verification. Adding an element uses the same logic as a search; however the terminal actions are different. If no match is found, i.e., the value of f(node parameter, element) does not match any value in a nodes set of records a new entry is created in the set is created. If a leaf node is reached and a potential match is found we enter the verification step. If there is a match the add routine reports that a duplicate has been found. On the other hand is a mismatch is found a new sub-node is created going down from the candidate element. The new sub node contains the candidate and the element being added. Summary: Finding a duplicate has the same cost as a failed search; a successful search has the same cost as a successful search plus the cost of creating a new node. Deleting elements is the most complicated operation because an element can appear more than once in the tree. The general plan is to replace each instance of the deleted element by a sibling and promote the sibling up in place of the deleted element. The details of this process will be described later. Most element deletions (at least half) are simple deletions, i.e., the only occurrence of the element is in the set of a leaf node. In the worst case the cost is proportional to the height of the tree. The complexity of the deletion process depends upon whether an element can easily be identified as an element in the tree. In many applications in which elements are deleted this will be the case. For example, a data element may well have the structure: struct data_element { int length; char * data; struct usage_record *rec; };In this case the usage record pointer can be used to identify the first occurence of the element to be deleted. If there is no such identification a standard search must be made and a back track process must be used to find the first occurrence.
In summary: Deletion can be complicated; however most deletions will cost about the
same as a failed search.
If we hash the data elements we need M reads per element, a cost r per read and a hashing cost h per read. The cost per element for the hashing is (r+h)*M. The amortized cost of checking duplicates is r*(D/N)*M which is small compared to the cost of hashing. The hash table cost should also be small compared to the cost of hashing. Determining the costs of a discrimination set tree based algorithm is not so simple because it is data dependent. In the worst case every entry requires a call to g(e,e) with a cost 2*r*M. This is better than hashing but not by much. However the average case is much better, the number of reads per element being log2(N)/log2(L) where L is the average length of the lists. Therefore the ratio of access costs (not counting hashing) is (2) K*log2(L)This is not the entire story; we also have to take into the cost of searching lists. Here we have a space/time tradeoff. If we put the lists in binary search trees the list search time will be O(log2(L)); if we are willing to invest some space for a table lookup the list search cost will be O(1). The choice is an implementation consideration. A major factor in the cost of implementations is the cost of storage management. There are three potential problem points; these are getting new nodes, getting initial set space for each new node, and increasing the set space as sets increase in size. If a naive storage allocation strategy used allocation/deallocation costs will dominate the algorithm costs; however there are standard techniques for minimizing such costs so that most allocations cost no more than removing an item from a free list. I have implemented an initial implementation; the preliminary results are quite promising; H tree searches are faster than hash table searches for the test data sets. However there is much work to do be done before the empirical results are satisfactory. This page was last updated April 24, 2007. |