Note:  I have changed the name of the data structure from 
H tree to discrimination set tree.  However the code for
this implementation uses the term "H tree".  The concepts
for the data structure are explained in 
http://richardhartersworld.com/cri/2007/cls.html.


INTERFACE NOTES:

The interface to the H tree package is described in the header 
file, htree.h.  There are three main entry points for 
manipulating data in the tree.  They are:

htree_add:      Adds one data element to the tree
htree_del:      Deletes one data element from the tree
htree_find:     Searches the tree for a data element

In addition to the main entry points there are two administrative 
entry points, htree_init and htree_close, and two print routines 
that can be used to print out the contents of an H tree.  Finally 
there is a utility routine call ht_verify that tests if two char 
arrays are equal.

htree_init:     Creates an H tree; returns the root.
htree_close:    Deletes an H tree corresponding to a root.
htree_verify:   Compares two data elements and reports whether
                they are equal or not.
                
htree_print_forest: Prints the entire H tree "forest" with a 
                    separate tree for each distinct data length.
htree_print_tree:   Prints an individual tree corresponding to
                    a specified data length.                    

Data elements are specified by a structure called an HT_DATA 
structure.  It has three elements, a length of type int, a 
pointer to a data item of type unsigned char, and an ID of type 
int.  The ID type is specified in htree.h.

In this implementation data lengths are limited to being no more 
than 255; however a BST could be added to ht_init to support 
lengths of arbitrary size.  

The search routine, htree_find, searches an H tree for the 
presence of a data element.  If the element is not present 
htree_find returns 0.  If it is present the ID field in the input 
data structure is set to the value stored in the H tree.  (Note: 
no check is made to ensure that each item in the tree has a 
unique ID.)  There are two arguments, a pointer to the tree root, 
and a pointer to an HT_DATA structure describing the datum being 
searched for.

The add routine, htree_add, adds an element to an H tree.  There 
are three arguments, a pointer to the root, a pointer to the 
HT_DATA structure describing the element, and the element ID.  
The htree_add routine returns 0 if the element is already 
present, and the pointer to the structure describing the element.  
The ID field in the HT_DATA structure is updated.

The delete routine, htree_del, deletes an element from an H tree. 
There are two arguments, a pointer to the tree root, and a 
pointer to an HT_DATA structure describing the element to be 
deleted.  The delete routine returns 0 (false) if the element 
couldn't be deleted and 1 (true).  The input data structure must 
have the ID field properly set.
    
Note: The htree package does not guarantee that the
data elements are nul terminated strings.

PERFORMANCE:

Timing studies suggest that H trees may be one of the
faster key storage and retrieval data structures in some 
contexts.  The major advantages are:

(a) For each distinct data length there is a separate subtree.  
For small and intermediate lengths this is advantageous because 
the data is partitioned in several bins each of which can be 
accessed directly.

(b) The node structure uses short arrays which can be searched
using a linear sentinel search that is significantly faster than
binary searches for short arrays.  Features (a) and (b) imply that 
H trees will be wide and shallow with fast searches on the width.

(c) Only a small number of bytes of a data element are accessed 
during a search.  Unsuccessful searches are very fast.  
Successful searches quickly lead to a terminal node; the search 
is then validated by a string compare function that is faster 
than an equivalent continued tree search.

(d) Only a small number of bytes of a data element are accessed 
when adding the element.  Basically the costs are the cost of an 
unsuccessful search plus the cost of either adding the item to an 
existing node (typically 75% of the time) or adding a new node
(about 25% of the time).

At this point I don't have timing studies comparing H trees 
with ternary trees or patricia trees.  H trees appear to be 
faster than hash tables for data elements that are intermediate 
in length and longer.  The great advantage of H trees (and 
patricia trees) over hash tables lies in the fact that the basic 
operations do not require scanning the entire data element.  Only 
a fragment of the element is accessed on an needed basis.  Hash 
tables may be faster for very small data elements, small being 
approximately less than ten.  For lengths greater than 100 H 
trees are significantly faster than hash tables.

The number of elements, N, in the table/tree is significant.  
Hash tables are relatively insensitive to N, as long as the load 
factor is not too high.  Trees are sensitive to N; access times 
increase as log(N) as N increases.  

Here are some timing results.  All times are in microseconds per
element access.  The hash times are the times to compute a simple 
hash function plus the time to confirm that equal hashes are 
actually equal keys.  The time to insert/delete/find items in the 
hash table is not included.

           1    2    3    4    5    6    7    8    9
H add    .284 .224 .153 .152 .125 .353 .188 .220 .241
H find   .296 .182 .090 .102 .083 .231 .177 .195 .204
hash     .232 .228 .250 .205 .148 .451 .977 .135 .301

Data set 1: 10201 lines of HTML 
Data set 2:  4814 lines of HTML
Data set 3:  1760 lines of HTML
Data set 4:   642 lines of C code
Data set 5:   262 lines of C code
Data set 6:  4500 records of length 100
Data set 7:   900 records of length 240
Data set 8:  4500 records of length 20
Data set 9:  9100 records of lengths 10-78

Timings were done on a PC running windows XP; execution 
files were build using djgpp (gcc for windows) with 
optimization -O3.  They do not reflect the latest version
of the program.

The timings suggest that H trees are not a universal panacea.
Applications where they may be particularly preferable include
processing code (including HTML) and very large data items.

PROGRAMMING NOTES

As noted above, the current version of the program does not 
support data elements longer than 255 bytes.  I plan to remove 
this restriction in the next release.  

The verify routine is not strictly portable - it assumes that
an int is 4 bytes. If this is not true in your system, modify
the code as needed.

The current version keeps multiple copies of data element
descriptors in the tree.  The program would definitely be more
space efficient and probably more time efficient if this were
changed.

The get_diff_pos routine can probably be improved.

This code is under the BSD license.  This means that you can do
any thing you please with the code as long as you don't omit
the license notice.