Note: I have changed the name of the data structure from H tree to discrimination set tree. However the code for this implementation uses the term "H tree". The concepts for the data structure are explained in http://richardhartersworld.com/cri/2007/cls.html. INTERFACE NOTES: The interface to the H tree package is described in the header file, htree.h. There are three main entry points for manipulating data in the tree. They are: htree_add: Adds one data element to the tree htree_del: Deletes one data element from the tree htree_find: Searches the tree for a data element In addition to the main entry points there are two administrative entry points, htree_init and htree_close, and two print routines that can be used to print out the contents of an H tree. Finally there is a utility routine call ht_verify that tests if two char arrays are equal. htree_init: Creates an H tree; returns the root. htree_close: Deletes an H tree corresponding to a root. htree_verify: Compares two data elements and reports whether they are equal or not. htree_print_forest: Prints the entire H tree "forest" with a separate tree for each distinct data length. htree_print_tree: Prints an individual tree corresponding to a specified data length. Data elements are specified by a structure called an HT_DATA structure. It has three elements, a length of type int, a pointer to a data item of type unsigned char, and an ID of type int. The ID type is specified in htree.h. In this implementation data lengths are limited to being no more than 255; however a BST could be added to ht_init to support lengths of arbitrary size. The search routine, htree_find, searches an H tree for the presence of a data element. If the element is not present htree_find returns 0. If it is present the ID field in the input data structure is set to the value stored in the H tree. (Note: no check is made to ensure that each item in the tree has a unique ID.) There are two arguments, a pointer to the tree root, and a pointer to an HT_DATA structure describing the datum being searched for. The add routine, htree_add, adds an element to an H tree. There are three arguments, a pointer to the root, a pointer to the HT_DATA structure describing the element, and the element ID. The htree_add routine returns 0 if the element is already present, and the pointer to the structure describing the element. The ID field in the HT_DATA structure is updated. The delete routine, htree_del, deletes an element from an H tree. There are two arguments, a pointer to the tree root, and a pointer to an HT_DATA structure describing the element to be deleted. The delete routine returns 0 (false) if the element couldn't be deleted and 1 (true). The input data structure must have the ID field properly set. Note: The htree package does not guarantee that the data elements are nul terminated strings. PERFORMANCE: Timing studies suggest that H trees may be one of the faster key storage and retrieval data structures in some contexts. The major advantages are: (a) For each distinct data length there is a separate subtree. For small and intermediate lengths this is advantageous because the data is partitioned in several bins each of which can be accessed directly. (b) The node structure uses short arrays which can be searched using a linear sentinel search that is significantly faster than binary searches for short arrays. Features (a) and (b) imply that H trees will be wide and shallow with fast searches on the width. (c) Only a small number of bytes of a data element are accessed during a search. Unsuccessful searches are very fast. Successful searches quickly lead to a terminal node; the search is then validated by a string compare function that is faster than an equivalent continued tree search. (d) Only a small number of bytes of a data element are accessed when adding the element. Basically the costs are the cost of an unsuccessful search plus the cost of either adding the item to an existing node (typically 75% of the time) or adding a new node (about 25% of the time). At this point I don't have timing studies comparing H trees with ternary trees or patricia trees. H trees appear to be faster than hash tables for data elements that are intermediate in length and longer. The great advantage of H trees (and patricia trees) over hash tables lies in the fact that the basic operations do not require scanning the entire data element. Only a fragment of the element is accessed on an needed basis. Hash tables may be faster for very small data elements, small being approximately less than ten. For lengths greater than 100 H trees are significantly faster than hash tables. The number of elements, N, in the table/tree is significant. Hash tables are relatively insensitive to N, as long as the load factor is not too high. Trees are sensitive to N; access times increase as log(N) as N increases. Here are some timing results. All times are in microseconds per element access. The hash times are the times to compute a simple hash function plus the time to confirm that equal hashes are actually equal keys. The time to insert/delete/find items in the hash table is not included. 1 2 3 4 5 6 7 8 9 H add .284 .224 .153 .152 .125 .353 .188 .220 .241 H find .296 .182 .090 .102 .083 .231 .177 .195 .204 hash .232 .228 .250 .205 .148 .451 .977 .135 .301 Data set 1: 10201 lines of HTML Data set 2: 4814 lines of HTML Data set 3: 1760 lines of HTML Data set 4: 642 lines of C code Data set 5: 262 lines of C code Data set 6: 4500 records of length 100 Data set 7: 900 records of length 240 Data set 8: 4500 records of length 20 Data set 9: 9100 records of lengths 10-78 Timings were done on a PC running windows XP; execution files were build using djgpp (gcc for windows) with optimization -O3. They do not reflect the latest version of the program. The timings suggest that H trees are not a universal panacea. Applications where they may be particularly preferable include processing code (including HTML) and very large data items. PROGRAMMING NOTES As noted above, the current version of the program does not support data elements longer than 255 bytes. I plan to remove this restriction in the next release. The verify routine is not strictly portable - it assumes that an int is 4 bytes. If this is not true in your system, modify the code as needed. The current version keeps multiple copies of data element descriptors in the tree. The program would definitely be more space efficient and probably more time efficient if this were changed. The get_diff_pos routine can probably be improved. This code is under the BSD license. This means that you can do any thing you please with the code as long as you don't omit the license notice.