Getting the minimum of a dynamic set
1. INITIAL FORMULATION We are given a set S of integers that changes dynamically over time, with the following operations on the set: insert: An integer that is not a duplicate of any current member is added to S. delete: A current member of the set is deleted. getmin: The value of the smallest integer in the set is returned. There is some number N such that it is guaranteed that the size of S will never be greater than N. Other than that the probability of inserts and deletes is equal. It is desired that the information about the set be organized so that: (a) the amortized cost of inserts, deletes, and getmins is O(1) 2. ADDITIONAL ASSUMPTIONS (1) The integers in S are machine representable, i.e., there is an IMAX such that all members of S are less than IMAX. (2) The integers inserted are all randomly drawn from the same probability distribution. This is a stronger assumption than is necessary, but it simplifies analysis. (3) All members of the set are equally likely to be deleted. This is a very important assumption; if, for example, the smallest element is always deleted, we get a completely different task. (4) The size of S over time has a mean value n, n<N. Again, this is a stronger assumption than necessary, but it simplifies analysis. (5) The process has a finite duration, bounded by TMAX inserts. The duration is "long", i.e. TMAX >> N. 3. REFINEMENT OF REQUIREMENTS (1) The worst case cost of an operation need not be O(1). If this is not the case then the best that can be done is some combination of O(log n) and O(1), e.g., insert and delete costs are O(log n) and getmin cost is O(1). (2) There is no significant likelihood that the worst case will occur. (3) The amortized cost of each operation should be minimized, i.e., we are looking for a (nearly) optimal solution. 4. DATA STRUCTURES FOR S There are at least three obvious methods for storing the set - using a hash table, using a trie, and using a bit vector. Hash tables are convenient, reasonably economical in storage usage, and are easily programmed. The disadvantages are (a) they have a bad worst case search cost, and (b) the cost of computing hash indices can be significant. Tries (convert the integer into a byte sequence) are also easily programmed and have a fixed cost for inserts and deletes. However they tend to be relatively expensive in terms of storage cost. Using a bit vector is the simplest and most efficient method; however it can only be used if IMAX is "reasonable". For example, if IMAX is 2**16 then the bit vector would consist of 2048 32 bit words; if IMAX is 2**32, then the bit vector would be a 2 Gigabyte monster. Which method to use is an implementation issue. 5. THE NAIVE GETMIN ALGORITHM The naive algorithm is to store the current minimum value. If a value smaller than the current minimum is inserted then it becomes the new current minimum value. If the minimum value is deleted then the entire set is scanned to determine the minimum of the remaining values. Evidently the getmin cost is O(1). The cost of determining the minimum after a delete of the current minimum is O(n). However this deletion only occurs in 1/n of the deletes, so the amortized cost of deletes is O(1). The naive getmin algorithm meets the initial requirements; however it would be desirable if the rescanning could be either eliminated or be made unlikely. There are two obvious alternatives for reducing the likelihood of rescanning: using a trie, and keeping a set of small values. Finding the minimum when the values are stored in a trie is a bounded cost. A trie may well be the best choice if guaranteed bounded costs are required. It will not considered further here. 6. THE SUBSET OF SMALL VALUES ALGORITHM The root idea here is to keep a subset S' of the smallest values of S; to get the minimum of S we need only to find the minimum of S'. There are several variations of this root idea; the simplest is to keep at most M values for some fixed M. As long as there are less than m values in our subset we can insert elements into it. If the subset is full (has M elements) then we discard the largest when there is an insertion of a "small" value. Conversely, if the subset becomes empty we scan the entire suite of elements to find the new subset of M smallest values. The thought is that it will be unlikely that we ever need to scan the entire suite if m is large enough. There remains the question of determining whether a newly inserted value should be added to S'. The method considered here is to record the maximum, S'_MAX, of S' when it is full (has M items) and use it as a test value, only adding items to S' that are smaller than S'_MAX. The algorithm then is: Initialization of S': Initially insert and delete items into S' until it becomes full (has M entries). Record S'_MAX. Thereafter: Inserts: Check whether the item is less the current S'_MAX. If not, store it in general storage. If it is add it to S' and increment the size, m, of S'. If m>M then we move the current maximum of S' to general storage, and scan S' for a new S'_MAX. Also, if the new item is less than the current minimum then it replaces the current minimum. Deletes: Check whether the item is in S'; if it is not, simply delete it. If it is in S' and it is not the current minimum, simply delete it from S' and decrement the size of S'. If it is the current minimum and m, the size of S', is greater than 1, delete it, decrement the size of S', and scan S' for the new minimum. Finally, if m=1, i.e., we are deleting the last member of S', then we scan all of S to get the m smallest entries and use them to form a new S'. Under the assumption that the distribution of inserted items comes from a well defined probability distribution, there will be some value M_TEST such that if there are n items in S, then there will be M items in S' on average. S'_MAX is an estimate of M_TEST. 7. ANALYSIS OF PERFORMANCE The amortized costs of inserts, deletes, and getmins are all O(1) with a worst case for deletes of O(n). There remains the question of how likely it is for it to be necessary to rescan the entire set, S, given a subset size, M. Suppose that at a given moment in time there are k members in S', k <= M. Suppose further that there are n members in S. Let t(k) be the expected time (where time is measured as the number of inserts and deletes) that S' is unchanged. During a course of 2n events (n inserts, n deletes) we can expect M inserts and k deletes. Then we have t(k) = 2n/(M+k). Over time the number of elements in S' will go up or down until it either overflows or underflows. Let T(k) be the total time (on average) in which there are k members in S'. We have T(k) = t(k) + (M/(M+k-1))*T(k-1) + ((k+1)/(M+k+1))*T(k+1) (2 < k < M) T(M) = t(M) + (M/(2M-1))*T(M-1) T(1) = t(2) + (2/(M+2))*T(2) The average lifetime of S' before an underflow or overflow is SUM = sum(T(k), K=1...M). Most "failures" will be overflows which have a cost of at most M. The average lifetime of S' before an underflow is: T_U = SUM*(T(M)/2 + T(1)/(M+1))/(T(1)/(M+1)) The recurrence equations are easily solved. The following table lists SUM/n and T_U/n for various values of M. M SUM/n T_U/n 2 1.39 3.55 3 2.77 9.48 4 4.56 22.24 5 6.60 48.02 6 8.77 96.70 7 10.97 181.75 8 13.16 318.50 9 15.36 521.29 10 17.58 801.59 15 28.97 3664.51 20 41.06 9875.70 25 53.76 20893.85 30 66.99 38266.69 40 94.74 98550.09 50 123.90 204157.76 60 154.21 369176.35 70 185.50 608220.03 80 217.65 936344.59 90 250.54 1368985.88 100 284.11 1921913.55 200 646.78 17718911.07 500 1893.83 326620455.63 1000 4225.63 2922059700.00 If we desire M to be large enough so that it is unlikely that there be an underflow then we need to pick it so that T_U > TMAX. Given the rate of growth of T_U as a function of M, it doesn't need to be much larger. This page was last updated January 1, 2004. |