home
table of contents
Comp Sci
January 2004
email

Getting the minimum of a dynamic set

1. INITIAL FORMULATION

We are given a set S of integers that changes dynamically over 
time, with the following operations on the set: 

        insert:  An integer that is not a duplicate of any current
                 member is added to S.
                 
        delete:  A current member of the set is deleted.
        
        getmin:  The value of the smallest integer in the set is returned.
        
There is some number N such that it is guaranteed that the size 
of S will never be greater than N.  Other than that the 
probability of inserts and deletes is equal.  It is desired that 
the information about the set be organized so that: 

(a) the amortized cost of inserts, deletes, and getmins is O(1) 

2.      ADDITIONAL ASSUMPTIONS

(1)  The integers in S are machine representable, i.e., there is 
an IMAX such that all members of S are less than IMAX.

(2)  The integers inserted are all randomly drawn from the same 
probability distribution.  This is a stronger assumption than is 
necessary, but it simplifies analysis.

(3)  All members of the set are equally likely to be deleted.  
This is a very important assumption; if, for example, the 
smallest element is always deleted, we get a completely different 
task.

(4)  The size of S over time has a mean value n, n<N.  Again, 
this is a stronger assumption than necessary, but it simplifies 
analysis.

(5)  The process has a finite duration, bounded by TMAX inserts.  
The duration is "long", i.e. TMAX >> N.

3.      REFINEMENT OF REQUIREMENTS

(1)     The worst case cost of an operation need not be O(1).  If 
this is not the case then the best that can be done is some 
combination of O(log n) and O(1), e.g., insert and delete costs 
are O(log n) and getmin cost is O(1).

(2)     There is no significant likelihood that the worst case 
will occur.  

(3)     The amortized cost of each operation should be minimized, 
i.e., we are looking for a (nearly) optimal solution.

4.      DATA STRUCTURES FOR S

There are at least three obvious methods for storing the set - 
using a hash table, using a trie, and using a bit vector.  Hash 
tables are convenient, reasonably economical in storage usage, 
and are easily programmed.  The disadvantages are (a) they have a 
bad worst case search cost, and (b) the cost of computing hash 
indices can be significant.  Tries (convert the integer into a 
byte sequence) are also easily programmed and have a fixed cost 
for inserts and deletes.  However they tend to be relatively 
expensive in terms of storage cost.  Using a bit vector is the 
simplest and most efficient method; however it can only be used 
if IMAX is "reasonable".  For example, if IMAX is 2**16 then the 
bit vector would consist of 2048 32 bit words; if IMAX is 2**32, 
then the bit vector would be a 2 Gigabyte monster.  Which method 
to use is an implementation issue.

5.      THE NAIVE GETMIN ALGORITHM

The naive algorithm is to store the current minimum value.  If a 
value smaller than the current minimum is inserted then it 
becomes the new current minimum value.  If the minimum value is 
deleted then the entire set is scanned to determine the minimum 
of the remaining values.  Evidently the getmin cost is O(1).  

The cost of determining the minimum after a delete of the current 
minimum is O(n).  However this deletion only occurs in 1/n of the 
deletes, so the amortized cost of deletes is O(1).

The naive getmin algorithm meets the initial requirements; 
however it would be desirable if the rescanning could be either 
eliminated or be made unlikely.  There are two obvious 
alternatives for reducing the likelihood of rescanning: using a 
trie, and keeping a set of small values.  

Finding the minimum when the values are stored in a trie is a 
bounded cost.  A trie may well be the best choice if guaranteed 
bounded costs are required.  It will not considered further here.  

6.      THE SUBSET OF SMALL VALUES ALGORITHM

The root idea here is to keep a subset S' of the smallest values 
of S; to get the minimum of S we need only to find the minimum of 
S'.  There are several variations of this root idea; the simplest 
is to keep at most M values for some fixed M.  
  
As long as there are less than m values in our subset we 
can insert elements into it.  If the subset is full (has M 
elements) then we discard the largest when there is an insertion 
of a "small" value.   

Conversely, if the subset becomes empty we scan the entire suite 
of elements to find the new subset of M smallest values.  The 
thought is that it will be unlikely that we ever need to scan the 
entire suite if m is large enough.  

There remains the question of determining whether a newly 
inserted value should be added to S'.  The method considered here 
is to record the maximum, S'_MAX, of S' when it is full (has M 
items) and use it as a test value, only adding items to S' that 
are smaller than S'_MAX.  The algorithm then is:

Initialization of S':  Initially insert and delete items into S' 
until it becomes full (has M entries).  Record S'_MAX.  
Thereafter:

Inserts:  Check whether the item is less the current S'_MAX.  If 
not, store it in general storage.  If it is add it to S' and 
increment the size, m, of S'.  If m>M then we move the current
maximum of S' to general storage, and scan S' for a new S'_MAX. 
Also, if the new item is less than the current minimum then it 
replaces the current minimum.

Deletes:  Check whether the item is in S'; if it is not, simply 
delete it.  If it is in S' and it is not the current minimum, 
simply delete it from S' and decrement the size of S'.  If it is 
the current minimum and m, the size of S', is greater than 1, 
delete it, decrement the size of S', and scan S' for the new 
minimum.  Finally, if m=1, i.e., we are deleting the last member 
of S', then we scan all of S to get the m smallest entries and
use them to form a new S'.

Under the assumption that the distribution of inserted items 
comes from a well defined probability distribution, there will be 
some value M_TEST such that if there are n items in S, then there 
will be M items in S' on average.  S'_MAX is an estimate of 
M_TEST.

7.      ANALYSIS OF PERFORMANCE

The amortized costs of inserts, deletes, and getmins are all 
O(1) with a worst case for deletes of O(n).  There remains the 
question of how likely it is for it to be necessary to rescan the 
entire set, S, given a subset size, M.  


Suppose that at a given moment in time there are k members in S', 
k <= M.  Suppose further that there are n members in S.  Let t(k) 
be the expected time (where time is measured as the number of 
inserts and deletes) that S' is unchanged.  During a course of 2n 
events (n inserts, n deletes) we can expect M inserts and k 
deletes.  Then we have t(k) = 2n/(M+k).

Over time the number of elements in S' will go up or down until 
it either overflows or underflows.  Let T(k) be the total time 
(on average) in which there are k members in S'.  We have

        T(k) = t(k) + (M/(M+k-1))*T(k-1) + ((k+1)/(M+k+1))*T(k+1)
        (2 < k < M)
        
        T(M) = t(M) + (M/(2M-1))*T(M-1)
        T(1) = t(2) + (2/(M+2))*T(2)
        
The average lifetime of S' before an underflow or overflow is 
SUM = sum(T(k), K=1...M).  Most "failures" will be overflows 
which have a cost of at most M.  The average lifetime of S' 
before an underflow is:

T_U = SUM*(T(M)/2 + T(1)/(M+1))/(T(1)/(M+1))

The recurrence equations are easily solved.  The following table 
lists SUM/n and T_U/n for various values of M.


           M      SUM/n                  T_U/n
           2       1.39                   3.55
           3       2.77                   9.48
           4       4.56                  22.24
           5       6.60                  48.02
           6       8.77                  96.70
           7      10.97                 181.75
           8      13.16                 318.50
           9      15.36                 521.29
          10      17.58                 801.59
          15      28.97                3664.51
          20      41.06                9875.70
          25      53.76               20893.85
          30      66.99               38266.69
          40      94.74               98550.09
          50     123.90              204157.76
          60     154.21              369176.35
          70     185.50              608220.03              
          80     217.65              936344.59
          90     250.54             1368985.88
         100     284.11             1921913.55
         200     646.78            17718911.07
         500    1893.83           326620455.63
        1000    4225.63          2922059700.00
        
If we desire M to be large enough so that it is unlikely that 
there be an underflow then we need to pick it so that T_U > TMAX.  
Given the rate of growth of T_U as a function of M, it doesn't 
need to be much larger.


This page was last updated January 1, 2004.

home
table of contents
Comp Sci
January 2004
email