The median of a moving window – II
In 2004 I wrote an article outlining an algorithm for tracking the median of a moving window as it passed over a data stream. The article claims that the algorithm is an O(log(log(k)) algorithm where k is the window size. It is not entirely clear that the algorithm is that efficient; the details of reheaping are not spelled out. I have never implemented the algorithm and I do not know of an implementation. In view of the Gil-Werman bound cited by Perreault and H´ebert it is unlikely that the argument is correct.
Regardless of whether the 2004 algorithm is viable I feel that it is worthwhile revisiting the problem. Median filters are important; they are widely used in digital processing of picture data and in the smoothing of time series data. It turns out that these are two separate and distinct problems. Digital picture data is two dimensional. More importantly, the possible data values are a relatively small number of predefined distinct integers, i.e., pixel values. Time series data, on the other hand, typically is one dimensional, and the possible data values are either real numbers or a large number of distinct numbers.
When the data values consist of a small number of distinct numbers the best approach is to represent the window with a histogram. The reason is quite simple; updating a histogram and extracting the median are both O(1) operations. The key point here is that all of the values in a bin are the same.
The situation in time series filtering is quite different. Constructing efficient histograms when the possible data values are variable is problematic. Even if a histogram is constructed the values within a bin are distinct, so the entire bin contents must be searched to find the median. The upshot is that value comparison techniques are the methods of choice. There are three major choices, using a double heap, using a search tree, and maintaining a sorted array. Update costs for a sorted array of values within a window is O(nw) where nw is the window size. However the update is simple and efficient. For this reason it is the method of choice for small windows. The double heap and the search tree methods are potentially O(log(nw)).
A troublesome implementation problem is that there are two data structures involved that must be coordinated. One is a circular buffer that holds, directly or indirectly, the values in the window. The other is a data structure from which the median is extracted; this may be an array, a linked list, a pair of heaps, a search tree, or something more exotic.
Using an array
A simple implementation keeps two copies of the window values, one in a sorted array, and one in a circular buffer. When an update is done we do a binary search on the sorted array to find the value being replace, and then a slide loop to enter the new value. As a programming note, the slide loop will be considerably faster if the loop uses sentinels.
Using a double heap
The idea here is to maintain two heaps, one containing all window values <= the median, and the other values > the median. An advantage of a heap is that it can be kept in an array; sifting is done by address calculation so no auxilliary links are needed. A link to a C implementation is given below. There are various choices for connecting the two data structures. The implementation given has a circular buffer of nodes where a node has three items, the data value, the index of the corresponding entry in the heap, and a flag saying which heap the item is in. (The flag can be omitted, but having it simplifies the code.) In turn the heaps contain pointers to the nodes. Locating the item in the heap is immediate. Sifting is straightforward. The major complication comes when an old item is deleted from one heap and the new item is inserted in the other.
The double heap algorithm has three major advantages. The performance is guaranteed to be O(log(nw)), the data structures are simple, and the logic is straightforward.
Using a binary search tree (BST)
In principle using a binary search tree should be quite simple. There is a single data structure, a tree of nodes, where a node has five items, a parent link, left and right child links, a link for the circular buffer, and the node value. When the window is moved, the node containing the trailing edge value is deleted from the tree, the value is set to the leading edge value, and the altered node is inserted into the tree. If the old and new values are both less than or greater than the median, the median remains unchanged. If they are not, the median is replaced by its successor/predecessor, depending on whether the new value is greater/lesser than the old value. (Deleting the median or inserting a copy are special cases,)
The mechanics of tracking the median are independent of the choice of BST; the performance is not. In practice the process of continually deleting the trailing edge and inserting the leading edge is likely to generate unbalanced trees, both for naive algorithms and for algorithms that have amortized O(log(nw)) performance. (The issue is that windows typically are dominated by trends.) What is wanted is one of the tree algorithms that is guaranteed to be O(log(nw)) performance, e.g., AVL or Red/Black trees.
In general, BST algorithms are more cumbersome than heap algorithms and are correspondingly slower. However this really is a quality of implementation issue.
An obvious alternative is to use a skip list. Another alternative is to use a T tree, which potentially combines the advantages of arrays and trees.
Well written array implementations are preferable for small windows. Judging from similar problems, the breakpoint is probably for window sizes of around 20-30. For larger window sizes the double heap algorithm is simple and has guaranteed performance. There may be better alternatives but establishing them will require significant amounts of testing and analysis.
ReferencesMedian Filtering in Constant Time, Simon Perreault and Patrick H´ebert
A two heap implementation of the median of a moving window
This page was last updated February 6, 2011.