# Optimizing the merging of sorted arrays

In 2001 I dashed off a page about optimizing the merging of sorted arrays. In 2009 Bron C. Nelson called my attention to the page and pointed out that it would be better to compare the last elements of the arrays. Discussion between us led to several improvements. This article summarizes the results.

We are given two sorted arrays, x and y, that we wish to merge into an array z. (This is the inner loop operation in merge sorts.) The general plan is to compare the initial elements of each array and move the smaller into z. This is done until in a loop that continues until one of the two arrays is exhausted. The elements of the loop body are:

```        compare current x to the current y
copy the smaller  into the current z
advance the position in the array holding the smaller
```

In addition we need a loop termination test. The obvious choice is to check whether we have exhausted the array that held the smaller. (The obvious is not necessarily best.) Finally we need to append to z the remaining elements of the array that was not exhausted. Here is some simple C code to do that:

```    CODE SAMPLE I
while ((x<after_x) && (y<after_y)) {
*z++ = *x <= *y ? *x++ : *y++;
}
if (x == after_x) memcpy(z,y,(after_y-y)*sizeof *y);
else              memcpy(z,x,(after_x-x)*sizeof *x);
```
Although this code is compact (and compact is good) it has a notable inefficiencies. Adding a point to z requires three comparisons, x<last_x, y<last_y, and *x <= *y. Of these, the latter is the most expensive because it requires at least one indirect fetch. Still, the first two have costs, and it turns out that this cost can be eliminated.

An obvious way to eliminate one of the tests is to move the termination test inside the loop. To do this we get code that looks like this.

```    CODE SAMPLE II
for (;;) {
if (*x <= *y) {
*z++ = *x++;
if (x == after_x) {
memcpy(z,y,(after_y-y)*sizeof *y);
break;
}
} else {
*z++ = *y++;
if (y == after_y) {
memcpy(z,x,(after_x-x)*sizeof *x);
break;
}
}
}
```
Code sample II only has two tests per output point, a value comparison and a pointer comparison. However the loop body has become is more cumbersome with nested conditionals. It is open to question whether it would run any faster than code sample I. Can we improve on code sample I? Yes. After all, we can determine in advance which array will be exhausted first by looking at the last elements of the arrays. We can take advantage of that knowledge and hoist the termination condition outside the loop. This gives us code sample III.
```    CODE SAMPLE III
if (*(after_x-1) <= *(after_y-1)) {
while (x < after_x) *z++ = *x <= *y ? *x++ : *y++;
memcpy(z,y,(after_y-y)*sizeof *y);
} else {
while (y < after_y) *z++ = *x <= *y ? *x++ : *y++;
memcpy(z,x,(after_x-x)*sizeof *x);
}
```
Code sample III appears to be clearer and more efficient than either code sample I or code sample II. However can we do still better? Yes we can. In each of these three versions the loop cannot be unrolled by the compiler. If we could unroll the loops we would eliminate most of the loop termination calls.

In loop unrolling we make several copies of the loop body; this saves the cost of making most of the loop termination tests. This saving is generally not significant unless the loop body is tight, which, however, is the case here. In order to unroll loops effectively we need to know how many times we will go through the loop. (It can also be done if we have a good lower bound on that number.) An obvious choice is to use the length of the merged array except for the length of the “tail”. Unfortunately we don’t know the length of the tail in advance. We can find it with a binary search but the cost of the search may be greater than our savings. Fortunately there is an alternative number that we do know – the length of the array that will first be exhausted. The plan is to only increment (decrement) the loop counter when an item is moved from that array into z. To do this we will need an inner loop for moving items from the other array into x. This gives us code sample IV.

```    CODE SAMPLE IV
if (*(after_x-1) <= *(after_y-1)) {
for (index = after_x-x;index--;) {
while (*y < *x) *z++ = *y++;
*z++ = *x++;
}
memcpy(z,y,(after_y-y)*sizeof *y);
} else {
for (index = after_y -y;index--;) {
while (*x <= *y) *z++ = *x++;
*z++ = *y++;
}
memcpy(z,x,(after_x-x)*sizeof *x);
}
```
In code sample IV the outer loops are simple count down loops that can be automatically unrolled by the compiler. Since there is both an inner and outer loop it may seem as though we’ve lost ground. However the inner loop termination test is the comparison that must be done in any event. Further more the number of transfers required for the inner loop are the same as the number of transfers required in the conditional assignment form.

Note that the comparison in the first inner loop is *y < *x, whereas the comparison in the second inner loop is *x <= *y. This is required for stability in merge sorts.

Can we do even better? In some cases, yes. In the sample codes the “tail” is simply copied. Similarly there will be a “head”. That is, one of the arrays will have a segment of items, all of which are smaller than the initial element of the other array. (The size of the head segment may be zero.) We could simply copy the “head” into z if we knew its length. Unfortunately we don’t. However we can find the end of the segment by using a binary search. The cost of the search would be c1*log2(n) where c1 is some constant and n is the size of array with the smallest initial value. The cost of processing the head is c2*m where c2 is a constant and m is the length of the head. The costs will be the same when m =(c1/c2)*log2(n). Ergo we can check whether it would pay to do a binary search by examining the m’th item. The implementation is left as an exercise for the reader.