Choosing k random lines from a file - Richard Harter's World

home

table of contents

Computer Science

July 2004

email

Choosing k random lines from a file

The task is to randomly select K values from a suite of N values. It turns out that there are two distinct cases, that in which the suite of values is an existing array that may be altered and that in which the suite may not be altered, either because it is not an array (e.g. it is specified by a function) or it is an array that may not be altered.
The first case is easy. What we do is make the first K values of the array our selection. The trick is, for each successive selection, to swap the value in the current position with a random selected value from the slice [current:end]. This gives us code that looks like this:
for (k=0;k<K;k++) { j = randinrange(k,N-1); temp = array[k]; array[k] = array[j]; array[j] = temp; }
where randinrange returns a random number between k through N-1.
We could reduce the second case to the first case by creating an auxilliary array of length N and initialize it to contain the integers 0:N-1. We then apply the case I algorithm to get an array of random indices that can be used to select the desired elements.
There are two major objections to this scheme: The first is that it is an O(N) algorithm whereas the case I algorithm is O(K). The second is that requires O(N) auxilliary space, whereas as we shall see, at most O(K) auxilliary space is needed.
Notice that when we use the auxilliary index array scheme at most 2K elements in the array are altered, these being the first K elements and the selected elements beyond the first K. This suggests that we don’t actually need to construct the auxilliary array; it suffices to keep track of the altered entries. Here is a simple example with N=1000 and K=4 (1 based indexing):
Iteration 1: randinrange(1,N) returned 909 swap array[1], array[909] Iteration 2: randinrange(2,N) returned 117 swap array[2], array[117] Iteration 3: randinrange(3,N) returned 264 swap array[3], array[264] Iteration 4: randinrange(4,N) returned 117 swap array[4], array[117]
Now look at what happens to the contents of the array. Initially it looks like this:
index: 1 2 3 4 .... N value: 1 2 3 4 .... N
After the first iteration it looks like this:
Index: 1 2 ... 908 909 910 ... N Value: 909 2 ... 908 1 910 ... N
and after the third like this:
Index: 1 2 3 ... 117 ... 264 ... 909 ... N Value: 909 117 264 ... 2 ... 3 ... 1 ... N
On the fourth iteration we swap with a location that has already been altered to get:
Index: 1 2 3 4 ... 117 ... 264 ... 909 ... N Value: 909 117 264 2 ... 4 ... 3 ... 1 ... N
It’s easy enough to see what is going on here but how do we organize it. Let’s start by writing the table like this:
Index: 1 2 3 4 Value: 909 117 264 2 Content: 1 4 3 117
Evidently what we want is an array of two element records. The first element is the selection, and the second is the content of the array at the selection. The algorithm, then, looks like this:
loop k = 1 thru K candidate = randinrange(k,N) is candidate an existing selection ? yes: let R be the record associated with the candidate and let C be the content field of R. Add a new record (C,candidate) and replace the content field of R by k. no: Add a new record (candidate,k) end loop
Whether this is an O(K**2), an O(K*log(K)), or a O(K) algorithm depends on how we store the records. If we use an array it is O(K**2), for a tree it is O(K*log(K)), and it is O(K) if we use a hash table.
This is an updated web version of a small paper I did, to wit:
A Note on Random Sampling, Information Processing Letters, Volume 29, Number 5, November 1988, with Marek Chrobak

This page was last updated July 1, 2004.

home

table of contents

Computer Science

July 2004

email