IRC Quote (1)

Here’s a small puzzle:

There’s a file with an unknown number of lines. From that file, I want to read and display 1 line. The line needs to be picked completely at random, so that every line has an equal chance to be selected.

What’s the fastest algorithm you can come up with?

This entry was posted in Code Poetry and tagged , , . Bookmark the permalink.

7 Responses to IRC Quote (1)

  1. Adhemar says:

    When one instructs a computer to pick anything at random, one uses a random number generator. Usually, that random number generator is used to generate the index of the chosen item.

    However, for this puzzle, you could generate a random number for each line. As you go over the file (assuming it has at least 1 line), you call random() each time a new line is started. If the new random number is the smallest number yet, you keep the line in memory, as well as its random number, dropping (of course) the previous line with smallest random number.

    The advantages are that you have to go over the file only once, keeping only 1 number and 1 line in memory.

    The disadvantages are that you have to generate a lot of random numbers; and that if the range of possible outputs of that function is limited, there is a bias towards either the early lines (if you compare with <) or the latest lines (if you compare with <=).

  2. David says:

    Hey Adhemar,

    A fine solution indeed! I’m currently using a very similar algorithm that also only needs to read the whole file once. It’s interesting to see how comparison of n random numbers yields equal chances of P(i) = 1/n for every i from 0 to n. This could be very useful, since that algorithm I’m using requires a modulo, a relatively expensive operation, as well as a random number for every line. I’ll detail it in a next blogpost, though.

    Emperical data suggests the bias to be minimal, which is to be expected looking at the range of a random number (on my platform GCC gives a RAND_MAX of 2147483647). Either way, the bias could be an advantage for the problem we want to solve.

    You don’t happen to have (a link to) a proof for this algorithm (the equal chances part)? Your comment spawned quite a discussion about its correctness on our IRC-channel ;) .

  3. Adhemar says:

    If all calls to random() are independent and identically-distributed, every line has the same change of having the/a minimal valued number. The idea is no different from the “participant with highest roll of the die (or dice), starts the game” principle in many party games (= gezelschapsspelletjes).

    The bias I talk about, occurs when there is more than one line with the minimal random value. The change of this happening decreases when RAND_MAX gets big, and increases when the number of lines gets big. Again, the traditional party game convention could be used: keep all the lines with the smallest random number in memory (or in a separate temporary file when there are too many), and repeat the process among them if there (or on that temporary file) is more than 1.

    On average, 1/RAND_MAX lines will be kept in memory for the next round. Best case is 1, worst case is all of them.

    BTW, I am interested in the IRC log of your discussion.

  4. Adhemar says:

    It just occurred to me that if you want to eliminate the bias (happening when there is more than one line with the minimal random value) by using the iterative process I described above, you can actually do that in one pass over the file.

    Instead of keeping 1 line and 1 number (the minimum so far) in memory, keep 1 line and an array of numbers in memory. The length of the array is the maximum number of iterations, it can also be an unbounded linked list. Don’t expect many iterations to be necessary, especially if RAND_MAX is big.

    So, when random() returns the same value as the minimal value so far, call random() twice again. Store the minimum of the two values in the second item of the array. If the first call returns the smallest of the two values, keep the old line. If the second call returns the smallest of the two values, keep the new one. If both are the same, call random() twice again, and store it in the third value of the array. And so on.

    For later lines, continue to compare the random() return value with the first item in the array first. When it is smaller, keep that line and drop all the further values in the array.

  5. David says:

    Very nice, Adhemar, thanks! The party games analogy makes it even more simple to understand :) .

    I’ll mail you the irc log one of these days, but I’m real busy this week with Zeus so it might take a while.

  6. Pingback: IRC Quote (2) at Crowdway

  7. payday loan says:

    znfkajamhr finthluic jskbwbumli eomnovkrnag thbqhuilk velegsfucxl fjqsiocjc zhgdzlavf

Leave a Reply

Your email address will not be published. Required fields are marked *

*

You may use these HTML tags and attributes: <a href="" title=""> <abbr title=""> <acronym title=""> <b> <blockquote cite=""> <cite> <code> <del datetime=""> <em> <i> <q cite=""> <strike> <strong> <pre lang="" line="" escaped="" highlight="">