I found out by accident that the Rice code works for the squozen sort problem. My intuition has barely caught up. This page rationalizes why and how to get there. See Recap for the summary.
In coding theory terms, we've got a message of one million symbols taken from an alphabet of 232 possible symbols (the gap widths from 0 to 232-1). We need to encode the individual gaps as codewords, bit strings that we write and read serially in memory. A gap "width" s is the unencoded number, and the corresponding codeword has a "length" in bits, which we'll call nbits( s ).
Throughout this explanation I'll stick with the constants 232-1 and 1,000,000 rather than giving them names. One weak reason is that were these parameters to change too much, or too much in relation to each other, some of the approximations I develop might stop working. Mainly I like having concrete examples and numbers to look at.
In typical coding problems, you want a code that minimizes the expected codeword length given a probability distribution for the possible symbols (the alphabet, A) to be encoded. That is, minimize
Sum( P(s) * nbits(s) ) s in Agiven P(s) is the probability or frequency with which we expect symbol s to appear. If the symbols in a message are independent (or if you're calculating as if they were), the expected length of a message is the total of the expected lengths of the individual codewords.
If you know the probability of each symbol, then a code where
nbits( s ) = -log2( P(s) )will have the best possible expected message length. Whether or not that equation holds, both of the following need to be true:
Sum( P(s) ) = 1.0 "probabilities sum to one" s in A Sum( 2-nbits( s ) ) = 1.0 "codewords fit" s in A
Our problem is like many coding problems in that narrower gaps will be encoded with shorter codewords. The average gap is 232/1,000,000, about 4296. So while we have to accommodate all 232 possible gap widths, hopefully most gaps will take fewer than 16 bits or we'll go over our 2,000,000 byte quota.
But the problem is unusual in that, instead of the expected length, here we're trying to minimize the maximum size of our message. And instead of a probability distribution on the gap widths, we're given a limit on their total, but it's clear that we need to exploit that somehow. These differences throw my usual instincts and tricks out the window. Time for some new ones.
(Because (as mentioned above) the codeword lengths need to vary, it didn't occur to me that the message length should be fixed. After all, how could both be true? Maybe with that hint you can design the code.)
What's our N? If you take a list of million 32-bit numbers and sort them,
you have removed
the order information, and what you have left is a million things, with
possible duplicates, taken from a set of 232
possibilities. There's a
formula for how many ways that can be done:
( 1,000,000 + 232-1 )! N = ------------------------ ~= 213,511,283.12 1,000,000! ( 232-1 )!(The factorial of 232 is a 131-billion-digit number; I used the lgamma function to calculate the log directly; see log2_factorial() here.) Now we have a target for our compressed message size:
log2( N ) < 13,511,284 bits < 1,688,911 bytes.But what does that say about encoding individual gaps, or how can we encode individual gaps to make the message come out that size? There are two ways to go from here, by concentrating on message and codeword lengths, or by finding a substitute for probability.
Here is one way to develop the nbits() function. First, the gap-width zero must have the shortest codeword length, nbits(0). So the message length can be divided into a fixed "overhead" plus the total of the variable parts of all the codewords (but that total, V, is also constant):
T = 1000000 * nbits(0) + VNext imagine stuffing a message with as many of some large gap as are allowed, filling the rest of the message with zeroes. For instance, only one gap of 232-1 can fit, or two gaps of 231-1. If we focus on gap widths that divide 232-1, then the maximum number of occurrances of a large gap s is
maxn(s) = (232-1) / sIf s is the only nonzero gap in the message, then maxn(s) instances of s must fill the variable part of the message, if our message is to have a fixed size:
( nbits( s ) - nbits(0) ) * maxn( s ) = V ( nbits( s ) - nbits(0) ) = V / maxn( s ) = V / ( (232-1) / s ) nbits( s ) = nbits(0) + s * V / (232-1)Another way to think about the nbits() relationship is to imagine a complete message, whose gaps total 232-1, and which fills the buffer. Now increase one gap and decrease another by the same amount to maintain the total gap width. Because all messages are ideally the same length, the two codewords that changed must have kept the same total length:
nbits( a + delta ) + nbits( b - delta ) = nbits(a) + nbits(b) nbits( a + delta ) - nbits(a) = nbits(b) - nbits( b - delta )which means that the slope of the nbits() function is the same everywhere.
Both of these point to a linear nbits() function. That hint's enough for us to derive the actual function, in the section after next.
Then the probability that a message starts with a symbol s is just the fraction of possible messages that start with s. That's the ratio of the number of possible continuations after s to the number of all messages. When s is removed from a message, the number of gaps decreases by one, and the total of the gap widths decreases by s, so
(999999+ 232-1-s)! (1000000 +232-1)! P(s) = (-----------------) / (-----------------) 999999!(232-1-s)! 1000000!(232-1)!This works out (derivation here) to
1000000 (999999 +232-1-s)! (232-1 )! P(s) = ------------- ---------------------------- 1000000+232-1 (999999 +232-1 )! (232-1-s)!This is the probability that a message will start with s. If we were to follow the pattern as the message develops, we'd have a perfect model of the "probabilities", and if we plugged them into a perfect arithmetic coder, we could encode every possible message in 13,511,284 bits. Instead, let's guess this is an approximately correct symbol distribution across the whole message.
The first factor is constant, and when s = 0, the second factor is one, so
1000000 P(0) = --------------- 1000000 + 232-1Until s gets close to 232 (see this chart), the second factor is very nearly
P(s) ( 232-1 )s ---- ~= -------------------- = ( 1 - P(0) )s P(0) ( 1000000 + 232-1 )sso
P(s) ~= P(0) * ( 1 - P(0) )sIn other words, this is (close to) a decaying exponential, representing a geometric distribution.
The approximations we've made in this section are to
The linear nbits() function explains how the individual codeword widths can vary and yet their total stays constant: the message size is a linear function of the number of gaps and their total. When both of those are fixed, the message size is fixed. Increasing one gap means others must shrink; the codeword lengths follow suit linearly.
The sections above suggested some numbers, but what we'll do now is take the idea of a geometric distribution and use it to optimize a code for the constraints of the problem. First, if
P(s) = P(0) qsThen the "probabilities sum to one" property sets the sum of a geometric series:
1.0 = P(0) + P(0)q + P(0)q2 + P(0)q3 ... = P(0) + 1.0 * q q = 1 - P(0) P(s) = P(0) (1 - P(0))s nbits(s) = -log2(P(0)) - s * log2( 1 - P(0) )Because nbits() is linear, the length of a message with 1000000 gaps totaling 232-1 is
nbits(M) = -1000000 * log2(P(0)) - (232-1) * log2( 1 - P(0) )We can adjust P(0) to minimize the message length by setting the derivative with respect to P(0) to zero:
d nbits(M) -1000000 / P(0) - (2321) / ( 1 - P(0) ) ---------- = --------------------------------------- = 0 d P(0) ln(2) 1000000 ( 1 - P(0) ) + (2321) P(0) = 0 1000000 P(0) = --------------- ~= 1/4296 1000000 + 232-1which is the same number we got from the combinations formula. Using this we get the message size:
nbits(M) ~= 13511294.41Eleven bits longer than our target message size. That's how far the geometric distribution is, efficiency-wise, from ideal for this problem. But so far we've been assuming we can use fractional bits. Let's get real...I mean get int.
ibits( s ) = int( s/a + b )where int() rounds down to an integer. The simplest case is where a is a power of two and b is an integer. That gives a group of a codewords of length b, followed by a of length b+1, then a of length b+2, etc. The "codewords fit" property says that
2-ba + 2-(b+1)a + 2(-b+2)a ... = 1.0but you can see that the sum of that series is 2-(b-1)a, which means that
2-(b-1)a = 1.0 a = 2b-1For our problem,
nbits(s) = -log2( P(0) ) - s * log2( 1 - P(0) ) -log2( P(0) ) ~= -log2(1/4296) ~= 12.07We can guess that we'll have to round up the b in our ibits() function, so
ibits(s) = int( s/4096 + 13 )Here's the beginning of the code we get by pouring bits into this mold:
|0 .. 4095||13||0 000000000000 .. 0 111111111111|
|4096 .. 8191||14||10 000000000000 .. 10 111111111111|
|2x4096 .. 3x4096 - 1||15||110 000000000000 .. 110 111111111111|
|3x4096 .. 4x4096 - 1||16||1110 000000000000 .. 1110 111111111111|
An important (if obvious) property of ibits() is
ibits(s) = int( s/a + b ) <= s/a + band so we have a guaranteed message length limit:
ibits(M) <= sum( s in M ) / a + len(M) * b <= (232-1)/4096 + 1000000 * 13 <= 14048576which is half a million bits, or almost 4%, more than the message size assuming arithmetic coding. That's a loss of just over 1/2 bit per codeword.
The Golomb code is a larger set of codes that includes the Rice codes as a subset. A Rice code is the case of a Golomb code when the a in the ibits() function (called M in the Wikipedia article) is a whole power of two.
Golomb codes are slightly more efficient for distributions that don't exactly match those of a Rice code. But it turns out, for reasons I'll leave as an exercise for the reader, that more-general Golomb codes don't help in our application; the maximum message size for a non-Rice Golomb code is actually greater than that for the best nearby Rice code, even though the expected size is smaller.
The sort_n.c program picks b = 13 (in the program, nb = 12) by iterating through all the possible b's (using a = 2b-1) and picking the best; around 13 the peak isn't very sharp:
232/211 + 1,000,000*12 = 14,097,152 bits 232/212 + 1,000,000*13 = 14,048,576 bits 232/213 + 1,000,000*14 = 14,524,288 bits
We compared the problem to what coding theory says about minimizing the expected, or average, size of a message, and the relationship between nbits(s), the number of bits to encode a symbol s, and P(s), the symbol's probability. Then we tried to reconcile the mismatch between our problem and the classic coding problem.
We observed that there's a finite set of possible messages, so that in theory we could encode each message with a single, huge number with a fixed number of bits, which would be the smallest possible maximum message size... which we calculated as a target to shoot for. Then we tried two ways to see how that theoretical message size could be built out of separate codewords.
One way was to do some thought experiments, imagining extreme messages and related messages, and asking what must be true for all those messages to have the same size. We broke nbits(s) down into a fixed minimum and a variable part, and found that the variable part of the codeword length must be proportional to s itself, in other words that nbits(s) is a linear function of s.
The second way was to try to find something that would work like the probablilty in the coding-theory way, yet fit the problem we're actually trying to solve. That turned out to be the fraction of possible messages that have a given symbol at a given place, or how much of the problem goes away once you encode that symbol. By way of a page full of combinatorial expressions and some fudging, we decided that P(s) is close to a geometric distribution, which also means that nbits() is a linear function.
Next we joined the two branches, took the idea of a linear nbits(), and derived the optimum linear nbits() function, which would hit very close to the theoretical target-- if we could use fractional bits.
Finally we designed a similar code, which turns out to be a Rice code, that uses a whole number of bits per codeword, at a slight loss of efficiency, but good enough for the original problem.
...which is not at all the path I originally took. I was trying to calculate the maximum message sizes for various plausible codes and having a hard time, so I thought, why don't I work out an example where the maximum is easy to calculate, even if it's with a code that's obviously inappropriate, like unary code or a Rice code. And of course the answer was under 16 million bits. There may be a whole different or very simplified lesson in that.
(You may have got here on a digression from
Up to my home page.
(You may have got here on a digression from Squozen Sort).