I found out by accident that the Rice code works for the squozen sort problem. My intuition has barely caught up. This page rationalizes why and how to get there. See Recap for the summary.

In coding theory terms, we've got a message of one
million symbols
taken from an alphabet
of 2^{32} possible symbols (the gap widths from 0
to 2^{32}-1). We need to encode the individual gaps as
**codewords**, bit strings that we write and read serially in
memory. A gap "width" *s* is the unencoded number, and the corresponding
codeword has a "length" in bits, which we'll call **nbits( s )**.

Throughout this explanation I'll stick with the constants
2^{32}-1 and 1,000,000 rather than giving them
names. One weak reason is that were these parameters to
change too much, or too much in relation to each other,
some of the approximations I develop might stop working.
Mainly I like having concrete examples and numbers to
look at.

In typical coding problems, you want a code that
minimizes
the *expected* codeword length given a probability
distribution for the possible symbols (the alphabet, A) to be encoded.
That is, minimize

Sum( P(s) * nbits(s) ) s in Agiven P(

If you know the probability of each symbol, then a code where

nbits( s ) = -log2( P(s) )will have the best possible expected message length. Whether or not that equation holds, both of the following need to be true:

Sum( P(s) ) = 1.0 "probabilities sum to one" s in A Sum( 2^{-nbits( s )}) = 1.0 "codewords fit" s in A

Our problem is like many coding problems in that narrower gaps will be
encoded with shorter codewords. The average gap is 2^{32}/1,000,000, about 4296. So while we have to accommodate all 2^{32}
possible gap widths, hopefully most gaps will take fewer than 16 bits or we'll
go over our 2,000,000 byte quota.

But the problem is unusual in that, instead of the expected length, here we're trying to minimize the *maximum* size of our message.
And instead of a probability distribution on the gap widths,
we're given a limit on their total,
but it's clear that we need to exploit that somehow.
These differences throw my usual instincts and
tricks out the window. Time for some new ones.

(Because (as mentioned above) the codeword lengths need to vary, it didn't occur to me that the message length should be fixed. After all, how could both be true? Maybe with that hint you can design the code.)

What's our N? If you take a list of million 32-bit numbers and sort them,
you have removed
the order information, and what you have left is a million things, with
possible duplicates, taken from a set of 2^{32}
possibilities. There's a
formula for how many ways that can be done:

( 1,000,000 + 2(The factorial of 2^{32}-1 )! N = ------------------------ ~= 2^{13,511,283.12}1,000,000! ( 2^{32}-1 )!

logBut what does that say about encoding individual gaps, or how can we encode individual gaps to make the message come out that size? There are two ways to go from here, by concentrating on message and codeword lengths, or by finding a substitute for probability._{2}( N ) < 13,511,284 bits < 1,688,911 bytes.

s |
nbits(s) |
2^{-nbits(s)} |

0 | 1 | 1/2 |

1 | 3 | 1/8 |

2 | 3 | 1/8 |

3 | 4 | 1/16 |

4 | 4 | 1/16 |

5 | 4 | 1/16 |

6 | 4 | 1/16 |

total | 1 |

To derive a code, sort the rows by nbits (they are in this example), and just assign each row the next available bit pattern:

s |
nbits(s) |
codeword |

0 | 1 | 0 |

1 | 3 | 100 |

2 | 3 | 101 |

3 | 4 | 1100 |

4 | 4 | 1101 |

5 | 4 | 1110 |

6 | 4 | 1111 |

This easy way to derive the code itself from nbits() means we can concentrate on nbits() for now, and leave the code per se to the end. At first it's convenient to allow ourselves fractional nbits() values, but finally we'll arrange to use "whole bits".

Here is one way to develop the nbits() function. First, the gap-width zero must have the shortest codeword length, nbits(0). So the message length can be divided into a fixed "overhead" plus the total of the variable parts of all the codewords (but that total, V, is also constant):

T = 1000000 * nbits(0) + VNext imagine stuffing a message with as many of some large gap as are allowed, filling the rest of the message with zeroes. For instance, only one gap of 2

maxn(s) = (2If^{32}-1) / s

( nbits( s ) - nbits(0) ) * maxn( s ) = V ( nbits( s ) - nbits(0) ) = V / maxn( s ) = V / ( (2Another way to think about the nbits() relationship is to imagine a complete message, whose gaps total 2^{32}-1) / s ) nbits( s ) = nbits(0) + s * V / (2^{32}-1)

nbits( a + delta ) + nbits( b - delta ) = nbits(a) + nbits(b) nbits( a + delta ) - nbits(a) = nbits(b) - nbits( b - delta )which means that the slope of the nbits() function is the same everywhere.

Both of these point to a linear nbits() function. That hint's enough for us to derive the actual function, in the section after next.

Then the probability that a message starts with a symbol *s*
is just the fraction of possible messages that start with *s*.
That's the ratio of the number of
possible continuations after *s* to the number of all messages.
When *s* is removed from a message,
the number of gaps decreases by one, and the
total of the gap widths decreases by *s*, so

(999999+ 2This works out (derivation here) to^{32}-1-s)! (1000000 +2^{32}-1)! P(s) = (-----------------) / (-----------------) 999999!(2^{32}-1-s)! 1000000!(2^{32}-1)!

1000000 (999999 +2This is the probability that a message will start with^{32}-1-s)! (2^{32}-1 )! P(s) = ------------- ---------------------------- 1000000+2^{32}-1 (999999 +2^{32}-1 )! (2^{32}-1-s)!

The first factor is constant, and when *s* = 0, the second factor is one,
so

1000000 P(0) = --------------- 1000000 + 2Until^{32}-1

P(s) ( 2so^{32}-1 )^{s}---- ~= -------------------- = ( 1 - P(0) )^{s}P(0) ( 1000000 + 2^{32}-1 )^{s}

P(s) ~= P(0) * ( 1 - P(0) )In other words, this is (close to) a decaying exponential, representing a geometric distribution.^{s}

The approximations we've made in this section are to

- pretend every symbol should be coded like the first,
- ignore the nonlinear part of the curve, and
- forget that the distribution is over a finite, not infinite alphabet.

The linear nbits() function explains how the individual codeword widths can vary and yet their total stays constant: the message size is a linear function of the number of gaps and their total. When both of those are fixed, the message size is fixed. Increasing one gap means others must shrink; the codeword lengths follow suit linearly.

The sections above suggested some numbers, but what we'll do now is take the idea of a geometric distribution and use it to optimize a code for the constraints of the problem. First, if

P(s) = P(0) qThen the "probabilities sum to one" property sets the sum of a geometric series:^{s}

1.0 = P(0) + P(0)q + P(0)qBecause nbits() is linear, the length of a message with 1000000 gaps totaling 2^{2}+ P(0)q^{3}... = P(0) + 1.0 * q q = 1 - P(0) P(s) = P(0) (1 - P(0))^{s}nbits(s) = -log_{2}(P(0)) - s * log_{2}( 1 - P(0) )

nbits(M) = -1000000 * logWe can adjust P(0) to minimize the message length by setting the derivative with respect to P(0) to zero:_{2}(P(0)) - (2^{32}-1) * log_{2}( 1 - P(0) )

d nbits(M) -1000000 / P(0) - (2which is the same number we got from the combinations formula. Using this we get the message size:^{32}1) / ( 1 - P(0) ) ---------- = --------------------------------------- = 0 d P(0) ln(2) 1000000 ( 1 - P(0) ) + (2^{32}1) P(0) = 0 1000000 P(0) = --------------- ~= 1/4296 1000000 + 2^{32}-1

nbits(M) ~= 13511294.41Eleven bits longer than our target message size. That's how far the geometric distribution is, efficiency-wise, from ideal for this problem. But so far we've been assuming we can use fractional bits. Let's get real...I mean get int.

ibits( s ) = int( s/a + b )where int() rounds down to an integer. The simplest case is where

2but you can see that the sum of that series is 2^{-b}a + 2^{-(b+1)}a + 2^{(-b+2)}a ... = 1.0

2For our problem,^{-(b-1)}a = 1.0 a = 2^{b-1}

nbits(s) = -logWe can guess that we'll have to round up the_{2}( P(0) ) - s * log_{2}( 1 - P(0) ) -log_{2}( P(0) ) ~= -log_{2}(1/4296) ~= 12.07

ibits(s) = int( s/4096 + 13 )Here's the beginning of the code we get by pouring bits into this mold:

s |
ibits(s) |
codewords |

0 .. 4095 | 13 | 0 000000000000 .. 0 111111111111 |

4096 .. 8191 | 14 | 10 000000000000 .. 10 111111111111 |

2x4096 .. 3x4096 - 1 | 15 | 110 000000000000 .. 110 111111111111 |

3x4096 .. 4x4096 - 1 | 16 | 1110 000000000000 .. 1110 111111111111 |

This is the Rice code that's used in the sort_n.c program.

An important (if obvious) property of ibits() is

ibits(s) = int( s/a + b )and so we have a guaranteed message length limit:<=s/a + b

ibits(M) <= sum( s in M ) / a + len(M) * b <= (2which is half a million bits, or almost 4%, more than the message size assuming arithmetic coding. That's a loss of just over 1/2 bit per codeword.^{32}-1)/4096 + 1000000 * 13 <= 14048576

The
Golomb code is a larger set of codes that includes the Rice codes
as a subset. A Rice code is the case of a Golomb
code when the *a* in the ibits() function (called
*M* in the Wikipedia article) is a whole power of
two.

Golomb codes are slightly more efficient for distributions that don't exactly match those of a Rice code. But it turns out, for reasons I'll leave as an exercise for the reader, that more-general Golomb codes don't help in our application; the maximum message size for a non-Rice Golomb code is actually greater than that for the best nearby Rice code, even though the expected size is smaller.

The sort_n.c program picks b
= 13 (in the program,
nb = 12) by iterating through all the possible *b*'s
(using *a* = 2^{b-1})
and picking the best; around 13 the peak isn't very sharp:

2^{32}/2^{11}+ 1,000,000*12 = 14,097,152 bits 2^{32}/2^{12}+ 1,000,000*13 = 14,048,576 bits 2^{32}/2^{13}+ 1,000,000*14 = 14,524,288 bits

We compared the problem to what coding theory says about minimizing the
*expected*, or average, size of a message, and the relationship between
nbits(*s*), the number of bits to encode a symbol *s*,
and P(*s*), the symbol's probability. Then we tried
to reconcile the mismatch between our problem
and the classic coding problem.

We observed that there's a finite set of possible messages, so that in theory we could encode each message with a single, huge number with a fixed number of bits, which would be the smallest possible maximum message size... which we calculated as a target to shoot for. Then we tried two ways to see how that theoretical message size could be built out of separate codewords.

One way was to do some thought experiments,
imagining extreme messages and related messages, and asking what must
be true for all those messages to have the same size. We broke
nbits(*s*) down into a fixed minimum and a variable part,
and found that the variable
part of the codeword length must be proportional to *s* itself,
in other words that nbits(*s*) is a linear function of s.

The second way was to try to find something that would work like the
probablilty in the coding-theory way, yet fit the problem we're actually
trying to solve. That turned out to be the fraction of possible messages
that have a given symbol at a given place, or how much of the problem
goes away once you encode that symbol. By way of a page full of
combinatorial expressions and some fudging, we decided that P(*s*)
is close to a geometric distribution, which also means that
nbits() is a linear function.

Next we joined the two branches, took the idea of a linear nbits(), and derived the optimum linear nbits() function, which would hit very close to the theoretical target-- if we could use fractional bits.

Finally we designed a similar code, which turns out to be a Rice code, that uses a whole number of bits per codeword, at a slight loss of efficiency, but good enough for the original problem.

...which is not at all the path I originally took. I was trying to calculate the maximum message sizes for various plausible codes and having a hard time, so I thought, why don't I work out an example where the maximum is easy to calculate, even if it's with a code that's obviously inappropriate, like unary code or a Rice code. And of course the answer was under 16 million bits. There may be a whole different or very simplified lesson in that.

--

(You may have got here on a digression from Squozen Sort).

--Steve Witham

Up to my home page.