## Huffman codes, language, and mythical creatures

Let’s say you’re a telegraph operator, and you can send bits of 0 or 1 down the line one at a time. Each bit has a cost associated with it; the longer the message the more you have to pay. To simplify things, let’s say the alphabet only consists of A, B, C, and D and that they occur with frequencies:

 A 50% B 20% C 15% D 15%

While these numbers are of course made up, it is true that different letters of the alphabet occur with different frequencies.

How then should we encode our messages? A naive method would be to use two bits for each letter, since we have four letters total and 2² = 4. Then we would have perhaps

 A 00 B 01 C 10 D 11

But remember, it costs us a little to send each bit. Can we do better?

We can, using what’s called Huffman coding. The general procedure is to take the most common letter (in this case A), and give it a code that we can unambiguously terminate on. So we’ll give it the code 0, so if we ever get a message that starts with 0, we immediately know it’s an A and we can start afresh figuring out the next letter (in our previous naive encoding, if we received a 0 we wouldn’t know if it were an A or a B before getting the next bit).

I won’t go into the algorithm to generate Huffman codes (see the Wikipedia page or an algorithms textbook for the full details; it’s not complicated) and will just give the results:

 A 0 B 10 C 110 D 111

On the face of it this doesn’t look promising. Whereas before we only ever had to send two digits, now we sometimes have to send three! But remember that those are the least frequent letters in our alphabet. Can we get an actual measure of how long a “typical” message will be? The answer is yes, by multiply the frequencies of each letter by their length in bits:

$\displaystyle 0.5 \times 1 \text{ bit} + 0.2 \times 2\text{ bits} + 2(0.15 \times 3\text{ bits}) = 1.8\text{ bits}$

Which is better than our naive encoding of 2 bits. Try decoding the following message (solution at the end of the post):

010101100111

The general philosophy behind Huffman coding is to make the common short and the uncommon long. If we were designing a language, we would want the most common words to be the shortest, and to some extent that’s how English operates! The most common words are almost all one syllable long. Just imagine if instead of “the” we used “Brobdingnagian” and instead of “a” we used “floccinaucinihilipilification.” It would take forever to say anything at all.

All this is to bring me to the observation that led me to write today’s post: why are the names of some mythical creatures so high up in English’s Huffman coding? Elf, dwarf, troll, ghost, wight… or if we allow ourselves two syllables dragon, ogre, werewolf, vampire… I think it says something colorful about the English-speaking peoples that certain mythical creatures are so common as to warrant short names!

And other peoples of course. The inspiration for all of this was that I saw that the Japanese term for dragon was “ryuu” and thought “That’s an awfully short word for a made-up creature.”