Fri, 13 Dec 2013 00:58:50 GMT

Again, a talk for the best mathematics club of the multiverse, for the pupils who do not yet study mathematics. This talk has very low requirements and is not thought for mathematicians, but for people who may become mathematicians. It is about basic coding theory, nothing too deep, but hopefully interesting anyway.

It happens often that one needs to type in a number somewhere. Several things can go wrong. To prevent this, one often adds some redundancy to these numbers, so it is less probable to make a mistake without noticing.

A famous example are ISBNs. They have ten digits where is usually denoted as . The rule for such an ISBN is now that , which can always be achieved by choosing correctly. The most common errors are single wrong digits, and digit-twists.

**Proposition:** Two distinct ISBNs differ in at least two digits.

**Proof:** Assume not. Consider the two ISBNs and
such that except for exactly one
. Then ,
but notice that this number cannot be divisible by , since
all of its factors are smaller than and is a prime
number. But the difference of -divisible numbers must be
-divisible. Contradiction. --

Thus, if we got only one digit wrong, we can recover the original code. Of course, if more than one digit is wrong, this does not help. For twists of two adjacent digits, we have

**Proposition:** Two distinct ISBNs cannot differ only in two adjacent
digits.

**Proof:** Again, consider the two ISBNs and
such that for all except for
, for which we have and vice
versa. Then again,
and that can again not be divisible by 11. Contradiction. --

While it is pretty easy to include such an error correcting code, most numbers humans use directly in real life do not use one, probably because people do not know that they exist. (At least I doubt that coding theory is a part of most engineering studies.) While ISBN Codes are widely used, this is of course only a minor feature, and for example, you usually have the name of the book given.

One example from technology that is widely used are the RAID-Systems, Redundant Arrays of Inexpensive Disks. The problem with hard disks is that they have a lot of pretty small technology, and their lifetime is limited. While on personal computers, it is sufficient to make regular backups, in data centers you need more reliability: You do not want to have long pauses or a complicated procedure if one of your disks fails.

The RAID-Systems work on a very low level of data, below the file system and partition table: RAID-Systems work with the bit-sequence, the "zeroes and ones", that are saved on the disk. Therefore, as one drawback, the disks must be of equal size. In practice, they are usually identical in construction.

The easiest way to do this is to just "mirror" a disk: You work with two disks that always save exactly the same thing, and if one disk fails, you just replace it and copy the data from the others. So at every time, you have two bit sequences and , where is the number of bits that can be saved on that disks, and where . If, say, the -disk breaks, the system continues to function, only using the -disk, and as soon as a new -disk is inserted, the whole content of is copied. This is called RAID 1. You need twice as much disks to gain security.

But we can do better, using *parity checks*. Let us say you have three
disks , and again with the bit sequences
. We now define a "virtual"
bit sequence such that and for all , thus
"merging" the bit sequences and . The
computer works on this bit sequence , thus seeing a
virtual hard disk twice the size of one of the three disks we
used. For the third disk, we define that it has to satisfy . The computer
maintains this invariant. Now, if breaks, it can just be
recalculated from and . The more interesting part is
when one of the other disks breaks. Assume breaks, but
and still work. As long as we want to read
, we can just read . If we want to read
, we have to recover the value of . But we know
that , therefore . Since , this is the same as
, and therefore, we can recalculate , as long as we have and
. Compared to simply mirroring, where we need twice as much
memory, here we only need times as much memory, but of
course, only one out of your three disks may fail. is called
the *parity disk*, because it saves the parities of the sums of the
bits of the other disks. In the same way, you can use more than three
disks and recover transparently if exactly one of them fails. Which
one you choose depends on how much you actually trust your disks. RAID
Systems are a way of making data loss less probable, but not
impossible.

Parity checks are simple, but they have one major disadvantage: You
have to know which of the disks is broken. This is the fact when some
essential part of a disk is broken. However, you might as well just
have parts of a disk returning wrong data. You can then use
*Hamming-codes*, which scale for large bit numbers, but not for small
ones. Therefore, in practice, they are rarely used for RAID systems,
since there are better possibilities in this particular use case of
only few disks on which you have random access. The practical use-case
of Hamming-codes is for sending data over an unreliable channel, for
example through atmospheric disturbances. The central idea is to cover
subsets of the data bits by multiple parity bits. Assume we have
bits . We look at the binary
representation of the indices , from to
. Every such that has exactly one
in its binary representation is used as a parity bit, thus,
we have the parity bits . All the other
bits are data bits, so we have an overhead of
. Now the parity bit describes the
parity of the sum of all bits with the binary digit that is set in
is also set. For example, is the
sum of ,,,,,,, all the indices have the second-least
binary digits set. Therefore, by knowing which of the parity bits are
wrong, we know which bit is wrong, assuming at most one bit flipped:
If only one parity bit is wrong, the bit itself must be wrong. If more
than one parity bit wrong, we can restore the binary representation of
the index of the flipped bit. Of course, the same works for other bit
numbers. Hamming codes also work for higher bit numbers. In general,
by the same procedure, we can recognize a single bit failure in a
chunk of bits by using bits as parity bits, and
bits as data bits.

While having error correction might sound desirable, in cryptographic
applications it is often desirable to have error-detection, but no
possibility of error-correction. A common problem on webservers is
that it must store the users' passwords in some secure way. On the
other hand, no server is absolutely secure, and if data is leaked, the
hacker can read the users' passwords. On the one hand, this is bad
because users tend to use the same passwords for many services, on the
other hand, if such leaks are not noticed, the thief can use the
cleartext passwords and use the usual logins. In general,
*Cryptographic checksums* of the passwords should be used. Famous
examples are md5 (which is obsolete for this purpose) and sha512.

Most data we work with contains a lot of redundancy. From a purely
theoretical point of view, for every machine model and every chunk of
data, there exists a minimal representation (see for example
Kolmogorov Complexity). However,
this minimal representation can usually not be calculated. In
practice, approximations are used. For some special purposes, there
can be *lossy* compression, for example, ogg vorbis and theora for
audio and video compression. We will only look at *lossless*
compression: The data can be recovered as it was before.

The easiest such compression method is *run-length encoding*, RLE. In
its simplest form, it is basically saving repetitive sequences of
characters only once with their multiplicity. For example, in many
graphics, you have a lot of singly-colored background, that is, a lot
of equally colored pixels. As a simple example, say we have sequences
of bytes (values from to ), in which long
subsequences consisting of zeroes occur. We can then say that instead
of saving the sequence of zeroes, we save only one zero and let the
following byte denote the number of zeroes that follow this one. That
is, if we have , we can instead save
. We then save a sequence of consecutive zeroes
with only two bytes. On the other hand, if we have only one zero to
save, we need two zeroes instead of just one. For example, the
sequence becomes ,
which needs more memory. A simple pigeonhole argument shows that this
must be the case in **every** lossless compression format: There
**must** be datasets that need more memory after compression. It is up
to the engineer to decide which format to use.

Things get a bit more complicated when we allow *backreferences*: It
is often the case that a text contains repetitions. In many
programming languages, for example, in the beginning, there are a lot
of *import* or *include* commands. In addition to saving your bytes,
you can also save a backreference, which is a pair , where
is the distance, the number of bytes we have to overjump
until we have the beginning of the repeated string, and is
the length of the backreference. For example, let's say we have the
string (we do not care about the actual representation of that file "on
the metal", since this is a more sophisticated problem and highly
depends on the underlying system). Then we could save this as . Of course, as before,
we need reserved bytes to denote these backreferences, and therefore,
there are situations in which this form of compression is even
bad. However, in practice, this method is very useful and often gains
a lot of compression. Especially, modern data formats like
XML employ a lot of redundancy, which
has some advantages when working with them locally, but is a waste of
bandwidth when sending them, and here it can be very efficient. Of
course, for every string, there exists a minimal representation with
backreferences. However, it is not practical to calculate it, even for
short strings, but the intuitive "linear" algorithm, just reading the
string from left to right and maintaining a dictionary usually
provides sufficient results.

While RLE tries to remove duplicate substrings, it is also often the
case that only a small amount of actual possible bytes are used, or at
least some bytes are used very seldom, compared to others. Therefore
it is plausible to increase the amount of memory needed for the less
frequently used bytes, and in exchange decrease the amount for the
more frequently used. There are several ways to do this. The one we
want to look at here is the use of *Huffman codings*. Firstly, we are
given some *alphabet*. Even in practice, this alphabet needs not to
fit into a byte, it may contain more than characters. So
say we have an alphabet , and
a sequence of characters of length
. Now let be the number of
times the character occurs in . As we want to
save our string as sequence of bits, we have to map every character on
a unique bit sequence, in a way that when reading the whole bit
sequence from left to right, no ambiguousity occurs. The code must be
*prefix free*: No sequence of one character is allowed to be the
prefix of another one, that is, we can form a binary tree such that
each branch is mapped to a character. If denotes the
(not yet known) lengths of the huffman codes, we want to minimize the
number of bits required to save our string, and this number is
. The following procedure
creates a tree which tries to do this:

- Step 1: Begin with the set of pairs of leaf nodes labelled with characters, and their frequency.
- Step 2: If contains only one element, return it: It is your huffman tree. If contains at least two elements, let be the two pairs with the lowest frequencies , and set , thus, replacing the two trees by a new higher tree.
- Step 3: Proceed at Step 1 with the new .

It is easy to see that if the frequencies are powers of two, this produces an optimal code.

Huffman codes are not unique, as we can "mirror" subtrees of the code. However, with some additional restrictions, we can make it unique. Shorter codes must lexicographically precede longer codes. Codes of the same length must have the same lexicographic ordering as the characters they order. And the whole coding must be lexicographically dense, and contain a string that only contains zeroes. For example, consider the huffman code . is shorter than , but does not lexicographically precede it. lexicographically precedes , however, does not precede . The corresponding canonical Huffman code, satisfying all the properties, is . It is easy to show (and left as an exercise) that each huffman code can be canonicalized.