Girl you know it

Foreword

Again, a talk for the best mathematics club of the multiverse, for the pupils who do not yet study mathematics. This talk has very low requirements and is not thought for mathematicians, but for people who may become mathematicians. It is about basic coding theory, nothing too deep, but hopefully interesting anyway.

Checksums

It happens often that one needs to type in a number somewhere. Several things can go wrong. To prevent this, one often adds some redundancy to these numbers, so it is less probable to make a mistake without noticing.

A famous example are ISBNs. They have ten digits where x_{10}=10 is usually denoted as x_{10}=X. The rule for such an ISBN is now that \sum\limits_{i=1}^{10} (11-i)x_i \equiv 0 \mod 11, which can always be achieved by choosing x_{10} correctly. The most common errors are single wrong digits, and digit-twists.

Proposition: Two distinct ISBNs differ in at least two digits.

Proof: Assume not. Consider the two ISBNs (x_i)_i and (y_i)_i such that x_i = y_i except for exactly one i=k. Then (\sum\limits_{i=1}^{10}
(11-i)x_i)-(\sum\limits_{i=1}^{10} (11-i)y_i) = (11-k)(x_k-y_k), but notice that this number cannot be divisible by 11, since all of its factors are smaller than 11 and 11 is a prime number. But the difference of 11-divisible numbers must be 11-divisible. Contradiction. --

Thus, if we got only one digit wrong, we can recover the original code. Of course, if more than one digit is wrong, this does not help. For twists of two adjacent digits, we have

Proposition: Two distinct ISBNs cannot differ only in two adjacent digits.

Proof: Again, consider the two ISBNs (x_i)_i and (y_i)_i such that x_i = y_i for all i except for i\in\{k,k+1\}, for which we have x_{k+1} = y_k and vice versa. Then again, |(\sum\limits_{i=1}^{10}
(11-i)x_i)-(\sum\limits_{i=1}^{10} (11-i)y_i)| = |(11-k) x_k + (11-k-1)
x_{k+1} - (11-k-1) x_k - (11 - k) x_{k+1}| = |x_{k+1} - x_k| < 11 and that can again not be divisible by 11. Contradiction. --

While it is pretty easy to include such an error correcting code, most numbers humans use directly in real life do not use one, probably because people do not know that they exist. (At least I doubt that coding theory is a part of most engineering studies.) While ISBN Codes are widely used, this is of course only a minor feature, and for example, you usually have the name of the book given.

One example from technology that is widely used are the RAID-Systems, Redundant Arrays of Inexpensive Disks. The problem with hard disks is that they have a lot of pretty small technology, and their lifetime is limited. While on personal computers, it is sufficient to make regular backups, in data centers you need more reliability: You do not want to have long pauses or a complicated procedure if one of your disks fails.

The RAID-Systems work on a very low level of data, below the file system and partition table: RAID-Systems work with the bit-sequence, the "zeroes and ones", that are saved on the disk. Therefore, as one drawback, the disks must be of equal size. In practice, they are usually identical in construction.

The easiest way to do this is to just "mirror" a disk: You work with two disks that always save exactly the same thing, and if one disk fails, you just replace it and copy the data from the others. So at every time, you have two bit sequences (a_i)_i\in\{0,1\}^m and (b_i)_i\in\{0,1\}^m, where m is the number of bits that can be saved on that disks, and where \forall_i a_i = b_i. If, say, the a-disk breaks, the system continues to function, only using the b-disk, and as soon as a new a-disk is inserted, the whole content of b is copied. This is called RAID 1. You need twice as much disks to gain security.

But we can do better, using parity checks. Let us say you have three disks a,b,c, and again with the bit sequences (a_i)_i,(b_i)_i,(c_i)_i\in\{0,1\}^m. We now define a "virtual" bit sequence (s_i)_i\in\{0,1\}^{2m} such that s_{2i} =
a_i and s_{2i+1}=b_i for all i\in\mathbb{N}, thus "merging" the bit sequences (a_i)_i and (b_i)_i. The computer works on this bit sequence (s_i)_i, thus seeing a virtual hard disk twice the size of one of the three disks we used. For the third disk, we define that it has to satisfy c_i =
a_i + b_i \mod 2 = \left\{\begin{array}{cl} 0 & \mbox{ for } a_i = b_i
\\ 1 & \mbox{ for } a_i \neq b_i \end{array}\right.. The computer maintains this invariant. Now, if c breaks, it can just be recalculated from a and b. The more interesting part is when one of the other disks breaks. Assume a breaks, but b and c still work. As long as we want to read s_{2i+1}, we can just read b_i. If we want to read s_{2i}, we have to recover the value of a_i. But we know that a_i + b_i \equiv c_i \mod 2, therefore b_i - c_i \equiv
a_i \mod 2. Since 1 \equiv -1 \mod 2, this is the same as b_i + c_i \mod 2, and therefore, we can recalculate a_i =
b_i + c_i \mod 2, as long as we have b and c. Compared to simply mirroring, where we need twice as much memory, here we only need 1.5 times as much memory, but of course, only one out of your three disks may fail. c is called the parity disk, because it saves the parities of the sums of the bits of the other disks. In the same way, you can use more than three disks and recover transparently if exactly one of them fails. Which one you choose depends on how much you actually trust your disks. RAID Systems are a way of making data loss less probable, but not impossible.

Parity checks are simple, but they have one major disadvantage: You have to know which of the disks is broken. This is the fact when some essential part of a disk is broken. However, you might as well just have parts of a disk returning wrong data. You can then use Hamming-codes, which scale for large bit numbers, but not for small ones. Therefore, in practice, they are rarely used for RAID systems, since there are better possibilities in this particular use case of only few disks on which you have random access. The practical use-case of Hamming-codes is for sending data over an unreliable channel, for example through atmospheric disturbances. The central idea is to cover subsets of the data bits by multiple parity bits. Assume we have 15 bits (b_i)_{0 < i \le 15}. We look at the binary representation of the indices i, from 0001_{(2)} to 1111_{(2)}. Every b_i such that i has exactly one 1 in its binary representation is used as a parity bit, thus, we have the 4 parity bits b_1,b_2,b_4,b_8. All the other 11 bits are data bits, so we have an overhead of \frac{4}{15}. Now the parity bit b_i describes the parity of the sum of all bits with the binary digit that is set in i is also set. For example, b_2=b_{0010_{(2)}} is the sum of b_{0011_{(2)}}=b_3, b_{0110_{(2)}}=b_6,b_{0111_{(2)}}=b_7,b_{1010_{(2)}}=b_{10},b_{1011_{(2)}}=b_{11},b_{1110_{(2)}}=b_{14},b_{1111_{(2)}}=b_{15}, all the indices have the second-least binary digits set. Therefore, by knowing which of the parity bits are wrong, we know which bit is wrong, assuming at most one bit flipped: If only one parity bit is wrong, the bit itself must be wrong. If more than one parity bit wrong, we can restore the binary representation of the index of the flipped bit. Of course, the same works for other bit numbers. Hamming codes also work for higher bit numbers. In general, by the same procedure, we can recognize a single bit failure in a chunk of 2^m-1 bits by using m bits as parity bits, and 2^m-m-1 bits as data bits.

While having error correction might sound desirable, in cryptographic applications it is often desirable to have error-detection, but no possibility of error-correction. A common problem on webservers is that it must store the users' passwords in some secure way. On the other hand, no server is absolutely secure, and if data is leaked, the hacker can read the users' passwords. On the one hand, this is bad because users tend to use the same passwords for many services, on the other hand, if such leaks are not noticed, the thief can use the cleartext passwords and use the usual logins. In general, Cryptographic checksums of the passwords should be used. Famous examples are md5 (which is obsolete for this purpose) and sha512.

Compression

Most data we work with contains a lot of redundancy. From a purely theoretical point of view, for every machine model and every chunk of data, there exists a minimal representation (see for example Kolmogorov Complexity). However, this minimal representation can usually not be calculated. In practice, approximations are used. For some special purposes, there can be lossy compression, for example, ogg vorbis and theora for audio and video compression. We will only look at lossless compression: The data can be recovered as it was before.

The easiest such compression method is run-length encoding, RLE. In its simplest form, it is basically saving repetitive sequences of characters only once with their multiplicity. For example, in many graphics, you have a lot of singly-colored background, that is, a lot of equally colored pixels. As a simple example, say we have sequences of bytes (values from 0 to 255), in which long subsequences consisting of zeroes occur. We can then say that instead of saving the sequence of zeroes, we save only one zero and let the following byte denote the number of zeroes that follow this one. That is, if we have 0,0,0,0,0,0,0,0,0,0,0, we can instead save 0,10. We then save a sequence of 256 consecutive zeroes with only two bytes. On the other hand, if we have only one zero to save, we need two zeroes instead of just one. For example, the sequence 0,1,0,1,0,1,0 becomes 0,0,1,0,0,1,0,0,1,0,0, which needs more memory. A simple pigeonhole argument shows that this must be the case in every lossless compression format: There must be datasets that need more memory after compression. It is up to the engineer to decide which format to use.

Things get a bit more complicated when we allow backreferences: It is often the case that a text contains repetitions. In many programming languages, for example, in the beginning, there are a lot of import or include commands. In addition to saving your bytes, you can also save a backreference, which is a pair (d,l), where d is the distance, the number of bytes we have to overjump until we have the beginning of the repeated string, and l is the length of the backreference. For example, let's say we have the string 1, 2, 3, 4, 5, 6, 7, 8, 1, 2, 3, 4, 5, 1, 2, 3, 4, 7, 8,
1 (we do not care about the actual representation of that file "on the metal", since this is a more sophisticated problem and highly depends on the underlying system). Then we could save this as 1, 2,
3, 4, 5, 6, 7, 8, (8, 5), (5, 4), (12, 3). Of course, as before, we need reserved bytes to denote these backreferences, and therefore, there are situations in which this form of compression is even bad. However, in practice, this method is very useful and often gains a lot of compression. Especially, modern data formats like XML employ a lot of redundancy, which has some advantages when working with them locally, but is a waste of bandwidth when sending them, and here it can be very efficient. Of course, for every string, there exists a minimal representation with backreferences. However, it is not practical to calculate it, even for short strings, but the intuitive "linear" algorithm, just reading the string from left to right and maintaining a dictionary usually provides sufficient results.

While RLE tries to remove duplicate substrings, it is also often the case that only a small amount of actual possible bytes are used, or at least some bytes are used very seldom, compared to others. Therefore it is plausible to increase the amount of memory needed for the less frequently used bytes, and in exchange decrease the amount for the more frequently used. There are several ways to do this. The one we want to look at here is the use of Huffman codings. Firstly, we are given some alphabet. Even in practice, this alphabet needs not to fit into a byte, it may contain more than 256 characters. So say we have an alphabet {\cal A}=\{a_0, \ldots, a_{m-1}\}, and a sequence of characters (b_i)_i \in {\cal A}^k of length k. Now let \#(b_i):=|\{j\mid b_j=a_i\}| be the number of times the character a_i occurs in (b_i)_i. As we want to save our string as sequence of bits, we have to map every character on a unique bit sequence, in a way that when reading the whole bit sequence from left to right, no ambiguousity occurs. The code must be prefix free: No sequence of one character is allowed to be the prefix of another one, that is, we can form a binary tree such that each branch is mapped to a character. If l(a_i) denotes the (not yet known) lengths of the huffman codes, we want to minimize the number of bits required to save our string, and this number is \sum\limits_i (l(b_i)\cdot\#(b_i)). The following procedure creates a tree which tries to do this:

  • Step 1: Begin with the set {\cal S} = \{(\left<a_i\right>,
    \#(a_i))\}_i of pairs of leaf nodes labelled with characters, and their frequency.
  • Step 2: If \cal S contains only one element, return it: It is your huffman tree. If \cal S contains at least two elements, let (q,h_q),(r,h_r) be the two pairs with the lowest frequencies h_q,h_r, and set {\cal S} = ({\cal
     S} \backslash \{(q,h_q),(r,h_r)\}) \cup
     \{(\left<q,r\right>,h_q+h_r)\}, thus, replacing the two trees by a new higher tree.
  • Step 3: Proceed at Step 1 with the new \cal S.

It is easy to see that if the frequencies are powers of two, this produces an optimal code.

Huffman codes are not unique, as we can "mirror" subtrees of the code. However, with some additional restrictions, we can make it unique. Shorter codes must lexicographically precede longer codes. Codes of the same length must have the same lexicographic ordering as the characters they order. And the whole coding must be lexicographically dense, and contain a string that only contains zeroes. For example, consider the huffman code A \rightarrow 00,
B\rightarrow 1, C\rightarrow 011, D\rightarrow 010. 1 is shorter than 0, but does not lexicographically precede it. 010 lexicographically precedes 011, however, D does not precede C. The corresponding canonical Huffman code, satisfying all the properties, is A\rightarrow
10,B\rightarrow 0,C\rightarrow 110,D\rightarrow 111. It is easy to show (and left as an exercise) that each huffman code can be canonicalized.