Again, a talk for the best mathematics club of the multiverse, for the pupils who do not yet study mathematics. This talk has very low requirements and is not thought for mathematicians, but for people who may become mathematicians. It is about basic coding theory, nothing too deep, but hopefully interesting anyway.
It happens often that one needs to type in a number somewhere. Several things can go wrong. To prevent this, one often adds some redundancy to these numbers, so it is less probable to make a mistake without noticing.
A famous example are
ISBNs. They
have ten digits where is usually denoted as
. The rule for such an ISBN is now that
, which can
always be achieved by choosing
correctly. The most
common errors are single wrong digits, and digit-twists.
Proposition: Two distinct ISBNs differ in at least two digits.
Proof: Assume not. Consider the two ISBNs and
such that
except for exactly one
. Then
,
but notice that this number cannot be divisible by
, since
all of its factors are smaller than
and
is a prime
number. But the difference of
-divisible numbers must be
-divisible. Contradiction. --
Thus, if we got only one digit wrong, we can recover the original code. Of course, if more than one digit is wrong, this does not help. For twists of two adjacent digits, we have
Proposition: Two distinct ISBNs cannot differ only in two adjacent digits.
Proof: Again, consider the two ISBNs and
such that
for all
except for
, for which we have
and vice
versa. Then again,
and that can again not be divisible by 11. Contradiction. --
While it is pretty easy to include such an error correcting code, most numbers humans use directly in real life do not use one, probably because people do not know that they exist. (At least I doubt that coding theory is a part of most engineering studies.) While ISBN Codes are widely used, this is of course only a minor feature, and for example, you usually have the name of the book given.
One example from technology that is widely used are the RAID-Systems, Redundant Arrays of Inexpensive Disks. The problem with hard disks is that they have a lot of pretty small technology, and their lifetime is limited. While on personal computers, it is sufficient to make regular backups, in data centers you need more reliability: You do not want to have long pauses or a complicated procedure if one of your disks fails.
The RAID-Systems work on a very low level of data, below the file system and partition table: RAID-Systems work with the bit-sequence, the "zeroes and ones", that are saved on the disk. Therefore, as one drawback, the disks must be of equal size. In practice, they are usually identical in construction.
The easiest way to do this is to just "mirror" a disk: You work with
two disks that always save exactly the same thing, and if one disk
fails, you just replace it and copy the data from the others. So at
every time, you have two bit sequences and
, where
is the number of bits that
can be saved on that disks, and where
. If,
say, the
-disk breaks, the system continues to function, only
using the
-disk, and as soon as a new
-disk is
inserted, the whole content of
is copied. This is called
RAID 1. You need twice as much disks to gain security.
But we can do better, using parity checks. Let us say you have three
disks , and again with the bit sequences
. We now define a "virtual"
bit sequence
such that
and
for all
, thus
"merging" the bit sequences
and
. The
computer works on this bit sequence
, thus seeing a
virtual hard disk twice the size of one of the three disks we
used. For the third disk, we define that it has to satisfy
. The computer
maintains this invariant. Now, if
breaks, it can just be
recalculated from
and
. The more interesting part is
when one of the other disks breaks. Assume
breaks, but
and
still work. As long as we want to read
, we can just read
. If we want to read
, we have to recover the value of
. But we know
that
, therefore
. Since
, this is the same as
, and therefore, we can recalculate
, as long as we have
and
. Compared to simply mirroring, where we need twice as much
memory, here we only need
times as much memory, but of
course, only one out of your three disks may fail.
is called
the parity disk, because it saves the parities of the sums of the
bits of the other disks. In the same way, you can use more than three
disks and recover transparently if exactly one of them fails. Which
one you choose depends on how much you actually trust your disks. RAID
Systems are a way of making data loss less probable, but not
impossible.
Parity checks are simple, but they have one major disadvantage: You
have to know which of the disks is broken. This is the fact when some
essential part of a disk is broken. However, you might as well just
have parts of a disk returning wrong data. You can then use
Hamming-codes, which scale for large bit numbers, but not for small
ones. Therefore, in practice, they are rarely used for RAID systems,
since there are better possibilities in this particular use case of
only few disks on which you have random access. The practical use-case
of Hamming-codes is for sending data over an unreliable channel, for
example through atmospheric disturbances. The central idea is to cover
subsets of the data bits by multiple parity bits. Assume we have
bits
. We look at the binary
representation of the indices
, from
to
. Every
such that
has exactly one
in its binary representation is used as a parity bit, thus,
we have the
parity bits
. All the other
bits are data bits, so we have an overhead of
. Now the parity bit
describes the
parity of the sum of all bits with the binary digit that is set in
is also set. For example,
is the
sum of
,
,
,
,
,
,
, all the indices have the second-least
binary digits set. Therefore, by knowing which of the parity bits are
wrong, we know which bit is wrong, assuming at most one bit flipped:
If only one parity bit is wrong, the bit itself must be wrong. If more
than one parity bit wrong, we can restore the binary representation of
the index of the flipped bit. Of course, the same works for other bit
numbers. Hamming codes also work for higher bit numbers. In general,
by the same procedure, we can recognize a single bit failure in a
chunk of
bits by using
bits as parity bits, and
bits as data bits.
While having error correction might sound desirable, in cryptographic applications it is often desirable to have error-detection, but no possibility of error-correction. A common problem on webservers is that it must store the users' passwords in some secure way. On the other hand, no server is absolutely secure, and if data is leaked, the hacker can read the users' passwords. On the one hand, this is bad because users tend to use the same passwords for many services, on the other hand, if such leaks are not noticed, the thief can use the cleartext passwords and use the usual logins. In general, Cryptographic checksums of the passwords should be used. Famous examples are md5 (which is obsolete for this purpose) and sha512.
Most data we work with contains a lot of redundancy. From a purely theoretical point of view, for every machine model and every chunk of data, there exists a minimal representation (see for example Kolmogorov Complexity). However, this minimal representation can usually not be calculated. In practice, approximations are used. For some special purposes, there can be lossy compression, for example, ogg vorbis and theora for audio and video compression. We will only look at lossless compression: The data can be recovered as it was before.
The easiest such compression method is run-length encoding, RLE. In
its simplest form, it is basically saving repetitive sequences of
characters only once with their multiplicity. For example, in many
graphics, you have a lot of singly-colored background, that is, a lot
of equally colored pixels. As a simple example, say we have sequences
of bytes (values from to
), in which long
subsequences consisting of zeroes occur. We can then say that instead
of saving the sequence of zeroes, we save only one zero and let the
following byte denote the number of zeroes that follow this one. That
is, if we have
, we can instead save
. We then save a sequence of
consecutive zeroes
with only two bytes. On the other hand, if we have only one zero to
save, we need two zeroes instead of just one. For example, the
sequence
becomes
,
which needs more memory. A simple pigeonhole argument shows that this
must be the case in every lossless compression format: There
must be datasets that need more memory after compression. It is up
to the engineer to decide which format to use.
Things get a bit more complicated when we allow backreferences: It
is often the case that a text contains repetitions. In many
programming languages, for example, in the beginning, there are a lot
of import or include commands. In addition to saving your bytes,
you can also save a backreference, which is a pair , where
is the distance, the number of bytes we have to overjump
until we have the beginning of the repeated string, and
is
the length of the backreference. For example, let's say we have the
string
(we do not care about the actual representation of that file "on
the metal", since this is a more sophisticated problem and highly
depends on the underlying system). Then we could save this as
. Of course, as before,
we need reserved bytes to denote these backreferences, and therefore,
there are situations in which this form of compression is even
bad. However, in practice, this method is very useful and often gains
a lot of compression. Especially, modern data formats like
XML employ a lot of redundancy, which
has some advantages when working with them locally, but is a waste of
bandwidth when sending them, and here it can be very efficient. Of
course, for every string, there exists a minimal representation with
backreferences. However, it is not practical to calculate it, even for
short strings, but the intuitive "linear" algorithm, just reading the
string from left to right and maintaining a dictionary usually
provides sufficient results.
While RLE tries to remove duplicate substrings, it is also often the
case that only a small amount of actual possible bytes are used, or at
least some bytes are used very seldom, compared to others. Therefore
it is plausible to increase the amount of memory needed for the less
frequently used bytes, and in exchange decrease the amount for the
more frequently used. There are several ways to do this. The one we
want to look at here is the use of Huffman codings. Firstly, we are
given some alphabet. Even in practice, this alphabet needs not to
fit into a byte, it may contain more than characters. So
say we have an alphabet
, and
a sequence of characters
of length
. Now let
be the number of
times the character
occurs in
. As we want to
save our string as sequence of bits, we have to map every character on
a unique bit sequence, in a way that when reading the whole bit
sequence from left to right, no ambiguousity occurs. The code must be
prefix free: No sequence of one character is allowed to be the
prefix of another one, that is, we can form a binary tree such that
each branch is mapped to a character. If
denotes the
(not yet known) lengths of the huffman codes, we want to minimize the
number of bits required to save our string, and this number is
. The following procedure
creates a tree which tries to do this:
It is easy to see that if the frequencies are powers of two, this produces an optimal code.
Huffman codes are not unique, as we can "mirror" subtrees of the
code. However, with some additional restrictions, we can make it
unique. Shorter codes must lexicographically precede longer
codes. Codes of the same length must have the same lexicographic
ordering as the characters they order. And the whole coding must be
lexicographically dense, and contain a string that only contains
zeroes. For example, consider the huffman code .
is
shorter than
, but does not lexicographically precede
it.
lexicographically precedes
, however,
does not precede
. The corresponding canonical
Huffman code, satisfying all the properties, is
. It is easy to
show (and left as an exercise) that each huffman code can be
canonicalized.