Nie dürft ihr so tief sinken, von dem Kakao,
durch den man euch zieht, auch noch zu trinken.
(Erich Kästner)

People learning about Haskell frequently run into the problem of "understanding Monads". This is not surprising, as this is a distinguishing feature of the language: Haskell-Code is often written in terms of monads. I want to give an introduction to monads and their main application, state monads, for people who are practically inclined, but explicitly not a practical introduction: My objective is to make these people understand monads, I do not want to argue whether monads are useful. Before I even start explaining monads, I will explain lots of secondary stuff, which is, in my opinion, source of confusion about monads and functional programming in general. And of course, this introduction is about my personal opinions and experiences at the time when I write this. They might change, they might not be correct. As always, I am thankful for suggestions and criticism.

This is not an introduction to functional programming. Basic knowledge about writing simple programs is required, specifically, what λ means, and what a recursive function is. I also assume good familiarity with imperative languages. If you think you are a good programmer, and you still don't understand most things explained here, you should probably think again. If something isn't clear, Wikipedia and DuckDuckGo are your friend.

Different ways of thinking

C's records are called "structs". They are more low-level than the records of most other languages, but conceptually the same. A point on a plane can be defined via

struct point {
  float x;
  float y;

An approximation for the distance between two points can be defined as

float distance (a, b)
 struct point* a;
 struct point* b;
{ union {
   float f;
   int l; } x_;
  float y_;
  int z;
  x_.f = b->x - a->x;
  y_ = b->y - a->y;
  x_.f = 1/(x_.f*x_.f + y_*y_);
  y_ = x_.f * 0.5f;
  x_.l = 0x5f3759df - (x_.l >> 1);
  for (z = 0; z < 10; ++z)
   x_.f *= (1.5f - (y_ * x_.f * x_.f));
 return x_.f;

Structs can also contain arrays, but only when they have fixed length; otherwise, one has to use pointers:

struct point_7d {
 float coordinates[7];

There is a language extension which allows the use of zero-length arrays under certain conditions. A person without prior C knowledge may wonder about the use case of an array of length zero. Well, in C, array access is insecure: b[n] is just an abbrevitation for *(b+n). Therefore, declaring an array of length zero while knowing that more elements are there is perfectly legal, and it often occurs that some file format has some header, which can be written as a struct, and which indicates the length of the rest of the file. The struct itself then only contains the header information, but the content of the file can still be accessed through this struct. As an example, the following program generates a bitmap file, using a struct with a zero-length array:

#include <stdint.h>
#include <unistd.h>
#include <stdlib.h>
#include <sys/types.h>
#include <sys/stat.h>
#include <sys/mman.h>
#include <fcntl.h>
#include <strings.h>

#pragma pack(push,1)
struct bmp {
  uint16_t magic;
  uint32_t filesize;
  uint16_t reserved1, reserved2;
  uint32_t pixel_offset, dib_hdr_size;
  int32_t width, height;
  uint16_t planes, bits_per_pixel;
  uint32_t compression;
  uint32_t image_size;
  uint32_t xppm, yppm, coltab, important_colors;
  uint32_t imagedata[0];
#pragma pack(pop)

int mmapfile (const char* filename, off_t s, void** r) {
  int ret = open(filename, O_CREAT|O_RDWR, S_IRWXU);
  if (ret < 0) exit(EXIT_FAILURE);
  if (ftruncate(ret, s) < 0) exit(EXIT_FAILURE);
  *r = mmap(NULL, s, PROT_READ|PROT_WRITE, MAP_SHARED, ret, 0);
  if (*r == MAP_FAILED) exit(EXIT_FAILURE);
  return ret;

struct bmp* bmp_file (const char* filename,
              uint32_t width, uint32_t height) {
    imgs = width*height*4,
    fs = sizeof(struct bmp)+imgs;
  struct bmp* b;

  mmapfile(filename, fs, (void**) &b);

  b->magic = 19778;
  b->filesize = fs;
  b->reserved1 = b->reserved2 =
    b->compression = b->xppm = b->yppm =
    b->coltab = b->important_colors = 0;
  b->pixel_offset = sizeof(struct bmp);
  b->dib_hdr_size = 40;
  b->width = width;
  b->height = height;
  b->planes = 1;
  b->bits_per_pixel = 32;
  b->image_size = imgs;
  return b;

int main (int argc, char** argv) {
  int i;
  if (argc < 2) exit(EXIT_FAILURE);
  struct bmp* b = bmp_file (argv[1], 100, 100);
  bzero(b->imagedata, b->image_size);
  for (i = 0; i < 100; ++i) {
    b->imagedata[i + 100*i] = ~0;

This code is not really portable, and it is arguable whether it is beautiful. I consider this use of array elements outside of the declared bounds as a hack, and as such, it is beautiful. It is a creative solution to a common problem. However, it is not the kind of code one should be writing these days, except when you really really need to squeeze out every processor cycle you have. However, it is a way of thinking: C's strength is that it does not enforce its abstractions. It lets you do whatever you want, without warning, so you get very efficient code. For today's measures, on the continuum of the abstraction-optimization tradeoff, it is on the far right side.

Haskell, on the other hand, is on the far left side: Abstraction is extremely valuable, and should only be broken if there is no alternative, and even then some overhead will be accepted to keep it behaving in a clean way. It is a radically different way of thinking. Scott Meyers said that everything in C++ has a reason. The same holds for Haskell. It's just that Haskell sets other priorities. One can argue whether this is better or worse, whether it is practically relevant or purely academic; but this is not the topic of this post. To understand why monads were invented, you need to think this way; if you're not willing to think this way, you will not be able to understand monads.

The problem

Functions in Haskell should not have side-effects. Some of them do, for example, Debug.Trace.trace does, but it is only meant for debugging. The same Haskell expression should always yield the same result. This way, Haskell can utilize lazy evaluation (or call-by-need): An expression is evaluated when its result is actually used. Therefore, Haskell can handle infinite structures in a nice way. For example, the following expression is a standard motivating example, it defines the infinite list of Fibonacci numbers, plus a function that then actually returns the n-th Fibonacci number.

fibs = 0 : 1 : addHd fibs
         addHd (x : r@(y : _)) = (x + y) : addHd r

fib n = fibs !! n

From an old-school software engineer's point of view, this might look horrible. And in fact, most programs written this way are slower than their highly-optimized counterparts. On the other hand, the definition is easy to read and to write (as soon as you got to know Haskell): relying on lazy evaluation can make writing and understanding software easier. And today's software problems are seldom about efficiency, they are about maintainability. A further nice property is that we do not need to have a special case for constructs like if-then-else. We can define

ifthenelse :: Bool -> a -> a -> a
ifthenelse True  a _ = a
ifthenelse False _ b = b

A seemingly equivalent C function would be

enum BOOL { FALSE, TRUE };

void* ifthenelse (b, yes, no)
     enum BOOL b;
     void *yes, *no;
  switch (b) {
  case FALSE:
    return no;
  case TRUE:
    return yes;

The difference is that C has eager evaluation: Every argument is evaluated before a function is called. In Haskell,

ifthenelse True 2 (mod 1 0)

will yield 2, while in C,

ifthenelse (TRUE, (void*)2, (void*)(1 % 0));

will yield an division-by-zero error (well, actually, most modern compilers will at least warn you about this, or optimize it out through dead-code-detection, but you get the idea). Notice that most other functional programming languages allow side effects, and also use eager evaluation and run into the same problem as C does in this case. Common Lisp (and generally most Lisp dialects) has a nice way of defining things like ifthenelse as macros. Anyway, Haskell does lazy evaluation, with all its benefits and drawbacks.

And one major drawback is that the presence of side-effects breaks lazy evaluation. For example, the following C program cannot be implemented in Haskell directly:

#include <stdio.h>

int i = 0;

int inc() {
  return i++;

int main (void) {
  int j;
  for (j = 0; j < 10; ++j)
    printf ("%d\n", inc());

Subsequent calls to inc return different results, the results depend on the hidden state i. Assume we had such a function in Haskell, and we would define

incs = inc : incs

Then inc !! n would yield the n-th element of the list to be called. The contents of the list would depend on the order of the calls to (!!). There might be scenarios where this is desirable, but in most cases, this just breaks stuff. As I wrote previously, it is possible to destructively modify values in Haskell: On the one hand, Haskell has an FFI which lets us call C functions; on the other hand, Haskell has "unsafe" functions like unsafePerformIO. Their use is discouraged, but there are situations in which it is useful, for example for debugging, and for developing abstractions which are only using mutable state in the background which cannot be accessed through the frontend, like DiffArrays.

The absence of side-effects makes certain programs impossible. For example, Quicksort, and in-situ-sorting-algorithms in general, cannot be realized without destructive modifications of an underlying array. However, there are good purely functional substitutes for these algorithms. More problematic is the absence of I/O: Accessing and modifying files is stateful. But not having any kind of I/O limits the usability of a programming language. This is the problem state monads try to solve.

Properties of pure functions

For this section, let us talk only about pure functions, that is, functions that do not have side-effects. Let A, B, C be some types and f:A→B, g:B→C, then we can define f↣g:=λₓg(f(x)), that is, (f↣g)(x)=g(f(x)). It is easy to show that ↣ is associative. Notice that the same does not hold for functions with side effects (I'll leave the finding of a counterexample as an exercise).

A common optimization, which came from but is not limited to functional programming, is tail call optimization (TCO). Functional programs use recursion a lot more than imperative programs. One possible definition of the Gaussian sum is

gsum 0 = 0
gsum n = n + gsum (n - 1)

The problem with recursion is that it consumes stack space. Therefore, for large values of n, this definition may exhaust the stack. However, if the last expression is the unmodified recursive call, the compiler can just optimize out the call that would otherwise consume stack. The above function would have to be rewritten to

gsum_tr_ 0 r = r
gsum_tr_ n r = gsum_tr_ (n - 1) (n + r)
gsum_tr n = gsum_tr_ n 0

The function gsum_tr_ would be called tail recursive, and the recursive call does not consume stack space. In some functional languages, this is still the method of choice. In Haskell, it is less relevant, because in many cases, rewriting a recursive function into a tail recursive function is straightforward, and Haskell applies this optimization. However, let us stick to the example gsum_tr_ for the moment. A definition in Java would look like

public static int gsum_tr_ (int n, int r) {
    if (n == 0) return r;
    return gsum_tr_ (n-1, n+r);

The corresponding generated Java Assembly (output of javap -c with comments by me, I chose Java Assembly because it is stack based) is:

public static int gsum_tr_(int, int);
   0: iload_0  //load first argument (n)
   1: ifne 6   //if n == 0 jump to byte 6
   4: iload_1  //load second argument (r)
   5: ireturn
   6: iload_0  //load first argument (n)
   7: iconst_1
   8: isub     // => n - 1
   9: iload_0
  10: iload_1
  11: iadd     // => n + r
  12: invokestatic  #2 // stack: n - 1, n + r
  15: ireturn

Before the recursive call, the arguments are pushed to the stack, just to be popped again directly by the called function. Then the return value is taken from the recursive call directly. We might as well just modify the values directly, in-place:

public static int gsum_tr__ (int n, int r) {
    while (true) {
        if (n == 0) return r;
        r = n + r;
        n = n - 1;

And get

public static int gsum_tr__(int, int);
   0: iload_0
   1: ifne 6
   4: iload_1
   5: ireturn
   6: iload_0
   7: iload_1
   8: iadd
   9: istore_1
  10: iload_0
  11: iconst_1
  12: isub
  13: istore_0
  14: goto 0

Which is almost identical: an imperative loop with destructive updates. That is, under certain conditions, we can have destructive updates inside purely functional code. Whenever we know that the computation cannot depend on the former value, we can destructively overwrite it. More specifically, a sufficient condition is that the value occurs in the function body exactly once. This can be trivially checked at compile time, and in fact, this condition is reflected in so-called uniqueness types in some functional languages (but not in Haskell).

Putting the pieces together

I hope that by now you get an impression of how much thought has been put into the concepts of functional programming. The concepts are not trivial, but neither are the concepts of higher imperative programming languages. Make sure you understood everything so far.

We now want to embed stateful operations into purely functional code, but without the use of uniqueness types, but using state monads. Firstly, let us define a type Ref a, which is a reference to some object of type a.

data Ref a = Pto a

Creating a new reference is so simple that it almost makes no sense. However, for instructional purposes, we define it:

ret :: a -> Ref a
ret x = Pto x

This definition is polymorphic. Let us look at the case a = Int. Now consider the following functions:

inc :: Int -> Ref Int
inc x = Pto $ x + 1

twice :: Int -> Ref Int
twice x = Pto $ 2 * x

These functions just read our pointer, and return a pointer to the increment/double of the former pointer. We furthermore need

(>=>) :: (a -> Ref b) -> (b -> Ref c) -> a -> Ref c
(>=>) f g x = case f x of { Pto y -> g y }

run :: Ref a -> a
run (Pto x) = x

Now we can write the following:

run ((inc >=> twice >=> twice >=> twice >=> inc >=> ret) 0)

Before you read on, try to find out what this function returns. Answer: hover The function returns 9 hover. The nice thing is that the reference itself is handed over to the next function directly as first argument. Therefore, there is no reason to ever have more than this one reference, and an optimizing compiler can (and will) not require stack space, and regard the operations on the reference as destructive, the same way it can be done with TCO. Notice that inside this definition, except for the first function call, we never actually talk about the reference, we only tell what to do with it.

Make sure you have understood this so far. If not, I failed at explaining, but it also makes no sense to read on.

The (>=>) is somewhat similar to our previous ↣. Especially, one can show that it is associative. Furthermore, the function ret is a neutral element on both sides. These properties make our (>=>) a Kleisli combinator, and Ref a monad with respect to this Kleisli combinator. Any structure with a combinator and a return function that satisfies these axioms is a monad, monads are mathematical structures, much like groups, rings, fields and vector spaces: They are collections of objects which satisfy certain axioms. And like fields, which can be used in cryptography, and vector spaces, which can be used for computer graphics, monads can be used to describe and solve specific problems, and one such problem is the description stateful operations in a purely functional context. But as for groups, fields and vector spaces, there are many different structures that obey the monad axioms.

Let us for a moment look at the group axioms: A group is a triple (G, ∘, e), where ∘:G→G→G e:G, such that

  • ∘ is associative: a∘(b∘c)=(a∘b)∘c.
  • e is a left-neutral element of ∘: e∘a=a.
  • Every a:G has a left-inverse a⁻¹:G, such that a⁻¹∘a=e.

The monad axioms for a Kleisli combinator are quite similar:

  • return >=> g = g
  • f >=> return = f
  • (f >=> g) >=> h = f >=> (g >=> h)

We have a neutral element (left and right neutral), and the combinator is associative.

I chose the Kleisli combinator because I think that it is easier to understand in the case of state monads: In our case, the only thing it does is passing the value of the returned reference of the left function as a first argument to the right function, making sure that the rest of the computation does not depend on the former value, so it can be optimized out. However, using the Kleisli combinator has several disadvantages for actual programming. The usual definition uses the bind combinator (>>=). We can define the bind combinator in terms of the Kleisli combinator:

(>>=) :: Ref a -> (a -> Ref b) -> Ref b
(>>=) ma g = (id >=> g) ma

But for the moment, let us stick with the Kleisli combinator. So far, we did not do anything except writing definitions and exploiting certain optimizations.

Haskell's IO monad

Now we want to write something to a file. The main-function of a Haskell program is of the type IO (), which is a monad. This monad references the "outside world" – and it is a monad, since there can only be one such world. This outside world is stateful. In C, we can write

#include <stdio.h>

void wr (FILE* d) {
  fprintf(d, "Hello!\n");

int main (int argc, char** argv) {
  FILE* d = fopen(argv[1], "a");
  printf("%d\n", ftell(d)); // 7
  printf("%d\n", ftell(d)); // 14

The problem is obvious: Every call to fprintf(3) advances the file pointer, so the ftell(3) calls return different file positions – and violate referential transparency. We could, however, rewrite this program in the following way:

#include <stdio.h>

FILE* wr (FILE* d) {
  fprintf(d, "Hello!\n");
  return d;

int main (int argc, char** argv) {
  FILE* d1 = fopen(argv[1], "a");
  FILE* d2 = wr(d1);
  printf("%d\n", ftell(d2)); // 7
  FILE* d3 = wr(d2);
  printf("%d\n", ftell(d3)); // 14

This is how it would be done with uniqueness types: Every modification of the file consumes a file handle, but produces a new one. Notice that we would have to encapsulate the printf(3) statements as well – they change the "world". Therefore, we can define a WORLD* type which passed instead, and think of this WORLD* type as a map from handles to resources:

#include <stdio.h>
#include <stdarg.h>

typedef WORLD; // dummy

struct wfret_k { WORLD* w; void* t; };

struct wfret_k wfopen (WORLD* w, const char* name, const char* mode) {
  struct wfret_k x;
  x.w = w;
  x.t = (void*) fopen(name, mode);
  return x;

struct wfret_k wfprintf (WORLD* w, FILE* d, const char* format, ...) {
  struct wfret_k x;
  va_list aq;
  va_start(aq, format);
  vfprintf(d, format, aq);
  x.w = w;
  x.t = NULL;
  return x;

struct wfret_k wftell (WORLD* w, FILE* d) {
  struct wfret_k wt;
  wt.w = w;
  wt.t = (void*) ftell(d);
  return wt;

    struct wfret_k wprintf (WORLD* w, const char* format, ...) {
  struct wfret_k wt;
  va_list aq;
  va_start(aq, format);
  vprintf(format, aq);
  wt.w = w;
  wt.t = NULL;
  return wt;

struct wfret_k wr (WORLD* w, FILE* d) {
  return wfprintf (w, d, "Hello!\n");

int main (int argc, char** argv) {
  WORLD* w0; //assume this was "outside"

  struct wfret_k o1 = wfopen(w0, argv[1], "a");
  WORLD* w1 = o1.w;
  FILE* f = o1.t;

  struct wfret_k o2 = wr(w1, f);
  WORLD* w2 = o2.w;
  struct wfret_k o3 = wftell(w2, f);
  WORLD* w3 = o3.w;

  struct wfret_k o4 = wprintf(w3, "%d\n", o3.t);
  WORLD* w4 = o4.w;

  struct wfret_k o5 = wr(w4, f);
  WORLD* w5 = o5.w;

  struct wfret_k o6 = wftell(w5, f);
  WORLD* w6 = o6.w;
  struct wfret_k o7 = wprintf(w6, "%d\n", o6.t);

Now, this is very complicated to do in plain C. Functional languages usually allow for overloading variable names, and when working with uniqueness types, this is usually the way it is done. At first sight, this WORLD* object does not contain anything, and is therefore useless. However, in the scope of uniqueness types, every WORLD* can only be passed to another function once. Then it is gone: If we tried to access it a second time, the code would not compile. Even though this WORLD* object does not contain any data, it prevents some code from being compiled: It is a formal guarantee that can be checked at compile time. At runtime, we could just optimize it out – but that is fine, since we checked the correctness at compile time.

The limits of C are now reached; especially, we need currying from now on. We will continue with pseudocode to define a Kleisli combinator. Our wrapper functions return a polymorphic pair wfret_k (which is realized via void*, because C does not have polymorphism), consisting of the world itself, and some return value.

We can now define a "modified" kleisli combinator (notice that this is pseudocode, an unholy mixture of C and Haskell):

m_kleisli :: (WORLD* → A → (WORLD*, B)) → (WORLD* → B → (WORLD*, C)) → (WORLD* → A → (WORLD*, C))
m_kleisli f g w0 a =
let (w1, b) = f w0 a
in g w1 b

Notice that except for the calls to the libc, we never do anything but passing the worlds – which was useful to assure linearity. However, if we take the Kleisli combinator as the given method of doing I/O, passing the world is not really necessary anymore, because all functions that do I/O will be combinations of smaller functions using the Kleisli combinator, and it will sooner or later decompose into calls to our elementary wf-functions. Instead, we assure linearity by defining a type functor IO a, which will replace our former (WORLD*, a):

(>=>) :: (a → IO b) → (b → IO c) → (a → IO c)

We now would still have to prove the monad laws, but this is a good exercise for you.

As mentioned earlier, it is possible to define the bind-operator

(>>=) :: IO a → (a → IO b) → IO b

which is more common. In Haskell, we can therefore write

import Debug.Trace
import System.IO
import System.Environment

main :: IO ()
main = getArgs >>=
       \ args ->
           openFile (args !! 0) AppendMode >>=
           \ handle -> hPutStr handle "Hello!\n" >>=
           \ _ -> hPutStr handle "Hello!\n" >>=
           \ _ -> hClose handle

There is a shorthand notation that makes these things (sometimes) easier:

import Debug.Trace
import System.IO
import System.Environment

main :: IO ()
main = do args <- getArgs
          handle <- openFile (args !! 0) AppendMode
          hPutStr handle "Hello!\n"
          hPutStr handle "Hello!\n"
          hClose handle