GHC performance over time

TL;DR: There’s no significant difference between the GHC versions 7.* and 8.0.1-alpha regarding the Haskell code used in the benchmarking game. And there might be an improvement of the performance of idiomatic Haskell code from GHC 7.0.1 to GHC 8.0.1-alpha.

I’m currently building a new benchmarking tool called temci as part of my bachelor thesis.

In search for good programs to benchmark for the evaluation part, I stumbled across the benchmarking game. This blog posts is about using parts of the benchmarking game and temci to measure and compare the performance of the following GHC versions: 7.0.1, 7.2.1, 7.4.1, 7.6.1, 7.8.1, 7.10.1 and 8.0.0 (the version from mid January, called 8.0.1 here for brevity) using different optimization levels („-O“, „-O2“ and „-Odph“, see: Haskell documentation) .

Related work

Johannes Waldman also compared several GHC versions (including some 6.* versions) using the nofib benchmarking suite:

(I didn’t include some 6.* versions because I was unable to install them. It would be cool if some one knows how to install older GHC versions than 7.0.1 on Ubuntu and shows me how to do it. So another version of this blog post could include them.)

Temci and

Temci is a essentially a framework wrapping other benchmarking tools (perf stat, SPEC or time). It adds some cool features like assembler randomization or cpuset management. It helps to benchmark programs easily and to compare these benchmarks afterwards. Temci is designed to be extended and built upon with ease. The benchmarks in this blog post are produced with the file that is modeled after the original benchmarks game and uses temci to do the actual measurements.

The benchmarked Haskell code comes from the benchmarks game (with some small modifications). Not all Haskell files are used, as many didn’t compile properly in all GHC versions or run almost infinitely (with their smallest input). I fixed simple errors but learning advanced Haskell just to fix the non trivial bugs is not in the scope of my bachelor thesis.

The aim of is to provide a tool to compare different implementations of the same language by compiling several programs in several categories and running them with different inputs in reproducible manner.


The benchmarking results were obtained on an Ubuntu 15.10 Linux (with ecrypt as its file system and unity as its desktop environment) running with root privileges on a computer with 8GB DDR3 RAM, a 120 GB SSD (mostly empty), all temporary files (compilation results, …) in the RAM (via tmpfs) and a AMD FX™  4100 x86_64 CPU. The CPU has 4 cores, 3 of them were available almost exclusively to the benchmarked program. This computer is just used for benchmarking purposes.

The whole benchmarking took around 4 days (with 20 iterations each).

What’s a benchmarked program? It’s the term here for the actually benchmarked command. It depends on the used compiler, the actual Haskell code and the input for the resulting program (if the programs resulting performance is important).

Important note: As the benchmarked code comes from the benchmarks game, the code might not have the best functional style and it doesn’t use lazy evaluation much, …. The benchmarking result can only give of GHC’s performance. If you distrust my benchmarking results, make the measurements yourself and add other algorithms to benchmark. The code is on github, please write an issue at github if you encounter any problems.

Used temci plugins

The following temci plugins were used to improve the benchmarking results (i.e. to reduce the variance and increase the reproducibility).

Discard blocks

The first 5 benchmarkings (one benchmarking block) are dismissed to prefill the cpu and file system caches.

Nice and other_nice

The process priority of the benchmarked process is increased and the priority of most other processes is decreased.

Stop Start

Many processes that are non vital to the system are stopped via a Unix signal, especially the ssh processes and most child processes of the window manager (e.g. firefox, …).

CPU sets

The benchmarked process has its own cpu set and can use 3 CPUs almost exclusively.


Mean scores (and other statistical stuff)

The following discussion of the benchmarking results uses several statistical terms that are explained here to prevent misunderstandings.

Std or standard deviation

A statistical property that describes the variance of a given set of values. It’s an important quality measure for the benchmarking. To quote Gernot Heiser’s list of systems benchmarking crimes:

Always do several runs, and check the standard deviation. Watch out for abnormal variance. In the sort of measurements we do, standard deviations are normally expected to be less than 0.1%. If you see >1% this should ring alarm bells.

Geometric standard deviation

The standard deviation of geometric mean. Subtract 1 to get a rough equivalent to the „normal“ standard deviation

Mean score or mean / best mean

This property is the division of the mean of the timings of the benchmarked program through the lowest (best) mean of the benchmarked program over all GHC versions. This gives the performance of each implementation relative to the „perfect“ GHC compiler.

Mean score Relative to the FIRst or mean / mean of the first implementation

This property is the mean of the timing of the benchmarked program for a specific implementation relative to the first implementation (it’s GHC 7.0.1 here).

Most plots in this blog post show the mean score distributions. The same conclusions drawn from the plots and the corresponding tables can also be drawn from the mean score relative to the first distributions.

Mean (score) differences

It’s important to take standard deviations into account when comparing different benchmarking scores, because as Gernot Heiser points out:

  • Don’t believe any effect that is less than a standard deviation
  • Be highly suspicious if it is less than two standard deviations

Geometric mean

To summarize several mean scores the geometric mean is used. Using the normal (arithmetic) mean with these normalized values would be like lying, according to the paper by Fleming and Wallace:

Using the arithmetic mean to summarize normalized benchmark results leads to mistaken conclusions that can be avoided by using the preferred method: the geometric mean.

Arithmetic mean

To summarize several mean scores relative to the first the arithmetic mean is used. Although the mean scores are normalized, the arithmetic mean can be used, as the reference implementation doesn’t change between several categories.

Benchmarking results

The benchmarking results with lot’s of plots and tables can be found here:

And the raw data (the YAML result files produced by temci) can be found here.


I discuss the benchmarking results by giving some of my conclusions and explaining them below (with plots and everything).

On compile times

  1. There are no significant differences between the different GHC versions and optimization levels (probably due to the fact, that the compiling takes only between 1 and 1.5 seconds)

Conclusion 1

As you can see in the following graphs, there’s really no (significant) difference between the different GHC versions. The following is the plot and the summary for the set of all mean scores for each GHC version using „-Odph“:


mean score (gmean(mean / best mean))mean score std (gmean std(mean / best mean))mean rel std (gmean(std / mean))

The right column is the important one: It shows that the standard deviation is around 30% for each single benchmarking. As every benchmarked program is only measured 20 times there is really no significance in the data. The results for the other optimization levels are nearly the same.

This result is a bit counter intuitive if you only see the plot above. But as this plot tells us only about the variance of the relative means over all categories and subprograms, the real deviation is hidden. If I only showed the plot, you would probably believe, that the newer GHC versions are faster than the older ones. It’s so easy to lie with statistics. Therefore: Distrust every (!) mean oder median value without at least a proper standard deviation (or variation) figure.

The following are the compile time mean score relative to the first distributions for all (tested) GHC versions and optimization levels (grouped by the latter). The table is omitted for brevity, as the standard deviations are not very different from the example above. The whole report can be seen here:

On execution times

  1. There’s no significant difference in the results between using „-Odph“ and „-O2“
  2. There’s a significant difference between using „-Odph“ (or „-O2“) and „-O“ this is contrary to the Haskell documentation that tells us: At the moment,-O2 is unlikely to produce better code than -O.
  3. The variance of the geometric mean over all individual mean scores is significantly with GHC 7.4.1 and GHC 7.6.1 than with the other versions (regarding „-Odph“ and „-O2“).
  4. The variance plotted over time has the form of a parabola.
  5. The same can be sad about the geometric mean of all mean scores for each implementation.
  6. The differences between these geometric means aren’t significant (and mostly lower than one standard deviation)
  7. There seems to be a roughly significant improvement from 7.0.1 to 8.0.1 regarding „-O“

Conclusion 1 & 2

The following are the plots for the set of all mean scores per GHC version and optimization level (first „-O“, then „-O2“ and „-Odph“):fig__program_category_haskell -Ofig__program_category_haskell -O2 fig__program_category_haskell -OdphIt’s visually clear the last „-O2“ and „-Odph“ are almost identical and that they differ from „-O“. But as I wrote before, the tabular data with the actual standard deviations is important and is given in the following in the same order:

mean score (gmean(mean / best mean))mean score std (gmean std(mean / best mean))mean rel std (gmean(std / mean))

mean score (gmean(mean / best mean))mean score std (gmean std(mean / best mean))mean rel std (gmean(std / mean))

mean score (gmean(mean / best mean))mean score std (gmean std(mean / best mean))mean rel std (gmean(std / mean))

Another way to see this conclusion is by comparing the mean score relative to the first distributions for all combinations of GHC version and all optimization levels, grouped by the latter:

Yet another way to look at the benchmarked data is to simply subtract each measured value regarding the data of one optimization level from the corresponding value of the data of another one. This data can then be plotted in the same way as above (but with setting the divisor for the mean scores to one to allow comparison of the values). The following are first the plot summarizing „-O“ – „-O2“ and then the plot summarizing „-O2“ – „-Odph“:

The conclusions should be obvious as mean over all differences as shown in the plots is many times higher with „-O“ – „-O2“ than with „-O2“ – „-Odph“ (just look at the x axis).

The difference between „-O“ and „-O2“ is (especially with older GHC versions) kind of significant, but with newer versions this effect diminishes. The following is the corresponding table:

mean score (mean(mean / mean of first impl))mean score std (std(mean / mean of first impl))mean rel std (mean(std / mean))

For more plots, take a look at the actual report.


Conclusion 3 & 4 & 5

It’s easy to see with this data that the variance of the geometric mean over all individual mean scores is significantly higher with GHC 7.4.1 and GHC 7.6.1 than with the other versions (regarding „-Odph“ and „-O2“). And that the variance and means over time are in some sort of parabolic form.

The same conclusions can be drawn from the mean score relative to the first distributions:

mean score (mean(mean / mean of first impl))mean score std (std(mean / mean of first impl))mean rel std (mean(std / mean))

Conclusion 6

The data also shows that for „-O2“ and „-Odph“ no difference between the mean scores of different GHC versions is really significant (or bigger than the maximum of standard deviations of both + the maximum of mean standard deviations).

Conclusion 7

As the plot of the beginning and the following table shows, there is a slightly significant difference between GHC 7 and GHC 8 (regarding „-O“), regarding the mean scores (first table) and the mean scores relative to the first (second table):

mean score (gmean(mean / best mean))mean score std (gmean std(mean / best mean))mean rel std (gmean(std / mean))
mean score (mean(mean / mean of first impl))mean score std (std(mean / mean of first impl))mean rel std (mean(std / mean))

But it’s only slightly significant, as the maximum of the standard deviations between the two is around 13% and the difference is only around 10%. This difference is far higher (relative to the standard deviations) than with „-O2“ or „-Odph“.

On execution times on specific program categories

The above just discussed the performance over all program categories (like binarytrees or fannkuchredux). The following are observations on specific program categories.


The fannkuchredux category is the category with longest running programs and can therefore show bigger (and more significant) differences. This category consists of has two benchmarked programs: 1 and 3

The first program consists of around 40 lines of code. It’s written in a good style (without fancy Haskell features), as far as I can tell.

The second program consists of around 90 lines of code. It uses the Control.Concurrent and the Foreign module to do some optimizations. The Haskell code is, contrary to the first program, pretty hard to understand for me.

If we look at the performance of this two programs regarding the biggest input (12) and the „-Odph“ optimization level (there is no big difference with other inputs or optimization levels), we see that GHC 7.0.1 is better with the second (optimized) program and 8.0.1 is better with first (simple) program (the plot shows the set of execution times in milliseconds for every GHC version):


Plot for the first program


Plot for the second program

This difference might suggest that

  • the second program was optimized for a specific GHC version or
  • the GHC got better in optimizing idiomatic Haskell programs (like the first)
  • (maybe that’s the reason why there’s no significant difference between the GHC versions regarding all programs and inputs)
  • that manually optimizing Haskell code is worthy subject: it reduces in this case the execution time from 5 – 9 minutes to 50 – 60 seconds.

Although the significance is not high (and its some sort of reading tea leaves at this significance levels), looking at the mean score for the upper half and the lower half of programs per category regarding their entropy supports the second conclusion.

The following two tables show the mean score distribution properties for this two halves for the optimization level „-O2“ (its the nearly the same for „-Odph“, but different from „-O“). The entropy of a program is the length of its gnu zipped source code. Programs with a lower entropy tend to be simpler and shorter. If a category has an uneven number of programs, then one program belongs both to the lower and to the upper half.

mean score (gmean(mean / best mean))mean score std (gmean std(mean / best mean))mean rel std (gmean(std / mean))
mean score (gmean(mean / best mean))mean score std (gmean std(mean / best mean))mean rel std (gmean(std / mean))

And the same for mean scores relative to the first:

mean score (mean(mean / mean of first impl))mean score std (std(mean / mean of first impl))mean rel std (mean(std / mean))
mean score (mean(mean / mean of first impl))mean score std (std(mean / mean of first impl))mean rel std (mean(std / mean))

Important note: I should emphasize one more time that this is like reading tea leaves and only suggest an area to investigate further.


Further work

  • Use the assembler randomization feature of temci to see whether or not I just observed some cache effects (easy, costs only computing time on my benchmark computer and slightly modifying the script)
  • Fix the other benchmarks game code that I didn’t benchmark as it contained bugs
  • Benchmark some older GHC versions

Thanks to Joachim Breitner for the idea for this blog post and his Haskell knowledge.


Es fühlt irgendwie komisch an, wenn man denkt man wäre mit seinen post pubertären Probleme alleine und dann merkt, dass viele Leute schon vor Jahrzehnten die gleichen Probleme hatten und sich die gleichen Fragen gestellt. Der Mann mit dem ich heute durch Zufall am Vögel beobachten (und Stanislav Lem lesen) redete war jenseits der 70 Jahre.

Music of the week – Plumes Ensemble

In dieser Kategorie meines Blogs stelle ich jede Woche kurz eine Band vor, deren Musik ich in jener Woche am liebsten hörte.

Das Plumes Ensemble spielt eine schöne Mischung aus Klassik und melodischem Indie Pop. Es stammt, wie ihre Frontsängerin Veronica Charnley, aus dem kanadischen Montréal.

Hier das Musikvideo zum aktuellem Album Plumes.

Das Ensemble spielte vor wenigen Wochen im Nun Cafe in Karlsruhe. Durch diesen Auftritt bin ich auf es erst aufmerksam geworden, wobei ich leider erst zu kurzfristig davon erfuhr und deswegen nicht dort sein konnte.

The Pumping Lemma – Poem by Harry Mairson

Any regular language L has a magic number p
And any long-enough word in L has the following property:
Amongst its first p symbols is a segment you can find
Whose repetition or omission leaves x amongst its kind.

So if you find a language L which fails this acid test,
And some long word you pump becomes distinct from all the rest,
By contradiction you have shown that language L is not
A regular guy, resiliant to the damage you have wrought.

But if, upon the other hand, x stays within its L,
Then either L is regular, or else you chose not well.
For w is xyz, and y cannot be null,
And y must come before p symbols have been read in full.

As mathematical postscript, an addendum to the wise:
The basic proof we outlined here does certainly generalize.
So there is a pumping lemma for all languages context-free,
Although we do not have the same for those that are r.e.

There are some other poems by Harry Mairson at I found this poem while reading a discussion at The Old Joel on Software Forum.

Real life practice in lottery scheduling

I currently have a course covering operating systems at university. We learn in this course several scheduling algorithms. An operating system needs these kind of algorithms to determine which process to run at which time to allow these process to be executed „simultaniously“ (from the users view).

Good scheduling algorithms have normally some very nice features:

  • They avoid the starving of one process (that one process can’t run at all and therefore makes no progress).
  • That all processes run the approriate percentage (defined by it’s priority) of time in each bigger time interval.

But they are maybe not only useful for operating systems but also for people like me. Probably they could help me to improve the scheduling of my learning.

An algorithm that seems to be suited best for this purpose is called lottery scheduling (pdf). It’s an algorithm that gives a each process some lottery tickets (their number resembles the priority of the process). Every time the algorithm has to choose a new process to run, it simply picks a lottery ticket randomly and returns the process that owns the ticket. A useful addition is (in scenarios with only a few tickets) to remove the tickets that are picked temporarily from the lottery pot and put them back again, when the pot is empty.

But how could I use this algorithm in practice?

I take a stack of blanko cards, each card worth around 2 hours of learning time, and assign them to each of my courses. E.g. the course operating systems gets 3 cards, out of the 25 cards that now make up my week.

Now, when I’ve time to learn, I pick a card out of the shuffeled card stack which then tells me to learn about 2 hours e.g. for my theoretical computer science course.

This „practical exercise“ in operating systems will hopefully boost my learning – and of course is funny way to learn, as I never know what I’m going to learn. Later, I probably also could use it to ensure that I’m blogging more regularly here and progress with my compiler project.

Nudeln mit asiatischer Soße

Basiert auf dem (noch unveröffentlichenten) Grundrezept (lasse aber umbedingt den Wein bzw. Essig weg).

Verwende statt zwei, 4 frische geriebene Knoblauchzehen und dünste diese schon am Anfang mit den Zwiebeln in der Pfanne. Gebe zu den angedünsteten Zwiebeln und dem geriebenen Knoblauch die Hälfte der Tomatenstückchen. Erhitze diese stark und gebe zwei Esslöffel Kokoscreme, zwei Teelöffel Harissa-Gewürz, einen gehäuften Teelöffel fein gerieben Ingwer und etwas Zitronenschale mit den Standardgewürzen (Pfeffer, Salz, Muskatnuss) hinzu. Dies nun etwa 2 Minuten unter ständigem rühren kochen. Nun die restlichen Tomatenstückchen hinzu geben, die Zitronenschale entfernen und 5 Minuten auf mittlerer Hize köcheln lassen.

Wenn die Nudeln gar sind, diese in die Soße geben und bei kleiner Hitze 2 Minuten ziehen lassen. Fertig.

Guten Appetit.

P.S.: Mich irritierte beim Essen dieses Gerichts, dass es auch wirklich schmeckt. Es war wieder eines dieser „ich habe heute Hunger auf was asiatisches“ Gerichte, die dann durch Improvisation entstehen.

P.P.S.: Keine Angst vor dem Knoblauch, er verliert durch das Dünsten seinen Schrecken.

The current status of POOL

The POOL programming language is a cool project – it’s fun to write your own programming language that does everything the way you like. I’ll give a short status update for this project in this article, as I’m going to delay it.

The problem with this project and the reason why I delay it (maybe some month or years) is that I realized that I lack the knowledge of creating and implementing programming languages. I’ve never developed a real Turing complete programming language before, but writing the real POOL implementation is such a hugh project that it’s probably better that I gained experience in this field before attempting it again. It’s a pity that I can’t implement it within the next couple of months as I developed POOL in my mind for about half a year.

What I currently have programmed for the POOL project are two failing attempts to write a grammar and a rough draft of a byte code interpreter (aka virtual machine). I’ve spend dozens of days for this attempts and drafts – but there are no real results. And that’s the real reason why I delay this project: I’m working hours and hours without any substantial progress – that’s really demotivating – and the probability that this project therefore fails after hundreds of hours of developing is high.

But what am I going to do next? I’m addicted to writing compilers, parsers and interpreters for unknown languages, but instead of starting another huge project like POOL, I’m going to start many little projects that don’t last  that long.

One of this project’s is OrangeLang a cool project about a minimal language with focus on the complier. I’ll post an article on this blog when the project has more than a rugh compiler draft. (That has been also a difficulty with POOL: I wrote several articles and talked with people about it without having anything to show)



Ich habe in diesem Blog seit längerem nichts Substantielles mehr geschrieben. Das heißt aber nicht, dass ich nichts habe, über das ich schreiben kann – ganz im Gegenteil: Wenn ich denn mal Zeit habe, schreibe ich endlich wieder über POOL und publiziere ein paar Kochrezepte. Das Problem ist zur Zeit wirklich, das ich nur wenig Zeit habe, weil ich gerade für eine wichtige Klausur in zwei Wochen lernen muss, nach dem ich in den vorangegangen Wochen für eine andere gelernt habe. Wenn



Ich und mein Roller bei Nacht

Ich und mein Roller in dessen linkem Rückspiegel sich der Vollmond spiegelt

Ich photographierte letzte Nacht auf einem Hügel im schönen Kraichgau. Ausgerüstet mit Stativ, Kamera und Roller photographierte bei Vollmond und experementierte etwas. Die Ergebnisse dieser (und sicher noch folgender Phototouren) werde ich in diesem Blog in nächster Zeit publizieren. Als Ausrüstung verwende ich übrigens eine Casio EX F1 und eine einfache Kompaktkamera von Canon, sowie ein einfaches Stativ.