Suppose I ran a benchmark. Being thorough, I ran it several hundred times, but that just produces a big pile of numbers and we need a way to summarize the results concisely.
The naive way would be to report the average run time and perhaps the variance or standard deviation. This would allow us to assert that the benchmark takes, say, 10.5 ± 1.2 seconds. A reader would feel confident that the bulk of the benchmark runs would take somewhere between 9.3 seconds and 11.7 seconds, and that a lot more runs would take near 10.5 seconds than the minimum of 9.3 seconds or the maximum of 11.7 seconds. The mean (average) and the standard deviation are the characteristic statistics of a normal (Gaussian) distribution. Rather than give the results of hundreds of runs, I summarize by saying that the distribution is Gaussian and I give the characteristic statistics of the best fitting Gaussian distribution.
But what if the distribution isn't Gaussian? By reporting the statistics of the best fitting Gaussian at best I've reported something irrelevant, at worst I've mislead the reader about the nature of the distribution.
It turns out that program run times do not typically follow a Gaussian distribution. A number of different factors influence the run time of a program and these are combined multiplicatively. The resulting distribution will be heavily skewed towards the shorter times, but have a long tail of longer run times. It will look a bit like this:
But a curious thing is going on here. If we plot the exact same curve but use a logarithmic scale for the X axis, we get this:
Our old friend the Gaussian distribution appears. Run times of computer programs typically follow a Gaussian distribution if you plot them on a logarithmic scale. This is called a log-normal distribution and it is one of the most frequently encountered distributions (after the Gaussian).
When reporting a log-normal distribution, it makes sense to report the mean and standard deviation of the Gaussian that appears after you transform to a logarithmic scale. If you transform and take the mean, you end up with the geometric mean. If you transform and take the standard deviation, you end up with something called the geometric standard deviation. This latter is just a little tricky. Instead of a margin of error that you add or subtract from the mean, you have a factor by which you multiply or divide the geometric mean. For example, I might assert a particular benchmark takes 10.5 ×/÷ 1.2, and you can interpret this as meaning that the majority of runs took between (/ 10.5 1.2) = 8.75
seconds and (* 10.5 1.2) = 12.6
seconds.
When you see a paper reporting the mean and standard deviation of some benchmark runs, you should be instantly suspicious. Benchmark runs don't usually follow a Gaussian distribution and it is likely that the author is unaware of this fact and is mindlessly applying irrelevant statistics. But if you see a paper where the geometric mean and geometric standard deviation are reported, you can have much more confidence that the author thought about what he or she was trying to say.
I find it particularly annoying when I see a benchmark harness that will run a benchmark several times and report the mean and standard deviation of the runs. The author of the harness is trying to be helpful by reporting some statistics, but the harness is likely reporting the wrong ones.