Playing with criterion

As a part of my Google Summer of Code project, I should take a look at criterion, a benchmark library for Haskell. As I’d have to use it, and it seems to be a useful library for a lot of people, I packaged it to Debian, with all its dependencies. The haskell-criterion package is waiting on the NEW queue.

Ok, so now I had to test it with something. I came up with a test related to my reading about the Haskell execution model. I noticed that I never have played with multi-threaded Haskell code to achieve better performance, but only to make things that should be in parallel work. So I thought this was a good oportunity to try both. I wrote this code:

import Control.Applicative ((<$>), (<*>))
import Control.Concurrent.MVar (newEmptyMVar, putMVar, takeMVar)
import Control.Exception (evaluate)
import Criterion.Main (Benchmark, bench, bgroup, defaultMain, whnf, whnfIO)
import GHC.Conc (forkIO, par)
import System.Environment (getArgs)

value :: Int
value = 32

threadsL :: [Int]
threadsL = [1, 2, 3, 4, 8, 16, 32, 64]

main :: IO ()
main
= defaultMain
[bgroup "fib" $ [bench "fib" $ whnf fib value]
, bgroup "pfib" $ map pbench threadsL
, bgroup "ifib" $ map ibench threadsL]

pbench :: Int -> Benchmark
pbench i = bench (show i) $ whnf (pfib i) value

ibench :: Int -> Benchmark
ibench i = bench (show i) $ whnfIO $ ifib i value

fib :: Int -> Int
fib n
| n < 2 = n
| otherwise = fib (n – 1) + fib (n – 2)

pfib :: Int -> Int -> Int
pfib threads n
| n < 2 = n
| threads < 2 = fib n
| otherwise = par n1 $ seq n2 $ n1 + n2
where
n1 = pfib (threads – 1) $ n – 1
n2 = fib $ n – 2

ifib :: Int -> Int -> IO Int
ifib threads n
| n < 2 = evaluate n
| threads < 2 = evaluate $ fib n
| otherwise
= do
f1 <- newEmptyMVar
f2 <- newEmptyMVar
forkIO $ ifib (threads – 1) (n – 1) >>= putMVar f1
(+) <$> evaluate (fib $ n – 2) <*> takeMVar f1

My point is to benchmark the naïve implementation of the Fibonacci function. I’ve done three implementations:

fib
which is the simple sequential algorithm
pfib
which call par threads times
ifib
which call forkIO trheads times

I have chosen the value to be found to 32, because it took long enought to make the algorithms show a difference, and short enough to allow me to wait for the benchmarks. I built with:

ghc --make -threaded fib

And runned with:

./fib --summary=fib.csv +RTS -N

The results were surprising for me:

Name,Mean,MeanLB,MeanUB,Stddev,StddevLB,StddevUB
"fib/fib",1.7197746428132055,1.5104790505051613,2.031735854971409,1.2921598378638277,0.9514102096387,1.7188536150257019
"pfib/1",1.246479478704929,1.1805571945786482,1.3496261318802836,0.41570319371618925,0.29446316107198395,0.5823241217769125
"pfib/2",0.709402781355381,0.6850755818963056,0.7553855475068096,0.1650893842889147,0.10136189180677138,0.26853312094492815
"pfib/3",0.6876571162819864,0.6694172366738321,0.7332831247925761,0.136836636420341,5.5142145935712775e-2,0.2581805596992922
"pfib/4",0.6940202649712561,0.6684220250725745,0.746035416948795,0.17779646229098522,9.15862482237111e-2,0.2804203635281818
"pfib/8",0.5887631162285805,0.5625207790017128,0.637939236986637,0.17967127569396815,0.10516397607491562,0.2680488710546616
"pfib/16",0.6578513725876809,0.6107277926087382,0.7372924216866497,0.3070809683902726,0.20679305873839168,0.44571891526087154
"pfib/32",0.8900135167717932,0.8123333438515664,0.9942556460976597,0.45942103747727,0.36818132854060703,0.6233813540278427
"pfib/64",0.6894888337731359,0.6386897596001626,0.7766118057847021,0.33329115700425754,0.22490082647959764,0.5323965317477072
"ifib/1",1.3290104445099833,1.2501559790253634,1.4700443919777875,0.5277373265779313,0.35208278898795436,0.9179652474421286
"ifib/2",0.7626163347840308,0.7239342554688452,0.8381309326767921,0.26785654248385327,0.1679981089775395,0.44853454282169086
"ifib/3",0.7440206035256386,0.6907905062317847,0.8396386941552162,0.3549675945489581,0.22865300516283724,0.536883746220313
"ifib/4",0.6503480085015301,0.6130640967965126,0.711405030119419,0.239522235814317,0.1712396398224678,0.3616636557596486
"ifib/8",0.6300301631569865,0.6085795434594155,0.699028070795536,0.17776228649867923,6.103918021418795e-2,0.39508078831864085
"ifib/16",0.6735970553040505,0.6393461784005164,0.7405966480851172,0.2366561170051713,0.13725890735483465,0.38273830961617655
"ifib/32",0.6162510784745214,0.5983195599198344,0.6736035832047466,0.14678572822644564,5.335959472427259e-2,0.3231734735693349
"ifib/64",0.7217859562516212,0.6820346101403237,0.7857911285042766,0.2534368199920431,0.18073576275895306,0.37636526159654315

I’m only analyzing the first column.

How come pfib/1 and ifib/1, that is, pfib and ifib called with just one thread, got faster than fib/fib? All they do is calling fib! There’s certainly something wrong here. So I reran the tests in a single user environment (recovery-mode on GRUB). And also learned that I should not trust blindly these results in my usual user environment. The new tests were also surprising:

Name,Mean,MeanLB,MeanUB,Stddev,StddevLB,StddevUB
"fib/fib",0.4979443445852824,0.4978961411169597,0.49802434593949996,3.1186229071682967e-4,2.1763919616793833e-4,4.8569259617552764e-4
"pfib/1",0.4980979338339395,0.4980578795126507,0.49816361099992496,2.58930038115918e-4,1.7011790805380493e-4,4.0161943492611696e-4
"pfib/2",0.30187882572923386,0.3018276229551859,0.3019332328489849,2.7088498930449747e-4,2.3482715228451473e-4,3.21989842698397e-4
"pfib/3",0.3025411692312786,0.3024852743795938,0.3026018276861736,2.975260476849551e-4,2.6907713057828055e-4,3.3200184242397705e-4
"pfib/4",0.3045588889769145,0.3024707928350993,0.31463312775407504,2.0236423610294995e-2,3.5108499279216634e-4,4.8255009076253054e-2
"pfib/8",0.27702674538408,0.27158569247041425,0.2846977105787821,3.2790225595837955e-2,2.5215205278321072e-2,4.6554023411818295e-2
"pfib/16",0.26909642369065967,0.2650135436705178,0.2758063927343913,2.615724515115467e-2,1.8039283932648264e-2,4.0956608339390285e-2
"pfib/32",0.2726816359213419,0.26952718407426557,0.27636309534822184,1.7444576031246224e-2,1.4227057669027803e-2,2.2935079362453085e-2
"pfib/64",0.28010295302186694,0.27646669537339885,0.28658743054185587,2.4301648337333875e-2,1.3360828345387364e-2,3.9546300072208655e-2
"ifib/1",0.4984497705153056,0.4984218493155071,0.4984962406805583,1.8040276986757106e-4,1.242720342516858e-4,2.870577387900989e-4
"ifib/2",0.3603615990332194,0.34868544012818997,0.3741170039824078,6.496311767153333e-2,5.652406802832432e-2,7.448312251646773e-2
"ifib/3",0.31676718384538366,0.3109451881102153,0.32481522709642136,3.4887028192202074e-2,2.744363202200484e-2,4.8060797483303504e-2
"ifib/4",0.31541500956330976,0.30635046393190096,0.32643843323503213,5.109215651574664e-2,4.339074296796736e-2,5.995395740457435e-2
"ifib/8",0.32055945069108693,0.3132960501364299,0.3302640953711101,4.2886292109838166e-2,3.422708481199234e-2,5.4196317284934535e-2
"ifib/16",0.32268204123292654,0.31607536703859057,0.33141186625276287,3.858016286486453e-2,3.0959817731033304e-2,4.801572113400926e-2
"ifib/32",0.32020341069017116,0.3151469627073833,0.32664997250352584,2.92266723352992e-2,2.3352577028246314e-2,3.5913110786596904e-2
"ifib/64",0.3183670273474284,0.3136345234564373,0.3248856058767864,2.834147907711714e-2,2.1803978149062524e-2,3.8088053144371074e-2

I didn’t thought the improvement in performance on a single-user would be so dramatic. The execution of the program in a single-user environment used 41% of the time to run in my usual environment. About the comparative results, they were more close to what I’d expect than the old ones. fib/fib, pfib/1 and ifib/1 had a very similar execution time. What surprised me was that I was expecting to have the best result in the benchmarks with 2 threads, since I ran the tests in a Core 2 Duo T5750 @ 2.00 GHz with 2 cores. But in pfib the execution time has reduced from 4 threads to 8, and in ifib, from 2 to 3. I can’t think of where these improvements come from.

Unfortunately the creation of charts is currently broken in GHC 6.12, so no graphics by now.

Uma resposta to “Playing with criterion”

  1. Twan van Laarhoven Says:

    You are missing a “-O2” in your invocation of Ghc, so these tests are run without optimizations.

    The reason why more than 2 threads could be helpful is because the workloads are not balanced. With two threads, one does 1.618 times as much work as the others, since one is calculating “fib 30” and the other “fib 31”. With 4 threads one core will do only 1.118 times as much work as the other, it does “fib 28” and “fib 30”, while the other does “fib 29” twice.

Deixe uma resposta

Preencha os seus dados abaixo ou clique em um ícone para log in:

Logotipo do WordPress.com

Você está comentando utilizando sua conta WordPress.com. Sair / Alterar )

Imagem do Twitter

Você está comentando utilizando sua conta Twitter. Sair / Alterar )

Foto do Facebook

Você está comentando utilizando sua conta Facebook. Sair / Alterar )

Foto do Google+

Você está comentando utilizando sua conta Google+. Sair / Alterar )

Conectando a %s


%d blogueiros gostam disto: