São Bento sitiado

28/6/2010 (segunda-feira)

http://renatinharenault.blogspot.com/2010/06/bairro-sao-bento-em-estado-de-sitio.html

Anúncios

Immix on GHC Summer of Code weekly report #6

18/6/2010 (sexta-feira)

My project.

This post assumes that the reader has read my last post.

I started the change in the line representation. Instead of a single linked list of free lines, it’s better to work with line groups, to avoid fragmentation inside lines in allocation. I’ve create a struct to represent this line group, with a pointer to the next group, and the size, in lines, of the group. It is stored in includes/rts/storage/GC.h currently, but I’m not sure if this is the best place to put it, and I’m thinking about changing it latter.

#include "rts/OSThreads.h"

typedef struct line_ {
    struct line_ *next;
    StgWord size;
} line;

/* -----------------------------------------------------------------------------

Then we need to change the code of liberation of lines in rts/sm/Sweep.c.

            else if(!(bd->flags & BF_MEDIUM)) {
                StgBool sequence;
                sequence = 0;

                for(i = 1; i < BLOCK_SIZE_W / BITS_IN(W_); i++) {
                    StgPtr start;
                    start = bd->start + BITS_IN(W_) * i;
                    if(bd->u.bitmap[i] == 0 && bd->u.bitmap[i - 1] == 0 &&
                       start + BITS_IN(W_) <= bd->free) {
                        printf("DEBUG: line_found(%p)\n", start); fflush(stdout);
                        if(gen->first_line == NULL) {
                            gen->first_line = (line *) start;
                        }
                        if(sequence) {
                            last_line->size++;
                        }
                        else {
                            if(last_line != NULL) {
                                last_line->next = (line *) start;
                            }
                            last_line = (line *) start;
                            last_line->next = NULL;
                            last_line->size = 1;
                        }
                        sequence = 1;
                    }
                    else {
                        sequence = 0;
                    }

I’ve tested this code using the same technique as before, checking the produced list trasversing it after it’s done. The list obtained showed the correct lines.

Now I’m back to the allocation code, which never worked. The changes in rts/sm/Evac.c are straightfoward.

        if (gen->first_line != NULL &&
            size <= BITS_IN(W_) * gen->first_line->size) {
            ws->todo_free = (StgPtr) gen->first_line;
            ws->todo_lim = ws->todo_free + BITS_IN(W_) * gen->first_line->size;
            gen->first_line = gen->first_line->next;
            to = ws->todo_free;
        }

This cause the same kind of errors I was getting before. I should go back to debugging. My co-supervisor in my Oriented Project in Computer Science suggested me using valgrind. I’ll try it.


Last week I forgot to mention I’ve presented my final presentation about my Oriented Project in Computer Science. This week I finished writing my monograph.

Summer of Code weekly report #5

15/6/2010 (terça-feira)

My project.

This post assumes that the reader has read my last post.

This was a week full of segfaults, failed assertions, changes in the user data and other crazy stuff. I started working on the allocation of memory on the freed lines. In the first glance, it seemed to be much harder than the small changes I’ve done to free memory in lines, because of the a never-seen-before-by-me datatypes being dealed with in alloc_for_copy(), from rts/sm/Evac.c: gen_workspace. But in a short time the code of alloc_for_copy() has shown itself to be simple, and it was easy to do an initial implementation. I went in the same path I was going on before, trying the simplest solution that works. For instance, the line representation used is not ideal, because groups of lines are not considered, but it was the simplest at that time, and I plan to improve it latter.

The gen_workspace data type contains a pointer to an area of a block that is not being used (todo_free) and a pointer to the end of that area (todo_lim). When a space for an object is requested, it’s allocated in this area, and todo_free is adjusted. If the area is full, a new block is requested. My intended change was to return the first free line of the generation when the object is smaller or equal to the size of a line, and use the current approach otherwise. This is the simplest way I could think, but has the problem that the only one object is allocated per line. This was a known issue.

    if(size <= BITS_IN(W_) && gen->first_line != NULL) {
        to = gen->first_line;
        gen->first_line = (StgPtr) *gen->first_line;
        return to;
    }

    ws = &gct->gens[gen->no];

After that I got the first round of segfaults. One problem I could spot so far was in the code in rts/sm/Sweep.c, related to the liberation of the lines. The block is allocated by need, and there’s a pointer to the first free byte, free. This free space is used by alloc_for_copy(). So, we should only think of lines in the region already allocated in the block, that is, where the address is smaller than free.

                    if(bd->u.bitmap[i] == 0 && bd->u.bitmap[i - 1] == 0 &&
                       start + BITS_IN(W_) <= bd->free)

But the segfaults, memory corruptions, etc, kept going on. I tried some restrictions, like allocating in lines only objects with the exact size of a line, or very small objects (size == 2), or allocating just one object for generation, or only one object at all. I also tried checking for the type of the allocated object, to see a correlation with the problems I was having. Nothing helped by now.

So I thought I should work on other things, so that maybe my mind gets clearer and I can spot the problem. I followed the suggestion of using todo_free and todo_lim to make it possible to allocate more than one object in a line. I liked this change, since I had the impression that it fits better with the rest of the code than the initial implementation, and it will be easier to addapt it when I improve the representation of the free lines. As I said, I wanted something to work while I think in how I’ll solve the segfault problems. The bad side of this choice is that I’ll not be able to test it, since it’s just a reimplementation of the same idea in the allocation code, and is not expected to work. The good size is that it may work, and solve the problems I was having by accident.

    if (ws->todo_free > ws->todo_lim) {
        if (size <= BITS_IN(W_) && gen->first_line != NULL) {
            ws->todo_free = gen->first_line;
            gen->first_line = (StgPtr) *gen->first_line;
            to = ws->todo_free;
            ws->todo_lim = ws->todo_free + BITS_IN(W_);
        } else {
            to = todo_block_full(size, ws);
        }
    }

I got the same results as with the older implementation, and no ideas about how to solve it (yet). So I thought about another thing to work on, which I’m talking about all through this post: improve the free line representation. This has the advantage of being testable, since it’s unrelated to the allocation code, and may give me an idea about how I can fix the segfaults. The bad side is that it probably won’t fix my problems directly. I’m working on it right now.


Following a suggestion, I’ve started using gdb to see what was happening in the GC instead of including printfs everywhere, and it’s been useful. I’ve noticed that sometimes printf is more effective, but sometimes gdb is much better.

Licença maternidade na Prefeitura de Belo Horizonte

11/6/2010 (sexta-feira)

Meu nome é Marco Túlio e passei por uma situação com a Prefeitura de Belo Horizonte que gostaria de relatar. Minha filha nasceu no dia 12 de janeiro de 2010 e minha esposa, professora da rede municipal, teria direito a 4 meses de licença. Tendo em vista que a Organização Mundial de Saúde (OMS) aconselha que a criança se alimente apenas do leite materno até os 6 meses, minha esposa entrou com um pedido na Prefeitura, para que sua licença fosse extendida para 6 meses. Naquele momento, o empregador poderia legalmente escolher se daria à funcionária 4 ou 6 meses de licença. A Prefeitura negou o pedido. Minha esposa, então, entrou com um processo junto ao sindicato contra a Prefeitura, demandando a licença de 6 meses. O juiz concedeu a liminar, e ficamos tranquilos com a notícia. Quando soubemos que, no dia 8 de março, Dia Internacional da Mulher, o Prefeito Márcio Lacerda assinou um projeto de lei que ampliava a licença para 6 meses, ficamos ainda mais tranquilos.

Apesar disso, no dia 9 de junho de 2010, recebemos um telefonema de uma funcionária da Regional, dizendo que minha esposa deveria retornar às suas atividades imediatamente. Como achávamos que ela voltaria ao trabalho somente a partir do dia 12 de julho, não começamos ainda a dar outros alimentos à nossa filha. Estamos alimentando-a apenas com o leite materno, como recomendado pela OMS.

A adaptação para que a criança comece a ingerir outros alimentos leva um certo tempo, e é muito difícil fazer isso de um dia para o outro. Levando isso em consideração e com um atestado médico que pedia 30 dias de licença para que minha esposa pudesse fazer essa adaptação, fomos à junta médica da Prefeitura, pedindo uma licença para acompanhamento. Os funcionários da Prefeitura tem direito a 30 dias dessa licença por ano, para acompanhar familiares doentes. O pedido foi negado, com a justificativa de que essa licença só se aplica em casos de doença. Nessa ida à junta médica, me impressionou quantas cópias haviam da notícia do projeto de lei assinado pelo Prefeito Márcio Lacerda, pregadas nas paredes em todos os lugares.

Além do problema da amamentação, temos que providenciar um lugar para deixar nossa filha, de 5 meses, de uma hora para outra. Estamos procurando alternativas, mas a creche com berçário mais próxima, a Unidade Municipal de Educação Infantil Alaíde Lisboa, localizada na UFMG, não tem vagas. Uma funcionária disse que, para atender toda a fila de espera, seriam necessárias 3 escolas.

O que mais me incomodou nessa história foi a decisão de recorrer da liminar que extendeu a licença para 6 meses. Se o próprio Prefeito Márcio Lacerda já tinha ampliado a licença, por que recorrer contra quem conseguiu essa ampliação na justiça? Tentando tirar essa dúvida, liguei para o gabinete do Promotor Saulo Converso Lara (3277-4029), da Procuradoria Geral do Município, mas não consegui falar com ele. Expliquei o que havia acontecido para uma servidora municipal que atendeu o telefone, e ela disse que o Promotor Saulo Converso Lara age em interesse do município, independentemente do contexto. Quando perguntei qual era o interesse do município nesse caso específico, a servidora disse que precisava desligar o telefone. Tentei perguntar novamente, dizendo que ela só estava desligando porque não conseguia responder essa pergunta. Ela me disse que não era obrigada a dar satisfações e desligou na minha cara. Infelizmente, não perguntei o nome da servidora. Tentei ligar novamente, para falar diretamente com o Promotor Saulo Converso Lara mas, ao ouvir a minha voz, a servidora desligou na minha cara novamente.

Summer of Code weekly report #4

4/6/2010 (sexta-feira)

My project.

I’m publishing my report earlier this week, because there was a lot to talk about. This week I started to make changes in the code going in the direction of what I want to do. I haven’t started a final implementation, but I’m studying about how what I want to do will affect the rest of the Garbage Collection (GC). I noticed the code in rts/sm/Sweep.c was simple and similar to what I’m planning to do, so I started changing how it works.

Sweep in the Glasgow Haskell Compiler (GHC) is done by a bitmap, which contains a bit for each word in a memory block, and is set to 1 when there’s an object starting in the mapped area and 0 otherwise. When there’s a block with no objects starting at it, that is, all bits of the bitmap are set to 0, the block is freed.

        if (resid == 0)
        {
            freed++;
            gen->n_old_blocks--;
            if (prev == NULL) {
                gen->old_blocks = next;
            } else {
                prev->link = next;
            }
            freeGroup(bd);
        }

The bits are analyzed in a group of BITS_IN(W_), where BITS_IN(W_) is the number of bits in a word.

        for (i = 0; i < BLOCK_SIZE_W / BITS_IN(W_); i++)
        {
            if (bd->u.bitmap[i] != 0) resid++;
        }

If more than ¼ of the groups are completely set to 0, the block is considered fragmented.

            if (resid < (BLOCK_SIZE_W * 3) / (BITS_IN(W_) * 4)) {
                fragd++;
                bd->flags |= BF_FRAGMENTED;
            }

Immix, the GC algorithm I plan to implement in GHC, divides the blocks of memory in lines. My initial plan was to identify free lines. I decided to consider a the size of a line fixed in BITS_IN(W_) words, because it will map to a word in the bitmap, and the code was already analyzing in groups of BITS_IN(W_) words. This was very easy with the current code.

            if (bd->u.bitmap[i] != 0) resid++;
            else printf("DEBUG: line_found(%p)\n", bd->start + BITS_IN(W_) * i);

This worked, and showed some free lines. I’m sure there are other ways of logging in GHC, but printf was the simplest way I could thought of. I measured the occurrence of free lines using the bernouilli program from the NoFib benchmark suite, calling it with 500 +RTS -w, to make it uses sweep. In 782 calls to GarbageCollect(), sweep() was called 171 times, for 41704 blocks to be swept and found 230461 free lines. This gives us about 5.5 free lines per block, from the 8 lines in each block (on 64 bits systems).

The problem is that the bitmap is marked only in the start of the objects allocated, so even in a line that all bits are marked with 0 we can’t assume that it’s completely free, because there may be an object that starts in the previous line that is using the space of the line. Checking only for the previous line doesn’t work either, because a big object can span several lines. What we can do here is a variation of conservative marking, as proposed in the Immix paper, checking only the previous line and working only with objects smaller than a line.

To make sure I was working only with objects smaller than a line, I had to mark the blocks that contains medium objects and avoid them when seeking free lines. The block flags are defined in includes/rts/storage/Block.h, so I included another flag in this file, BF_MEDIUM.

/* Block contains objects evacuated during this GC */
#define BF_EVACUATED 1
/* Block is a large object */
#define BF_LARGE     2
/* Block is pinned */
#define BF_PINNED    4
/* Block is to be marked, not copied */
#define BF_MARKED    8
/* Block is free, and on the free list  (TODO: is this used?) */
#define BF_FREE      16
/* Block is executable */
#define BF_EXEC	     32
/* Block contains only a small amount of live data */
#define BF_FRAGMENTED 64
/* we know about this block (for finding leaks) */
#define BF_KNOWN     128
/* Block contains objects larger than a line */
#define BF_MEDIUM    256

The GHC GC is generational, there is, objects are allocated in a generation and, after a time, the ones that are still being used are moved the next generation. This idea assumes the death probability of younger objects is higher, so few objects are moved to the next generation. Sweep and Immix work only in the last generation so, to mark blocks with medium objects we have to check the size of the objects that are moved to the next generation.

This is done in the copy_tag function of rts/sm/Evac.c. I inserted a code that checks for the object size and marks the block when it’s bigger than BITS_IN(W_).

STATIC_INLINE GNUC_ATTR_HOT void
copy_tag(StgClosure **p, const StgInfoTable *info, 
         StgClosure *src, nat size, generation *gen, StgWord tag)
{
    StgPtr to, from;
    nat i;

    to = alloc_for_copy(size,gen);

    if(size > 8) {
        Bdescr(to)->flags |= BF_MEDIUM;
    }

So I updated the code in rts/sm/Sweep.c to only inspect for free lines in blocks without BF_MEDIUM mark.

            if (bd->u.bitmap[i] != 0) resid++;
            else if(!(bd->flags & BF_MEDIUM)) {
                printf("DEBUG: line_found(%p)\n", bd->start + BITS_IN(W_) * i);
            }

This also worked. Now, in the 32012 blocks there were 189015 free lines, found in the same number of GCs, making about 5.9 free lines per block. We considered only blocks with small objects, but we didn’t ignore the first line of each group of free lines. This can be achieved by checking if the previous line was also free.

            if (bd->u.bitmap[i] != 0) resid++;
            else if(!(bd->flags & BF_MEDIUM) && i > 0 && bd->u.bitmap[i] == 0) {
                printf("DEBUG: line_found(%p)\n", bd->start + BITS_IN(W_) * i);
            }

Now, from the 32239 blocks, 165547 free lines were found, giving 5.1 free lines per block. But there are more things to improve. If the whole block is free, we want to free it, instead of marking it’s lines as free. So it’s better to mark the lines after we know that the blocks are not completely free. So I left the code that checks the bitmap as it was, and included a line check only for blocks that are not completely free. At this point, I also associated the fragmentation test with blocks with medium objects, because in blocks of small objects we plan to allocate in free lines, so fragmentation is not a (big) issue.

            if (resid < (BLOCK_SIZE_W * 3) / (BITS_IN(W_) * 4) &&
                (bd->flags & BF_MEDIUM)) {
                fragd++;
                printf("DEBUG: BF_FRAGMENTED\n");
                bd->flags |= BF_FRAGMENTED;
            }
            else if(!(bd->flags & BF_MEDIUM)) {
                for(i = 1; i < BLOCK_SIZE_W / BITS_IN(W_); i++)
                {
                    if(bd->u.bitmap[i] == 0 && bd->u.bitmap[i - 1] == 0) {
                        printf("DEBUG: line_found(%p)\n", bd->start + BITS_IN(W_) * i);
                    }
                }
            }

The total ammount of blocks increased dramatically: the blocks that become fragmented and were not called again in sweep made a huge difference. From the 345143 blocks, 1633268 free lines were found, or about 4.7 free lines per block. 9434 blocks were free, so, from the remaining blocks, we have about 4.9 free lines per block.

Something we’ll need then is a way to access these lines latter. The simplest way I thought to achieve it is constructing a list of lines, in each the first word of each free line is a pointer to the next free line, and the first word of the last free line is 0. It’s useful to keep reporting the lines to stdout, so that we can then follow the list and check if we went to the same lines.

                    if(bd->u.bitmap[i] == 0 && bd->u.bitmap[i - 1] == 0) {
                        StgPtr start = bd->start + BITS_IN(W_) * i;
                        printf("DEBUG: line_found(%p)\n", start);
                        if(line_first == NULL) {
                            line_first = start;
                        }
                        if(line_last != NULL) {
                            *line_last = (StgWord) start;
                        }
                        line_last = start;
                        *line_last = 0;
                    }
                }
            }
        }
    }

    for(line_last = line_first; line_last; line_last = (StgPtr) *line_last) {
        fprintf(stderr, "DEBUG: line_found(%p)\n", line_last);
    }

I printed the inclusion of the lines on the list to stdout, and the walking on the list in stderr, so that it’d be easy to diff. There was no difference between the lists.

There’re another improvements that can be made, like using a list of groups of free lines, but I think it’ll be better to think about this after studying how the allocation in the free lines will be done. That’s where I’m going to now.


There are some minor things I learned, and thought they worth blogging. The current GHC uses three strategies for collecting the last generation: copying, mark-compact and mark-sweep. Copying is the default until the memory reaches 30% of the maximum heap size; after that, mark-compact is used. Sweep can be chosen by a Real Time System (RTS) flag, -w. To use mark-compact always, the flag is -c.

I’ve been submitting small patches to the cvs-ghc mailling list, mostly about outdated comments. Most of them were accepted, except for one which contained a lot of commentary, and that indeed was not completely correct. I corrected it and resend to the list, but the message is waiting for approval because the message header matched a filter rule. I believe this is because I replied the message generated from darcs.

There’s a very useful ghc option, specially for testing the compiler, because in this case you need to rebuild the source, even when there’re no changes in it. It’s -fforce-recomp, and it makes only sense when used with --make.

Summer of Code weekly report #3

31/5/2010 (segunda-feira)

My project

This week I followed an advice of my mentor, Simon Marlow, and started to play with the code including print statements in interesting places in order to understand it’s flow. It’s indeed much better than just reading the source code and the documentation on the wiki. As usual, I had a lot of doubts about some parts of the code, and noted them down. But now, I’m seeing that I’m resolving most of them just by playing more with the code, without need to ask in the mailing list. There are still some left, and I’m planning to send them to the list in a next oportunity.

The first thing I noticed was the high number of objects, from my perspective, that are created and collected in:

main = return ()

I measured how much times evacuate() was called. This function moves an object to the next generation of the Garbage Collector. It’s called 775 times, in 1 call to GarbageCollect(), with 479 different objects (with different pointers to their Info Tables). Also, there were pinned objects and large objects in this count. This was surprising to me.

I also made a small draw of the memory allocation in lines using diagrams. The draw:

The source code:

import Prelude hiding (tan)
import Control.Monad
import Data.Colour.RGBSpace
import Data.List
import Graphics.Rendering.Diagrams
import System.Random

objectMaxLength :: Double
objectMaxLength = 50

objectMinLength :: Double
objectMinLength = 10

nObjects :: Int
nObjects = 10

main :: IO ()
main
= do
— objects <- replicateM nObjects $ randomRIO (objectMinLength, objectMaxLength)
let objects = [18,50,12,46,16,41,14,30,49,23]
renderAs PNG "mem.png" Auto
$ vsep
10
[text 40 "(1)" <> continuous objects
, text 40 "(2)" <> lineByLine objects
, text 40 "(3)" <> allocated]

continuous :: [Double] -> Diagram
continuous = memory colors

lineByLine :: [Double] -> Diagram
lineByLine = memory (intersperse white colors ++ [white]) . objectsLine

allocated :: Diagram
allocated
= memory
(intersperse white colors ++ [white])
[18, 82, 50, 0, 12, 0, 46, 0, 16, 0, 41, 35, 14, 0, 30, 0, 49, 7, 23]

objectsLine :: [Double] -> [Double]
objectsLine [] = []
objectsLine (x : xs) = x : objectMaxLength – x : objectsLine xs

memory :: Color color => [color] -> [Double] -> Diagram
memory colours objects
= unionA
top
left
[hcat
$ map (uncurry fc)
$ zip (cycle colours)
$ map (lw 0 . flip rect objectMaxLength) objects
, lines_]

lines_ :: Diagram
lines_ = hcat $ replicate nObjects $ rect objectMaxLength 50

colors :: (Ord colour, Floating colour) => [Colour colour]
colors
= [red, green, blue, yellow, magenta, cyan
, darkred, darkgreen, darkblue, darkmagenta, darkcyan]

Playing with criterion

23/5/2010 (domingo)

As a part of my Google Summer of Code project, I should take a look at criterion, a benchmark library for Haskell. As I’d have to use it, and it seems to be a useful library for a lot of people, I packaged it to Debian, with all its dependencies. The haskell-criterion package is waiting on the NEW queue.

Ok, so now I had to test it with something. I came up with a test related to my reading about the Haskell execution model. I noticed that I never have played with multi-threaded Haskell code to achieve better performance, but only to make things that should be in parallel work. So I thought this was a good oportunity to try both. I wrote this code:

import Control.Applicative ((<$>), (<*>))
import Control.Concurrent.MVar (newEmptyMVar, putMVar, takeMVar)
import Control.Exception (evaluate)
import Criterion.Main (Benchmark, bench, bgroup, defaultMain, whnf, whnfIO)
import GHC.Conc (forkIO, par)
import System.Environment (getArgs)

value :: Int
value = 32

threadsL :: [Int]
threadsL = [1, 2, 3, 4, 8, 16, 32, 64]

main :: IO ()
main
= defaultMain
[bgroup "fib" $ [bench "fib" $ whnf fib value]
, bgroup "pfib" $ map pbench threadsL
, bgroup "ifib" $ map ibench threadsL]

pbench :: Int -> Benchmark
pbench i = bench (show i) $ whnf (pfib i) value

ibench :: Int -> Benchmark
ibench i = bench (show i) $ whnfIO $ ifib i value

fib :: Int -> Int
fib n
| n < 2 = n
| otherwise = fib (n – 1) + fib (n – 2)

pfib :: Int -> Int -> Int
pfib threads n
| n < 2 = n
| threads < 2 = fib n
| otherwise = par n1 $ seq n2 $ n1 + n2
where
n1 = pfib (threads – 1) $ n – 1
n2 = fib $ n – 2

ifib :: Int -> Int -> IO Int
ifib threads n
| n < 2 = evaluate n
| threads < 2 = evaluate $ fib n
| otherwise
= do
f1 <- newEmptyMVar
f2 <- newEmptyMVar
forkIO $ ifib (threads – 1) (n – 1) >>= putMVar f1
(+) <$> evaluate (fib $ n – 2) <*> takeMVar f1

My point is to benchmark the naïve implementation of the Fibonacci function. I’ve done three implementations:

fib
which is the simple sequential algorithm
pfib
which call par threads times
ifib
which call forkIO trheads times

I have chosen the value to be found to 32, because it took long enought to make the algorithms show a difference, and short enough to allow me to wait for the benchmarks. I built with:

ghc --make -threaded fib

And runned with:

./fib --summary=fib.csv +RTS -N

The results were surprising for me:

Name,Mean,MeanLB,MeanUB,Stddev,StddevLB,StddevUB
"fib/fib",1.7197746428132055,1.5104790505051613,2.031735854971409,1.2921598378638277,0.9514102096387,1.7188536150257019
"pfib/1",1.246479478704929,1.1805571945786482,1.3496261318802836,0.41570319371618925,0.29446316107198395,0.5823241217769125
"pfib/2",0.709402781355381,0.6850755818963056,0.7553855475068096,0.1650893842889147,0.10136189180677138,0.26853312094492815
"pfib/3",0.6876571162819864,0.6694172366738321,0.7332831247925761,0.136836636420341,5.5142145935712775e-2,0.2581805596992922
"pfib/4",0.6940202649712561,0.6684220250725745,0.746035416948795,0.17779646229098522,9.15862482237111e-2,0.2804203635281818
"pfib/8",0.5887631162285805,0.5625207790017128,0.637939236986637,0.17967127569396815,0.10516397607491562,0.2680488710546616
"pfib/16",0.6578513725876809,0.6107277926087382,0.7372924216866497,0.3070809683902726,0.20679305873839168,0.44571891526087154
"pfib/32",0.8900135167717932,0.8123333438515664,0.9942556460976597,0.45942103747727,0.36818132854060703,0.6233813540278427
"pfib/64",0.6894888337731359,0.6386897596001626,0.7766118057847021,0.33329115700425754,0.22490082647959764,0.5323965317477072
"ifib/1",1.3290104445099833,1.2501559790253634,1.4700443919777875,0.5277373265779313,0.35208278898795436,0.9179652474421286
"ifib/2",0.7626163347840308,0.7239342554688452,0.8381309326767921,0.26785654248385327,0.1679981089775395,0.44853454282169086
"ifib/3",0.7440206035256386,0.6907905062317847,0.8396386941552162,0.3549675945489581,0.22865300516283724,0.536883746220313
"ifib/4",0.6503480085015301,0.6130640967965126,0.711405030119419,0.239522235814317,0.1712396398224678,0.3616636557596486
"ifib/8",0.6300301631569865,0.6085795434594155,0.699028070795536,0.17776228649867923,6.103918021418795e-2,0.39508078831864085
"ifib/16",0.6735970553040505,0.6393461784005164,0.7405966480851172,0.2366561170051713,0.13725890735483465,0.38273830961617655
"ifib/32",0.6162510784745214,0.5983195599198344,0.6736035832047466,0.14678572822644564,5.335959472427259e-2,0.3231734735693349
"ifib/64",0.7217859562516212,0.6820346101403237,0.7857911285042766,0.2534368199920431,0.18073576275895306,0.37636526159654315

I’m only analyzing the first column.

How come pfib/1 and ifib/1, that is, pfib and ifib called with just one thread, got faster than fib/fib? All they do is calling fib! There’s certainly something wrong here. So I reran the tests in a single user environment (recovery-mode on GRUB). And also learned that I should not trust blindly these results in my usual user environment. The new tests were also surprising:

Name,Mean,MeanLB,MeanUB,Stddev,StddevLB,StddevUB
"fib/fib",0.4979443445852824,0.4978961411169597,0.49802434593949996,3.1186229071682967e-4,2.1763919616793833e-4,4.8569259617552764e-4
"pfib/1",0.4980979338339395,0.4980578795126507,0.49816361099992496,2.58930038115918e-4,1.7011790805380493e-4,4.0161943492611696e-4
"pfib/2",0.30187882572923386,0.3018276229551859,0.3019332328489849,2.7088498930449747e-4,2.3482715228451473e-4,3.21989842698397e-4
"pfib/3",0.3025411692312786,0.3024852743795938,0.3026018276861736,2.975260476849551e-4,2.6907713057828055e-4,3.3200184242397705e-4
"pfib/4",0.3045588889769145,0.3024707928350993,0.31463312775407504,2.0236423610294995e-2,3.5108499279216634e-4,4.8255009076253054e-2
"pfib/8",0.27702674538408,0.27158569247041425,0.2846977105787821,3.2790225595837955e-2,2.5215205278321072e-2,4.6554023411818295e-2
"pfib/16",0.26909642369065967,0.2650135436705178,0.2758063927343913,2.615724515115467e-2,1.8039283932648264e-2,4.0956608339390285e-2
"pfib/32",0.2726816359213419,0.26952718407426557,0.27636309534822184,1.7444576031246224e-2,1.4227057669027803e-2,2.2935079362453085e-2
"pfib/64",0.28010295302186694,0.27646669537339885,0.28658743054185587,2.4301648337333875e-2,1.3360828345387364e-2,3.9546300072208655e-2
"ifib/1",0.4984497705153056,0.4984218493155071,0.4984962406805583,1.8040276986757106e-4,1.242720342516858e-4,2.870577387900989e-4
"ifib/2",0.3603615990332194,0.34868544012818997,0.3741170039824078,6.496311767153333e-2,5.652406802832432e-2,7.448312251646773e-2
"ifib/3",0.31676718384538366,0.3109451881102153,0.32481522709642136,3.4887028192202074e-2,2.744363202200484e-2,4.8060797483303504e-2
"ifib/4",0.31541500956330976,0.30635046393190096,0.32643843323503213,5.109215651574664e-2,4.339074296796736e-2,5.995395740457435e-2
"ifib/8",0.32055945069108693,0.3132960501364299,0.3302640953711101,4.2886292109838166e-2,3.422708481199234e-2,5.4196317284934535e-2
"ifib/16",0.32268204123292654,0.31607536703859057,0.33141186625276287,3.858016286486453e-2,3.0959817731033304e-2,4.801572113400926e-2
"ifib/32",0.32020341069017116,0.3151469627073833,0.32664997250352584,2.92266723352992e-2,2.3352577028246314e-2,3.5913110786596904e-2
"ifib/64",0.3183670273474284,0.3136345234564373,0.3248856058767864,2.834147907711714e-2,2.1803978149062524e-2,3.8088053144371074e-2

I didn’t thought the improvement in performance on a single-user would be so dramatic. The execution of the program in a single-user environment used 41% of the time to run in my usual environment. About the comparative results, they were more close to what I’d expect than the old ones. fib/fib, pfib/1 and ifib/1 had a very similar execution time. What surprised me was that I was expecting to have the best result in the benchmarks with 2 threads, since I ran the tests in a Core 2 Duo T5750 @ 2.00 GHz with 2 cores. But in pfib the execution time has reduced from 4 threads to 8, and in ifib, from 2 to 3. I can’t think of where these improvements come from.

Unfortunately the creation of charts is currently broken in GHC 6.12, so no graphics by now.

Google Summer of Code weekly report #1

17/5/2010 (segunda-feira)

Some students of the Google Summer of Code are required to make a weekly report describing their activities.  As far as I know, I’m not, but I think it’s a good idea in general, to comunicate to my mentor and other interested folks what I’ve been doing.  So I’ll start to do them here.

My project is Implementing the Immix Garbage Collector Algorithm on GHC.

Last week I was most of the time involved in my school activities, so I didn’t have much time to work on the project.  I went to the bureaucratic services of my university to get a Proof of Enrollment in English, print/sign/scanned the Foreign Certification Form and sent them to Google on Friday afternoon.  I asked them to confirm the receival, which they didn’t. I’m planning to ask again for a confirmal today in the afternoon, if I still get no reply.

I also tried to build the current version of GHC, from the darcs repository. I was getting an error with my old (and updated) repository, so I donwnloaded a new one and could build nicely. Then I noticed that with make distclean I could also build with my older one, so I just removed the new. I downloaded the nofib benchmark suite and made an initial benchmark.

While I was reading the documentation of the build process of GHC, I noticed a bug, and submitted a patch to the cvs-ghc mailing list. The patch was applied, with some changes.

This week I plan to focus more on the project, and spend the weekly 36 hours working on it. I’m planning to start reading the Immix Technical Report to see if there’s anything new there with regard to the article. After that I’ll read the wiki documentation on the GHC current GC, together with the GHC source code. I’ll also experiment with Criterion.

While I was writing this post I noticed that the GC benchmark in nofib is failing:

==nofib== fibheaps: time to run fibheaps follows...
/usr/bin/time ../../runstdtest/runstdtest ./fibheaps  -o1 fibheaps.stdout -o1 fibheaps.stdout  -ghc-timing     300000;  /usr/bin/time ../../runstdtest/runstdtest ./fibheaps  -o1 fibheaps.stdout -o1 fibheaps.stdout  -ghc-timing     300000;  /usr/bin/time ../../runstdtest/runstdtest ./fibheaps  -o1 fibheaps.stdout -o1 fibheaps.stdout  -ghc-timing     300000;  /usr/bin/time ../../runstdtest/runstdtest ./fibheaps  -o1 fibheaps.stdout -o1 fibheaps.stdout  -ghc-timing     300000;  /usr/bin/time ../../runstdtest/runstdtest ./fibheaps  -o1 fibheaps.stdout -o1 fibheaps.stdout  -ghc-timing     300000;
Command exited with non-zero status 2
0.72user 0.05system 0:00.77elapsed 99%CPU (0avgtext+0avgdata 277664maxresident)k
0inputs+32outputs (0major+17695minor)pagefaults 0swaps
././fibheaps 300000 < /dev/null
**** expected exit status 0 not seen ; got 2
././fibheaps 300000 < /dev/null
expected stderr not matched by reality
--- /tmp/no_stderr3559  2010-05-17 07:35:09.000000000 -0300
+++ /tmp/runtest3559.2  2010-05-17 07:35:10.000000000 -0300
@@ -0,0 +1,2 @@
+Stack space overflow: current size 8388608 bytes.
+Use `+RTS -Ksize -RTS' to increase it.
Command exited with non-zero status 1
0.72user 0.06system 0:00.85elapsed 92%CPU (0avgtext+0avgdata 277664maxresident)k
368inputs+48outputs (1major+22085minor)pagefaults 0swaps
Command exited with non-zero status 2
0.71user 0.04system 0:00.78elapsed 97%CPU (0avgtext+0avgdata 277664maxresident)k
0inputs+32outputs (0major+17694minor)pagefaults 0swaps
././fibheaps 300000 < /dev/null
**** expected exit status 0 not seen ; got 2
././fibheaps 300000 < /dev/null
expected stderr not matched by reality
--- /tmp/no_stderr3578  2010-05-17 07:35:10.000000000 -0300
+++ /tmp/runtest3578.2  2010-05-17 07:35:11.000000000 -0300
@@ -0,0 +1,2 @@
+Stack space overflow: current size 8388608 bytes.
+Use `+RTS -Ksize -RTS' to increase it.
Command exited with non-zero status 1
0.72user 0.06system 0:00.80elapsed 98%CPU (0avgtext+0avgdata 277664maxresident)k
0inputs+48outputs (0major+22093minor)pagefaults 0swaps
Command exited with non-zero status 2
0.72user 0.06system 0:00.80elapsed 98%CPU (0avgtext+0avgdata 277680maxresident)k
0inputs+32outputs (0major+17696minor)pagefaults 0swaps
././fibheaps 300000 < /dev/null
**** expected exit status 0 not seen ; got 2
././fibheaps 300000 < /dev/null
expected stderr not matched by reality
--- /tmp/no_stderr3597  2010-05-17 07:35:11.000000000 -0300
+++ /tmp/runtest3597.2  2010-05-17 07:35:12.000000000 -0300
@@ -0,0 +1,2 @@
+Stack space overflow: current size 8388608 bytes.
+Use `+RTS -Ksize -RTS' to increase it.
Command exited with non-zero status 1
0.72user 0.07system 0:00.83elapsed 96%CPU (0avgtext+0avgdata 277680maxresident)k
0inputs+48outputs (0major+22095minor)pagefaults 0swaps
Command exited with non-zero status 2
0.68user 0.08system 0:00.78elapsed 97%CPU (0avgtext+0avgdata 277680maxresident)k
0inputs+32outputs (0major+17695minor)pagefaults 0swaps
././fibheaps 300000 < /dev/null
**** expected exit status 0 not seen ; got 2
././fibheaps 300000 < /dev/null
expected stderr not matched by reality
--- /tmp/no_stderr3616  2010-05-17 07:35:12.000000000 -0300
+++ /tmp/runtest3616.2  2010-05-17 07:35:12.000000000 -0300
@@ -0,0 +1,2 @@
+Stack space overflow: current size 8388608 bytes.
+Use `+RTS -Ksize -RTS' to increase it.
Command exited with non-zero status 1
0.70user 0.09system 0:00.80elapsed 98%CPU (0avgtext+0avgdata 277680maxresident)k
0inputs+48outputs (0major+22089minor)pagefaults 0swaps
Command exited with non-zero status 2
0.71user 0.06system 0:00.93elapsed 82%CPU (0avgtext+0avgdata 277680maxresident)k
0inputs+32outputs (0major+17695minor)pagefaults 0swaps
././fibheaps 300000 < /dev/null
**** expected exit status 0 not seen ; got 2
././fibheaps 300000 < /dev/null
expected stderr not matched by reality
--- /tmp/no_stderr3635  2010-05-17 07:35:12.000000000 -0300
+++ /tmp/runtest3635.2  2010-05-17 07:35:13.000000000 -0300
@@ -0,0 +1,2 @@
+Stack space overflow: current size 8388608 bytes.
+Use `+RTS -Ksize -RTS' to increase it.
Command exited with non-zero status 1
0.73user 0.06system 0:00.95elapsed 83%CPU (0avgtext+0avgdata 277680maxresident)k
0inputs+48outputs (0major+22089minor)pagefaults 0swaps
make[1]: ** [runtests] Erro 1
Failed making all in fibheaps: 1
make: ** [all] Erro 1

Erro is error in Portuguese. I’ll investigate on that.

Google Summer of Code

3/5/2010 (segunda-feira)

It’s kind of late now to talk about this, but I was accepted as a Student in the Google Summer of Code student. My project is about making the binaries produced by GHC faster by improving its Garbage Collector algorithm. I’ll implement the techniques described in the Immix Garbage Collector.  It’ll be great to work on GHC, specially in the RTS, and to produce code that can contribute to a wide number of users, like the users of darcs, XMonad, pandoc and other softwares built with GHC.

I’ll be blogging here about the status of my project.  Coding starts at May 24th, until then I’ll read the GHC source code documentation and experimenting with performance measurements using Criterion in a system with no daemons enabled.

Good Summer to everyone.

Another interesting code using Template Haskell

14/3/2010 (domingo)

While writing the post about performance, I got interested in Template Haskell, specially in the $(lift) construction.  Investigating a little bit about it, I got to the following code:

{-# LANGUAGE TemplateHaskell #-}

import Language.Haskell.TH.Syntax
import X

main :: IO ()
main = print $(lift $ x)

With X.hs:

module X where

import System.IO.Unsafe

x :: Int
x = read $ unsafePerformIO $ putStr "Number: " >> getLine

The result was a surprise to me. It asked for the number in compilation time.

$ ghc --make main
[2 of 2] Compiling Main             ( main.hs, main.o )
Loading package ghc-prim ... linking ... done.
Loading package integer-gmp ... linking ... done.
Loading package base ... linking ... done.
Loading package array-0.3.0.0 ... linking ... done.
Loading package containers-0.3.0.0 ... linking ... done.
Loading package pretty-1.0.1.1 ... linking ... done.
Loading package template-haskell ... linking ... done.
Number: 5
Linking main ...
marcot@zezinho:~/codigo/haskell$ ./main
5