Archive for the ‘PlanetHaskell’ Category

Immix on GHC Summer of Code final report

20/8/2010 (sexta-feira)

My project.

This part of this post assumes that the reader has read my last post.

The Summer of Code is over. It was great to spend time working on GHC and getting money for it. Although the implementation is not mature enough to be included in the repository, I’m happy with the state it is now. I think it’s a good start, and I plan to keep working on it. It’s good to see how my motivation increased now that the program is over, and that I’m free to not work on it if I want to. So I’m going to take a look at it again, as soon as I do the stuff I was postergating during the program. I’ve created a wiki page for it, and my plans are to implement and measure the “Remove partial lists” patch, and then to debug “Allocate in lines in minor GCs”. Any help is very welcome.

Anúncios

Immix on GHC Summer of Code report #12 / DebConf / Debian Day BH

13/8/2010 (sexta-feira)

My project.

This part of this post assumes that the reader has read my last post.

This post contains embarassing news: In my effort to organize the patches to make it easier to run a script that benchmarks each change to see how it affects performance, I made a very simple mistake, which mostly would disable Immix. This was done in the first patch of the stack, so all patches were compromised. I noticed this while I was attending Simon Marlow’s request of measuring how many lines are being freed and allocated, and notice that it was none. My error was simply to set two different constants with the same value: BF_MEDIUM and BF_SWEPT. After the correction, I noticed two more errors that were introduced in this period, and that were not corrected because, as Immix was not activated, they were not causing any harm. Now, with everything in the stage I was before: a memory corruption that happens ocasionaly only in the nofib/real/maillist benchmark. As described in this link, I’ve tried already a lot of different techniques to debug this one, but it’s being hard. In the mean time, I’m measuring again the stack of patches, in a lab using the complete nofib (except for nofib/real/maillist) and nofib/gc, and in my laptop only fibheaps, fulsom and constraints, from nofib/gc, which were used in my last post.


Debconf was great. I’ve stayed in New York with my family doing tourism for some days before it, and New York city is fantastic. A very diverse architecture and people from all around the world. It’s really impressive. I arrived in DebCamp on Friday and it was not easy for me to find where to check-in, and where to go after unpacking my stuff. I assume I haven’t paid much attention to the e-mails explaining this, but I tried to search them when I got there, without success. It was good, on the other hand, to walk on Columbia searching: it’s a very nice campus, not comparable to anything we have here in Brazil. During DebConf I tried to help the organization as much as I could, being Talk-meister in some talks, helping a little bit with food organization and participating in the video team. All of this was very fun. There were some talks that I couldn’t attend to, and I watched the videos, or part of them, here on Monday: Making Debian rule again, FreeBSD, Anti-features… I still want to watch some, like the Analysis of Debian mailling lists using machine learning, and there are others that I wanted to see but are not recorded, like the one about notmuch. I got my key signed by a lot of people, which certainly was not the case before, and it the GPG best practices BoF was great. And there was the proposal to host DebConf12 in Brazil, maybe in my city, Belo Horizonte. Which reminds me:

Tomorrow Betim will host the Debian Day BH, in parallel to another event: FSL-BH. A BoF related to the organization of DebConf12 in Belo Horizonte is scheduled. See you there.

Immix on GHC Summer of Code report #11

10/8/2010 (terça-feira)

My project.

This post assumes that the reader has read my last post.

I’ve been trying to find a reasonable method (which computer to use, single user or multi user) to benchmark for some time, and now an e-mail from Simon Marlow calmed me down. He suggests that in this stage it’s not necessary to run a long benchmark, but only a short one, that will be enought to have an idea about what each change is causing to performance.

For some time I was struggling with the benchmarks to get a reasonable result. I’ve started more than once to write a blog post analysing the results I’ve got, until in the middle of the post I notice the results are not consistent. This usually happens when there’s a big distortion in benchmarks that shouldn’t present no change. For instance, in a patch that only changes code that is executed with +RTS -w -RTS, there was a difference of 50% on execution time in default mode, that is, without +RTS -w -RTS.

I think that only disabling CPU scaling and running a small number of tests, which make it easier to reproduce a possibly inconsistent result, will be enough to get a good data.

I’ve made a stack of patches, splitting each part of my work. The stack does not reflect my preferences between the patches, that is, the complete stack applied is not what I think that is the best change to GHC. It’s only a set of patches applied linearly, in a way that it’s easy to see the impact in performance of each change. The patches are:

Don’t check for swept blocks in -DS.

The checkHeap function assumed the allocated part of the block contained only alive objects and slops. This was not true for blocks that are collected using mark sweep. The code in this patch skip the test for this kind of blocks.

This patch is only needed to make it possible to run the binary with the parameters +RTS -w -DS -RTS. I’ve already explained this patch. I didn’t measure the difference in performance before this patch, since it only changes code that is not going to be executed in the benchmarks, since none of them are runned with -DS.

Immix: allocate and free memory on lines.

This is the main patch with the initial implementation of Immix. As it is the initial implementation, it has it’s problems, that will be treated in the following patches. The comparison was done using 3 programs from nofib/gc: fibheaps, fulsom and constraints, as suggested by Simon Marlow. Two tests were made for each version of the compiler code: one using the default GC strategie (copy-collection with mark-compact if there’s few heap space left), another one using mark-sweep (or Immix after my changes). When comparing the default GC strategie between before and after this change, the program got 0.4% slower, with the collection of the generation 0 being 0.1% slower and the collection of the generation 1 0.4% slower. Comparing the mark-sweep with Immix, it got 13.7% slower, 15% in GC0 and 7.2% in GC1. The change in memory used is irrelevant.

Chose between allocating in line or blocks in todo_block_full

The first improvement I’ve made is to change the place where the allocation was searching for free lines. Most of the changes were done in the function alloc_for_copy(), in rts/sm/Evac.c. This is bad because this function is called a lot of times, and should be kept fast. So I moved them to ctodo_block_full() in rts/sm/GCUtils.c. It got about 9.9% and 9.6% faster in default and sweep, which makes it 1% faster and 2.8% slower than in the original code.

Improvements in sweep()

I was bothered that the code I’ve written to free memory in lines was messy, and I thought it could be simplified and maybe even turn faster. The general execution time increased 0.4%, but the GC1 time, which is the only one that should be affected, was reduced by 0.6%. In the default GC strategie, there was also a 0.2% increase in time, so I guess this is inside the margin of change. I have to investigate better on this one, but it seems to have improved slightly the performance.

Line before inscresing block size.

In a conversatio with Simon Marlow, we thought about these two options: one is to search for a line after trying to increase the block size, and that’s what I did in Chose between allocating in line or blocks in todo_block_full; the other is trying to search a line first, and just increase the block size if a block (and not a line) was being used for allocation and there is no line available. With this patch I’ve changed it to use the second one. Using sweep, the GC time got 0.5% slower, and with default it got 0.1%.

rts/sm/Sweep.c: Mark all BF_MEDIUM blocks as BF_FRAGMENTED

In my first implementation, the blocks that contain objects bigger than a line (of medium size) are marked with BF_MEDIUM, and are treated as the usual block in mark-sweep: if the block is empty, it’s freed; if it’s very fragmented, it’s marked to be collected using copy-collection. In this patch, instead of making it use the rules of mark-sweep, I just mark it as fragmented and make it be collected by copy-collection. This made the GC0 time be reduced by 3%, but caused an increase of 509.7% in the GC1 time. This was also the first time the memory used was reduced, by 25.4%. The results in the default mode are insignificant. What it seems to me is that the copy-collection code is more efficient with memory usage, but also more slow. Marking objects as BF_FRAGMENTED make this algorithm be used even for full objects.

rts/sm/Sweep.c: Don’t use 3/4 heuristics to mark as BF_FRAGMENTED

The mark-sweep algorithm considered that an object was fragmented if more than 1/4 of the word groups are completely unmarked. I kept using this heuristic, and this change removes it, making more use of the Immix, and less use of copy-collection. This didn’t change the memory used, and made the code 10.4% slower in sweep. In default GC mode, there was no relevant difference, as expected.

Allocate in lines in minor GCs.

In the initial implementation I was only allocating on lines only in major GCs, because I needed the mark stack, which was only available in these kind of allocations. In this change, I created the mark stack on all GCs, and used the allocation on lines. The results for the default mode are insignificant, and the code got 9.0% faster in sweep, using the same memory.

Selection

I’ve made a selection with the presumably better patches, which are: Don’t check for swept blocks in -DS, Immix: allocate and free memory on lines, Chose between allocating in line or blocks in todo_block_full, Improvements in sweep() and Allocate in lines in minor GCs. This is an attempt to achieve the best set of patches possible, to see how it improved the original code. Comparing the default strategie for this selection and the original code, it got 3.7% slower. Comparing both sweep, it got 3.9% slower. Comparing the original default with the final sweep, it got 4.2% slower and uses 4.7% more memory. There’s a lot of room for improvement, and I’m willing to hear suggestions of what I could change in the code to achieve this improvement.


The results are available. Every time I had doubt about a comparison, I’d run both versions again to check. This is way there are some backup files. The data presented here is not from all the most recent versions of each measurement, but from the ones that I thought were more similar in conditions.

Immix on GHC Summer of Code weekly report #10

21/7/2010 (quarta-feira)

My project.

This post assumes that the reader has read my last post.

This weekly report is late, because I was too interested in the project to stop working on it to write. Next week it have to be earlier, since I’m going to New York on Sunday, and will have to take a time from the project.

First of all, I need to say that I thought all problems were gone in my last post, but I noticed after a lot of testing that there is a segfault (or memory corruption or etc) that happens about one of five times I ran the real/maillist benchmark. I tried a lot of combinations of parameters to the RTS (-C0 -A4k -i0 -DS and others) to make it behave deterministically, and it did, without segfaults. I could not reproduce the segfault deterministically, because all of these parameters made the segfault disappear. My old debugging technique, of printing a log to see where the program has went before the segfault happened, also is causing the segfault to disappear, so this one is being much harder to debug than the last ones. I’m still looking at it.

In a moment of this week I thought of pulling the latests patches from GHC, which created the need of rebuilding the whole system. I probably rebuilt with a different mk/build.mk, so I got some very distorted benchmark results when comparing before executions from before this build and after them. Because this distortion only made itself visible in a benchmark where I’ve introduced some improvements which I was very positive about, I naively thought that the -46% results were actually real. After a reasonable disbelief of my mentor, I remembered the complete rebuild and thought that was the cause. I reran the pre-rebuild benchmarks, and got the updated results.

This whole week I kept on benchmarking. At first I was using my usual system, but I noticed some very unexpected results, so I decided to run them in single-user mode, running nothing in parallel. In this condition, I belive the distortions are minimized. I organized my changes in sequential patches to the repository, to make it easier to measure. It takes a long time to run nofib, specially gc/gc_bench. While I organized the code, I notice the segfault was gone. I don’t know still what was causing it, but I’m not worrying very much, since it’s gone.

I’m planning to write a bigger report next week with the complete results of the benchmarks.

Immix on GHC Summer of Code weekly report #9

7/7/2010 (quarta-feira)

My project.

This post assumes that the reader has read my last post.

I’m posting this weekly report earlier this week because there are too many things to tell already. I’ve found the reason behind the segfault I’m looking for so much time. The last problem, which is the only I know exactly when was fixed, because the programs started working, was related to allocating two times in the same region of memory. This happened because in each major GC the list of free line groups is generated again, but my old code was still allocating in the same line group of the last generation. So the last part of the line group, which was not yet used, would be a part of a line group in generated in the new collection, and it will be used for allocation two times: one in the allocation of the current line group, and another when this new line group starts being allocated.

The implementation of the allocation of memory in lines is not very complicated, but it has some details that should be paid attention, and that were the cause of most trouble last weeks, and still need improvement. Initially I was allocating one object per line, just to see if it would work. As it didn’t, I kept on improving the approach until I could find the problem. The next attempt was by setting ws->todo_free and ws->todo_lim in alloc_for_copy() in rts/sm/Evac.c. I think this is not ideal, because I didn’t want the code to become too inconsistent with the way memory was allocated using these pointers before my changes. So I created new variables, line_free and line_lim, at first in the gen_workspace ws, the same place that todo_free and todo_lim are, but because of the last problem I described in the previous paragraph I changed it to generation. I’m still not sure about where to place these pointers, this is something that can be improved.

Another problem that I took a long time to understand is that the object need to be scavenged after being allocated. When it was allocated in todo_free, it was being scavenged by scavenge_block() in rts/sm/Scav.c, because the block in which it’s in, todo_bd, is scavenged by this function. As I didn’t wanted to the whole block where the free line group is to get scavenged again, I didn’t want to send it to this function. So I thought about creating a way to scavenge only part of a block, that is, the space in the free line group that was allocated. This is still a valid idea, but I noticed that it was easier to use the mark stack. So I mark the object that is allocated in the line and push it in the mark stack. The main problem with this approach it’s only possible to allocate in lines during major GCs, since only in this kind of GCs the mark stack is active. This is certainly the place where I can make more improvement.

The patch of these changes and another one for the sanity checking explained in the last post.

I’m now benchmarking these changes with nofib, to see how much it affects the performance.

Immix on GHC Summer of Code weekly report #8

5/7/2010 (segunda-feira)

My project.

This post assumes that the reader has read my last post.

This week I could get my focus on the project again, since most of my classes are already over. I’ve investigated the segfault that was happening in GHC with +RTS -w -DS, and I noticed that the code in rts/sm/Sanity.c assumed that all objects in the allocated area of the block are being used.

    for (; bd != NULL; bd = bd->link) {
	p = bd->start;
	while (p < bd->free) {
	    nat size = checkClosure((StgClosure *)p);
	    /* This is the smallest size of closure that can live in the heap */
	    ASSERT( size >= MIN_PAYLOAD_SIZE + sizeofW(StgHeader) );
	    p += size;
	    
	    /* skip over slop */
	    while (p < bd->free &&
		   (*p < 0x1000 || !LOOKS_LIKE_INFO_PTR(*p))) { p++; } 
	}
    }

Since this is true for copy collection and mark compact, it was only with mark sweep that the error happened. The only way I could manage to make the segfault disappear by now was marking the swept blocks with a new flag, and avoid running this code in them.

So I went back to my old problem with the allocation of memory in lines. I noticed that (one of) the problem(s) may be that the object that is allocated in the free line is not scavenged after the evacuation. When the object is allocated in a block, using the current allocation method, it will eventually be scavenged, because all blocks that were being used to allocation are scavenged. I’m planning to implement a list of lines that need to be scavenged and code to scavenge the lines in this list, and the current line. It’ll be very similar to the code that does this with blocks.

Immix on GHC Summer of Code weekly report #7

29/6/2010 (terça-feira)

My project.

This post assumes that the reader has read my last post.

This week I had less time to work on my project than the last ones, because it was kind of the last week of the semester for some disciplines. This was unfortunate from a point of view, but it’s good in another, because now I’ve finished most of my disciplines and can focus more. This is also the reason why the weekly report is slightly delayed.

Most of my work this week was continuing the investigation of the segfault caused by my implementation of allocation in lines. The most interesting thing I discovered came after a suggestion from my menthor: to use +RTS -DS to turn on the sanity checker. I runned the code with my allocation implementation and the sanity checker and got a segfault. I runned again only with the code to free memory in lines and I got also a segfault. Then I runned without all my patches, using sweep, and I got the segfault. So it seems that there is something wrong with sweep, and now I’m investigating this new segfault.

I’m using the bernouilli program from nofib to test, running with 148 as a parameter and, naturally, +RTS -w -DS passed to the RunTime System. The output of gdb:

Current directory is /home/marcot/trabalho/livre/ghc/nofib/imaginary/bernouilli/
GNU gdb (GDB) 7.1-debian
Copyright (C) 2010 Free Software Foundation, Inc.
License GPLv3+: GNU GPL version 3 or later <http://gnu.org/licenses/gpl.html>
This is free software: you are free to change and redistribute it.
There is NO WARRANTY, to the extent permitted by law.  Type "show copying"
and "show warranty" for details.
This GDB was configured as "x86_64-linux-gnu".
For bug reporting instructions, please see:
<http://www.gnu.org/software/gdb/bugs/>...
Reading symbols from /home/marcot/trabalho/livre/ghc/nofib/imaginary/bernouilli/Main...done.
(gdb) r 148 +RTS -w -DS
Starting program: /home/marcot/trabalho/livre/ghc/nofib/imaginary/bernouilli/Main 148 +RTS -w -DS
[Thread debugging using libthread_db enabled]

Program received signal SIGSEGV, Segmentation fault.
0x00000000006309cb in LOOKS_LIKE_INFO_PTR_NOT_NULL (p=12297829382473034410) at includes/rts/storage/ClosureMacros.h:225
(gdb) where
#0  0x00000000006309cb in LOOKS_LIKE_INFO_PTR_NOT_NULL (p=12297829382473034410) at includes/rts/storage/ClosureMacros.h:225
#1  0x0000000000630a16 in LOOKS_LIKE_INFO_PTR (p=12297829382473034410) at includes/rts/storage/ClosureMacros.h:230
#2  0x0000000000630a4b in LOOKS_LIKE_CLOSURE_PTR (p=0x7ffff6c84062) at includes/rts/storage/ClosureMacros.h:235
#3  0x0000000000631427 in checkClosure (p=0x7ffff6853a08) at rts/sm/Sanity.c:320
#4  0x00000000006319ef in checkHeap (bd=0x7ffff68014c0) at rts/sm/Sanity.c:479
#5  0x000000000063222f in checkSanity (check_heap=rtsTrue) at rts/sm/Sanity.c:686
#6  0x000000000062dfdb in GarbageCollect (force_major_gc=rtsFalse, gc_type=0, cap=0x8d2ec0) at rts/sm/GC.c:768
#7  0x0000000000620431 in scheduleDoGC (cap=0x8d2ec0, task=0x8f5080, force_major=rtsFalse) at rts/Schedule.c:1420
#8  0x000000000061fa2c in schedule (initialCapability=0x8d2ec0, task=0x8f5080) at rts/Schedule.c:539
#9  0x0000000000620c77 in scheduleWaitThread (tso=0x7ffff6c80000, ret=0x0, cap=0x8d2ec0) at rts/Schedule.c:1902
#10 0x000000000065762b in rts_evalLazyIO (cap=0x8d2ec0, p=0x89d8b0, ret=0x0) at rts/RtsAPI.c:495
#11 0x000000000061d3db in real_main () at rts/RtsMain.c:66
#12 0x000000000061d4ca in hs_main (argc=5, argv=0x7fffffffe6d8, main_init=0x406558 <__stginit_ZCMain>, main_closure=0x89d8b0) at rts/RtsMain.c:115
#13 0x00007ffff6fbcabd in __libc_start_main (main=<value optimized out>, argc=<value optimized out>, ubp_av=<value optimized out>, init=<value optimized out>, fini=<value optimized out>, rtld_fini=<value optimized out>, stack_end=0x7fffffffe6c8) at libc-start.c:222
#14 0x0000000000403a69 in _start ()

I started the investigation with the most obvious test: to remove the call to sweep() in rts/sm/GC.c.

  if (major_gc && oldest_gen->mark) {
      if (oldest_gen->compact) 
          compact(gct->scavenged_static_objects);
      // else
      //     sweep(oldest_gen);

The result was the same, same segfault in the same place. So I decided to do it a little stronger and avoid the blocks getting the BF_MARKED flag, so that they are not even marked.

                if (!(bd->flags & BF_FRAGMENTED)) {
                    // bd->flags |= BF_MARKED;
                }

This one worked. The code in sweep() is not commented, but it’s irrelevant, since it ignores blocks that don’t have this flag. My third try was to comment the places where the BF_MARKED flag is read, to see each one was causing the segfault. I got a list of places to search with grep, and there weren’t a lot of them.

$ grep BF_MARKED rts/sm/*.c
rts/sm/Compact.c:       if (bd->flags & BF_MARKED)
rts/sm/Evac.c:  if ((bd->flags & (BF_LARGE | BF_MARKED | BF_EVACUATED)) != 0) {
rts/sm/Evac.c:        if (bd->flags & BF_MARKED) {
rts/sm/GCAux.c:    if ((bd->flags & BF_MARKED) && is_marked((P_)q,bd)) {
rts/sm/GC.c:                    if (!(bd->flags & BF_MARKED))
rts/sm/GC.c:                        // time, so reset the BF_MARKED flags.
rts/sm/GC.c:                        // compact.  (search for BF_MARKED above).
rts/sm/GC.c:                        bd->flags &= ~BF_MARKED;
rts/sm/GC.c:                // Also at this point we set the BF_MARKED flag
rts/sm/GC.c:                // BF_MARKED is always unset, except during GC
rts/sm/GC.c:                    bd->flags |= BF_MARKED;
rts/sm/Sweep.c:        if (!(bd->flags & BF_MARKED)) { 

The first one is in rts/sm/Compact.c, so it’s not relevant to the use with -w. The second one, in rts/sm/Evac.c, is a bit indirect.

  if ((bd->flags & (BF_LARGE | BF_MARKED | BF_EVACUATED)) != 0) {

      // pointer into to-space: just return it.  It might be a pointer
      // into a generation that we aren't collecting (> N), or it
      // might just be a pointer into to-space.  The latter doesn't
      // happen often, but allowing it makes certain things a bit
      // easier; e.g. scavenging an object is idempotent, so it's OK to
      // have an object on the mutable list multiple times.
      if (bd->flags & BF_EVACUATED) {
          // We aren't copying this object, so we have to check
          // whether it is already in the target generation.  (this is
          // the write barrier).
	  if (bd->gen < gct->evac_gen) {
	      gct->failed_to_evac = rtsTrue;
	      TICK_GC_FAILED_PROMOTION();
	  }
	  return;
      }

      /* evacuate large objects by re-linking them onto a different list.
       */
      if (bd->flags & BF_LARGE) {
	  info = get_itbl(q);
	  if (info->type == TSO && 
	      ((StgTSO *)q)->what_next == ThreadRelocated) {
	      q = (StgClosure *)((StgTSO *)q)->_link;
              *p = q;
	      goto loop;
	  }
	  evacuate_large((P_)q);
	  return;
      }
      
      /* If the object is in a gen that we're compacting, then we
       * need to use an alternative evacuate procedure.
       */
      if (!is_marked((P_)q,bd)) {
          mark((P_)q,bd);
          push_mark_stack((P_)q);
      }
      return;
  }

The first if, in line 466, is executed if any of the three flags is present: BF_LARGE, BF_MARKED or BF_EVACUATED. The second if, in line 474, checks for BF_EVACUATED, and returns. The third if, in line 487, checks for BF_LARGE and returns. The code in lines 502-505 is only executed if BF_MARKED is present, and not the other ones. I tried commenting this code, and got an assertion fail in the user code, so I think this is not a good path to follow.

The second occurence of BF_MARKED is in the same file.

        if (bd->flags & BF_MARKED) {
            // must call evacuate() to mark this closure if evac==rtsTrue
            *q = (StgClosure *)p;
            if (evac) evacuate(q);
            unchain_thunk_selectors(prev_thunk_selector, (StgClosure *)p);
            return;
        }

Commenting it, with or without the call to sweep() commented, causes the same segfault. So, I’m tending to think this part of the code is unrelated to the issue.

The third occurence is in rts/sm/GCAux.c. I tried commenting it with all the four combinations of the two others commented and not commented, and all resulted in the segfault in the same place.

There’s another place to check in rts/sm/GC.c. Again, commenting it made no difference. The last one is part of sweep(), so it’s avoided anyway when the call to this function is commented.

Immix on GHC Summer of Code weekly report #6

18/6/2010 (sexta-feira)

My project.

This post assumes that the reader has read my last post.

I started the change in the line representation. Instead of a single linked list of free lines, it’s better to work with line groups, to avoid fragmentation inside lines in allocation. I’ve create a struct to represent this line group, with a pointer to the next group, and the size, in lines, of the group. It is stored in includes/rts/storage/GC.h currently, but I’m not sure if this is the best place to put it, and I’m thinking about changing it latter.

#include "rts/OSThreads.h"

typedef struct line_ {
    struct line_ *next;
    StgWord size;
} line;

/* -----------------------------------------------------------------------------

Then we need to change the code of liberation of lines in rts/sm/Sweep.c.

            else if(!(bd->flags & BF_MEDIUM)) {
                StgBool sequence;
                sequence = 0;

                for(i = 1; i < BLOCK_SIZE_W / BITS_IN(W_); i++) {
                    StgPtr start;
                    start = bd->start + BITS_IN(W_) * i;
                    if(bd->u.bitmap[i] == 0 && bd->u.bitmap[i - 1] == 0 &&
                       start + BITS_IN(W_) <= bd->free) {
                        printf("DEBUG: line_found(%p)\n", start); fflush(stdout);
                        if(gen->first_line == NULL) {
                            gen->first_line = (line *) start;
                        }
                        if(sequence) {
                            last_line->size++;
                        }
                        else {
                            if(last_line != NULL) {
                                last_line->next = (line *) start;
                            }
                            last_line = (line *) start;
                            last_line->next = NULL;
                            last_line->size = 1;
                        }
                        sequence = 1;
                    }
                    else {
                        sequence = 0;
                    }

I’ve tested this code using the same technique as before, checking the produced list trasversing it after it’s done. The list obtained showed the correct lines.

Now I’m back to the allocation code, which never worked. The changes in rts/sm/Evac.c are straightfoward.

        if (gen->first_line != NULL &&
            size <= BITS_IN(W_) * gen->first_line->size) {
            ws->todo_free = (StgPtr) gen->first_line;
            ws->todo_lim = ws->todo_free + BITS_IN(W_) * gen->first_line->size;
            gen->first_line = gen->first_line->next;
            to = ws->todo_free;
        }

This cause the same kind of errors I was getting before. I should go back to debugging. My co-supervisor in my Oriented Project in Computer Science suggested me using valgrind. I’ll try it.


Last week I forgot to mention I’ve presented my final presentation about my Oriented Project in Computer Science. This week I finished writing my monograph.

Summer of Code weekly report #5

15/6/2010 (terça-feira)

My project.

This post assumes that the reader has read my last post.

This was a week full of segfaults, failed assertions, changes in the user data and other crazy stuff. I started working on the allocation of memory on the freed lines. In the first glance, it seemed to be much harder than the small changes I’ve done to free memory in lines, because of the a never-seen-before-by-me datatypes being dealed with in alloc_for_copy(), from rts/sm/Evac.c: gen_workspace. But in a short time the code of alloc_for_copy() has shown itself to be simple, and it was easy to do an initial implementation. I went in the same path I was going on before, trying the simplest solution that works. For instance, the line representation used is not ideal, because groups of lines are not considered, but it was the simplest at that time, and I plan to improve it latter.

The gen_workspace data type contains a pointer to an area of a block that is not being used (todo_free) and a pointer to the end of that area (todo_lim). When a space for an object is requested, it’s allocated in this area, and todo_free is adjusted. If the area is full, a new block is requested. My intended change was to return the first free line of the generation when the object is smaller or equal to the size of a line, and use the current approach otherwise. This is the simplest way I could think, but has the problem that the only one object is allocated per line. This was a known issue.

    if(size <= BITS_IN(W_) && gen->first_line != NULL) {
        to = gen->first_line;
        gen->first_line = (StgPtr) *gen->first_line;
        return to;
    }

    ws = &gct->gens[gen->no];

After that I got the first round of segfaults. One problem I could spot so far was in the code in rts/sm/Sweep.c, related to the liberation of the lines. The block is allocated by need, and there’s a pointer to the first free byte, free. This free space is used by alloc_for_copy(). So, we should only think of lines in the region already allocated in the block, that is, where the address is smaller than free.

                    if(bd->u.bitmap[i] == 0 && bd->u.bitmap[i - 1] == 0 &&
                       start + BITS_IN(W_) <= bd->free)

But the segfaults, memory corruptions, etc, kept going on. I tried some restrictions, like allocating in lines only objects with the exact size of a line, or very small objects (size == 2), or allocating just one object for generation, or only one object at all. I also tried checking for the type of the allocated object, to see a correlation with the problems I was having. Nothing helped by now.

So I thought I should work on other things, so that maybe my mind gets clearer and I can spot the problem. I followed the suggestion of using todo_free and todo_lim to make it possible to allocate more than one object in a line. I liked this change, since I had the impression that it fits better with the rest of the code than the initial implementation, and it will be easier to addapt it when I improve the representation of the free lines. As I said, I wanted something to work while I think in how I’ll solve the segfault problems. The bad side of this choice is that I’ll not be able to test it, since it’s just a reimplementation of the same idea in the allocation code, and is not expected to work. The good size is that it may work, and solve the problems I was having by accident.

    if (ws->todo_free > ws->todo_lim) {
        if (size <= BITS_IN(W_) && gen->first_line != NULL) {
            ws->todo_free = gen->first_line;
            gen->first_line = (StgPtr) *gen->first_line;
            to = ws->todo_free;
            ws->todo_lim = ws->todo_free + BITS_IN(W_);
        } else {
            to = todo_block_full(size, ws);
        }
    }

I got the same results as with the older implementation, and no ideas about how to solve it (yet). So I thought about another thing to work on, which I’m talking about all through this post: improve the free line representation. This has the advantage of being testable, since it’s unrelated to the allocation code, and may give me an idea about how I can fix the segfaults. The bad side is that it probably won’t fix my problems directly. I’m working on it right now.


Following a suggestion, I’ve started using gdb to see what was happening in the GC instead of including printfs everywhere, and it’s been useful. I’ve noticed that sometimes printf is more effective, but sometimes gdb is much better.

Summer of Code weekly report #4

4/6/2010 (sexta-feira)

My project.

I’m publishing my report earlier this week, because there was a lot to talk about. This week I started to make changes in the code going in the direction of what I want to do. I haven’t started a final implementation, but I’m studying about how what I want to do will affect the rest of the Garbage Collection (GC). I noticed the code in rts/sm/Sweep.c was simple and similar to what I’m planning to do, so I started changing how it works.

Sweep in the Glasgow Haskell Compiler (GHC) is done by a bitmap, which contains a bit for each word in a memory block, and is set to 1 when there’s an object starting in the mapped area and 0 otherwise. When there’s a block with no objects starting at it, that is, all bits of the bitmap are set to 0, the block is freed.

        if (resid == 0)
        {
            freed++;
            gen->n_old_blocks--;
            if (prev == NULL) {
                gen->old_blocks = next;
            } else {
                prev->link = next;
            }
            freeGroup(bd);
        }

The bits are analyzed in a group of BITS_IN(W_), where BITS_IN(W_) is the number of bits in a word.

        for (i = 0; i < BLOCK_SIZE_W / BITS_IN(W_); i++)
        {
            if (bd->u.bitmap[i] != 0) resid++;
        }

If more than ¼ of the groups are completely set to 0, the block is considered fragmented.

            if (resid < (BLOCK_SIZE_W * 3) / (BITS_IN(W_) * 4)) {
                fragd++;
                bd->flags |= BF_FRAGMENTED;
            }

Immix, the GC algorithm I plan to implement in GHC, divides the blocks of memory in lines. My initial plan was to identify free lines. I decided to consider a the size of a line fixed in BITS_IN(W_) words, because it will map to a word in the bitmap, and the code was already analyzing in groups of BITS_IN(W_) words. This was very easy with the current code.

            if (bd->u.bitmap[i] != 0) resid++;
            else printf("DEBUG: line_found(%p)\n", bd->start + BITS_IN(W_) * i);

This worked, and showed some free lines. I’m sure there are other ways of logging in GHC, but printf was the simplest way I could thought of. I measured the occurrence of free lines using the bernouilli program from the NoFib benchmark suite, calling it with 500 +RTS -w, to make it uses sweep. In 782 calls to GarbageCollect(), sweep() was called 171 times, for 41704 blocks to be swept and found 230461 free lines. This gives us about 5.5 free lines per block, from the 8 lines in each block (on 64 bits systems).

The problem is that the bitmap is marked only in the start of the objects allocated, so even in a line that all bits are marked with 0 we can’t assume that it’s completely free, because there may be an object that starts in the previous line that is using the space of the line. Checking only for the previous line doesn’t work either, because a big object can span several lines. What we can do here is a variation of conservative marking, as proposed in the Immix paper, checking only the previous line and working only with objects smaller than a line.

To make sure I was working only with objects smaller than a line, I had to mark the blocks that contains medium objects and avoid them when seeking free lines. The block flags are defined in includes/rts/storage/Block.h, so I included another flag in this file, BF_MEDIUM.

/* Block contains objects evacuated during this GC */
#define BF_EVACUATED 1
/* Block is a large object */
#define BF_LARGE     2
/* Block is pinned */
#define BF_PINNED    4
/* Block is to be marked, not copied */
#define BF_MARKED    8
/* Block is free, and on the free list  (TODO: is this used?) */
#define BF_FREE      16
/* Block is executable */
#define BF_EXEC	     32
/* Block contains only a small amount of live data */
#define BF_FRAGMENTED 64
/* we know about this block (for finding leaks) */
#define BF_KNOWN     128
/* Block contains objects larger than a line */
#define BF_MEDIUM    256

The GHC GC is generational, there is, objects are allocated in a generation and, after a time, the ones that are still being used are moved the next generation. This idea assumes the death probability of younger objects is higher, so few objects are moved to the next generation. Sweep and Immix work only in the last generation so, to mark blocks with medium objects we have to check the size of the objects that are moved to the next generation.

This is done in the copy_tag function of rts/sm/Evac.c. I inserted a code that checks for the object size and marks the block when it’s bigger than BITS_IN(W_).

STATIC_INLINE GNUC_ATTR_HOT void
copy_tag(StgClosure **p, const StgInfoTable *info, 
         StgClosure *src, nat size, generation *gen, StgWord tag)
{
    StgPtr to, from;
    nat i;

    to = alloc_for_copy(size,gen);

    if(size > 8) {
        Bdescr(to)->flags |= BF_MEDIUM;
    }

So I updated the code in rts/sm/Sweep.c to only inspect for free lines in blocks without BF_MEDIUM mark.

            if (bd->u.bitmap[i] != 0) resid++;
            else if(!(bd->flags & BF_MEDIUM)) {
                printf("DEBUG: line_found(%p)\n", bd->start + BITS_IN(W_) * i);
            }

This also worked. Now, in the 32012 blocks there were 189015 free lines, found in the same number of GCs, making about 5.9 free lines per block. We considered only blocks with small objects, but we didn’t ignore the first line of each group of free lines. This can be achieved by checking if the previous line was also free.

            if (bd->u.bitmap[i] != 0) resid++;
            else if(!(bd->flags & BF_MEDIUM) && i > 0 && bd->u.bitmap[i] == 0) {
                printf("DEBUG: line_found(%p)\n", bd->start + BITS_IN(W_) * i);
            }

Now, from the 32239 blocks, 165547 free lines were found, giving 5.1 free lines per block. But there are more things to improve. If the whole block is free, we want to free it, instead of marking it’s lines as free. So it’s better to mark the lines after we know that the blocks are not completely free. So I left the code that checks the bitmap as it was, and included a line check only for blocks that are not completely free. At this point, I also associated the fragmentation test with blocks with medium objects, because in blocks of small objects we plan to allocate in free lines, so fragmentation is not a (big) issue.

            if (resid < (BLOCK_SIZE_W * 3) / (BITS_IN(W_) * 4) &&
                (bd->flags & BF_MEDIUM)) {
                fragd++;
                printf("DEBUG: BF_FRAGMENTED\n");
                bd->flags |= BF_FRAGMENTED;
            }
            else if(!(bd->flags & BF_MEDIUM)) {
                for(i = 1; i < BLOCK_SIZE_W / BITS_IN(W_); i++)
                {
                    if(bd->u.bitmap[i] == 0 && bd->u.bitmap[i - 1] == 0) {
                        printf("DEBUG: line_found(%p)\n", bd->start + BITS_IN(W_) * i);
                    }
                }
            }

The total ammount of blocks increased dramatically: the blocks that become fragmented and were not called again in sweep made a huge difference. From the 345143 blocks, 1633268 free lines were found, or about 4.7 free lines per block. 9434 blocks were free, so, from the remaining blocks, we have about 4.9 free lines per block.

Something we’ll need then is a way to access these lines latter. The simplest way I thought to achieve it is constructing a list of lines, in each the first word of each free line is a pointer to the next free line, and the first word of the last free line is 0. It’s useful to keep reporting the lines to stdout, so that we can then follow the list and check if we went to the same lines.

                    if(bd->u.bitmap[i] == 0 && bd->u.bitmap[i - 1] == 0) {
                        StgPtr start = bd->start + BITS_IN(W_) * i;
                        printf("DEBUG: line_found(%p)\n", start);
                        if(line_first == NULL) {
                            line_first = start;
                        }
                        if(line_last != NULL) {
                            *line_last = (StgWord) start;
                        }
                        line_last = start;
                        *line_last = 0;
                    }
                }
            }
        }
    }

    for(line_last = line_first; line_last; line_last = (StgPtr) *line_last) {
        fprintf(stderr, "DEBUG: line_found(%p)\n", line_last);
    }

I printed the inclusion of the lines on the list to stdout, and the walking on the list in stderr, so that it’d be easy to diff. There was no difference between the lists.

There’re another improvements that can be made, like using a list of groups of free lines, but I think it’ll be better to think about this after studying how the allocation in the free lines will be done. That’s where I’m going to now.


There are some minor things I learned, and thought they worth blogging. The current GHC uses three strategies for collecting the last generation: copying, mark-compact and mark-sweep. Copying is the default until the memory reaches 30% of the maximum heap size; after that, mark-compact is used. Sweep can be chosen by a Real Time System (RTS) flag, -w. To use mark-compact always, the flag is -c.

I’ve been submitting small patches to the cvs-ghc mailling list, mostly about outdated comments. Most of them were accepted, except for one which contained a lot of commentary, and that indeed was not completely correct. I corrected it and resend to the list, but the message is waiting for approval because the message header matched a filter rule. I believe this is because I replied the message generated from darcs.

There’s a very useful ghc option, specially for testing the compiler, because in this case you need to rebuild the source, even when there’re no changes in it. It’s -fforce-recomp, and it makes only sense when used with --make.