Immix on GHC Summer of Code weekly report #7

My project.

This post assumes that the reader has read my last post.

This week I had less time to work on my project than the last ones, because it was kind of the last week of the semester for some disciplines. This was unfortunate from a point of view, but it’s good in another, because now I’ve finished most of my disciplines and can focus more. This is also the reason why the weekly report is slightly delayed.

Most of my work this week was continuing the investigation of the segfault caused by my implementation of allocation in lines. The most interesting thing I discovered came after a suggestion from my menthor: to use +RTS -DS to turn on the sanity checker. I runned the code with my allocation implementation and the sanity checker and got a segfault. I runned again only with the code to free memory in lines and I got also a segfault. Then I runned without all my patches, using sweep, and I got the segfault. So it seems that there is something wrong with sweep, and now I’m investigating this new segfault.

I’m using the bernouilli program from nofib to test, running with 148 as a parameter and, naturally, +RTS -w -DS passed to the RunTime System. The output of gdb:

Current directory is /home/marcot/trabalho/livre/ghc/nofib/imaginary/bernouilli/
GNU gdb (GDB) 7.1-debian
Copyright (C) 2010 Free Software Foundation, Inc.
License GPLv3+: GNU GPL version 3 or later <http://gnu.org/licenses/gpl.html>
This is free software: you are free to change and redistribute it.
There is NO WARRANTY, to the extent permitted by law.  Type "show copying"
and "show warranty" for details.
This GDB was configured as "x86_64-linux-gnu".
For bug reporting instructions, please see:
<http://www.gnu.org/software/gdb/bugs/>...
Reading symbols from /home/marcot/trabalho/livre/ghc/nofib/imaginary/bernouilli/Main...done.
(gdb) r 148 +RTS -w -DS
Starting program: /home/marcot/trabalho/livre/ghc/nofib/imaginary/bernouilli/Main 148 +RTS -w -DS
[Thread debugging using libthread_db enabled]

Program received signal SIGSEGV, Segmentation fault.
0x00000000006309cb in LOOKS_LIKE_INFO_PTR_NOT_NULL (p=12297829382473034410) at includes/rts/storage/ClosureMacros.h:225
(gdb) where
#0  0x00000000006309cb in LOOKS_LIKE_INFO_PTR_NOT_NULL (p=12297829382473034410) at includes/rts/storage/ClosureMacros.h:225
#1  0x0000000000630a16 in LOOKS_LIKE_INFO_PTR (p=12297829382473034410) at includes/rts/storage/ClosureMacros.h:230
#2  0x0000000000630a4b in LOOKS_LIKE_CLOSURE_PTR (p=0x7ffff6c84062) at includes/rts/storage/ClosureMacros.h:235
#3  0x0000000000631427 in checkClosure (p=0x7ffff6853a08) at rts/sm/Sanity.c:320
#4  0x00000000006319ef in checkHeap (bd=0x7ffff68014c0) at rts/sm/Sanity.c:479
#5  0x000000000063222f in checkSanity (check_heap=rtsTrue) at rts/sm/Sanity.c:686
#6  0x000000000062dfdb in GarbageCollect (force_major_gc=rtsFalse, gc_type=0, cap=0x8d2ec0) at rts/sm/GC.c:768
#7  0x0000000000620431 in scheduleDoGC (cap=0x8d2ec0, task=0x8f5080, force_major=rtsFalse) at rts/Schedule.c:1420
#8  0x000000000061fa2c in schedule (initialCapability=0x8d2ec0, task=0x8f5080) at rts/Schedule.c:539
#9  0x0000000000620c77 in scheduleWaitThread (tso=0x7ffff6c80000, ret=0x0, cap=0x8d2ec0) at rts/Schedule.c:1902
#10 0x000000000065762b in rts_evalLazyIO (cap=0x8d2ec0, p=0x89d8b0, ret=0x0) at rts/RtsAPI.c:495
#11 0x000000000061d3db in real_main () at rts/RtsMain.c:66
#12 0x000000000061d4ca in hs_main (argc=5, argv=0x7fffffffe6d8, main_init=0x406558 <__stginit_ZCMain>, main_closure=0x89d8b0) at rts/RtsMain.c:115
#13 0x00007ffff6fbcabd in __libc_start_main (main=<value optimized out>, argc=<value optimized out>, ubp_av=<value optimized out>, init=<value optimized out>, fini=<value optimized out>, rtld_fini=<value optimized out>, stack_end=0x7fffffffe6c8) at libc-start.c:222
#14 0x0000000000403a69 in _start ()

I started the investigation with the most obvious test: to remove the call to sweep() in rts/sm/GC.c.

  if (major_gc && oldest_gen->mark) {
      if (oldest_gen->compact) 
          compact(gct->scavenged_static_objects);
      // else
      //     sweep(oldest_gen);

The result was the same, same segfault in the same place. So I decided to do it a little stronger and avoid the blocks getting the BF_MARKED flag, so that they are not even marked.

                if (!(bd->flags & BF_FRAGMENTED)) {
                    // bd->flags |= BF_MARKED;
                }

This one worked. The code in sweep() is not commented, but it’s irrelevant, since it ignores blocks that don’t have this flag. My third try was to comment the places where the BF_MARKED flag is read, to see each one was causing the segfault. I got a list of places to search with grep, and there weren’t a lot of them.

$ grep BF_MARKED rts/sm/*.c
rts/sm/Compact.c:       if (bd->flags & BF_MARKED)
rts/sm/Evac.c:  if ((bd->flags & (BF_LARGE | BF_MARKED | BF_EVACUATED)) != 0) {
rts/sm/Evac.c:        if (bd->flags & BF_MARKED) {
rts/sm/GCAux.c:    if ((bd->flags & BF_MARKED) && is_marked((P_)q,bd)) {
rts/sm/GC.c:                    if (!(bd->flags & BF_MARKED))
rts/sm/GC.c:                        // time, so reset the BF_MARKED flags.
rts/sm/GC.c:                        // compact.  (search for BF_MARKED above).
rts/sm/GC.c:                        bd->flags &= ~BF_MARKED;
rts/sm/GC.c:                // Also at this point we set the BF_MARKED flag
rts/sm/GC.c:                // BF_MARKED is always unset, except during GC
rts/sm/GC.c:                    bd->flags |= BF_MARKED;
rts/sm/Sweep.c:        if (!(bd->flags & BF_MARKED)) { 

The first one is in rts/sm/Compact.c, so it’s not relevant to the use with -w. The second one, in rts/sm/Evac.c, is a bit indirect.

  if ((bd->flags & (BF_LARGE | BF_MARKED | BF_EVACUATED)) != 0) {

      // pointer into to-space: just return it.  It might be a pointer
      // into a generation that we aren't collecting (> N), or it
      // might just be a pointer into to-space.  The latter doesn't
      // happen often, but allowing it makes certain things a bit
      // easier; e.g. scavenging an object is idempotent, so it's OK to
      // have an object on the mutable list multiple times.
      if (bd->flags & BF_EVACUATED) {
          // We aren't copying this object, so we have to check
          // whether it is already in the target generation.  (this is
          // the write barrier).
	  if (bd->gen < gct->evac_gen) {
	      gct->failed_to_evac = rtsTrue;
	      TICK_GC_FAILED_PROMOTION();
	  }
	  return;
      }

      /* evacuate large objects by re-linking them onto a different list.
       */
      if (bd->flags & BF_LARGE) {
	  info = get_itbl(q);
	  if (info->type == TSO && 
	      ((StgTSO *)q)->what_next == ThreadRelocated) {
	      q = (StgClosure *)((StgTSO *)q)->_link;
              *p = q;
	      goto loop;
	  }
	  evacuate_large((P_)q);
	  return;
      }
      
      /* If the object is in a gen that we're compacting, then we
       * need to use an alternative evacuate procedure.
       */
      if (!is_marked((P_)q,bd)) {
          mark((P_)q,bd);
          push_mark_stack((P_)q);
      }
      return;
  }

The first if, in line 466, is executed if any of the three flags is present: BF_LARGE, BF_MARKED or BF_EVACUATED. The second if, in line 474, checks for BF_EVACUATED, and returns. The third if, in line 487, checks for BF_LARGE and returns. The code in lines 502-505 is only executed if BF_MARKED is present, and not the other ones. I tried commenting this code, and got an assertion fail in the user code, so I think this is not a good path to follow.

The second occurence of BF_MARKED is in the same file.

        if (bd->flags & BF_MARKED) {
            // must call evacuate() to mark this closure if evac==rtsTrue
            *q = (StgClosure *)p;
            if (evac) evacuate(q);
            unchain_thunk_selectors(prev_thunk_selector, (StgClosure *)p);
            return;
        }

Commenting it, with or without the call to sweep() commented, causes the same segfault. So, I’m tending to think this part of the code is unrelated to the issue.

The third occurence is in rts/sm/GCAux.c. I tried commenting it with all the four combinations of the two others commented and not commented, and all resulted in the segfault in the same place.

There’s another place to check in rts/sm/GC.c. Again, commenting it made no difference. The last one is part of sweep(), so it’s avoided anyway when the call to this function is commented.

About these ads

6 Respostas to “Immix on GHC Summer of Code weekly report #7”

  1. Logan Says:

    I don’t know how well this would work for something like GHC, but have you tried throwing valgrind at the problem? SEGFAULTs are the sort of thing it can be really good at tracking down and you won’t have to mess with the code so much (assuming it works), possibly hiding the problem.

  2. marcotmarcot Says:

    Valgrind is only displaying the read error that precedes the segfault. Nothing really useful.

  3. Mate Says:

    Try valgrind. It should help a LOT with this.

  4. Sebastian Says:

    I recommend you stop the guess work, and look at what’s actually going on. Debuggnig is detective work, not a lottery.

    You have a debugger, and a repro case, this really should be a matter of hours, not weeks. Why does it segfault? Where did that pointer come from? When did that get set (data breakpoints help here so you can check the specific pointer that crashes, if you know it, each time it’s set).

    Really, blindly commenting things out and hoping to find the problem that way is not the way to debug C. You need to find out what’s *actually* going on by looking at the state of the running program and tracking down where it first diverges from what you expect. The debugger is crucial to this (don’t listen to people who tell you not to use a debugger, they are idiots).

  5. marcotmarcot Says:

    Hi Sebastian. I agree with almost everything you said. You should consider the code base is complex and big, and I’m not very familliar with the code. I added these comments not to fix the segfault, but to try to limit the scope of where I should look for problems. I believe this is part of the detective work you said. This may not be the best technique, I admit, but I’ve tried others before and this was came into my mind in the course of the investigation, which I described in detail. It was interesting to know that the problem was not in sweep(), but in tagging objects as BF_MARKED, just to cite one conclusion that I got with this technique.

    I’m using a debugger, and I’m trying to answer these questions you asked. I was looking at what is going as best as I could before and after these activies I described in this blog post. I just focused on describing how I tried to limit the scope of search.

  6. Immix on GHC Summer of Code weekly report #8 « Blog do Marcot Says:

    [...] Blog do Marcot « Immix on GHC Summer of Code weekly report #7 [...]

Deixe um comentário

Preencha os seus dados abaixo ou clique em um ícone para log in:

Logotipo do WordPress.com

Você está comentando utilizando sua conta WordPress.com. Sair / Alterar )

Imagem do Twitter

Você está comentando utilizando sua conta Twitter. Sair / Alterar )

Foto do Facebook

Você está comentando utilizando sua conta Facebook. Sair / Alterar )

Foto do Google+

Você está comentando utilizando sua conta Google+. Sair / Alterar )

Conectando a %s


Seguir

Obtenha todo post novo entregue na sua caixa de entrada.

%d blogueiros gostam disto: