Archive for agosto \25\UTC 2010
This part of this post assumes that the reader has read my last post.
The Summer of Code is over. It was great to spend time working on GHC and getting money for it. Although the implementation is not mature enough to be included in the repository, I’m happy with the state it is now. I think it’s a good start, and I plan to keep working on it. It’s good to see how my motivation increased now that the program is over, and that I’m free to not work on it if I want to. So I’m going to take a look at it again, as soon as I do the stuff I was postergating during the program. I’ve created a wiki page for it, and my plans are to implement and measure the “Remove partial lists” patch, and then to debug “Allocate in lines in minor GCs”. Any help is very welcome.
I’ve decided to try using local DNS caching again, using the simple dnsmasq. The problem is that, even including
prepend domain-name-servers 127.0.0.1;
in /etc/dhcp3/dhclient.conf, NetworkManager will overwrite /etc/resolv.conf, and remove
from there. I found out in a comment here, that I could create a script in /etc/NetworkManager/dispatcher.d/ to make this line be included in the file. The script is simple
#!/bin/sh set -e cp /etc/resolv.conf /etc/resolv.conf.tmp echo nameserver 127.0.0.1 > /etc/resolv.conf cat /etc/resolv.conf.tmp >> /etc/resolv.conf rm /etc/resolv.conf.tmp
and it’s working
This part of this post assumes that the reader has read my last post.
This post contains embarassing news: In my effort to organize the patches to make it easier to run a script that benchmarks each change to see how it affects performance, I made a very simple mistake, which mostly would disable Immix. This was done in the first patch of the stack, so all patches were compromised. I noticed this while I was attending Simon Marlow’s request of measuring how many lines are being freed and allocated, and notice that it was none. My error was simply to set two different constants with the same value: BF_MEDIUM and BF_SWEPT. After the correction, I noticed two more errors that were introduced in this period, and that were not corrected because, as Immix was not activated, they were not causing any harm. Now, with everything in the stage I was before: a memory corruption that happens ocasionaly only in the nofib/real/maillist benchmark. As described in this link, I’ve tried already a lot of different techniques to debug this one, but it’s being hard. In the mean time, I’m measuring again the stack of patches, in a lab using the complete nofib (except for nofib/real/maillist) and nofib/gc, and in my laptop only fibheaps, fulsom and constraints, from nofib/gc, which were used in my last post.
Debconf was great. I’ve stayed in New York with my family doing tourism for some days before it, and New York city is fantastic. A very diverse architecture and people from all around the world. It’s really impressive. I arrived in DebCamp on Friday and it was not easy for me to find where to check-in, and where to go after unpacking my stuff. I assume I haven’t paid much attention to the e-mails explaining this, but I tried to search them when I got there, without success. It was good, on the other hand, to walk on Columbia searching: it’s a very nice campus, not comparable to anything we have here in Brazil. During DebConf I tried to help the organization as much as I could, being Talk-meister in some talks, helping a little bit with food organization and participating in the video team. All of this was very fun. There were some talks that I couldn’t attend to, and I watched the videos, or part of them, here on Monday: Making Debian rule again, FreeBSD, Anti-features… I still want to watch some, like the Analysis of Debian mailling lists using machine learning, and there are others that I wanted to see but are not recorded, like the one about notmuch. I got my key signed by a lot of people, which certainly was not the case before, and it the GPG best practices BoF was great. And there was the proposal to host DebConf12 in Brazil, maybe in my city, Belo Horizonte. Which reminds me:
This post assumes that the reader has read my last post.
I’ve been trying to find a reasonable method (which computer to use, single user or multi user) to benchmark for some time, and now an e-mail from Simon Marlow calmed me down. He suggests that in this stage it’s not necessary to run a long benchmark, but only a short one, that will be enought to have an idea about what each change is causing to performance.
For some time I was struggling with the benchmarks to get a reasonable result. I’ve started more than once to write a blog post analysing the results I’ve got, until in the middle of the post I notice the results are not consistent. This usually happens when there’s a big distortion in benchmarks that shouldn’t present no change. For instance, in a patch that only changes code that is executed with +RTS -w -RTS, there was a difference of 50% on execution time in default mode, that is, without +RTS -w -RTS.
I think that only disabling CPU scaling and running a small number of tests, which make it easier to reproduce a possibly inconsistent result, will be enough to get a good data.
I’ve made a stack of patches, splitting each part of my work. The stack does not reflect my preferences between the patches, that is, the complete stack applied is not what I think that is the best change to GHC. It’s only a set of patches applied linearly, in a way that it’s easy to see the impact in performance of each change. The patches are:
Don’t check for swept blocks in -DS.
The checkHeap function assumed the allocated part of the block contained only alive objects and slops. This was not true for blocks that are collected using mark sweep. The code in this patch skip the test for this kind of blocks.
This patch is only needed to make it possible to run the binary with the parameters +RTS -w -DS -RTS. I’ve already explained this patch. I didn’t measure the difference in performance before this patch, since it only changes code that is not going to be executed in the benchmarks, since none of them are runned with -DS.
Immix: allocate and free memory on lines.
This is the main patch with the initial implementation of Immix. As it is the initial implementation, it has it’s problems, that will be treated in the following patches. The comparison was done using 3 programs from nofib/gc: fibheaps, fulsom and constraints, as suggested by Simon Marlow. Two tests were made for each version of the compiler code: one using the default GC strategie (copy-collection with mark-compact if there’s few heap space left), another one using mark-sweep (or Immix after my changes). When comparing the default GC strategie between before and after this change, the program got 0.4% slower, with the collection of the generation 0 being 0.1% slower and the collection of the generation 1 0.4% slower. Comparing the mark-sweep with Immix, it got 13.7% slower, 15% in GC0 and 7.2% in GC1. The change in memory used is irrelevant.
Chose between allocating in line or blocks in todo_block_full
The first improvement I’ve made is to change the place where the allocation was searching for free lines. Most of the changes were done in the function alloc_for_copy(), in rts/sm/Evac.c. This is bad because this function is called a lot of times, and should be kept fast. So I moved them to ctodo_block_full() in rts/sm/GCUtils.c. It got about 9.9% and 9.6% faster in default and sweep, which makes it 1% faster and 2.8% slower than in the original code.
Improvements in sweep()
I was bothered that the code I’ve written to free memory in lines was messy, and I thought it could be simplified and maybe even turn faster. The general execution time increased 0.4%, but the GC1 time, which is the only one that should be affected, was reduced by 0.6%. In the default GC strategie, there was also a 0.2% increase in time, so I guess this is inside the margin of change. I have to investigate better on this one, but it seems to have improved slightly the performance.
Line before inscresing block size.
In a conversatio with Simon Marlow, we thought about these two options: one is to search for a line after trying to increase the block size, and that’s what I did in Chose between allocating in line or blocks in todo_block_full; the other is trying to search a line first, and just increase the block size if a block (and not a line) was being used for allocation and there is no line available. With this patch I’ve changed it to use the second one. Using sweep, the GC time got 0.5% slower, and with default it got 0.1%.
rts/sm/Sweep.c: Mark all BF_MEDIUM blocks as BF_FRAGMENTED
In my first implementation, the blocks that contain objects bigger than a line (of medium size) are marked with BF_MEDIUM, and are treated as the usual block in mark-sweep: if the block is empty, it’s freed; if it’s very fragmented, it’s marked to be collected using copy-collection. In this patch, instead of making it use the rules of mark-sweep, I just mark it as fragmented and make it be collected by copy-collection. This made the GC0 time be reduced by 3%, but caused an increase of 509.7% in the GC1 time. This was also the first time the memory used was reduced, by 25.4%. The results in the default mode are insignificant. What it seems to me is that the copy-collection code is more efficient with memory usage, but also more slow. Marking objects as BF_FRAGMENTED make this algorithm be used even for full objects.
rts/sm/Sweep.c: Don’t use 3/4 heuristics to mark as BF_FRAGMENTED
The mark-sweep algorithm considered that an object was fragmented if more than 1/4 of the word groups are completely unmarked. I kept using this heuristic, and this change removes it, making more use of the Immix, and less use of copy-collection. This didn’t change the memory used, and made the code 10.4% slower in sweep. In default GC mode, there was no relevant difference, as expected.
Allocate in lines in minor GCs.
In the initial implementation I was only allocating on lines only in major GCs, because I needed the mark stack, which was only available in these kind of allocations. In this change, I created the mark stack on all GCs, and used the allocation on lines. The results for the default mode are insignificant, and the code got 9.0% faster in sweep, using the same memory.
I’ve made a selection with the presumably better patches, which are: Don’t check for swept blocks in -DS, Immix: allocate and free memory on lines, Chose between allocating in line or blocks in todo_block_full, Improvements in sweep() and Allocate in lines in minor GCs. This is an attempt to achieve the best set of patches possible, to see how it improved the original code. Comparing the default strategie for this selection and the original code, it got 3.7% slower. Comparing both sweep, it got 3.9% slower. Comparing the original default with the final sweep, it got 4.2% slower and uses 4.7% more memory. There’s a lot of room for improvement, and I’m willing to hear suggestions of what I could change in the code to achieve this improvement.
The results are available. Every time I had doubt about a comparison, I’d run both versions again to check. This is way there are some backup files. The data presented here is not from all the most recent versions of each measurement, but from the ones that I thought were more similar in conditions.