Concurrent Garbage Collection (on Windows)

One of the most annoying properties of the current garbage collector used by the D programming language is that it needs to "stop-the-world" when running its collection. An application that has allocated a few hundreds MB of GC managed memory will experience pause times of some seconds, even on modern hardware.

The last couple of days I experimented with some ideas how to realize a concurrent garbage collection (as implemented by Leandro Lucarella for Linux [1]) on Windows. The linux version uses "fork" to create an exact snapshot of the running process, which can then be used to find unreferenced objects at the time of the snapshot while the process can continue to mutate memory (using copy-on-write page protection by the virtual memory management of the OS).

Fork on Windows?

Unfortunately, Windows does not have "fork". The closest match is using VirtualProtect with page protection PAGE_WRITECOPY. This only works on a file mapping object created with CreateFileMapping [3] that also allows sharing with another process using MapViewOfFile [4]. You don't have to back it up with an actual file, but you can also use the paging file. As far as I can tell, no file I/O is happening if you have enough memory to do this.

So here is a plan: create a child process, allocate GC managed memory using CreateFileMapping, so it can be shared with the child process. When starting a collection, use VirtualProtect to set the memory to copy-on-write. This works as expected for the first collection, but some pages have been duplicated by COW in the application process. How do you get the state of these pages back into the shared memory view? There seems to be no efficient way to do it using the virtual memory manager.

One slow way out could be to map the shared memory again, then copy the modified pages into this view. But there might be still references into them, so you cannot release the modified pages. In the long run this ends up with all the memory having a non-shared copy in the other process.

What you really want is the reverse behaviour: make a copy of the pages before the modification and use this copy during collection in the other process. When the collection is done these copies can just be thrown away without affecting the application.

[Update 2014/6/15] Dimitry Olshansky proposed a way out of the dilemma: instead of using multiple processes that have to copy memory with slow operations ReadProcessMemory/ WriteProcessMemory, the collection can be done by a thread in the same process with the capability of fast memory access. When starting a collection, GC managed memory is mapped to a secondary shared view within the same process, and the memory is set to copy-on-write by VirtualProtect. When collection is done, modified pages are copied back into the shared view. The COW view is released and replaced by a mapping of the shared memory at the very same virtual address using MapViewOfFileEx. The GC view of the memory can be released. [/Update]

Another issue is how to detect modified pages: there is a function GetWriteWatch that does exactly that in combination with ResetWriteWatch, but it only works on memory allocated with VirtualAlloc using the MEM_WRITE_WATCH flag. It cannot be applied later to shared memory created with CreateFileMapping. Using QueryWorkingSet to detect whether a page is still shared seems to work, though you'll have to walk all the pages yourself. [Update 2014/6/15] Another option is VirtualQuery that returns regions of identical page protection, but it seems slightly slower than QueryWorkingSet. [/Update]

COW on Windows?

So let's get some benchmarks first to estimate the cost of copy-on-write. Ignoring initial cache effects, here are the results for the first write access to a page (the code can be found here [7]):

page allocated with VirtualAlloc: 1800 cycles
page allocated with VirtualAlloc+MEM_WRITE_WATCH: 1800 cycles
page allocated with VirtualAlloc+ResetWriteWatch: 1100 cycles
page allocated with CreateFileMapping+MapViewOfFile: 2200 cycles
page allocated with CreateFileMapping+MapViewOfFile+COW: 4600 cycles

[These numbers were measured on Win 8.1/64 with a mobile Intel processor (i7-2670QM @ 2.2 GHz). Getting reliable numbers on it doesn't work too well, although I use a power-scheme of minimum and maximum processor state set to 99% to avoid the processor changing core frequency, using Turbo Boost or disabling too many components. Still, averaged values vary quite a bit from run to run. The shown values are typical rounded minimum results over 1024 pages. The observed average values are usually 10 - 20 % above the minimum.]

The values don't seem to vary much between 32-bit and 64-bit processes. Coincidentally, Linus Torvalds very recently measured the time for a page fault on linux [6], with the result of a bit more than 1000 cycles.

Seeing quite some penalty for copy-on-write by the OS, I considered doing the copy-on-write in a user exception handler for a write to a read-only page aswell (ignoring the fact that this will ruin the debugging experience). This would allow customizing the behaviour as mentioned above. Capturing writes to protected pages should be handled before going through any other exception handlers, so hooking the user mode callback for exceptions seems the fastest feasible way.

Here are the benchmark results for different kind of exception handlers (these actually don't do a full copy of the page but just change the address register to point to writable memory - according to the measurements above copying the page takes more than 2000 cycles if done by the OS):

Win32 Win64
User mode callback for page fault exceptions: 28000 cycles 7000 cycles
Page fault handled by the first SEH exception handler: 30000 cycles not measured
Page fault handled by catch(Throwable): 4000000 cycles unavailable
A standard D exception using try { throw } catch: 18500 cycles 4200 cycles

Please note that the dmd exception handler for Win64 uses a non-native format that avoids the OS exception mechanism. Therefore, capturing page faults does not work there.

Concurrency in the GC

While the results for dealing with COW issues in the process itself seems possible for Win64, the overhead for 32-bit processes is considerable. Writing a driver for Win32 that deals with these issues in kernel space could be possible with similar performance as the OS, but having to install it with every application written in D is inappropriate.

A slightly different approach is inspired by the existence of GetWriteWatch and its quite good performance: when a collection is needed (e.g. no more free memory available),

This will results in objects staying alive that were referenced when taking the first or the second snapshot is taken.

Current GC

In a nutshell, this is how the current D runtime GC works:

Concurrency in the GC

With the building blocks of the current implementation, this is a more detailed description of a concurrent GC:

Shared data structures

The GC calls by the user are stll synchronized with a global lock, so all user threads can be considered a single main thread. While the background thread is running a mark phase, the main thread still modifies pool memory and data structures. Without additional synchronization, consistent reading and writing to these structures from different threads has to be guaranteed.

Implementation details: [Update 2014/6/15]

[/Update]

Interaction with GC calls on a user thread during collection

The user API to the GC has the following operations:

Concurrent operations not marked "OK" still have to be analyzed.

Implementation of a concurrent GC

You can find an implementation of this cuncurrent GC here: https://github.com/rainers/druntime/tree/concurrent_gc2 It has some configuration switches enabled by the environment variable D_GC and some versions to be defined when compiling the GC.

A benchmark suite has recently been added to druntime, these are the result for Win64 (I have disabled tree2.d as it seems sensitive to false pointers which can skew results unpredictively):

D_GC=profile=1:
Standard GC
Test Exec Time MemoryCollectionsPausetimemax Pause
bulk.d 4.121 s105 MB19 974 ms 166 ms
resize.d 1.861 s9 MB3 1 ms 1 ms
string.d 0.075 s9 MB3 2 ms 2 ms
conalloc.d 3.120 s4 MB6442 982 ms 0 ms
conappend.d 0.426 s4 MB19 4 ms 0 ms
concpu.d 3.171 s9 MB2134 924 ms 0 ms
dlist.d 4.113 s16 MB102 1840 ms 19 ms
huge_single.d 0.033 s1501 MB3 2 ms 1 ms
rand_large.d 2.461 s81 MB5187 972 ms 0 ms
rand_small.d 1.016 s16 MB1168 419 ms 0 ms
slist.d 4.286 s16 MB102 2036 ms 23 ms
testgc3.d 3.441 s196 MB17 1498 ms 262 ms
tree1.d 1.886 s16 MB149 1199 ms 14 ms
words.d 1.658 s379 MB7 89 ms 57 ms
Average/Max2.262 s168 MB781 ms262 ms

There is a gratuitious collection running at the very end of the application, let's take this out. Also we segregate pools into memory that might contain references and plain data:

D_GC=profile=1 finalCollect=0, version=POOL_NOSCAN:
Standard GC
Test Exec Time MemoryCollectionsPausetimemax Pause
bulk.d 4.573 s105 MB19 934 ms 158 ms
resize.d 1.871 s9 MB3 1 ms 0 ms
string.d 0.058 s18 MB4 2 ms 1 ms
conalloc.d 2.651 s9 MB2469 424 ms 0 ms
conappend.d 0.041 s9 MB8 1 ms 0 ms
concpu.d 3.785 s16 MB2459 1227 ms 0 ms
dlist.d 5.159 s38 MB54 2900 ms 64 ms
huge_single.d 0.031 s1504 MB3 1 ms 1 ms
rand_large.d 1.741 s102 MB3371 842 ms 0 ms
rand_small.d 1.033 s16 MB1156 472 ms 0 ms
slist.d 5.458 s18 MB102 3207 ms 33 ms
testgc3.d 3.130 s185 MB16 1220 ms 212 ms
tree1.d 1.496 s16 MB81 799 ms 11 ms
words.d 1.659 s467 MB9 66 ms 38 ms
Average/Max2.334 s179 MB864 ms212 ms

This is a run with the GC disabled. It will still collect if it doesn't get more memory from the OS (I have disabled paging).

D_GC=profile=1 finalCollect=0 disable=1, version=POOL_NOSCAN:
Standard GC
Test Exec Time MemoryCollectionsPausetimemax Pause
bulk.d 3.405 s741 MB0 0 ms 0 ms
resize.d 2.283 s16 MB0 0 ms 0 ms
string.d 0.060 s25 MB0 0 ms 0 ms
conalloc.d 3.368 s4096 MB0 0 ms 0 ms
conappend.d 0.044 s25 MB0 0 ms 0 ms
concpu.d 2.784 s4096 MB0 0 ms 0 ms
dlist.d 2.415 s512 MB0 0 ms 0 ms
huge_single.d 0.031 s1504 MB1 1 ms 1 ms
rand_large.d 0.505 s5664 MB11 73 ms 6 ms
rand_small.d 4.102 s5376 MB0 0 ms 0 ms
slist.d 2.423 s512 MB0 0 ms 0 ms
testgc3.d 2.472 s704 MB0 0 ms 0 ms
tree1.d 0.844 s416 MB0 0 ms 0 ms
words.d 1.609 s470 MB0 0 ms 0 ms
Average/Max1.881 s1725 MB5 ms6 ms

This is the concurrent GC with write watches:

D_GC=profile=1 finalCollect=0 concurrent=1, version=POOL_NOSCAN:
Concurrent GC
Test Exec Time MemoryCollectionsPausetimemax Pause
bulk.d 4.476 s293 MB15 941 ms 159 ms
resize.d 2.295 s9 MB2 0 ms 0 ms
string.d 0.584 s25 MB4 3 ms 1 ms
conalloc.d 2.584 s576 MB25 15 ms 14 ms
conappend.d 0.154 s25 MB4 1 ms 0 ms
concpu.d 2.581 s576 MB25 49 ms 14 ms
dlist.d 3.065 s225 MB11 522 ms 133 ms
huge_single.d 0.133 s1504 MB1 1 ms 0 ms
rand_large.d 0.529 s2144 MB74 95 ms 1 ms
rand_small.d 1.963 s576 MB25 108 ms 15 ms
slist.d 3.013 s169 MB10 603 ms 132 ms
testgc3.d 2.898 s288 MB12 687 ms 156 ms
tree1.d 1.019 s121 MB10 243 ms 69 ms
words.d 1.770 s470 MB8 87 ms 47 ms
Average/Max1.933 s500 MB239 ms159 ms

This is the concurrent GC with copy-on-write:

D_GC=profile=1 finalCollect=0 concurrent=1, version=POOL_NOSCAN version=COW:
Concurrent GC
Test Exec Time MemoryCollectionsPausetimemax Pause
bulk.d 4.812 s261 MB14 502 ms 85 ms
resize.d 1.987 s9 MB2 2 ms 1 ms
string.d 0.592 s25 MB4 6 ms 1 ms
conalloc.d 2.508 s704 MB29 54 ms 2 ms
conappend.d 0.145 s25 MB3 4 ms 1 ms
concpu.d 3.048 s544 MB24 575 ms 60 ms
dlist.d 2.743 s225 MB10 198 ms 59 ms
huge_single.d 0.126 s1504 MB1 2 ms 1 ms
rand_large.d 1.232 s2144 MB73 361 ms 6 ms
rand_small.d 1.531 s576 MB25 196 ms 13 ms
slist.d 2.734 s225 MB10 255 ms 84 ms
testgc3.d 2.988 s320 MB12 451 ms 124 ms
tree1.d 1.099 s121 MB10 320 ms 89 ms
words.d 2.482 s470 MB8 48 ms 26 ms
Average/Max2.001 s510 MB212 ms124 ms

Most of the applications are faster with the concurrent GC than with the traditional implementation, mostly by shorter pause times, i.e. the time the application thread is stopped by the GC. This comes at the cost of much higher memory usage for appliations that allocate faster than the GC can find unused memory. Some performance gain is also the result of fewer collections being run.

Links: