Concurrent Garbage Collection (on Windows)
One of the most annoying properties of the current garbage collector used by the D programming language is that it needs to "stop-the-world" when running its collection. An application that has allocated a few hundreds MB of GC managed memory will experience pause times of some seconds, even on modern hardware.
The last couple of days I experimented with some ideas how to realize a concurrent garbage collection (as implemented by Leandro Lucarella for Linux [1]) on Windows. The linux version uses "fork" to create an exact snapshot of the running process, which can then be used to find unreferenced objects at the time of the snapshot while the process can continue to mutate memory (using copy-on-write page protection by the virtual memory management of the OS).
Fork on Windows?
Unfortunately, Windows does not have "fork". The closest match is using VirtualProtect with page protection PAGE_WRITECOPY. This only works on a file mapping object created with CreateFileMapping [3] that also allows sharing with another process using MapViewOfFile [4]. You don't have to back it up with an actual file, but you can also use the paging file. As far as I can tell, no file I/O is happening if you have enough memory to do this.
So here is a plan: create a child process, allocate GC managed memory using CreateFileMapping, so it can be shared with the child process. When starting a collection, use VirtualProtect to set the memory to copy-on-write. This works as expected for the first collection, but some pages have been duplicated by COW in the application process. How do you get the state of these pages back into the shared memory view? There seems to be no efficient way to do it using the virtual memory manager.
One slow way out could be to map the shared memory again, then copy the modified pages into this view. But there might be still references into them, so you cannot release the modified pages. In the long run this ends up with all the memory having a non-shared copy in the other process.
What you really want is the reverse behaviour: make a copy of the pages before the modification and use this copy during collection in the other process. When the collection is done these copies can just be thrown away without affecting the application.
[Update 2014/6/15] Dimitry Olshansky proposed a way out of the dilemma: instead of using multiple processes that have to copy memory with slow operations ReadProcessMemory/ WriteProcessMemory, the collection can be done by a thread in the same process with the capability of fast memory access. When starting a collection, GC managed memory is mapped to a secondary shared view within the same process, and the memory is set to copy-on-write by VirtualProtect. When collection is done, modified pages are copied back into the shared view. The COW view is released and replaced by a mapping of the shared memory at the very same virtual address using MapViewOfFileEx. The GC view of the memory can be released. [/Update]
Another issue is how to detect modified pages: there is a function GetWriteWatch that does exactly that in combination with ResetWriteWatch, but it only works on memory allocated with VirtualAlloc using the MEM_WRITE_WATCH flag. It cannot be applied later to shared memory created with CreateFileMapping. Using QueryWorkingSet to detect whether a page is still shared seems to work, though you'll have to walk all the pages yourself. [Update 2014/6/15] Another option is VirtualQuery that returns regions of identical page protection, but it seems slightly slower than QueryWorkingSet. [/Update]
COW on Windows?
So let's get some benchmarks first to estimate the cost of copy-on-write. Ignoring initial cache effects, here are the results for the first write access to a page (the code can be found here [7]):
page allocated with VirtualAlloc: | 1800 cycles |
page allocated with VirtualAlloc+MEM_WRITE_WATCH: | 1800 cycles |
page allocated with VirtualAlloc+ResetWriteWatch: | 1100 cycles |
page allocated with CreateFileMapping+MapViewOfFile: | 2200 cycles |
page allocated with CreateFileMapping+MapViewOfFile+COW: | 4600 cycles |
[These numbers were measured on Win 8.1/64 with a mobile Intel processor (i7-2670QM @ 2.2 GHz). Getting reliable numbers on it doesn't work too well, although I use a power-scheme of minimum and maximum processor state set to 99% to avoid the processor changing core frequency, using Turbo Boost or disabling too many components. Still, averaged values vary quite a bit from run to run. The shown values are typical rounded minimum results over 1024 pages. The observed average values are usually 10 - 20 % above the minimum.]
The values don't seem to vary much between 32-bit and 64-bit processes. Coincidentally, Linus Torvalds very recently measured the time for a page fault on linux [6], with the result of a bit more than 1000 cycles.
Seeing quite some penalty for copy-on-write by the OS, I considered doing the copy-on-write in a user exception handler for a write to a read-only page aswell (ignoring the fact that this will ruin the debugging experience). This would allow customizing the behaviour as mentioned above. Capturing writes to protected pages should be handled before going through any other exception handlers, so hooking the user mode callback for exceptions seems the fastest feasible way.
Here are the benchmark results for different kind of exception handlers (these actually don't do a full copy of the page but just change the address register to point to writable memory - according to the measurements above copying the page takes more than 2000 cycles if done by the OS):
Win32 | Win64 | |
User mode callback for page fault exceptions: | 28000 cycles | 7000 cycles |
Page fault handled by the first SEH exception handler: | 30000 cycles | not measured |
Page fault handled by catch(Throwable): | 4000000 cycles | unavailable |
A standard D exception using try { throw } catch: | 18500 cycles | 4200 cycles |
Please note that the dmd exception handler for Win64 uses a non-native format that avoids the OS exception mechanism. Therefore, capturing page faults does not work there.
Concurrency in the GC
While the results for dealing with COW issues in the process itself seems possible for Win64, the overhead for 32-bit processes is considerable. Writing a driver for Win32 that deals with these issues in kernel space could be possible with similar performance as the OS, but having to install it with every application written in D is inappropriate.
A slightly different approach is inspired by the existence of GetWriteWatch and its quite good performance: when a collection is needed (e.g. no more free memory available),
- take a snapshot of the roots and reset the write watches.
- Trigger a background thread to mark the heap starting from the roots. Normal program flow can continue including allocating new memory.
- Once it is finished, take a new snapshot of the roots and rescan modified pages. This assumes that only few new references have been added, so most of the work has already been done.
- sweep unreferenced objects
This will results in objects staying alive that were referenced when taking the first or the second snapshot is taken.
Current GC
In a nutshell, this is how the current D runtime GC works:- memory is organized in a number of pools, each handling a contiguous memory area. Bit flags are used to describe the allocations inside this memory block:
- finals : entries that need finalizer run on them
- noscan : entries that should not be scanned
- appendable : entries that are appendable
- nointerior : interior pointers should be ignored (Only implemented for large object pools)
- mark : entries already scanned, or should not be scanned
- scan : entries that need to be scanned
- freebits : entries that are on the free list
- there are free-lists for objects sizes that are a power of 2. New allocations draw from these lists, create new lists from free pages inside a pool, or trigger a collection if there are no more free pages available
- when an allocation cannot be served, a full "stop-the-world" mark-and-sweep collection is run:
- stop the world
- prepare: scan = false, rebuild free from free-lists, mark = free
- mark roots: for all entries referenced by stacks/registers/roots/ranges: mark-refs()
- scan heap: if scan { scan = 0, mark-refs() }
- sweep: if !mark { if finals { finalizer() }, free = true }
- recover: clear free-lists, if free, add to free-list
- resume the world
Concurrency in the GC
With the building blocks of the current implementation, this is a more detailed description of a concurrent GC:
- create a background thread for the mark phase
- allocate GC managed memory with VirtualAlloc and MEM_WRITE_WATCH
- when a collection seems necessary:
- if the background GC is still active
- continue with allocating a new pool
- else
- stop the world
- if the background GC is done with a previous mark
- if memory watches are used
- mark all the roots that are active now
- mark references from pages that have been written to since the last reset
- scan heap for new references by new roots/modified pages (in the hope that there are only a few left. if the number of new references exceeds some threshold, this could be delegated to the background thread again)
- [Update 2014/6/15] if copy-on-write is used
- copy modified pages from primary view to secondary view
- unmap primary view, remap memory at same address as read-write
- release secondary view
- resume the world
- sweep unreferenced objects and recover empty pages/pools (this could be deferred into allocations that reuse the memory)
- if memory watches are used
- else
- collect all roots: scan stacks/registers/roots/ranges and record to array
- if memory watches are used: reset write watches on GC managed memory
- [Update 2014/6/15] if copy-on-write is used:
- create secondary view of GC managed memory
- protect protect primary by copy-on-write
- prepare scanning
- trigger the background thread
- resume the world
- continue with what was freed by sweep/recover
- if the background GC is still active
- the background thread waits to be triggered, then
- mark collected roots
- scan heap
Shared data structures
The GC calls by the user are stll synchronized with a global lock, so all user threads can be considered a single main thread. While the background thread is running a mark phase, the main thread still modifies pool memory and data structures. Without additional synchronization, consistent reading and writing to these structures from different threads has to be guaranteed.
- the GC thread reads, but does not write the pool table, the freebits and noscan bitsets.
- the main and the GC thread never read/write the mask/scan bits simultaniously.
- heap memory is read by the GC thread, but might be written to by the main thread. All modified pages are rescanned in the second heap scan with the stopped world. (Only objects marked alive by the background thread have to be sanned.)
Implementation details: [Update 2014/6/15]
- a snapshot of the pool table is taken at collection start to avoid accessing it while being reallocated. Only memory within the snapshot is scanned or collected.
- the freebits flags have to be mantained during a collection to account for new allocations
when doing the second heap scan or when sweeping memory. They are now simply
kept consistent with an entry being linked in any of the free-lists.
- malloc: clears free bit
- free: sets free bit
- sweep: sets free bit for unmarked objects
Interaction with GC calls on a user thread during collection
The user API to the GC has the following operations:
- malloc: clears freebits: OK
- realloc:
- extend:
- free: object added to free-list, sets freebits: OK
- setAttr/clrAttr/getAttr
- enable/disable
- reserve
- addrOf/sizeOf/query: only queries objects: OK
- addRoot/removeRoot: roots transferred synchronized: OK
- addRange/removeRange: roots in ranges transferred synchronized: OK
- fullCollect/minimze
Concurrent operations not marked "OK" still have to be analyzed.
Implementation of a concurrent GC
You can find an implementation of this cuncurrent GC here: https://github.com/rainers/druntime/tree/concurrent_gc2 It has some configuration switches enabled by the environment variable D_GC and some versions to be defined when compiling the GC.
A benchmark suite has recently been added to druntime, these are the result for Win64 (I have disabled tree2.d as it seems sensitive to false pointers which can skew results unpredictively):
Test | Exec Time | Memory | Collections | Pausetime | max Pause |
---|---|---|---|---|---|
bulk.d | 4.121 s | 105 MB | 19 | 974 ms | 166 ms |
resize.d | 1.861 s | 9 MB | 3 | 1 ms | 1 ms |
string.d | 0.075 s | 9 MB | 3 | 2 ms | 2 ms |
conalloc.d | 3.120 s | 4 MB | 6442 | 982 ms | 0 ms |
conappend.d | 0.426 s | 4 MB | 19 | 4 ms | 0 ms |
concpu.d | 3.171 s | 9 MB | 2134 | 924 ms | 0 ms |
dlist.d | 4.113 s | 16 MB | 102 | 1840 ms | 19 ms |
huge_single.d | 0.033 s | 1501 MB | 3 | 2 ms | 1 ms |
rand_large.d | 2.461 s | 81 MB | 5187 | 972 ms | 0 ms |
rand_small.d | 1.016 s | 16 MB | 1168 | 419 ms | 0 ms |
slist.d | 4.286 s | 16 MB | 102 | 2036 ms | 23 ms |
testgc3.d | 3.441 s | 196 MB | 17 | 1498 ms | 262 ms |
tree1.d | 1.886 s | 16 MB | 149 | 1199 ms | 14 ms |
words.d | 1.658 s | 379 MB | 7 | 89 ms | 57 ms |
Average/Max | 2.262 s | 168 MB | 781 ms | 262 ms |
There is a gratuitious collection running at the very end of the application, let's take this out. Also we segregate pools into memory that might contain references and plain data:
Test | Exec Time | Memory | Collections | Pausetime | max Pause |
---|---|---|---|---|---|
bulk.d | 4.573 s | 105 MB | 19 | 934 ms | 158 ms |
resize.d | 1.871 s | 9 MB | 3 | 1 ms | 0 ms |
string.d | 0.058 s | 18 MB | 4 | 2 ms | 1 ms |
conalloc.d | 2.651 s | 9 MB | 2469 | 424 ms | 0 ms |
conappend.d | 0.041 s | 9 MB | 8 | 1 ms | 0 ms |
concpu.d | 3.785 s | 16 MB | 2459 | 1227 ms | 0 ms |
dlist.d | 5.159 s | 38 MB | 54 | 2900 ms | 64 ms |
huge_single.d | 0.031 s | 1504 MB | 3 | 1 ms | 1 ms |
rand_large.d | 1.741 s | 102 MB | 3371 | 842 ms | 0 ms |
rand_small.d | 1.033 s | 16 MB | 1156 | 472 ms | 0 ms |
slist.d | 5.458 s | 18 MB | 102 | 3207 ms | 33 ms |
testgc3.d | 3.130 s | 185 MB | 16 | 1220 ms | 212 ms |
tree1.d | 1.496 s | 16 MB | 81 | 799 ms | 11 ms |
words.d | 1.659 s | 467 MB | 9 | 66 ms | 38 ms |
Average/Max | 2.334 s | 179 MB | 864 ms | 212 ms |
This is a run with the GC disabled. It will still collect if it doesn't get more memory from the OS (I have disabled paging).
Test | Exec Time | Memory | Collections | Pausetime | max Pause |
---|---|---|---|---|---|
bulk.d | 3.405 s | 741 MB | 0 | 0 ms | 0 ms |
resize.d | 2.283 s | 16 MB | 0 | 0 ms | 0 ms |
string.d | 0.060 s | 25 MB | 0 | 0 ms | 0 ms |
conalloc.d | 3.368 s | 4096 MB | 0 | 0 ms | 0 ms |
conappend.d | 0.044 s | 25 MB | 0 | 0 ms | 0 ms |
concpu.d | 2.784 s | 4096 MB | 0 | 0 ms | 0 ms |
dlist.d | 2.415 s | 512 MB | 0 | 0 ms | 0 ms |
huge_single.d | 0.031 s | 1504 MB | 1 | 1 ms | 1 ms |
rand_large.d | 0.505 s | 5664 MB | 11 | 73 ms | 6 ms |
rand_small.d | 4.102 s | 5376 MB | 0 | 0 ms | 0 ms |
slist.d | 2.423 s | 512 MB | 0 | 0 ms | 0 ms |
testgc3.d | 2.472 s | 704 MB | 0 | 0 ms | 0 ms |
tree1.d | 0.844 s | 416 MB | 0 | 0 ms | 0 ms |
words.d | 1.609 s | 470 MB | 0 | 0 ms | 0 ms |
Average/Max | 1.881 s | 1725 MB | 5 ms | 6 ms |
This is the concurrent GC with write watches:
Test | Exec Time | Memory | Collections | Pausetime | max Pause |
---|---|---|---|---|---|
bulk.d | 4.476 s | 293 MB | 15 | 941 ms | 159 ms |
resize.d | 2.295 s | 9 MB | 2 | 0 ms | 0 ms |
string.d | 0.584 s | 25 MB | 4 | 3 ms | 1 ms |
conalloc.d | 2.584 s | 576 MB | 25 | 15 ms | 14 ms |
conappend.d | 0.154 s | 25 MB | 4 | 1 ms | 0 ms |
concpu.d | 2.581 s | 576 MB | 25 | 49 ms | 14 ms |
dlist.d | 3.065 s | 225 MB | 11 | 522 ms | 133 ms |
huge_single.d | 0.133 s | 1504 MB | 1 | 1 ms | 0 ms |
rand_large.d | 0.529 s | 2144 MB | 74 | 95 ms | 1 ms |
rand_small.d | 1.963 s | 576 MB | 25 | 108 ms | 15 ms |
slist.d | 3.013 s | 169 MB | 10 | 603 ms | 132 ms |
testgc3.d | 2.898 s | 288 MB | 12 | 687 ms | 156 ms |
tree1.d | 1.019 s | 121 MB | 10 | 243 ms | 69 ms |
words.d | 1.770 s | 470 MB | 8 | 87 ms | 47 ms |
Average/Max | 1.933 s | 500 MB | 239 ms | 159 ms |
This is the concurrent GC with copy-on-write:
Test | Exec Time | Memory | Collections | Pausetime | max Pause |
---|---|---|---|---|---|
bulk.d | 4.812 s | 261 MB | 14 | 502 ms | 85 ms |
resize.d | 1.987 s | 9 MB | 2 | 2 ms | 1 ms |
string.d | 0.592 s | 25 MB | 4 | 6 ms | 1 ms |
conalloc.d | 2.508 s | 704 MB | 29 | 54 ms | 2 ms |
conappend.d | 0.145 s | 25 MB | 3 | 4 ms | 1 ms |
concpu.d | 3.048 s | 544 MB | 24 | 575 ms | 60 ms |
dlist.d | 2.743 s | 225 MB | 10 | 198 ms | 59 ms |
huge_single.d | 0.126 s | 1504 MB | 1 | 2 ms | 1 ms |
rand_large.d | 1.232 s | 2144 MB | 73 | 361 ms | 6 ms |
rand_small.d | 1.531 s | 576 MB | 25 | 196 ms | 13 ms |
slist.d | 2.734 s | 225 MB | 10 | 255 ms | 84 ms |
testgc3.d | 2.988 s | 320 MB | 12 | 451 ms | 124 ms |
tree1.d | 1.099 s | 121 MB | 10 | 320 ms | 89 ms |
words.d | 2.482 s | 470 MB | 8 | 48 ms | 26 ms |
Average/Max | 2.001 s | 510 MB | 212 ms | 124 ms |
Most of the applications are faster with the concurrent GC than with the traditional implementation, mostly by shorter pause times, i.e. the time the application thread is stopped by the GC. This comes at the cost of much higher memory usage for appliations that allocate faster than the GC can find unused memory. Some performance gain is also the result of fewer collections being run.
Links:
- [1] http://llucax.com.ar/proj/dgc/index.html
- [2] http://msdn.microsoft.com/en-us/library/windows/desktop/aa366898%28v=vs.85%29.aspx
- [3] http://msdn.microsoft.com/en-us/library/windows/desktop/aa366537%28v=vs.85%29.aspx
- [4] http://msdn.microsoft.com/en-us/library/windows/desktop/aa366761%28v=vs.85%29.aspx
- [5] http://msdn.microsoft.com/en-us/library/windows/desktop/ms684946%28v=vs.85%29.aspx
- [6] https://plus.google.com/+LinusTorvalds/posts/YDKRFDwHwr6
- [7] https://github.com/rainers/druntime/blob/concurrent_gc2/benchmark/gcbench/cowbench.d