Concurrent Garbage Collection (on Windows)

One of the most annoying properties of the current garbage collector used by the D programming language is that it needs to "stop-the-world" when running its collection. An application that has allocated a few hundreds MB of GC managed memory will experience pause times of some seconds, even on modern hardware.

The last couple of days I experimented with some ideas how to realize a concurrent garbage collection (as implemented by Leandro Lucarella for Linux [1]) on Windows. The linux version uses "fork" to create an exact snapshot of the running process, which can then be used to find unreferenced objects at the time of the snapshot while the process can continue to mutate memory (using copy-on-write page protection by the virtual memory management of the OS).

Fork on Windows?

Unfortunately, Windows does not have "fork". The closest match is using VirtualProtect with page protection PAGE_WRITECOPY. This only works on a file mapping object created with CreateFileMapping [3] that also allows sharing with another process using MapViewOfFile [4]. You don't have to back it up with an actual file, but you can also use the paging file. As far as I can tell, no file I/O is happening if you have enough memory to do this.

So here is a plan: create a child process, allocate GC managed memory using CreateFileMapping, so it can be shared with the child process. When starting a collection, use VirtualProtect to set the memory to copy-on-write. This works as expected for the first collection, but some pages have been duplicated by COW in the application process. How do you get the state of these pages back into the shared memory view? There seems to be no efficient way to do it using the virtual memory manager.

One slow way out could be to map the shared memory again, then copy the modified pages into this view. But there might be still references into them, so you cannot release the modified pages. In the long run this ends up with all the memory having a non-shared copy in the other process.

What you really want is the reverse behaviour: make a copy of the pages before the modification and use this copy during collection in the other process. When the collection is done these copies can just be thrown away without affecting the application.

[Update 2014/6/15] Dimitry Olshansky proposed a way out of the dilemma: instead of using multiple processes that have to copy memory with slow operations ReadProcessMemory/ WriteProcessMemory, the collection can be done by a thread in the same process with the capability of fast memory access. When starting a collection, GC managed memory is mapped to a secondary shared view within the same process, and the memory is set to copy-on-write by VirtualProtect. When collection is done, modified pages are copied back into the shared view. The COW view is released and replaced by a mapping of the shared memory at the very same virtual address using MapViewOfFileEx. The GC view of the memory can be released. [/Update]

Another issue is how to detect modified pages: there is a function GetWriteWatch that does exactly that in combination with ResetWriteWatch, but it only works on memory allocated with VirtualAlloc using the MEM_WRITE_WATCH flag. It cannot be applied later to shared memory created with CreateFileMapping. Using QueryWorkingSet to detect whether a page is still shared seems to work, though you'll have to walk all the pages yourself. [Update 2014/6/15] Another option is VirtualQuery that returns regions of identical page protection, but it seems slightly slower than QueryWorkingSet. [/Update]

COW on Windows?

So let's get some benchmarks first to estimate the cost of copy-on-write. Ignoring initial cache effects, here are the results for the first write access to a page (the code can be found here [7]):

page allocated with VirtualAlloc:	1800 cycles
page allocated with VirtualAlloc+MEM_WRITE_WATCH:	1800 cycles
page allocated with VirtualAlloc+ResetWriteWatch:	1100 cycles
page allocated with CreateFileMapping+MapViewOfFile:	2200 cycles
page allocated with CreateFileMapping+MapViewOfFile+COW:	4600 cycles

[These numbers were measured on Win 8.1/64 with a mobile Intel processor (i7-2670QM @ 2.2 GHz). Getting reliable numbers on it doesn't work too well, although I use a power-scheme of minimum and maximum processor state set to 99% to avoid the processor changing core frequency, using Turbo Boost or disabling too many components. Still, averaged values vary quite a bit from run to run. The shown values are typical rounded minimum results over 1024 pages. The observed average values are usually 10 - 20 % above the minimum.]

The values don't seem to vary much between 32-bit and 64-bit processes. Coincidentally, Linus Torvalds very recently measured the time for a page fault on linux [6], with the result of a bit more than 1000 cycles.

Seeing quite some penalty for copy-on-write by the OS, I considered doing the copy-on-write in a user exception handler for a write to a read-only page aswell (ignoring the fact that this will ruin the debugging experience). This would allow customizing the behaviour as mentioned above. Capturing writes to protected pages should be handled before going through any other exception handlers, so hooking the user mode callback for exceptions seems the fastest feasible way.

Here are the benchmark results for different kind of exception handlers (these actually don't do a full copy of the page but just change the address register to point to writable memory - according to the measurements above copying the page takes more than 2000 cycles if done by the OS):

	Win32	Win64
User mode callback for page fault exceptions:	28000 cycles	7000 cycles
Page fault handled by the first SEH exception handler:	30000 cycles	not measured
Page fault handled by catch(Throwable):	4000000 cycles	unavailable
A standard D exception using try { throw } catch:	18500 cycles	4200 cycles

Please note that the dmd exception handler for Win64 uses a non-native format that avoids the OS exception mechanism. Therefore, capturing page faults does not work there.

Concurrency in the GC

While the results for dealing with COW issues in the process itself seems possible for Win64, the overhead for 32-bit processes is considerable. Writing a driver for Win32 that deals with these issues in kernel space could be possible with similar performance as the OS, but having to install it with every application written in D is inappropriate.

A slightly different approach is inspired by the existence of GetWriteWatch and its quite good performance: when a collection is needed (e.g. no more free memory available),

take a snapshot of the roots and reset the write watches.
Trigger a background thread to mark the heap starting from the roots. Normal program flow can continue including allocating new memory.
Once it is finished, take a new snapshot of the roots and rescan modified pages. This assumes that only few new references have been added, so most of the work has already been done.
sweep unreferenced objects

This will results in objects staying alive that were referenced when taking the first or the second snapshot is taken.

Current GC

In a nutshell, this is how the current D runtime GC works:

memory is organized in a number of pools, each handling a contiguous memory area. Bit flags are used to describe the allocations inside this memory block:

finals : entries that need finalizer run on them
noscan : entries that should not be scanned
appendable : entries that are appendable
nointerior : interior pointers should be ignored (Only implemented for large object pools)

mark : entries already scanned, or should not be scanned
scan : entries that need to be scanned
freebits : entries that are on the free list

there are free-lists for objects sizes that are a power of 2. New allocations draw from these lists, create new lists from free pages inside a pool, or trigger a collection if there are no more free pages available
when an allocation cannot be served, a full "stop-the-world" mark-and-sweep collection is run:
- stop the world
- prepare: scan = false, rebuild free from free-lists, mark = free
- mark roots: for all entries referenced by stacks/registers/roots/ranges: mark-refs()
- scan heap: if scan { scan = 0, mark-refs() }
- sweep: if !mark { if finals { finalizer() }, free = true }
- recover: clear free-lists, if free, add to free-list
- resume the world
mark-refs(): for all references found in the entry: if !mark { mark = true; if !noscan { if recurse mark-refs(), else scan = true } }

Concurrency in the GC

With the building blocks of the current implementation, this is a more detailed description of a concurrent GC:

create a background thread for the mark phase
allocate GC managed memory with VirtualAlloc and MEM_WRITE_WATCH
when a collection seems necessary:
- if the background GC is still active
  - continue with allocating a new pool
- else
  - stop the world
  - if the background GC is done with a previous mark
    - if memory watches are used
      - mark all the roots that are active now
      - mark references from pages that have been written to since the last reset
      - scan heap for new references by new roots/modified pages (in the hope that there are only a few left. if the number of new references exceeds some threshold, this could be delegated to the background thread again)
    - [Update 2014/6/15] if copy-on-write is used
      - copy modified pages from primary view to secondary view
      - unmap primary view, remap memory at same address as read-write
      - release secondary view
      [/Update]
    - resume the world
    - sweep unreferenced objects and recover empty pages/pools (this could be deferred into allocations that reuse the memory)
  - else
    - collect all roots: scan stacks/registers/roots/ranges and record to array
    - if memory watches are used: reset write watches on GC managed memory
    - [Update 2014/6/15] if copy-on-write is used:
      - create secondary view of GC managed memory
      - protect protect primary by copy-on-write
      [/Update]
    - prepare scanning
    - trigger the background thread
    - resume the world
  - continue with what was freed by sweep/recover
the background thread waits to be triggered, then
- mark collected roots
- scan heap

Shared data structures

The GC calls by the user are stll synchronized with a global lock, so all user threads can be considered a single main thread. While the background thread is running a mark phase, the main thread still modifies pool memory and data structures. Without additional synchronization, consistent reading and writing to these structures from different threads has to be guaranteed.

the GC thread reads, but does not write the pool table, the freebits and noscan bitsets.
the main and the GC thread never read/write the mask/scan bits simultaniously.
heap memory is read by the GC thread, but might be written to by the main thread. All modified pages are rescanned in the second heap scan with the stopped world. (Only objects marked alive by the background thread have to be sanned.)

Implementation details: [Update 2014/6/15]

a snapshot of the pool table is taken at collection start to avoid accessing it while being reallocated. Only memory within the snapshot is scanned or collected.
the freebits flags have to be mantained during a collection to account for new allocations when doing the second heap scan or when sweeping memory. They are now simply kept consistent with an entry being linked in any of the free-lists.
- malloc: clears free bit
- free: sets free bit
- sweep: sets free bit for unmarked objects

[/Update]

Interaction with GC calls on a user thread during collection

The user API to the GC has the following operations:

malloc: clears freebits: OK
realloc:
extend:
free: object added to free-list, sets freebits: OK
setAttr/clrAttr/getAttr
enable/disable
reserve
addrOf/sizeOf/query: only queries objects: OK
addRoot/removeRoot: roots transferred synchronized: OK
addRange/removeRange: roots in ranges transferred synchronized: OK
fullCollect/minimze

Concurrent operations not marked "OK" still have to be analyzed.

Implementation of a concurrent GC

You can find an implementation of this cuncurrent GC here: https://github.com/rainers/druntime/tree/concurrent_gc2 It has some configuration switches enabled by the environment variable D_GC and some versions to be defined when compiling the GC.

A benchmark suite has recently been added to druntime, these are the result for Win64 (I have disabled tree2.d as it seems sensitive to false pointers which can skew results unpredictively):

D_GC=profile=1:

Standard GC
Test	Exec Time	Memory	Collections	Pausetime	max Pause
bulk.d	4.121 s	105 MB	19	974 ms	166 ms
resize.d	1.861 s	9 MB	3	1 ms	1 ms
string.d	0.075 s	9 MB	3	2 ms	2 ms
conalloc.d	3.120 s	4 MB	6442	982 ms	0 ms
conappend.d	0.426 s	4 MB	19	4 ms	0 ms
concpu.d	3.171 s	9 MB	2134	924 ms	0 ms
dlist.d	4.113 s	16 MB	102	1840 ms	19 ms
huge_single.d	0.033 s	1501 MB	3	2 ms	1 ms
rand_large.d	2.461 s	81 MB	5187	972 ms	0 ms
rand_small.d	1.016 s	16 MB	1168	419 ms	0 ms
slist.d	4.286 s	16 MB	102	2036 ms	23 ms
testgc3.d	3.441 s	196 MB	17	1498 ms	262 ms
tree1.d	1.886 s	16 MB	149	1199 ms	14 ms
words.d	1.658 s	379 MB	7	89 ms	57 ms

Average/Max	2.262 s	168 MB		781 ms	262 ms

There is a gratuitious collection running at the very end of the application, let's take this out. Also we segregate pools into memory that might contain references and plain data:

D_GC=profile=1 finalCollect=0, version=POOL_NOSCAN:

Standard GC
Test	Exec Time	Memory	Collections	Pausetime	max Pause
bulk.d	4.573 s	105 MB	19	934 ms	158 ms
resize.d	1.871 s	9 MB	3	1 ms	0 ms
string.d	0.058 s	18 MB	4	2 ms	1 ms
conalloc.d	2.651 s	9 MB	2469	424 ms	0 ms
conappend.d	0.041 s	9 MB	8	1 ms	0 ms
concpu.d	3.785 s	16 MB	2459	1227 ms	0 ms
dlist.d	5.159 s	38 MB	54	2900 ms	64 ms
huge_single.d	0.031 s	1504 MB	3	1 ms	1 ms
rand_large.d	1.741 s	102 MB	3371	842 ms	0 ms
rand_small.d	1.033 s	16 MB	1156	472 ms	0 ms
slist.d	5.458 s	18 MB	102	3207 ms	33 ms
testgc3.d	3.130 s	185 MB	16	1220 ms	212 ms
tree1.d	1.496 s	16 MB	81	799 ms	11 ms
words.d	1.659 s	467 MB	9	66 ms	38 ms

Average/Max	2.334 s	179 MB		864 ms	212 ms

This is a run with the GC disabled. It will still collect if it doesn't get more memory from the OS (I have disabled paging).

D_GC=profile=1 finalCollect=0 disable=1, version=POOL_NOSCAN:

Standard GC
Test	Exec Time	Memory	Collections	Pausetime	max Pause
bulk.d	3.405 s	741 MB	0	0 ms	0 ms
resize.d	2.283 s	16 MB	0	0 ms	0 ms
string.d	0.060 s	25 MB	0	0 ms	0 ms
conalloc.d	3.368 s	4096 MB	0	0 ms	0 ms
conappend.d	0.044 s	25 MB	0	0 ms	0 ms
concpu.d	2.784 s	4096 MB	0	0 ms	0 ms
dlist.d	2.415 s	512 MB	0	0 ms	0 ms
huge_single.d	0.031 s	1504 MB	1	1 ms	1 ms
rand_large.d	0.505 s	5664 MB	11	73 ms	6 ms
rand_small.d	4.102 s	5376 MB	0	0 ms	0 ms
slist.d	2.423 s	512 MB	0	0 ms	0 ms
testgc3.d	2.472 s	704 MB	0	0 ms	0 ms
tree1.d	0.844 s	416 MB	0	0 ms	0 ms
words.d	1.609 s	470 MB	0	0 ms	0 ms

Average/Max	1.881 s	1725 MB		5 ms	6 ms

This is the concurrent GC with write watches:

D_GC=profile=1 finalCollect=0 concurrent=1, version=POOL_NOSCAN:

Concurrent GC
Test	Exec Time	Memory	Collections	Pausetime	max Pause
bulk.d	4.476 s	293 MB	15	941 ms	159 ms
resize.d	2.295 s	9 MB	2	0 ms	0 ms
string.d	0.584 s	25 MB	4	3 ms	1 ms
conalloc.d	2.584 s	576 MB	25	15 ms	14 ms
conappend.d	0.154 s	25 MB	4	1 ms	0 ms
concpu.d	2.581 s	576 MB	25	49 ms	14 ms
dlist.d	3.065 s	225 MB	11	522 ms	133 ms
huge_single.d	0.133 s	1504 MB	1	1 ms	0 ms
rand_large.d	0.529 s	2144 MB	74	95 ms	1 ms
rand_small.d	1.963 s	576 MB	25	108 ms	15 ms
slist.d	3.013 s	169 MB	10	603 ms	132 ms
testgc3.d	2.898 s	288 MB	12	687 ms	156 ms
tree1.d	1.019 s	121 MB	10	243 ms	69 ms
words.d	1.770 s	470 MB	8	87 ms	47 ms

Average/Max	1.933 s	500 MB		239 ms	159 ms

This is the concurrent GC with copy-on-write:

D_GC=profile=1 finalCollect=0 concurrent=1, version=POOL_NOSCAN version=COW:

Concurrent GC
Test	Exec Time	Memory	Collections	Pausetime	max Pause
bulk.d	4.812 s	261 MB	14	502 ms	85 ms
resize.d	1.987 s	9 MB	2	2 ms	1 ms
string.d	0.592 s	25 MB	4	6 ms	1 ms
conalloc.d	2.508 s	704 MB	29	54 ms	2 ms
conappend.d	0.145 s	25 MB	3	4 ms	1 ms
concpu.d	3.048 s	544 MB	24	575 ms	60 ms
dlist.d	2.743 s	225 MB	10	198 ms	59 ms
huge_single.d	0.126 s	1504 MB	1	2 ms	1 ms
rand_large.d	1.232 s	2144 MB	73	361 ms	6 ms
rand_small.d	1.531 s	576 MB	25	196 ms	13 ms
slist.d	2.734 s	225 MB	10	255 ms	84 ms
testgc3.d	2.988 s	320 MB	12	451 ms	124 ms
tree1.d	1.099 s	121 MB	10	320 ms	89 ms
words.d	2.482 s	470 MB	8	48 ms	26 ms

Average/Max	2.001 s	510 MB		212 ms	124 ms

Most of the applications are faster with the concurrent GC than with the traditional implementation, mostly by shorter pause times, i.e. the time the application thread is stopped by the GC. This comes at the cost of much higher memory usage for appliations that allocate faster than the GC can find unused memory. Some performance gain is also the result of fewer collections being run.

Links:

D

Community

Concurrent Garbage Collection (on Windows)

Fork on Windows?

COW on Windows?

Concurrency in the GC

Current GC

Concurrency in the GC

Shared data structures

Interaction with GC calls on a user thread during collection

Implementation of a concurrent GC