Turbo Haskell
0.1
|
Operational Thesis:
I want a heap:
1.) which doesn't have to use header words per object
2.) which doesn't stop the world
3.) which doesn't have to pay for full semispace storage overhead to improve locality
4.) which doesn't pay for forwarding pointers for unique data, which let's me generalize Wadler's space leak fix to do much broader forms of opportunistic evaluation
5.) which lets me determine which case I'm evaluating on the other side of the prefetch overhead of reading the target of a pointer.
It'd be nice to get benefits in terms of the underlying evaluation model to avoid megamorphic jumps to unknown addresses, while I'm at it.
Some relevant papers:
*
? We'd potentially have to 'self-heal' twice, but any data constructor can live in multiple places. If that is (almost) all our non-indirection based garbage we could have a very very cheap garbage collector indeed.No one of these solutions in isolation is the answer, so let's throw things into a blender.
e.g. How do we eliminate indirections? We can shove indirections into special pages that use copy-collection, and use the target type to eliminate them once they become data constructors.
Working model:
Use a loaded value barrier as a read-barrier. Modify it to track uniqueness. Unlike Wise/Friedman, don't bother to recover uniqueness during mark for now, we can't stop the world. We can recover uniqueness locally by stepping through an indirection. Investigate later if we can recover uniqueness during GC.
Observation: If an indirection points to an 8 byte block of memory that we will replace with a blackhole with an ironclad guarantee that nobody else will enter the thunk while we evaluate it, then we can store a unique pointer in a non-unique indirection and retain that property. During evaluation of it we remain locally unique. In compressor terms, if we copy the indirections to to-space they become require-update pages.
Wadler's trick becomes two smaller tricks: a unique selector referencing a data constructor can forward itself. a non-unique selector referencing a data constructor is a chain of an indirection referencing a selector, referencing the data constructor. we can evaluate this with a variant of the blackholing process, either by inserting a blackhole as usual into the indirection, then copying the value from the selected field, or by directly compare-and-swapping the contents of the field in for the indirection.
We need to rendezvous between threads occasionally. We could a.) actually rendezvous which requires a full stop, killing worker productivity or b.) do a baton pass, everybody says they've confirmed the transition, then we switch modes when we know everyone has signed off.
One nice trick in the C4 barrier was the reservation of space 0 for non-heap pointers. Done here it means we can treat a 0-extended 32-bit integer as a valid 'heap' pointer, just by being careful when we trace. This makes it much easier to opportunistically evaluate additions and the like on 32-bit or smaller integers without wasted memory accesses.
I'm generally proposing that holding a reference from an older (or more global) generation to a newer generation pins the object in place. We could memory map a copy of that page in the oldest generation, copy out all the other garbage, and since this is the same page just mapped twice mutations that happen to one happen to both, but if our data was inert we could just copy the darn thing over without ceremony, so long as we had a way to track the forwarding of the page to fix up old references.
Working thoughts:
We seem to have a few different kinds of things when we break apart the cases. Each of them is suitable to a different scheme.
Most data is local. Each HEC can use whatever scheme it wants to manage local heaps. Concurrent? Stop-the-world? It makes sense to be able to vary the strategy, then we can have the user interface thread or the one running OpenGL, etc. to be locally concurrent. It'd be nice to be able to schedule a concurrent global-style GC for local pages even, by using hyperthreading to keep it on the same core, but its hard to schedule that accurately. Can we use thread_affinity to set this sort of game up / pair mutator threads up with collector threads in general for the global heap as well?