• If you are citizen of an European Union member nation, you may not use this service unless you are at least 16 years old.

  • You already know Dokkio is an AI-powered assistant to organize & manage your digital files & messages. Very soon, Dokkio will support Outlook as well as One Drive. Check it out today!


Block-based virtual machine

Page history last edited by Michael van der Gulik 9 years, 1 month ago

Design for a block-based virtual machine.



This is a design for a virtual machine that:

  • Scales to terabyte or larger images.
  • Has completely concurrent garbage collection (GC).
  • Stores most of the image on disk with a cache in memory. Garbage collection does not need to access the entire image.
  • Potentially could support transactions.
  • Potentially could support cheap snapshots and rolling back of an image to any particular snapshot.
  • Potentially could run on a cluster of machines.


This is achieved by: 

  • The virtual machine (VM) is divided into blocks. The size of the blocks can be decided by the implementation. Blocks are arrays of bytes or words.
  • Two types of pointers are used: NearRefs and FarRefs.
    • NearRefs are direct references to memory locations within a block. The nature of these is up to the implementor of the VM; 16 bits and 32 bits are obvious choices. 8-bits is possible.
    • FarRefs refer to objects in other blocks and are described below.
  • Two garbage collection algorithms are used:
    1. A simple single-threaded mark/sweep or compacting GC algorithm is used for intra-block GCs (intraGC).
    2. A backtracing GC algorithm is used for inter-block GCs (interGC).



The memory of the virtual machine consists of several components (not including implied components such as the Interpreter):


(include diagram here).



The blocks in the VM are arrays of bytes or words (depending on implementation) and store objects much like the Squeak VM currently does. Blocks are referred to by some mechanism that is up to the implementor.


Blocks have a number of "In-pointers" which are objects which must be able to be referenced from outside the block both before and after garbage collection. This could be implemented in a number of ways. These in-pointers form the root set for intra-block garbage collection. In-pointers contain a NearRef to the object they point to. Usually in-pointers would be in a fixed place in the block, such as at the end and growing downwards; they do not move. In-pointers each have a collection of backreferences to the out-pointers that reference them. Backreferences can be kept in another block to save address space.



A BlockManager is a service that manages blocks:

  • If a block is requested that is on disk or another host, the block manager will locate and read that block into memory to make it available.
  • If a block has not been accessed for some time and memory is becoming full, that block is eventually written to disk and removed from memory.
  • Blocks may be stored elsewhere; for example, they might be stored or accessed on a remote computer over a network, or on separate disks and partitions.
  • If supported, the BlockManager is responsible for performing snapshots of system state: it marks all blocks as "copy-on-write" so that the next time the block is modified, a new copy of it is made.



Processes are Smalltalk objects run by actual OS processes or threads. This VM is designed to be very concurrent.


Generationalish Garbage Collection

Blocks are roughly ordered by age. The details of this are up to the implementor. One possible implementation would be to have all blocks be in a linked list from newest to oldest by creation or last-write date. This ordering is shared by all Processes.


Eden space

Every process has a special block called the "Eden space". This is the only place that new objects can be made. When full of objects, the Eden space is intra-GCed; if still mostly full, it becomes a regular block and a new Eden-space block is made for that thread.



A NearRef is a direct pointer to another object in the same block. It could be an 8-, 16- or 32-bit pointer depending on the implementer's preference. I suspect that a 16-bit pointer would allow many more objects to fit into a CPU's caches meaning that overall performance would be faster, but some CPUs such as ARM processors have poor support for 16-bit pointers. A NearRef is a pointer to a memory location in the same block (indexed starting from the beginning of the block) to an object header.


If a "read-block" is supported (see below) then part of the address space of a NearRef could potentially point into that block instead.



A FarRef is a distant pointer that points to objects in another block.


FarRefs are implemented as standard objects inside blocks. When used, that object would contain a NearRef (16 or 32 bits) like it usually does, except rather than pointing to the object it expects to, it will point to a FarRef object in the same block. That FarRef object then contains the actual pointer to the entry in the object index table. This way, objects can assume that their contents are always the same size, and a FarRef can contain a large 64-bit pointer.


The object index table entry would contain the block ID and the object number (a NearRef) in that block of that distant object.


Alternative FarRef implementation: A FarRef object contains a NearRef to a BlockRef object, and an InPointer reference in that other block. I.e.


  block ----------------> BlockRef
  inPointerIndex            64-bit block ID.


Object Index Table

The virtual machine would have a large Object Index Table. Each entry in this table refers to a particular object in a particular block. This table is used by every FarRef to locate objects in blocks other than their own.


Because of it's possible large size, this object index table might have its data stored in special blocks so that the index is stored mostly on disk.


Each entry would have the following properties:

  • An object index table ID, which is the same number stored in FarRefs to refer to objects referred to by this table. If implemented as an array, this ID could be implied by the position of this entry in this table.
  • The block number of the object this entry refers to.
  • The NearRef in that block that this entry refers to (maybe?).
  • A list of "backreferences" which enable a backreferencing garbage collector to be used. This allows cheap navigation from any object to every other object that refers to it. Each entry in this backreference list would be a object index table ID. Alternatively, each entry could be a block ID and every access would require searching the block for that reference.
  • Optionally, information about the last read or write time of this object may be kept for optimising the locations of objects in this VM. One implementation is a simple timestamp. Another implementation is to store an updating count, mean and standard deviation of the last read (and write?) operation on this object.


There needs to be a method of cheaply finding all table entries for a particular block. This is used as a root set of objects for intra-GC. One way of achieving this is to build a separate data structure that forms an "index" (in the relational database sense) over the block number column.


Alternative implementation: FarRefs could refer to a BlockRef object that contains the block ID.



Multiple OS processes/threads could run interpreters at the same time. Each would have their own block as their "eden space" for creating new objects.


Multiple interpreters can perform most of their operations without problems:

  • Reading an object in any block works fine without protection provided that block isn't undergoing GC.
  • Writing to an object's instance variable can be done without protection if the actual write operation is atomic and that block isn't undergoing GC.
  • Making a new object is always done in a private block which is an Eden space, so no protection is needed. 


The object index table would need to be protected for concurrent writes.


Intra-block garbage collection only affects a single block, provided that no other processes are trying to read or write objects in that block. This means that any number of intra-block garbage collections can be done in parallel.


Inter-block garbage collection is also safely parallel, as described below.


Reading and writing objects in another process's eden space could be safe, although I need to think this through. New objects are only created by the process that owns that Eden space, and making a new object involves finding space for that object, writing the object's contents (writing a valid object header and "nil" as contents for each instvar), updating the "next free space" pointer in the block and then writing a nearref to that object where it is required. This is thread-safe; all written values are owned by the process that owns that eden block. Every thing else in that Eden block can be read or written by other processes safely as long as GC isn't running. References to objects in that block will require the use of an "In-pointer", which again will need to be added in a thread-safe manner.


Intra-block garbage collection (Intra-GC)

Intra-block garbage collection is done using any standard available GC algorithm. It could for example be a compacting mark-sweep collector or a copying collector. If Squeak is used as a base for a block-based VM, the garbage collection algorithm could be retained as the intra-GC algorithm, although it is more complex than required.


The root set of an Intra-GC are the in-pointers.


If the Intra-GC removes an out-pointer as garbage, it must also remove a backreference at the associated in-pointer at the far end. This operation might cascade, causing more garbage collection elsewhere. An Intra-GC algorithm must process out-pointers that are collected; this could mean that a finalisation sweep over all reclaimed objects in the block needs to be done after GC.


At some stages during execution, various blocks might be marked for Intra-GC when there is suspected garbage. One or more background collector processes could keep a list of blocks that need to be eventually collected, and eventually GC those blocks.


Refererence-counting GC could be used. Assuming a 256-entry block, a uint8[256] could be used to store reference counts (ditto, a uint8[65536] for a 16-bit block; if ref counts hit 255 then they can be considered maxed out and will never decrement - they will wait for a compacting GC to be collected). If the block is full, run a compacting GC on the block and potentially promote it. Compacting GC can also do several optimisations; it can relocate all FarRefs to a well-known place in the block so that they can quickly be iterated over (when doing backreferencing GC). Any promoted block has FarRefs in that place as a quickly iterable list.


Interpreters running in a block might need to also have their pointers updated if the objects they point to are moved around.


BackReferencing Garbage Collection (Inter-block GC)


This garbage collection algorithm is very similar to reference counting GC. Instead of keeping a count of the number of references to a particular object, a list of objects referring to this object is kept. This algorithm is often used in distributed systems but not standard virtual machines because of the overhead. My hope is that this overhead is able to be managed by having the vast majority of pointers be simple NearRefs and the BackReferencing garbage collection only applies to FarRefs.


The main advantage of this algorithm is that disk accesses (or remote network access) can be minimised by only collecting objects or blocks known to be garbage. Other algorithms such as Mark/Sweep often require the entire on-disk heap to be traversed.


Popular objects (e.g. classes) would have very long backreference lists. They could be promoted to immortal objects and not have backreference lists, or we could just put up with very long lists and hide them away on disk.


Possible implementation notes:

  • Back-references could be kept in a separate large BTree structure. 
  • Back-references could be stored as objects in their blocks.
  • BackReferencing GC could apply to blocks rather than objects.
  • References to other blocks can be found in BlockRef objects (see above). Some way of finding these in a block is needed, e.g. link them together.
  • BackReference lists would then be a list of blocks refering to this block. These too could be linked together in a list or tree for traversing. (although farrefs would be shared by nearrefs).
  • back-references could be kept in an adjacent block, or in an adjacent B-Tree if there are a lot of them.


Possible optimisation: most of the action happens in eden blocks. It might be worthwhile delaying writing to a backreference list until we know that the reference will be long lived. For example, we could add a "BackReferenceRequest" object to eden space that gets processed when the block is no longer an eden space block.


The algorithm is simply that, by having the rest of the VM maintain the list of references (a.k.a. backreferences) to a particular object, it can be determined that an object is garbage when its list of backreferences becomes empty. That object index table entry can then be removed and recycled. Empty entries can form a linked list to each other so that finding an unused object index table entry is a cheap operation.


If the very last in-pointer is removed from a block, the entire block can be discarded (after the out-pointers have been finalized).


One problem is that objects could have cyclic references to each other, which is exactly the same problem that occurs in reference counting GC implementations.


One cycle-detecting algorithm is to have one of the backreferences be the "primary" backreference. Primary backreferences form an acyclic tree up to the root set such that primary backreferences can be traversed in a direct path to an object/block in the root set. When the primary backreference is removed in a GC operation, other backreferences must be searched to reconstitute the acyclic tree. This happens recursively until we find another backreference in the tree. If we find ourselves (object or block) while traversing the tree to the root, there is a loop; the penultimate object/block pointing to us must attempt to also reconstitute its acyclic tree. Use the tortoise/hare algorithm for loop detection. If this fails, we back down the traverse path until some object/block can reconstitute its tree. Objects on the traversal path that cannot do this are thus garbage. Any object found to be garbage, including ourselves, must inform all objects/blocks that it points to (with standard references) that it can no longer be a backreference for them, triggering the same operation recursively.


When an in-pointer is removed, then the block of that in-pointer must also be marked for an eventual intra-GC. This might cascade; nearrefs might be collected, leading to more FarRefs which are collected, which leads to more in-pointers being collected.


The search for another primary backreference to remake the acyclic tree could be running concurrently with other searches.


As a backreference list could become potentially very large when a large number of objects reference a particular object, a suitable data structure needs to be chosen for this list. One optimisation would be for when many objects in one particular block start refer to the same object index table entry; the backreferences to all of those objects could be replaced with one backreference entry that represents the whole block and no particular object in that block; to find the backreferences, that entire block would need to be searched. This might be too slow.


One option is to only store the block number in the backreference. A backreference points to an out-pointer, which might move around the block when GC happens. Several out-pointers in a block which point to the same in-pointer can share the same backreference (although they should be consolidated into one reference, e.g. during intra-GC). If only the block number is stored, then traversal of backreferences to find a primary backreference (and thus prove that the given object is not part of a disconnected loop) must be done in a difficult manner: traveral within a block can only be done from in-pointer to out-pointer. To find the in-pointers that connect to a given out-pointer requires searching from the in-pointer to find the out-pointer; this can be done by marking each node during the search as "being searched" or as "does not connect" during backtracking.


Another idea would be to assume that an object with a massive number of backreferences will never be garbage, and so the entire backreference list could be discarded and that object/block marked as immortal. If the VM stores blocks on disk and that object really is garbage, then it will be stored on disk where storage is large and cheap meaning that the object is unlikely to cause problems again. A massive month-long mark/sweep collection over the entire address space could potentially collect these objects.


The backreference list could be used to find all instances of a class. A class's backreference list would contain every instance of that class.


If a back-reference list is part of the root tree, then it has a primary backreference. If it does not have a primary backreference, then an inter-GC process is currently searching for one. This means that either another thread is running an inter-GC on this backreference list, or we are (in a loop). In either case, we should back up one step and try the next entry in the backreference list to see if it can be part of the root tree.


Storing backreferences

To re-cap:

  • A near-ref is just a local pointer within a block.
  • A far-ref is a special object in a block containing either a (block ID, in-pointer index) tuple, or a reference to an object table.
  • Each block has In-pointers in a well-known location (e.g.start of block). In-pointers stay put; an in-pointer index will point to an in-pointer in a block. In-pointers contain a near-ref to the object they point to as well as a backreference list.


Backreference lists are owned by in-pointers. Backreference lists can be huge but will commonly only have one entry. You don't want to store them in the same block as their in-pointers because of their size; they can be far bigger than one block. Backreference lists shouldn't contain duplicates (is this possible?). A backreference can be either a (block ID, far-ref index) tuple (if far-refs can be indexed) or it can just have the block ID which gets searched for the relevant far-ref. In each backreference list, there is a primary backreference which forms the root tree.


You can store backreferences in an adjacent block. The in-pointers can refer to entries in this special block. The back-references can be linked lists using the free space in that block. Unallocated space in that block can form an unallocated space list. These special blocks should not be garbage collected by intra-gc because we know when backreferences are garbage. If the block becomes full, backreference lists can contain a special entry of (block ID, object index) pointing to another special backreference block. Backreference blocks aren't intra-GCed, meaning that backreference lists stay put and we don't need in-pointers.


Thus, an in-pointer can be a (pointer, backreference pointer) tuple. The backreference pointer is assumed to point into the backreference block, where the backreference list object is a (pointer, block ID, primary backreference) tuple. The pointer points to the end of the list (which is where new elements are added, and where the list can be iterated over to the start); the block ID is the block where the pointer points to (because if the list grows bigger than the block, then the end of the list will be in another block) and the primary backreference is a near-ref to the acyclic root tree. Each entry in the list is a block ID which refers to a far-ref in that block... somewhere, somehow.


Reference-Counting GC

At this stage, I'm wondering if reference counting could be used instead. Each object table index entry has a reference counting counter added to it. A small (8? 16?) array of block back-references could be used to try to detect cycles:  after a reference count has been decremented, it can be scheduled for cleaning in VM idle time where the block back-reference list is scanned for any cyclic references.


Alternatively, the cyclic references could be moved around by an object optimiser until they are in the same block - and then they get GCed by the intraGC. This won't happen if the cycle won't all fit into one block.


Alternatively, they could just be left alone. Eventually, they'll sink down the generations until they get stored on disk and forgotten about. Then a regular (weekly? monthly?) mark-sweep GC of the entire image can collect these cycles. This would be like checking your filesystem for corruption.


If the block back-reference list becomes full or the reference count gets very large, the object can be marked as "permanent" and simply won't ever be GCed. Example permanent objects are true, false, nil and often used classes; these objects are referenced by so many other objects that they will always exist in the image.


Implementation Details

Every time an FarRef is made, we also assume that the VM adds this new value to the backreferences of the refererred object. Every time we create possible garbage in a non-Eden block, we mark that block for GC.


If an FarRef is ever modified to refer to another object in the same block, the reference should be made a NearRef.



  • If code is traversing objects in another block, significant overhead will happen with creating/deleting FarRefs.
    • That other block could be made into the Eden block - but then, objects in the current Eden might need FarRefs to them.
    • That other block could be merged in to Eden if there is space.
  • If an object is written to, it is usually an object reference from somewhere. If those references are in another block, again, merging blocks might be an optimisation.
  • Splitting blocks optimally requires finding some chokepoint in an object graph (i.e. group into two set of objects with minimal references between the sets).
    • Splitting blocks might make them even bigger with extra FarRefs. FarRefs could be the same size as the objects they refer to.
  • Some inconvenient objects are classes and methods. An object's class might be in a different block.
  • SmallIntegers can just be copied, as can any immutable object. 


Intra-block Garbage Collection

This would happen when Eden space is full and on a block when the removable of all backreferences to an object in that block cause it to be removed. The latter might cause a cascading GC between lots of blocks.

  1. If supporting snapshotting, copy the block for writing:
    1. lock the index entries for all InPointers for writing.
    2. copy the block.
    3. update all index entries for InPointers to point to the new block (note: keep the old index entries for rollbacks?)
    4. unlock the index entries.
  2. If not supporting snapshotting, just lock the block for writing.
  3. Do a simple mark-sweep GC on the NearRefs in the block. Use the entries for this block in the object index table as the root set.
    1. If any FarRefs are GCed because they are not referenced from inside the block, remove their entries from the backreference lists.


Possible implementation detail:

  • InPointers could be a resizable array at the start of the block.
  • FarRefs could be linked together in a linked list or tree so they can be traversed.


Reading an object

If the object is in Eden, just do it.


A NearRef or a FarRef can be traversed to find the remote object.


If the object is in a remote block and already has an object index entry to point to it, then the object can simply be read:

  1. The object would have a NearRef which pointed to a FarRef in the same block (eden space).
  2. The FarRef would contain an index into the object index table.
  3. The object index table would contain a block number and object number in that block.
  4. Read that block into memory (using a block manager service or something) if it is not already there.
  5. Refer to the object in that block at the particular object number. If we are using the address space trick below, make that block the new "read block", removing an old one if necessary.


If the object is the result of a message send to another remote object, and was referenced by that other object using a NearRef, then either:

  1. (label:readFarRef) A new object index table entry needs to be made to that object so that it can be remotely accessed, or
  2. (label:joinAddressSpace) Some magic could be done with the address space within the Eden space to also temporarily include the read block into the same address space.
  3. The current context and locally available variables could be moved into the remote object's block to continue execution, so that most references are NearRefs.


The problem with option (readFarRef) above is that iterating over a collection in another block would incur significant overhead, as an FarRefs needs to be made for every object accessed.


Read-blocks in the same address space as Eden

Option 2, (joinAddressSpace) above involves reserving some of the address space in each block for accessing another block directly.

  • Eden space, could for example, be addressed in the upper half of the NearRef address space.
  • The current "read-block" could be address in the lower half of the NearRef address space. Another way of thinking about this is that the highest bit of the address space determines whether the address refers to an object in Eden or an object in the "read-block" space.


To join address spaces, first we separate any current block occupying the read space, and then make the new block the new read space.


To separate address spaces, there are two options that I can see:

  1. Keep track of any pointers in Eden that refer to the read-block, and convert them into FarRefs when separating the address spaces. It may or may not be beneficial to do a garbage collection first depending on how often blocks become the "read-block".
  2. Do a intra-GC in Eden; while doing this intra-GC, convert any surviving references to the read-block into FarRefs.


Perhaps another portion of the address space could be reserved for CompiledMethods and classes as well? This depends on whether looking up a CompiledMethod incurs significant overhead if that CompiledMethod is referred to using an FarRefs. This may be the case, as every MethodContext and BlockContext would refer to a CompiledMethod meaning that backreferences need to be rapidly made and destroyed for every new context.


Writing to an object instvar

If the object is in Eden, just do it.

  1. If the referred value is a FarRef obtained from a method invocation on an existing FarRef, then make a FarRef and update backreferences.


(How to move an object between blocks:)

If the object is in a remote block, one idea is to move that object to Eden for writing:

(TODO: check this algorithm)

  1. If the block is on disk, read it into memory.
  2. Copy that object to Eden.
    1. Make new FarRefs for every instvar in that object. All instance variables will need to be FarRefs because none of them will refer to anything in Eden.
    2. For FarRefs referring to the original block, create new object index table entries for references which were not already FarRefs . Add backreferences
    3. For FarRefs referring to other blocks, use the original FarRefs to locate the InPointers.
    4. All instance variables in that object will refer to new FarRefs (TODO: what if the remote block gets full?).
  3. Create an object index table entry for that object in Eden (TODO: lock anything?).
  4. In the original block, create an FarRefs for the old object at the location where the original entry was (effectively, do a becomeForward: to an FarRefs ). This should point to the index entry, and a backreference should be added.
  5. Mark that block for GC.
  6. Modify the object index to point to the new InPointer in Eden.
  7. Unlock the index entry.
  8. Edit the object.


It might be simpler just to write directly to the block the object is in:

  1. If the block is on disk, read it into memory.
  2. Lock the block for writing.
  3. Write to the object.
    1. If the referred object is in the same block (via an FarRefs and InPointer) then refer to the block directly and remove the backreference.
    2. If the new pointer is a FarRef and the old pointer is a NearRef, make a new FarRef in that block.
      1. If the block becomes full, move the object to Eden and put an FarRefs in its place and update backreferences. Then, it would be possible to use a direct reference within Eden instead.
    3. If the new pointer is a FarRef or NearRef and the old pointer is a FarRef, then just modify it.
    4. Mark that block for garbage collection because the old value might have left garbage.
  4. Unlock the block for writing.


(or, you know, merge the blocks)


Creating a new object

New objects are always created in Eden, which can be considered permanently locked for writing.


If Eden becomes full, do an intra-GC. If it is still full, then it becomes a new generation and a new empty Eden space is made. The new generation might need extra space in it for possible FarRefs to be made. TODO: but it shouldn't become completely empty!


Perhaps if the address space is cut into at least two areas (see "Reading an object" above): Eden and a read block, then Eden could be compactly mark-sweeped into the read block.


Sending a message 

(TODO: review this)

Usually in Smalltalk, a message send requires looking up the object's class, search the method dictionary, and then searching method dictionaries of all superclasses. To optimise this slow process, a message dispatch table is used. This table needs to be stored somewhere.


One idea is to manage classes differently: have another type of pointer like a FarRef called perhaps a "ClassRef" which points to a particular class.


The CompiledMethods that are to be executed also need to be available locally.


The result of a message send on a FarRef might mean that another FarRef needs to be created. Alternatively, some clever scheme could be used to make the remote block part of the local address space for a while.


One possibility is for each thread to have an auxilliary block that contains read-only replicas of CompiledMethods. This auxilliary block can also be used by the garbage collector for storing state. Potentially this block could also hold the current stack.


Another possibility is to have a special object which contains a CompiledMethod's code, but is read-only and transient, and it is this object which gets copied into a thread's current block. The garbage collector will simply remove this object when GC occurs as it is only a copy of the original CompiledMethod. This special object would contain a single link to the original CompiledMethod where the literals can be found. This copying may have significant overhead.


Another possibility is to have the interpreter be able to simply execute code which is in another block. This would make a hardware implementation need to understand OutReferences and InReferences. In this occasion, a hardware implementation with cores having their own static memory would mean that the block containing the method would need to be copied to the local memory - for multiple methods in different blocks this can be significant overhead. The overhead can be minimised by putting serially executed methods in the same block, and perhaps including a long-lived message dispatch table with that block as well.


Making a new Object Index table entry

A new entry can simply be made with the relevant backreference.


Removing an Object Index table entry

The index entry should also be removed. The block mAn index entry and backreference list for that new InPointer also needs to be made.ust be marked for intra-GC.


Making a new FarRef

Before creation, it should be checked whether the referred object is in the same block. If so, just make a direct reference instead.

Also, check to see if there is already an FarRefs for that object in this block. This can be done by searching the backreferences of the referred object.

A new backreference must be added to the referred InPointer for this FarRef.


Modifying an FarRef

This is functionally equivalent to removing the old FarRef and adding a new FarRef, with all NearRefs to the old FarRef be redirected to the new FarRef.

  1. Remove the backreference for this FarRef.
  2. Change the value of the FarRef to another value. If the new value is in the same block, replace the reference with a NearRef.
  3. Add a new backreference for this FarRef .


Removing an FarRef

FarRefs can be GCed by the intra-GC. When this happens, the backreference for it must also be removed.


Removing a backreference

If the last backreference is removed, then the InPointer that it refers to can also be removed and the relevant block marked for GC.


Block is below half-full

If a block becomes empty, it can be recycled or removed. This assumes blocks are a fixed size.


If a block becomes less than half-full, then it would be worth merging that block with another. Perhaps a list of half-empty blocks should be kept so that a block of similar age can quickly be found. The block with which it merges should be related; perhaps FarRefs or object index table entries should be searched for another half-full block? Perhaps blocks should only be merged if two subsequently requested blocks (by a single thread) are found to be below half-full? Block merging can be done in a separate thread.


The block possibly should not simply be emptied into Eden space because that would cause object churn: generations below Eden space will often empty out quickly, causing older objects to continually be churned between Eden and the generations just below it. Whether this is actually an issue needs to be investigated.


Any operation involving moving objects should tend towards linked objects becoming closer to each other.


Another idea is to move individual objects around as they are used. When an object is accessed and its block isn't within "ideal" sizing requirements (i.e. too full or empty), objects are moved between blocks as they are accessed to optimise their layout.


Block is full 

If 16-bit pointers are used, a block might run out of space. If 32-bit pointers are used and blocks are a flexible size then this isn't an issue.


A block that is nearly full runs the risk of not having enough space for InPointers and FarRefs that need to be created in that block. InPointers might be stored in the object index instead. If a block is written to, then it's likely that the object pointer that is written will refer to an object outside that block, meaning that a FarRef must be made.


Firstly, of course, if a GC is pending on that block then a GC should be run.


The best approach may be to cut the block in two. This requires an efficient graph-theory algorithm to find a division between the objects with the least number of connections between the two sets (or at least a good minimal division). One approach is to do profiling to determine which instance variables of a class are used most (storing the information with the class?), and then creating a (depth-first?) tree following the most common links first from that stopping at half or one third of the way.


Another approach is to give each object a counter (in a new block?), then starting with any object, increment all objects it points to by one, and then follow the link to the object with the highest value until half the objects have been reached.


Another approach is just to move some individual objects out of that block, assuming that this operation won't make the block bigger rather than smaller.




A snapshot can be taken of the running image by making a copy of the block table and object index table, and making all blocks copy-on-write.



Writing a block to disk


Every block has an age associated with it. One way of doing this is to have blocks in order (somehow) and the oldest blocks (that have not been read or written for a long time) are written to disk. This is only necessary when memory is full.


Occasionally (every 60 seconds?), a snapshot is made where all blocks in memory, including the most recently created ones, are written to disk. This is done by marking those blocks as copy-on-write and synchronising the originals to disk in a background thread. Only dirty blocks need to be written; a block is marked as dirty when it is locked for writing.


Because the block table can have 2^64 entries in it, it too will need to be predominantly stored on the disk. The block table will be stored last, so that recovery is possible if the write to disk fails part way through. Blocks on the disk are not overwritten; the disk image is only appended to with newer versions of blocks.


One way of preventing ever increasing disk usage would be to keep only several snapshots, e.g. 1 minute ago, 1 hour, 1 day, 1 week, 1 month. These would roll over - the minutely one gets overwritten every minute. Once an hour, the hourly one is replaced by the minutely one. Every day, the daily one is replaced by the hourly one, etc. The rolling over happens from oldest to youngest so that the latest minutely snapshot doesn't overwrite everything! Snapshots refer to objects in each other, so they must all be present. Replacing one snapshot by another may just be a case of updating some pointer somewhere.


One way of thinking about snapshots is that they are like generational garbage collection. The oldest objects are in the oldest snapshot.


Writing to an object in an old snapshot would involve reading the relevant block from that snapshot into memory, writing to it, and when the snapshotting occurs it now becomes part of the latest snapshot.


Moving objects between blocks 

(also, see above in "Writing to an object instvar")


For optimization, it is sometimes preferable to move objects between blocks to minimize the number of far references. This is done as follows:


  1. A GC is performed on the old block to remove unnecessary references to this object.
  2. Both blocks are locked for writing by this thread.
  3. The object is copied and a new InPointer is made for it in the new block. If the object referred to objects in the old block, FarRefs are made. The local block is checked for to see if any of these FarRefs already exist and can be re-used (maybe?).
  4. If any objects in the old block refer to this object, a new FarRefs is made in the old block referring to this object's new location. Those objects are updated to point to that FarRef.
  5. The back references for the old InPointer are traversed and updated. If any of these back references are actually in the new block, then substitute them for a local reference.


FarRefs in a block can be used by multiple objects in that block. To find if an FarRef to an object already exists in a block, either the backref list of the target can be searched, or all FarRefs in a block can be searched.


Writing to an object, old ideas:

An optimisation is to make a thread hold a write lock on a block, releasing it only when asked to. When another thread gains the write lock on a thread's current block, that block's thread is blocked until it regains its lock. The owner thread cannot make any new objects meaning that it cannot create any MethodContexts until it regains its lock. This means that a thread might relinquish its lock on its current block at every context switch. An optimisation would be to make the owner of a block make a new block when another thread wants to access its current block. This makes it effectively impossible for one thread to access another thread's current MethodContext, meaning the debugger would have a difficult job. 


Alternatively, blocks can be copy-on-write. A thread will do all its work on a copy of the block. When the thread has finished, the block is "committed" back to be part of the main memory. TODO: think about how this affects concurrent synchronisation.


The copy-on-write method may make it possible to have a transacted VM with rollback possibilities. If blocks are copied on write, then a snapshot could be made by keeping the old versions of the blocks and keeping an old copy of the block table.


Say that to write to a block, a thread obtains a copy of that block for writing to, and when finished the block is inserted back into the image. This means:

  • Any other thread wanting to write to that block will need to wait for the existing block to finish. This means that the block needs to record who has it locked anyway. This only happens when writing to existing objects; new objects will be created in thread's private blocks.
  • I suspect that every time a Semaphore is signalled, the block will need to be written back before the involved thread is resumed to ensure that concurrent synchronisation still works. I'm not sure if there are other cases, such as waiting on a semaphore.
  • Other threads can still read from the old copy and so no blocking is needed. They only block on writing.
  • This approach may or may not have significant memory copying overheads. An initial write to a block means it needs to be copied.


Distributed computing

The block-based VM could potentially run across a cluster. Blocks then could be accessed in any of these manners:

  • If a block is only being read, it can be copied from a remote host.
  • If a block has lots of writes but not many reads, write and read requests can be sent to an authoritave remote host. The block should migrate to the host doing the most writes.
  • If a block has lots of writes and lots of reads, it could be replicated with writes being broadcast.


More considerations:

Some objects could be special, such as SharedQueues or Semaphores with behaviour tuned for distributed computing.

Transactional semantics might be needed.




Lessons learned from working on large scale server software (originally http://ferd.ca/lessons-learned-while-working-on-large-scale-server-software.html):


  • Plan for the worst possible failure first. Make sure you can recover from it. Every other failure recovery is an optimisation over this.
  • http://en.wikipedia.org/wiki/CAP_theorem
  • https://en.wikipedia.org/wiki/Fallacies_of_distributed_computing
  • Handle overload conditions well.
  • Make bugs transparent. If something can corrupt, crash the node rather than continue in an unknown state. Make everything transparent (e.g network traffic).
  • Consider how to upgrade a cluster with as little downtime as possible. Maybe start a new cluster incrementally, transferring (and translating) state over from an old version?
  • Consider how to resurrect a dead or corrupted cluster. 
  • Use a real network. loopback will protect you from lots of errors - latency, packet ordering/loss, corruption, kernel bugs, hardware bugs.


Random Obsolete Ideas




A pointer within a block is a direct 16-bit pointer. These are fast and the intention is that these local pointers help absorb the overhead generated by the rest of this proposal.


A pointer from a block to an object is another block has three levels of indirection. Firstly, a 16-bit pointer points to a larger "FarRef" object (e.g. 64 bits?) in the same block. That larger pointer points to an entry in a global object mapping table. That table entry then has the location of the external block and the "InPointer" index of the object in that block.


Blocks contain objects, InPointers and FarRef. FarRef point (indirectly) to InPointers in other blocks. InPointers in a block are referred to by index (0, 1, 2... etc) that doesn't change if the local object the InPointer refers to is moved around the block.


An InPointer consists of: [localObject, backRefs]. localObject is a 16-bit direct pointer to the object in this block. "backRefs" is a list of back references to the out pointers in other blocks which refer to this pointer. This list can be implemented in several ways. One way is to have each FarRef have a nextPointer, so that they form a linked list of all FarRef pointing to a particular InPointer (needs more thinking - how does the object table fit in?).


An FarRef consists of a pointer (64 bits?) to an entry into the object mapping table. It may also contain a link to the next FarRef in a linked list of back references if back references are implemented in this way.


* Including linked lists in the FarRef will cause a lot of CPU cache misses as it jumps around memory. Use a local array instead.


* Rather than using an object lookup table, just a block lookup table is needed. An FarRef then is a (block#, inPtr#) tuple. An InPointer is then a (localRef, backRefs) tuple with both entries being direct 16-bit pointers.  Unfortunately, this limits the number of back references to what can fit in a 64KB block.



A backtracing GC algorithm uses the back references to determine how many references there are to a particular object. When this list is empty, the object can be GCed.


The backtracing GC algorithm would mean that an image could be mostly stored on disk. The GC algorithm does not trace through all of memory, meaning that objects on the disk can remain untouched by a GC.



The backtracing GC algorithm also is very concurrent; a single object can be reclaimed by it's own private thread of execution independently of any other GCs which are also happening.


Each block can be garbage collected, compacted, etc independently of any other block.


Objects can be migrated between blocks to optimize object access. Preferably, most pointers are local 16-bit pointers which are much faster than inter-block pointers.


This type of virtual machine would make it possible to use the SSEs on the cell processor. It would make very good use of multi-cored CPUs.


Custom hardware could be made that executes Smalltalk bytecodes directly. GC and far reference management would then be implemented in bytecodes rather than by hardware. Also, a cell-processor style local memory could be used for each core with a master core doing transfers to/from main memory and disk.


Disadvantages: can't have large (>64KB) arrays.


This could also use a partition on a disk rather than files on a filesystem. Both options should be available. RAID would be used to make an image span multiple disks.


Implementing this


A Squeak image is a block! It isn't a 16-bit one, but it can be a block.


A Squeak image can have in pointers and out pointers. In pointers would be some global collection in that image (OrderedCollection maybe? or an Array?). Out pointers would be pointers to other images.


Inter-image communication could be done using perhaps the HydraVM primitives, or maybe a networking protocol such as MPI? Voila... Squeak on a cluster!


This could all be written in Squeak, with special primitives to "accelerate" bytecode interpretation. Bytecode accelerators would return to the interpreter when they need to handle a FarRef. All block handling could be done in standard Squeak.




This is a prime candidate for implementing in Slang.


An initial implementation could be written in Smalltalk and run on a VM. The basic functionality it needs is for blocks (ByteArrays), processes (Process), display output, disk access and probably network access. Unfortunately, a Squeak implementation would not be optimal because Squeak does not support multiple processes. An initial prototype could verify that the design works.


Can Smalltalk (rather than Squeak's Slang) be compiled to C?


A class MemoryBlock could be made with an artificial limit on the values each location can contain. In this way, both 16-bit and 32-bit NearRefs can be played with. The MemoryBlock would be an array of the size of each block that can be written to disk. By making the implementation flexibile (wrt size of NearRefs and block size), some experimentation can take place to see which gets the best performance.


These MemoryBlocks would be written to a real disk, which will show the exact same behaviour as a compiled VM.


Multi-threaded implementations can be simulated using Squeak Processes. These would become pthreads in a compiled version.


The Squeak interpreter as it stands could remain largely the same. The ObjectMemory could be replaced with a BlockMemory. Most methods would stay the same; no special block handling functionality would need to be added at the Interpreter other than "interrupts" when the block is full which causes a call to bytecodes which then handle that case. The BlockMemory objects would also be visible from within the image so that GC can happen in bytecodes (if desired - slow).


Ideally, the Slang compiler could be modified to be able to read and write real objects as well. These "real" objects would be compiled into a bunch of C functions that take a pointer to the memory location of that object as their first argument. These functions would have an implementation of the guessed class of that object (where perhaps profiling information would be useful?), and first tests to see if that class is correct. If that guessed class is wrong, the primitive would fail and the Interpreter would resume by trying to find another implementation of that method. Several tests like this could be chained together if the method is a common one. Alternatively, the compiled VM could keep a special method lookup table which referred to compiled versions of these methods. (TODO: how does Exupery do this?).


Prototype Implementation

An implementation could be made in Squeak using plain objects to test whether the algorithms work and how well it performs.

  • Blocks could contain an OrderedCollection of InPointers and an OrderedCollection of other objects.
  • Objects would just be Arrays of integers. Element 1 could be metadata (a special object). Element 2 could be the class.
  • The object index would be a Dictionary mapping integers to (block, InPointer index, backreference list) tuples.
  • CompiledMethods could stay the same.
  • The interpreter from the Debugger could be ripped out and used.




Targetted Platforms

  • Multi-cored computers (dual-core / quad-core).
  • Very multi-cored computers, e.g. Sun T2000.
  • Cell processor (6 cores, local memory only).
  • Platforms with little memory - most data held on disk.
  • Custom hardware.


Cores must be able to be added and removed at run-time to facilitate:

  • Powering on and off the various hardware cores.
  • Reassigning hardware cores to other tasks (e.g. primitives, reconfiguring FPGA for other task, reconfiguring Rapport Kilocore core for another task).
  • testing, tuning.


Ideally, the word size for inter-block references should be customisable - 16 bit, 32 bit. This allows this VM to run efficiently on an architecture where 64kb is too fine a granularity.


Custom hardware implementation


This VM could be designed with an eventual hardware implementation in mind. A standard computer could have Smalltalk co-processors added to it, much like the Cell architecture has a PowerPC core with custom SSE added to it. The main CPU would handle moving data around the co-processors and doing GC and block management. Alternatively, this could also be done by a Smalltalk-based CPU which would just be one of many cores.


Each "core" would have 64KB (or maybe more if multiple blocks need to be loaded?) of local static RAM, and would execute Smalltalk bytecodes. The following would be implemented in hardware:

  • Bytecode execution (stack (push, pop), jump, send, return).
  • Primitives:
    • SmallInteger, Integer operations.
    • Floating point operations (done by shared co-processor?)
    • Object allocation (basicNew).
    • BitBlt and display operations on a shared display device?
  • Maybe intra-block garbage collection? Can this be done in hardware?
  • Block primitives - copying a block back to main memory, requesting a block from main memory, locking a block for writing. These would be callable using special bytecode primitives.
  • Message dispatch table


The following would be implemented using bytecodes because they would be too complex to be implemented directly in hardware:

  • Inter-block reference management.
  • Inter-block garbage collection.
  • Maybe intra-block GC if it can't be done in hardware.
  • Block management - moving blocks between disk, main memory and each core's private memory.
  • Method lookup (if it is not in the message dispatch table) and CompiledMethod loading from other blocks.


Example: code in a CompiledMethod wants to write to an object referred to using a FarRef. The hardware would detect that the object being written to is an FarRef rather than an actual object and would jump to a special location that has bytecodes for dealing with FarRef (i.e. cause an "interrupt"). That special code would load the block in the FarRef into the local memory (maybe unloading other blocks in the process to make room), lock it for writing, write the value, unlock the block for writing and return from this "interrupt".


A message send would require the CompiledMethod to be locally available. This is discussed below.


Block-based VM OS


Clients should be fast booting. Clients should only need minimal hardware resources; provided that they are reactive, all the grunt can sit on the server.

Anything that runs across an administered network should get it's full configuration from that network.


If an OS was going to be made out of this VM:


  • It should be easy to install - an installation of a server would be done using a remote management client. Clients could be either simple PC applications or thin clients.
  • It should be installable over a network connection and using nothing but a serial console or telnet session if need be.
  • Installation involves the absolute bare minimum. If anything can be configured in a running system, then that's where it will be configured rather than during installation.
  • Like ZFS, all storage vehicles (disks, CF cards, other devices) should be able to be added and removed from the "pool".
  • There should be no file system except as an optional device.
  • Object storage should sit directly on top of disk partitions.
  • Object storage can sit on a RAID array.
  • All management of a server will be remote over a network connection. It will be assumed that a server will never have a screen or keyboard.
  • Clients will boot in less than 2 seconds. Clients will shut down in less than 2 seconds. This may be simulated using a battery to keep DRAM refreshes going, or by doing rapid loading of memory directly from a reserved part of storage. Servers can take as long as they want to boot.


Booting a server:

  • The server boots over the network using DHCP, tftp or whatever. The network card does this, and the BIOS (or whatever) is configured to do this by default.
  • The boot server should inform the network administrator that a new server is asking for boot information and ask what to do. From this point onwards, the administrator has complete control over the new server.
  • Logging is done to a log server once the kernel has loaded (or before, if possible!). Ditto alarms, management. Infact, why not just keep a TCP connection to a management server for logging, alarms and management commands?
  • The server reads it's disk configuration from the network - i.e. which partitions on which disks it should start up.
  • It boots.


If it is a client, then it boots the traditional way by reading a boot block from storage.


On disk, the latest snapshot contains a record of what was last in memory. This is loaded back into memory. It would be possible to boot from an earlier snapshot - on a server, this would involve using management commands and on a client using a pre-boot menu of some sort.


Debugging made awesome


Debugging would be awesome if you could not only go forwards, but also backwards in time. This could be implemented by making snapshots, and recording the exact times when external resources affect the image. By recording the exact time (to the currently executing bytecode) of every external influence on the VM, the VM can be made completely deterministic allowing the exact sequence of events leading up to a bug occuring to be recreated.


This would require recording the exact time to the bytecode and the device I/O of:

  • Context switches (i.e. switching between processes) occur.
  • Accessed timers.
  • All network traffic coming in.
  • Keyboard, mouse events.
  • file I/O times and latency.
  • other devices.


Given a snapshot and the above events that occur, the VM can then recreate the exact circumstances under which an error occured. This allows the debugger to go back to a particular snapshot and recreate events up to a certain point in time with the user having full visibility over what happens.


The recorded information should be minimal. If the overhead is not to high, it may even be possible to keep, say, the last few minutes since the last snapshot of this information in memory and dump it to log files when an unhandled exception occurs (allowing the programmer full visiblility over what happened).


An added benefit is that this information acts as a log file which can be replayed when the VM is restored.


Implementation details


Pooling across partitions.


The object store sits on top of the block manager. The block manager keeps track of which blocks are stored on which physical disks (or if they are only in memory). The block table stores this information.


Would it be worth making the block manager able to keep "backup blocks" on separate devices, ala software RAID? If a disk is going to be decommissioned or replaced, it should be taken out of the pool meaning all blocks / objects need to be moved off it onto other disks. A disk can be marked as "very reliable" (i.e. RAID) to "unreliable" (i.e. IDE cables made of cellotape) and replication algorithms can be made depending on the reliability of disks. You would also need to specify that partitions are actually on the same disk, meaning that replication isn't useful.


Accelerated InterpreterSimulator


Currently, Squeak has an InterpreterSimulator written in Slang (a Smalltalk subset which compiles to C code). This can be interpreted using a Squeak VM or compiled to C to make a new Squeak VM. There is also another class that interprets bytecodes slowly, which is used by the debugger (Interpreter?).


If access to the raw memory of the image was available to code within the image, it would become possible for just one Process in an image to be run in the InterpreterSimulator, concurrently with other Processes being run by the native C interpreter. This would require that the VM is capable of proper multithreading and good ObjectMemory locking semantics.


This could also be how the debugger could be used. The debugger could use a modified InterpreterSimulator to simulate a Process running in an image, and modify the underlying image memory directly.



What advantages would this approach have over the existing approach of simulating the entirety of a Squeak VM? Code would run faster, and only a very small part of the image is actually needed.


Some form of image protection would be needed. If a bug or accident occurs, there needs to be safeguards to make sure the development image is not scrapped. Perhaps only a particular block might be simulated?




Reusable API for various object memories. We assume a fixed field size, probably something like 64 bits, which can hold multiple references. We assume that fields can be moved within a block for compacting GC.


We assume that objects might span multiple fields. TODO: it's easier just to let the user define objects of various sizes and ask the user which parts of that are references.


This API would probably be provided as C code so the compiler can optimise stuff. Most of these methods are best inlined. Can a compiler inline methods from a library?


block = createBlock(transaction);

block = retrieveBlock(blockPtr, &retrieveDoneCallback());

commitBlock(block, commitLevel, transaction); // commitLevel in {eventual, toCache, toDisk}

discardBlock(block, transaction);

// The block is a pointer into memory. There will be a BLOCK_SIZE which can be added to it to find the end of the block. The user has complete control over the contents of the block, but must respect some callbacks to traverse, merge, split and move fields within the block.


// InPtr management.

it = getInPtrIterator(block);
it = inPtrIteratorNext(it);
setInPtr(*inPtr, nearRef); // When a field is moved.
nearRef = getInPtr(&inPtr);

// For compacting GC

// TODO: this API is not sufficient to update all references in a block. We'd need to keep a translation array of where fields have been moved to, and translate all references in a second pass (?).

nextFreeMemory = callbackCompactObject(block, nearRef, nextFreeMemory);

// Called when a block is full:

garbageCollect(block, transaction); // Can also be invoked by the block manager.
    // Called for each object; the user implements this.
    callbackGarbageCollectFindReferences(blockPtr, nearRef);
        // the above must call these function for each reference found.
        garbageCollectDoTraverseNearRef(blockPtr, nearRef);
        garbageCollectDoTraverseFarRef(blockPtr, inPtr); // User must make his own far refs.

percentUsed = callbackBlockPercentUsed(block);
callbackMergeBlocks(block, block);
block, block = callbackSplitBlock(block);


// How do you move an object from one block to another? TODO

// Heck, by this stage we may as well just get the API to define object formats and manage it all ourselves.

// e.g. (revised version of the above GC callback methods)

*obj = createObject(*block, sizeInWords); // aka malloc();


     // Which must call:

     traverseNearRef(&obj, offset);

     // The block manager manages nearRefs. Users should never touch them.

     // We can automatically identify and manage farRefs.

setNearRef(block, targetNearRef, offset, newValue); // or something??? *(targetNearRef+offset) = newValue.


// Replication?
setBlockReplicationAlgorithm(block, algorithm);
setBlockFaultTolerance(block, numReplicas);
// also getters.

// Transactions?
tx = createTransaction(level); // level in {serializable, repeatableReads, readCommitted, readUncommitted}


// Users want arrays

*array = createBigArray(type, numFields); // Can span multiple blocks




This looks eerily familiar: http://www.merlintec.com/lsi/persist.html

What every programmer should know about memory: http://people.redhat.com/drepper/cpumemory.pdf

Somebody made something very similar: http://portal.acm.org/citation.cfm?id=1640134.1640149

Comments (0)

You don't have permission to comment on this page.