Design for a block-based virtual machine.
Introduction
This is a design for a virtual machine that:
- Scales to terabyte or larger images.
- Has completely concurrent garbage collection (GC).
- Stores most of the image on disk with a cache in memory. Garbage collection does not need to access then entire image.
- Potentially could support transactions.
- Potentially could support cheap snapshots and rolling back of an image to any particular snapshot.
This is achieved by:
- The virtual machine (VM) is divided into blocks. The size of the blocks can be decided by the implementation. Blocks are arrays of bytes or words.
- Two types of pointers are used: NearRefs and FarRefs.
- NearRefs are direct references to memory locations within a block. The nature of these is up to the implementor of the VM; 16 bits and 32 bits are obvious choices.
- FarRefs refer to objects in other blocks and are described below.
- Two garbage collection algorithms are used:
- A simple single-threaded mark/sweep GC algorithm is used for intra-block GCs (intraGC).
- A backtracing GC algorithm is used for inter-block GCs (interGC).
Design
The memory of the virtual machine consists of several components (not including implied components such as the Interpreter):
(include diagram here).
Blocks
The blocks in the VM are arrays of bytes or words (depending on implementation) and store objects much like the Squeak VM currently does. Blocks are referred to by some mechanism that is up to the implementor.
Blocks have a number of "In-pointers" which are objects which must be able to be referenced from outside the block both before and after garbage collection. This could be implemented in a number of ways. These in-pointers form the root set for intra-block garbage collection.
BlockManager
A BlockManager is a service that manages blocks:
- If a block is requested that is on disk, the block manager will locate and read that block into memory to make it available.
- If a block has not been accessed for some time and memory is becoming full, that block is eventually written to disk and removed from memory.
- Blocks may be stored elsewhere; for example, they might be stored or accessed on a remote computer over a network, or on separate disks and partitions.
- If supported, the BlockManager is responsible for performing snapshots of system state: it marks all blocks as "copy-on-write" so that the next time the block is modified, a new copy of it is made.
Processes
Processes are Smalltalk objects run by actual OS processes or threads. This VM is designed to be very concurrent.
Generationalish Garbage Collection
Blocks are roughly ordered by age. The details of this are up to the implementor. One possible implementation would be to have all blocks be in a linked list from newest to oldest by creation or last-write date. This ordering is shared by all Processes.
Eden space
Every process has a special block called the "Eden space". This is the only place that new objects can be made. When full of objects, the Eden space is intra-GCed; if still mostly full, it becomes a regular block and a new Eden-space block is made for that thread.
NearRef
A NearRef is a direct pointer to another object in the same block. It could be a 16- or 32-bit pointer depending on the implementer's preference. I suspect that a 16-bit pointer would allow many more objects to fit into a CPU's caches meaning that overall performance would be faster, but some CPUs such as ARM processors have poor support for 16-bit pointers. A NearRef is a pointer to a memory location in the same block (indexed starting from the beginning of the block) to an object header.
If a "read-block" is supported (see below) then part of the address space of a NearRef could potentially point into that block instead.
FarRef (also called OutPointers)
TODO: search/replace "OutPointers", replace with "FarRefs".
A FarRef is a distant pointer that points to objects in another block. It is a standard Smalltalk object containing a pointer to an entry in the object index table (see below) which is handled in special ways by the Interpreter.
An object referring to an object in another block would have a normal NearRef to a FarRef object in the same block. That FarRef would refer to an entry in the object index table. The object index table entry would contain the block ID and the object number (a NearRef) in that block of that distant object.
Object Index Table
The virtual machine would have a large Object Index Table. Each entry in this table refers to a particular object in a particular block. This table is used by every FarRef to locate objects in blocks other than their own.
Because of it's possible large size, this object index table might have its data stored in special blocks so that the index is stored mostly on disk.
Each entry would have the following properties:
- An object index table ID, which is the same number stored in FarRefs to refer to objects referred to by this table. If implemented as an array, this ID could be implied by the position of this entry in this table.
- The block number of the object this entry refers to.
- The NearRef in that block that this entry refers to.
- A list of "backreferences" which enable a backreferencing garbage collector to be used. This allows cheap navigation from any object to every other object that refers to it. Each entry in this backreference list would be a object index table ID. Alternatively, each entry could be a block ID and every access would require searching the block for that reference.
- Optionally, information about the last read or write time of this object may be kept for optimising the locations of objects in this VM. One implementation is a simple timestamp. Another implementation is to store an updating count, mean and standard deviation of the last read (and write?) operation on this object.
There needs to be a method of cheaply finding all table entries for a particular block. This is used as a root set of objects for intra-GC. One way of achieving this is to build a separate data structure that forms an "index" (in the relational database sense) over the block number column.
Concurrency
Multiple OS processes/threads could run interpreters at the same time. Each would have their own block as their "eden space" for creating new objects.
Multiple interpreters can perform most of their operations without problems:
- Reading an object works fine without protection.
- Writing to an object's instance variable can be done without protection if the actual write is atomic.
- Making a new object is always done in a private block which is an Eden space, so no protection is needed.
The object index table would need to be protected for concurrent writes.
Intra-block garbage collection only affects a single block, so any number of blocks could be concurrently intra-GCed.
Inter-block garbage collection is also safely parallel.
TODO: what happens when OS threads want to access objects in another thread's Eden space?
Intra-block garbage collection (Intra-GC)
Intra-block garbage collection is done using any standard available GC algorithm. It could for example be a compacting mark-sweep collector. If Squeak is used as a base for a block-based VM, the garbage collection algorithm could be retained as the intra-GC algorithm.
The root set of an Intra-GC is the collection of all object index table entries that refer to this block. For this reason, all objects in a particular block need to be cheaply accessable somehow from the object index table.
If the Intra-GC removes an FarRef as garbage, it must also remove a backreference from the FarRef's object index table entry.
At some stages during execution, various blocks might be marked for Intra-GC when there is suspected garbage. One or more background collector processes could keep a list of blocks that need to be eventually collected, and eventually GC those blocks.
Interpreters running in a block might need to also have their pointers updated if the objects they point to are moved around.
BackReferencing Garbage Collection (Inter-block GC)
TODO: why not just use reference counting? Why did I decide that backreferencing GC is needed, other than to detect cycles?
This garbage collection algorithm is very similar to reference counting GC. Instead of keeping a count of the number of references to a particular object, a list of objects referring to this object is kept. This algorithm is often used in distributed systems but not standard virtual machines because of the overhead. My hope is that this overhead is able to be managed by having the vast majority of pointers be simple NearRefs.
The algorithm is simply that, by having the rest of the VM maintain the list of references (a.k.a. backreferences) to a particular object (details are described below), it can be determined that an object is garbage when its list of backreferences becomes empty. That object index table entry can then be removed and recycled. Empty entries can form a linked list to each other so that finding an unused object index table entry is a cheap operation.
One problem is that objects could have cyclic references to each other, which is exactly the same problem that occurs in reference counting GC implementations.
One cycle-detecting algorithm would be to have the first backreference in the backreference list be the the "primary backreference", such that all primary references in all objects form a non-cyclic tree from the root set to every single object (also including NearRefs, although their backreferences are not explicitely recorded). When the primary reference is removed from an object's backreference list, that object must begin a breadth-first search through its backreferences searching for another primary backreference on another object that can be used to reconstitute the non-cyclic tree from the root set to this object. This search will need to mark objects as it goes to make sure it does not loop back on itself indefinitely. If the search fails, the object and all objects involved in the search are garbage.
When table entries are removed from the object index table, then the block of that entry must also be marked for an eventual intra-GC. The table entries form the root sets for intra-GCs on each block, and the removal of each entry might mean that the block in question might now contain more garbage. In this way, garbage collections might cascade over several blocks.
The search for another primary backreference to remake the acyclic tree could be running concurrently with other searches. When marking objects, the mark must be unique to this particular instance of the GC algorithm. If the search encounters a mark that is our own, then a loop has been found and the search must continue elsewhere. If the search finds another search algorithm's mark, then the algorithm can asynchonously request the eventual result from that other algorithm and continue searching elsewhere.
As a backreference list could become potentially very large when a large number of objects reference a particular reference, a suitable data structure needs to be chosen for this list. One optimisation would be for when many objects in one particular block start refer to the same object index table entry; the backreferences to all of those objects could be replaced with one backreference entry that represents the whole block and no particular object in that block; to find the backreferences, that entire block would need to be searched.
Reference-Counting GC
At this stage, I'm wondering if reference counting could be used instead. Each object table index entry has a reference counting counter added to it. A small (8? 16?) array of block back-references could be used to try to detect cycles: after a reference count has been decremented, it can be scheduled for cleaning in VM idle time where the block back-reference list is scanned for any cyclic references.
Alternatively, the cyclic references could be moved around by an object optimiser until they are in the same block - and then they get GCed by the intraGC. This won't happen if the cycle won't all fit into one block.
Alternatively, they could just be left alone. Eventually, they'll sink down the generations until they get stored on disk and forgotten about. Then a regular (weekly? monthly?) mark-sweep GC of the entire image can collect these cycles. This would be like checking your filesystem for corruption.
If the block back-reference list becomes full or the reference count gets very large, the object can be marked as "permanent" and simply won't ever be GCed. Example permanent objects are true, false, nil and often used classes; these objects are referenced by so many other objects that they will always exist in the image.
Implementation Details
Every time an FarRef is made, we also assume that the VM adds this new value to the backreferences of the referered object. Every time we create possible garbage in a non-Eden block, we mark that block for GC.
If an FarRef is ever modified to refer to another object in the same block, the reference should be made a NearRef.
Intra-block Garbage Collection
This would happen when Eden space is full and on a block when the removable of all backreferences to an object in that block cause it to be removed. The latter might cause a cascading GC between lots of blocks.
- If supporting snapshotting, copy the block for writing:
- lock the index entries for all InPointers for writing.
- copy the block.
- update all index entries for InPointers to point to the new block (note: keep the old index entries for rollbacks?)
- unlock the index entries.
- If not supporting snapshotting, just lock the block for writing.
- Do a simple mark-sweep GC on the NearRefs in the block. Use the entries for this block in the object index table as the root set.
- If any FarRefs are GCed because they are not referenced from inside the block, remove their entries from the backreference lists.
Reading an object
If the object is in Eden, just do it.
A NearRef or a FarRef can be traversed to find the remote object.
If the object is in a remote block and already has an object index entry to point to it, then the object can simply be read:
- The object would have a NearRef which pointed to a FarRef in the same block (eden space).
- The FarRef would contain an index into the object index table.
- The object index table would contain a block number and object number in that block.
- Read that block into memory (using a block manager service or something) if it is not already there.
- Refer to the object in that block at the particular object number. If we are using the address space trick below, make that block the new "read block", removing an old one if necessary.
If the object is the result of a message send to another remote object, and was referenced by that other object using a NearRef, then either:
- (label:readFarRef) A new object index table entry needs to be made to that object so that it can be remotely accessed, or
- (label:joinAddressSpace) Some magic could be done with the address space within the Eden space to also temporarily include the read block into the same address space.
The problem with option (readFarRef) above is that iterating over a collection in another block would incur significant overhead, as an FarRefs needs to be made for every object accessed.
Read-blocks in the same address space as Eden
Option 2, (joinAddressSpace) above involves reserving some of the address space in each block for accessing another block directly.
- Eden space, could for example, be addressed in the upper half of the NearRef address space.
- The current "read-block" could be address in the lower half of the NearRef address space. Another way of thinking about this is that the highest bit of the address space determines whether the address refers to an object in Eden or an object in the "read-block" space.
To join address spaces, first we separate any current block occupying the read space, and then make the new block the new read space.
To separate address spaces, there are two options that I can see:
- Keep track of any pointers in Eden that refer to the read-block, and convert them into FarRefs when separating the address spaces. It may or may not be beneficial to do a garbage collection first depending on how often blocks become the "read-block".
- Do a intra-GC in Eden; while doing this intra-GC, convert any surviving references to the read-block into FarRefs.
Perhaps another portion of the address space could be reserved for CompiledMethods and classes as well? This depends on whether looking up a CompiledMethod incurs significant overhead if that CompiledMethod is referred to using an FarRefs. This may be the case, as every MethodContext and BlockContext would refer to a CompiledMethod meaning that backreferences need to be rapidly made and destroyed for every new context.
Writing to an object instvar
If the object is in Eden, just do it.
- If the referred value is a FarRef obtained from a method invocation on an existing FarRef, then make a FarRef and update backreferences.
(How to move an object between blocks:)
If the object is in a remote block, one idea is to move that object to Eden for writing:
(TODO: check this algorithm)
- If the block is on disk, read it into memory.
- Copy that object to Eden.
- Make new FarRefs for every instvar in that object. All instance variables will need to be OutPointers because none of them will refer to anything in Eden.
- For FarRefs referring to the original block, create new object index table entries for references which were not already FarRefs . Add backreferences
- For FarRefs referring to other blocks, use the original FarRefs to locate the InPointers.
- All instance variables in that object will refer to new FarRefs (TODO: what if the remote block gets full?).
- Create an object index table entry for that object in Eden (TODO: lock anything?).
- In the original block, create an FarRefs for the old object at the location where the original entry was (effectively, do a becomeForward: to an FarRefs ). This should point to the index entry, and a backreference should be added.
- Mark that block for GC.
- Modify the object index to point to the new InPointer in Eden.
- Unlock the index entry.
- Edit the object.
It might be simpler just to write directly to the block the object is in:
- If the block is on disk, read it into memory.
- Lock the block for writing.
- Write to the object.
- If the referred object is in the same block (via an FarRefs and InPointer) then refer to the block directly and remove the backreference.
- If the new pointer is a FarRef and the old pointer is a NearRef, make a new FarRef in that block.
- If the block becomes full, move the object to Eden and put an FarRefs in its place and update backreferences. Then, it would be possible to use a direct reference within Eden instead.
- If the new pointer is a FarRef or NearRef and the old pointer is a FarRef, then just modify it.
- Mark that block for garbage collection because the old value might have left garbage.
- Unlock the block for writing.
Creating a new object
New objects are always created in Eden, which can be considered permanently locked for writing.
If Eden becomes full, do an intra-GC. If it is still full, then it becomes a new generation and a new empty Eden space is made. The new generation might need extra space in it for possible FarRefs to be made. TODO: but it shouldn't become completely empty!
Perhaps if the address space is cut into at least two areas (see "Reading an object" above): Eden and a read block, then Eden could be compactly mark-sweeped into the read block.
Sending a message
(TODO: review this)
Usually in Smalltalk, a message send requires looking up the object's class, search the method dictionary, and then searching method dictionaries of all superclasses. To optimise this slow process, a message dispatch table is used. This table needs to be stored somewhere.
The CompiledMethods that are to be executed also need to be available locally.
The result of a message send on a FarRef might mean that another FarRef needs to be created. Alternatively, some clever scheme could be used to make the remote block part of the local address space for a while.
One possibility is for each thread to have an auxilliary block that contains read-only replicas of CompiledMethods. This auxilliary block can also be used by the garbage collector for storing state. Potentially this block could also hold the current stack.
Another possibility is to have a special object which contains a CompiledMethod's code, but is read-only and transient, and it is this object which gets copied into a thread's current block. The garbage collector will simply remove this object when GC occurs as it is only a copy of the original CompiledMethod. This special object would contain a single link to the original CompiledMethod where the literals can be found. This copying may have significant overhead.
Another possibility is to have the interpreter be able to simply execute code which is in another block. This would make a hardware implementation need to understand OutReferences and InReferences. In this occasion, a hardware implementation with cores having their own static memory would mean that the block containing the method would need to be copied to the local memory - for multiple methods in different blocks this can be significant overhead. The overhead can be minimised by putting serially executed methods in the same block, and perhaps including a long-lived message dispatch table with that block as well.
Making a new Object Index table entry
A new entry can simply be made with the relevant backreference.
Removing an Object Index table entry
The index entry should also be removed. The block mAn index entry and backreference list for that new InPointer also needs to be made.ust be marked for intra-GC.
Making a new FarRef
Before creation, it should be checked whether the referred object is in the same block. If so, just make a direct reference instead.
Also, check to see if there is already an FarRefs for that object in this block. This can be done by searching the backreferences of the referred object.
A new backreference must be added to the referred InPointer for this FarRef.
Modifying an FarRef
This is functionally equivalent to removing the old FarRef and adding a new FarRef, with all NearRefs to the old FarRef be redirected to the new FarRef.
- Remove the backreference for this FarRef.
- Change the value of the FarRef to another value. If the new value is in the same block, replace the reference with a NearRef.
- Add a new backreference for this FarRef .
Removing an FarRef
FarRefs can be GCed by the intra-GC. When this happens, the backreference for it must also be removed.
Removing a backreference
If the last backreference is removed, then the InPointer that it refers to can also be removed and the relevant block marked for GC.
Block is below half-full
If a block becomes empty, it can be recycled or removed.
If a block becomes less than half-full, then it would be worth merging that block with another. Perhaps a list of half-empty blocks should be kept so that a block of similar age can quickly be found. The block with which it merges should be related; perhaps FarRefs or object index table entries should be searched for another half-full block? Perhaps blocks should only be merged if two subsequently requested blocks (by a single thread) are found to be below half-full? Block merging can be done in a separate thread.
The block should not simply be emptied into Eden space because that would cause object churn: generations below Eden space will often empty out quickly, causing older objects to continually be churned between Eden and the generations just below it.
Block is full
A block that is nearly full runs the risk of not having enough space for InPointers that need to be created.
The best approach may be to cut the block in two. This requires an efficient graph-theory algorithm to find a division between the objects with the least number of connections between the two sets (or at least a good minimal division). One approach is to do profiling to determine which instance variables of a class are used most (storing the information with the class?), and then creating a (depth-first?) tree following the most common links first from that stopping at half or one third of the way.
Another approach is to give each object a counter (in a new block?), then starting with any object, increment all objects it points to by one, and then follow the link to the object with the highest value until half the objects have been reached.
Snapshot
A snapshot can be taken of the running image by making a copy of the block table and object index table, and making all blocks copy-on-write.
(TODO)
Writing a block to disk
Every block has an age associated with it. One way of doing this is to have blocks in order (somehow) and the oldest blocks (that have not been read or written for a long time) are written to disk. This is only necessary when memory is full.
Occasionally (every 60 seconds?), a snapshot is made where all blocks in memory, including the most recently created ones are written to disk. This is done by marking those blocks as copy-on-write and synchronising the originals to disk in a background thread. Only dirty blocks need to be written; a block is marked as dirty when it is locked for writing.
Because the block table can have 2^64 entries in it, it too will need to be predominantly stored on the disk. The block table will be stored last, so that recovery is possible if the write to disk fails part way through. Blocks on the disk are not overwritten; the disk image is only appended to with newer versions of blocks.
One way of preventing ever increasing disk usage would be to keep only several snapshots, e.g. 1 minute ago, 1 hour, 1 day, 1 week, 1 month. These would roll over - the minutely one gets overwritten every minute. Once an hour, the hourly one is replaced by the minutely one. Every day, the daily one is replaced by the hourly one, etc. The rolling over happens from oldest to youngest so that the latest minutely snapshot doesn't overwrite everything! Snapshots refer to objects in each other, so they must all be present. Replacing one snapshot by another may just be a case of updating some pointer somewhere.
One way of thinking about snapshots is that they are like generational garbage collection. The oldest objects are in the oldest snapshot.
Writing to an object in an old snapshot would involve reading the relevant block from that snapshot into memory, writing to it, and when the snapshotting occurs it now becomes part of the latest snapshot.
Moving objects between blocks
(also, see above in "Writing to an object instvar")
For optimization, it is sometimes preferable to move objects between blocks to minimize the number of far references. This is done as follows:
- A GC is performed on the old block to remove unnecessary references to this object.
- Both blocks are locked for writing by this thread.
- The object is copied and a new InPointer is made for it in the new block. If the object referred to objects in the old block, FarRefs are made. The local block is checked for to see if any of these FarRefs already exist and can be re-used (maybe?).
- If any objects in the old block refer to this object, a new FarRefs is made in the old block referring to this object's new location. Those objects are updated to point to that OutPointer.
- The back references for the old InPointer are traversed and updated. If any of these back references are actually in the new block, then substitute them for a local reference.
OutPointers in a block can be used by multiple objects in that block. To find if an OutPointer to an object already exists in a block, either the backref list of the target can be searched, or all OutPointers in a block can be searched.
Writing to an object, old ideas:
An optimisation is to make a thread hold a write lock on a block, releasing it only when asked to. When another thread gains the write lock on a thread's current block, that block's thread is blocked until it regains its lock. The owner thread cannot make any new objects meaning that it cannot create any MethodContexts until it regains its lock. This means that a thread might relinquish its lock on its current block at every context switch. An optimisation would be to make the owner of a block make a new block when another thread wants to access its current block. This makes it effectively impossible for one thread to access another thread's current MethodContext, meaning the debugger would have a difficult job.
Alternatively, blocks can be copy-on-write. A thread will do all its work on a copy of the block. When the thread has finished, the block is "committed" back to be part of the main memory. TODO: think about how this affects concurrent synchronisation.
The copy-on-write method may make it possible to have a transacted VM with rollback possibilities. If blocks are copied on write, then a snapshot could be made by keeping the old versions of the blocks and keeping an old copy of the block table.
Say that to write to a block, a thread obtains a copy of that block for writing to, and when finished the block is inserted back into the image. This means:
- Any other thread wanting to write to that block will need to wait for the existing block to finish. This means that the block needs to record who has it locked anyway. This only happens when writing to existing objects; new objects will be created in thread's private blocks.
- I suspect that every time a Semaphore is signalled, the block will need to be written back before the involved thread is resumed to ensure that concurrent synchronisation still works. I'm not sure if there are other cases, such as waiting on a semaphore.
- Other threads can still read from the old copy and so no blocking is needed. They only block on writing.
- This approach may or may not have significant memory copying overheads. An initial write to a block means it needs to be copied.
Random Obsolete Ideas
A pointer within a block is a direct 16-bit pointer. These are fast and the intention is that these local pointers help absorb the overhead generated by the rest of this proposal.
A pointer from a block to an object is another block has three levels of indirection. Firstly, a 16-bit pointer points to a larger "OutPointer" object (e.g. 64 bits?) in the same block. That larger pointer points to an entry in a global object mapping table. That table entry then has the location of the external block and the "InPointer" index of the object in that block.
Blocks contain objects, InPointers and OutPointers. OutPointers point (indirectly) to InPointers in other blocks. InPointers in a block are referred to by index (0, 1, 2... etc) that doesn't change if the local object the InPointer refers to is moved around the block.
An InPointer consists of: [localObject, backRefs]. localObject is a 16-bit direct pointer to the object in this block. "backRefs" is a list of back references to the out pointers in other blocks which refer to this pointer. This list can be implemented in several ways. One way is to have each OutPointer have a nextPointer, so that they form a linked list of all OutPointers pointing to a particular InPointer (needs more thinking - how does the object table fit in?).
An OutPointer consists of a pointer (64 bits?) to an entry into the object mapping table. It may also contain a link to the next OutPointer in a linked list of back references if back references are implemented in this way.
* Including linked lists in the outpointers will cause a lot of CPU cache misses as it jumps around memory. Use a local array instead.
* Rather than using an object lookup table, just a block lookup table is needed. An OutPointer then is a (block#, inPtr#) tuple. An InPointer is then a (localRef, backRefs) tuple with both entries being direct 16-bit pointers. Unfortunately, this limits the number of back references to what can fit in a 64KB block.
A backtracing GC algorithm uses the back references to determine how many references there are to a particular object. When this list is empty, the object can be GCed.
The backtracing GC algorithm would mean that an image could be mostly stored on disk. The GC algorithm does not trace through all of memory, meaning that objects on the disk can remain untouched by a GC.
The backtracing GC algorithm also is very concurrent; a single object can be reclaimed by it's own private thread of execution independently of any other GCs which are also happening.
Each block can be garbage collected, compacted, etc independently of any other block.
Objects can be migrated between blocks to optimize object access. Preferably, most pointers are local 16-bit pointers which are much faster than inter-block pointers.
This type of virtual machine would make it possible to use the SSEs on the cell processor. It would make very good use of multi-cored CPUs.
Custom hardware could be made that executes Smalltalk bytecodes directly. GC and far reference management would then be implemented in bytecodes rather than by hardware. Also, a cell-processor style local memory could be used for each core with a master core doing transfers to/from main memory and disk.
Disadvantages: can't have large (>64KB) arrays.
This could also use a partition on a disk rather than files on a filesystem. Both options should be available. RAID would be used to make an image span multiple disks.
Implementing this
A Squeak image is a block! It isn't a 16-bit one, but it can be a block.
A Squeak image can have in pointers and out pointers. In pointers would be some global collection in that image (OrderedCollection maybe? or an Array?). Out pointers would be pointers to other images.
Inter-image communication could be done using perhaps the HydraVM primitives, or maybe a networking protocol such as MPI? Voila... Squeak on a cluster!
Implementation
This is a prime candidate for implementing in Slang.
An initial implementation could be written in Smalltalk and run on a VM. The basic functionality it needs is for blocks (ByteArrays), processes (Process), display output, disk access and probably network access. Unfortunately, a Squeak implementation would not be optimal because Squeak does not support multiple processes. An initial prototype could verify that the design works.
Can Smalltalk (rather than Squeak's Slang) be compiled to C?
A class MemoryBlock could be made with an artificial limit on the values each location can contain. In this way, both 16-bit and 32-bit NearRefs can be played with. The MemoryBlock would be an array of the size of each block that can be written to disk. By making the implementation flexibile (wrt size of NearRefs and block size), some experimentation can take place to see which gets the best performance.
These MemoryBlocks would be written to a real disk, which will show the exact same behaviour as a compiled VM.
Multi-threaded implementations can be simulated using Squeak Processes. These would become pthreads in a compiled version.
The Squeak interpreter as it stands could remain largely the same. The ObjectMemory could be replaced with a BlockMemory. Most methods would stay the same; no special block handling functionality would need to be added at the Interpreter other than "interrupts" when the block is full which causes a call to bytecodes which then handle that case. The BlockMemory objects would also be visible from within the image so that GC can happen in bytecodes (if desired - slow).
Ideally, the Slang compiler could be modified to be able to read and write real objects as well. These "real" objects would be compiled into a bunch of C functions that take a pointer to the memory location of that object as their first argument. These functions would have an implementation of the guessed class of that object (where perhaps profiling information would be useful?), and first tests to see if that class is correct. If that guessed class is wrong, the primitive would fail and the Interpreter would resume by trying to find another implementation of that method. Several tests like this could be chained together if the method is a common one. Alternatively, the compiled VM could keep a special method lookup table which referred to compiled versions of these methods. (TODO: how does Exupery do this?).
Prototype Implementation
An implementation could be made in Squeak using plain objects to test whether the algorithms work and how well it performs.
- Blocks could contain an OrderedCollection of InPointers and an OrderedCollection of other objects.
- Objects would just be Arrays of integers. Element 1 could be metadata (a special object). Element 2 could be the class.
- The object index would be a Dictionary mapping integers to (block, InPointer index, backreference list) tuples.
- CompiledMethods could stay the same.
- The interpreter from the Debugger could be ripped out and used.
Targetted Platforms
- Multi-cored computers (dual-core / quad-core).
- Very multi-cored computers, e.g. Sun T2000.
- Cell processor (6 cores, local memory only).
- Platforms with little memory - most data held on disk.
- Custom hardware.
Cores must be able to be added and removed at run-time to facilitate:
- Powering on and off the various hardware cores.
- Reassigning hardware cores to other tasks (e.g. primitives, reconfiguring FPGA for other task, reconfiguring Rapport Kilocore core for another task).
- testing, tuning.
Ideally, the word size for inter-block references should be customisable - 16 bit, 32 bit. This allows this VM to run efficiently on an architecture where 64kb is too fine a granularity.
Custom hardware implementation
This VM could be designed with an eventual hardware implementation in mind. A standard computer could have Smalltalk co-processors added to it, much like the Cell architecture has a PowerPC core with custom SSE added to it. The main CPU would handle moving data around the co-processors and doing GC and block management. Alternatively, this could also be done by a Smalltalk-based CPU which would just be one of many cores.
Each "core" would have 64KB (or maybe more if multiple blocks need to be loaded?) of local static RAM, and would execute Smalltalk bytecodes. The following would be implemented in hardware:
- Bytecode execution (stack (push, pop), jump, send, return).
- Primitives:
- SmallInteger, Integer operations.
- Floating point operations (done by shared co-processor?)
- Object allocation (basicNew).
- BitBlt and display operations on a shared display device?
- Maybe intra-block garbage collection? Can this be done in hardware?
- Block primitives - copying a block back to main memory, requesting a block from main memory, locking a block for writing. These would be callable using special bytecode primitives.
- Message dispatch table
The following would be implemented using bytecodes because they would be too complex to be implemented directly in hardware:
- Inter-block reference management.
- Inter-block garbage collection.
- Maybe intra-block GC if it can't be done in hardware.
- Block management - moving blocks between disk, main memory and each core's private memory.
- Method lookup (if it is not in the message dispatch table) and CompiledMethod loading from other blocks.
Example: code in a CompiledMethod wants to write to an object referred to using a OutPointer. The hardware would detect that the object being written to is an OutPointer rather than an actual object and would jump to a special location that has bytecodes for dealing with OutPointers (i.e. cause an "interrupt"). That special code would load the block in the OutPointer into the local memory (maybe unloading other blocks in the process to make room), lock it for writing, write the value, unlock the block for writing and return from this "interrupt".
A message send would require the CompiledMethod to be locally available. This is discussed below.
Block-based VM OS
Clients should be fast booting. Clients should only need minimal hardware resources; provided that they are reactive, all the grunt can sit on the server.
Anything that runs across an administered network should get it's full configuration from that network.
If an OS was going to be made out of this VM:
- It should be easy to install - an installation of a server would be done using a remote management client. Clients could be either simple PC applications or thin clients.
- It should be installable over a network connection and using nothing but a serial console or telnet session if need be.
- Installation involves the absolute bare minimum. If anything can be configured in a running system, then that's where it will be configured rather than during installation.
- Like ZFS, all storage vehicles (disks, CF cards, other devices) should be able to be added and removed from the "pool".
- There should be no file system except as an optional device.
- Object storage should sit directly on top of disk partitions.
- Object storage can sit on a RAID array.
- All management of a server will be remote over a network connection. It will be assumed that a server will never have a screen or keyboard.
- Clients will boot in less than 2 seconds. Clients will shut down in less than 2 seconds. This may be simulated using a battery to keep DRAM refreshes going, or by doing rapid loading of memory directly from a reserved part of storage. Servers can take as long as they want to boot.
Booting a server:
- The server boots over the network using DHCP, tftp or whatever. The network card does this, and the BIOS (or whatever) is configured to do this by default.
- The boot server should inform the network administrator that a new server is asking for boot information and ask what to do. From this point onwards, the administrator has complete control over the new server.
- Logging is done to a log server once the kernel has loaded (or before, if possible!). Ditto alarms, management. Infact, why not just keep a TCP connection to a management server for logging, alarms and management commands?
- The server reads it's disk configuration from the network - i.e. which partitions on which disks it should start up.
- It boots.
If it is a client, then it boots the traditional way by reading a boot block from storage.
On disk, the latest snapshot contains a record of what was last in memory. This is loaded back into memory. It would be possible to boot from an earlier snapshot - on a server, this would involve using management commands and on a client using a pre-boot menu of some sort.
Debugging made awesome
Debugging would be awesome if you could not only go forwards, but also backwards in time. This could be implemented by making snapshots, and recording the exact times when external resources affect the image. By recording the exact time (to the currently executing bytecode) of every external influence on the VM, the VM can be made completely deterministic allowing the exact sequence of events leading up to a bug occuring to be recreated.
This would require recording the exact time to the bytecode and the device I/O of:
- Context switches (i.e. switching between processes) occur.
- Accessed timers.
- All network traffic coming in.
- Keyboard, mouse events.
- file I/O times and latency.
- other devices.
Given a snapshot and the above events that occur, the VM can then recreate the exact circumstances under which an error occured. This allows the debugger to go back to a particular snapshot and recreate events up to a certain point in time with the user having full visibility over what happens.
The recorded information should be minimal. If the overhead is not to high, it may even be possible to keep, say, the last few minutes since the last snapshot of this information in memory and dump it to log files when an unhandled exception occurs (allowing the programmer full visiblility over what happened).
An added benefit is that this information acts as a log file which can be replayed when the VM is restored.
Implementation details
Pooling across partitions.
The object store sits on top of the block manager. The block manager keeps track of which blocks are stored on which physical disks (or if they are only in memory). The block table stores this information.
Would it be worth making the block manager able to keep "backup blocks" on separate devices, ala software RAID? If a disk is going to be decommissioned or replaced, it should be taken out of the pool meaning all blocks / objects need to be moved off it onto other disks. A disk can be marked as "very reliable" (i.e. RAID) to "unreliable" (i.e. IDE cables made of cellotape) and replication algorithms can be made depending on the reliability of disks. You would also need to specify that partitions are actually on the same disk, meaning that replication isn't useful.
Accelerated InterpreterSimulator
Currently, Squeak has an InterpreterSimulator written in Slang (a Smalltalk subset which compiles to C code). This can be interpreted using a Squeak VM or compiled to C to make a new Squeak VM. There is also another class that interprets bytecodes slowly, which is used by the debugger (Interpreter?).
If access to the raw memory of the image was available to code within the image, it would become possible for just one Process in an image to be run in the InterpreterSimulator, concurrently with other Processes being run by the native C interpreter. This would require that the VM is capable of proper multithreading and good ObjectMemory locking semantics.
This could also be how the debugger could be used. The debugger could use a modified InterpreterSimulator to simulate a Process running in an image, and modify the underlying image memory directly.
What advantages would this approach have over the existing approach of simulating the entirety of a Squeak VM? Code would run faster, and only a very small part of the image is actually needed.
Some form of image protection would be needed. If a bug or accident occurs, there needs to be safeguards to make sure the development image is not scrapped. Perhaps only a particular block might be simulated?
Links
This looks eerily familiar: http://www.merlintec.com/lsi/persist.html
What every programmer should know about memory: http://people.redhat.com/drepper/cpumemory.pdf
Somebody made something very similar: http://portal.acm.org/citation.cfm?id=1640134.1640149
Comments (0)
You don't have permission to comment on this page.