gulik

 

Parallel processing

Page history last edited by Michael van der Gulik 8 mos ago

Parallel processing

 

Until recently, CPUs have been getting faster by increasing the megahertz of a CPU and using clever techniques such as pipelines, branch prediction and out-of-order execution. Recently the emphasis has changed to focus on parallelism - putting more cores into a CPU:

 

  • You can now buy a 6-core x86 CPU from your local computer shop. On a motherboard with four sockets, this means 24 cores in one box. Alternatively, four quad-core Xeons with hyperthreading gives you 32 hardware threads in a consumer-level PC. Caveat: very expensive.
  • The XBox 360 has three Power-PC based cores.
  • The Cell processor in the Playstation 3  has a simple Power-PC based core with 6 "SPE"s. Each SPE is a RISC core in its own right with 256KB of memory and each is 4-way SIMD (all instructions do their operations on 4 words simultaniously).
  • Sun's UltraSPARC T1 has 8 cores each running 4 threads each. Many of Sun's processors and computers are parallel.
  • NVIDIA's GeForce 8 Series video cards have 128 "stream processors" that may be programmable for parallel processing.
  • Intel has announced an 80-core CPU... at some stage in the future.
  • Then there's the Kilocore CPU...
  • The J-Machine: http://cva.stanford.edu/projects/j-machine/

 

What this means is that programmers will have to learn how to do parallel processing. Using multi-threading in conventional languages works if you don't mind race conditions and deadlocks. Java has good support for threading, and Squeak/Smalltalk supports it but currently doesn't take advantage of the extra hardware available.

 

Erlang apparently supports trivial multithreading. I'm not sure about Mozart/OZ.

 

Prolog, or a better logic-based language, should be parallisable. Prolog is essentially a search algorithm for a solution over logical clauses. Search algorithms are usually easily parallised.

 

In Smalltalk

 

Squeak, and Smalltalk in general, has quite good functionality in terms of multithreading, but unfortunately programmers often don't make use of it. As a result, Squeak feels very single-threaded. Running a command typically freezes up the entire UI.

 

There are a few patterns that can be used in Smalltalk to make use of parallel processing.

 

  • Any block can be forked: [ anObject doSomething ] fork. This creates a new thread.
  • Collections can be parallallised. "aCollection do: [ :each | each doSomething ]" could fork off a lot of threads. Ditto for other collection operations.
  • Futures can be used. This is similar to forking. If you don't need the result of something immediately, you could do "result := Future doing: [ anObject doSomething ].". When this code is complete, result contains the result of that block. If you try to use result before the block has finished, it will wait until the block returns.
  • Smalltalk has a notion of updating objects using Object>>addDependent:, Object>>changed and Object>>update (and friends). A small change here could make the VM use more threads; most updates don't need to be strictly in order.
  • The user interface could certainly be more multithreaded. The drawing thread should be high-priority and strictly non-blocking. There could be multiple threads updating widgets. As a result, multiple threads could be updating the screen at any one point.

 

One idea is to add a "percentageComplete" variable to a thread. When a thread is assigned to a particular task, it is assigned the name of that task and updates "percentageComplete" to reflect the progress being made. The user interface can then show a progress bar for any thread that it is waiting for.

 

VM optimisation idea: making a new process could be made effectively as expensive as a block invocation. When "[...] fork" is encountered, the VM doesn't actually create a new process but just flags the current call stack as having forked, and continues with execution in the forked block... if the forked block returns to a flagged thread, then execution continues with the parent process. When a context switch occurs, the VM will check if the current context is flagged for forking and if so, then actually makes the new process then. One possible bug is if a fork happens on an already flagged context - perhaps the flag needs to be an integer that gets incremented with each fork and decremented down to zero during the context switch? This optimisation would mean that forked processes would have very little overhead if it returns before the next context switch.

 

Parallel abstractions

 

 

Parallel programming is often considered as being difficult. One way of minimising bugs in parallel applications is to encapsulate and abstract parallel behaviour to contain bugs. A well-tested parallel abstraction will have less bugs than custom written code.

 

Trivially parallel

 

Often it is easiest just to write a parallel algorithm using forks and semaphores. I personally don't consider forking processes and synchronising on semaphores too low level to write parallel applications.

 

Semaphore can be used in many ways. A Mutex is a type of Semaphore that is already signalled; it can be used to protect critical sections of code. A Multex has been signalled n times, it can be used to make sure no more than n processes execute a critical section of code at the same time. Semaphores can be used to join processes .

 

Unrelatedly parallel / Embarrassingly parallel

 

Obviously, two unrelated sections of code that never share objects could run in parallel. A good example of this is a user interface that lets applications run in their own threads rather than block if one application blocks. Another example is initialisation of an application; two unrelated parts of an application could initialise at the same time if they share no state.

 

Parallel collections

 

Many of the methods on collections could be parallellised. For example, "someCollection do: [:each ...]" could easily fork off a thread for each element in that collection. There is a lot of scope for parallelisation here.

 

The main caveat to parallizing collections is to watch out for side effects. If a number of threads are performing an operation on a collection, then it is important that the block doesn't alter any external state in an unsafe way.

 

Using this abstraction, a parallel collection library could be built up. Compare Java's java.util.concurrent package.

 

Enumeration operations on a collection in Smalltalk include:

  • do:
  • collect:
  • select:
  • detect:
  • inject:into:
  • reject:

These could all have parallelised implementations made, except for >>inject:into: where each result depends on the previous.

 

Another operation could be added:

  • summerize:

Which takes a block as an argument and can make a summary of a collection. For example, to add up all numbers in an array of numbers:

 

sum := array summarize: [ :each :every | each + every ].

(noting of course that a fast implementation of Array>>sum already exists).

 

This does a very similar job to >>inject:into: except that it is run in parallel: firstly, the collection is paired up and the block is applied to each pair. Then the block is applied to pairs of results, and then again to pairs of results, until a final summary is made.

 

This would mean that the Smalltalk equivalent of Map-Reduce is Collect-Summarize:

 

result := collection collect: [ :each | ... ] summarize: [ :each :every | ... ].

 

This firstly applies collect: to the collection, and then applies summarize to the resulting elements. The implementation could be very concurrent.

 

Many other collection routines could be made concurrent. Here are a few more examples:

  • Collection>>+, -, /, *, sum, etc. These operations could also use available SIMD instructions as well as being multi-threaded.
  • SequenceableCollection>>copyFrom:to: and friends.
  • SequenceableCollection>>reverse
  • Collection>>asArray and other converting methods.
  • Collection>>addAll:
  • ... and so forth.

 

Protected data types

 

Rather than include a mutex with every object you intend to protect in your code, it is less effort to have a library of thread-safe classes. A protected integer, for example, would be very useful if it had such methods as "increment" or "waitUntilGreaterThan:". A protected data type would contain its own locks so that it's user doesn't have to have a separate lock for protecting it.

 

This would be very useful for collections.

 

Job management

 

A library could be written to manage "jobs" and job queues. Jobs can be submitted to a job queue, which are then processed by a limited set of threads. This could implement, e.g. scatter/gather type jobs.

 

There are different types of job queues that could be made - compare the queuing features in Java's java.util.concurrent package.

 

Smalltalk is awesome; jobs could be blocks. For example, if you have one or more worker threads, jobs could be easily submitted as blocks of code:

 

jobQueue add: [ some long running task ].

 

Or

 

j := Job doing: [ :progress | some long running task. ] description: 'Some long-running task'.

j whenCompleteDo: [ ... ].

j progress. "Returns the percentage done; this would be showen to the user. "

j suspend. j continue. j terminate.

j timeToComplete. "Returns a time span, automatically calculated using progress in the last few minutes"

j description. "Show something to the user."

 

This allows a UI widget to show the user why it is currently busy and how long it will take to finish.

 

The task could be eventually performed by a worker thread or set of worker threads. Compare EventQueue and SwingWorker in Java.

 

Futures

 

A future allows a program to continue executing while a variable's value is being calculated. They're described above somewhere.

 

Barrier

 

A barrier could be turned into a reusable object. A barrier makes sure that all threads have reached a particular rendevous point before continuing.

 

Queues

 

Various types of queues could be used. Synchronised queues are used in concurrent program for applications such as producer/consumer problems.

 

Synchronisation helper routines

 

Most of the complexity of parallel programming is involved with synchronising multiple threads. There is some scope for simplifying operations here.

 

Semaphores could have extra support for stuff mentioned in the Little Book of Semaphores (I can't remember what these are right now).

 

XXX I had a great idea for this, but I forgot it. It was something to do with a "[] runAfter: aBlock" methods.

 

You could have "[xxx] runWhen: [condition]" which will poll the condition but not run it until the condition evaluates to true.

 

Data flow algorithms

 

A streaming architecture could be made, where the program represents a production line. A bunch of Stream objects are plugged into each other, plugging outputs from some Stream objects into inputs in others. Buffers may or may not be used depending on whether they improve or inhibit performance. Each Stream object does some processing on the elements it receives, and then passes the output to the next Stream object.

 

There are many ways of parallelising this:

  • Each Stream object could have a single thread or set of worker threads which read a FIFO queue (which elements are deposited into from other Streams), perform processing and send output to other Streams. This limits the parallelism to the number of worker threads, and necessitates a context switch to read elements - if a queue is used, a lot of objects can be processed in the same context switch, but latency would be higher than other implementations.
  • Semaphores could be used for synchronisation. Threads would continue executing though the entire chain of Stream objects, one thread would be forked for each element that passes through the chain. This has lower latency and is likely more efficient. An object could also pass through the entire chain in one context switch.
  • New processes could be forked to handle incoming objects at every point in the chain. With a VM optimisation (where new processes are only created when process scheduling occurs), the cost of forking could be reduced to that of a method invocation.

 

This abstraction allows for use of SIMD-like parallelism. If using a FIFO queue, objects could be queued until there are enough objects to run a parallel instruction on all of them in one go. One example would be 3-D graphics transformations, multimedia instructions, or hand-optimised primitives.

 

Objects could also be passed between streams in efficient bulk operations.

 

SIMD Array operations

 

SIMD instructions on CPUs (e.g. Altivec, SSE, Cell SPE) could be used for Array bulk operations such as arithmetic with two arrays or scalars, searching, summarising methods (such as max, min, sum) and so forth. Operations such as matrix/vector operations and signal processing could also have parallel implementations.

 

Debugging Concurrent code

The bugs which can occur in concurrent code are:

  • Deadlocks. This happens when locks are acquired by two or more mutual threads.
  • Race conditions. I believe this is the result of not protecting shared state.

 

Finding deadlocks after they occur is easy: look at the threads which are waiting on a semaphore. Finding deadlocks before they occur is more difficult.

 

An example of a deadlock:

 

methodOne

    sema1 critical: [ sema2 critical: [ ... ] ].

 

methodTwo

     sema2 critical: [ sema1 critical: [ ... ] ]

 

[ self methodOne ] fork.

[ self methodTwo ] fork.

 

Locks should be aquired in a particular order. Alternatively, only aquire one lock at a time. TODO: what do locks look like in Smalltalk? Semaphores? Perhaps Semaphores could have an order that they must be aquired in?

 

Shared state should always be protected:

  • Avoid having transient state in objects which are shared between threads. For example, if an object requires multiple method calls to perform a particular action, these method calls should be in a critical section. Alternatively, give that object a block of code which it should execute atomically.
  • Try to keep your locks private in an object. This encapsulates any possible bugs.

 

Example:

 

increment

    a := a + 1.

 

This is not an atomic operation. Here, 1 is added to a, then the assignment occurs. In theory, another thread could read a, increment it, be suspended while another thread runs this method, and then overwrite that other method's result with the older value of a.

 

Parallel programming techniques

 

Parallel algorithms can get confusing. Here are some tips which can make parallel code easier to write:

  • Wherever possible, encapsulate the parallellism.
  • Keep locked regions of code as small as possible.
  • When suitable, use the fact that writing an object reference is an atomic operation. Rather than holding a lock, make a shallow copy of the object in question, make the changes, and then write the copy back to the original.
  • Learn how to use Semaphores well! They can do a whole bunch of tricks. See the Little Book of Semaphores: http://www.greenteapress.com/semaphores/
  • Keep side-effects minimal. For example, a forked block (aka Process) shouldn't change the state of unrelated objects unless it really has to.
  • Don't require that operations on your API need multiple method invocations (e.g. Canvas>>setColor:, followed by Canvas>>drawLine:).
  • When writing concurrent code, always keep in mind that multiple Processes may enter that method.
  • Aquire locks in a particular order. TODO: Is there a way we can automate this? For example, by assigning each lock a number, and locks need to be aquired from lowest to highest number or the Process will cause an error?
  • Use a code verifier to find concurrency bugs? Such a tool probably needs to first be written.

 

Tools

 

Deterministic VM

 

A deterministic VM would do context switches in exactly predictable locations. This would allow the programmer to repeat the conditions leading up to a bug in concurrent code. This could be implemented by:

  • Giving the scheduler a seed value (like a random number seed) which then makes that scheduler schedule threads in an exact order with context switches at exact bytecodes until some non-determinism creeps into the system.
  • Recording the exact locations of context switches and replaying them.

 

Pedantic VM

 

A Pedantic VM is one that will try as hard as it can to trigger concurrency bugs. This VM will schedule context switches at times that make it most likely to trigger bugs.

 

One good place to switch contexts is just before storing to variables. This is to encourage race conditions. The VM could check to see if the next bytecode is a store operation, and if so then cycle through all other possible processes before performing that store. This makes it much more likely to trigger bugs where code has made assumptions about the atomicity of operations: atomic operations are typically ones that read something, process it and store it.

 

The Pedantic VM could switch processes at any variable store to an instance variable. Method and block variable stores (assuming that closures are implemented) would not need to be stored if there is no way that the current process has been able to fork a new process (i.e. if there are any #fork message sends in the current method).

 

Another good place to switch contexts is just before the process signals a semaphore. This greatly increases the chances of triggering any deadlocks in the program: by making signals lazy, as many processes as possible are waiting on semaphores. This may conflict with trying to trigger race conditions above so perhaps the developer could choose between trying to set off deadlocks or triggering race conditions.

 

The scheduler of the Pedantic VM could make an initial run through the process table. After the initial cycle through all processes at that runlevel, every process should be either about to perform a write to a potentially shared variable, or signal a waiting process. The next process to run could be chosen by a deterministic random number generator that could be seeded with a particular number by the developer. This would make the conditions that lead to the bug reproducable and the behaviour deterministic. The scheduler should have an equal likelihood of rescheduling the currently running process as well to make sure bugs are not hidden like that.

 

TODO: is it possible to make bugs that this pedantic VM would never detect? Perhaps it's possible to indirectly cause a race condition or deadlock?

 

There are a large number of possible permutations of the order to schedule processes to run. Ideally, all of them would be tried, but it would be expensive to do all of them. Perhaps method metadata could be useful:

  • The number of threads running a particular method? If a method has a lot of different threads calling it, then it is a prime candidate for bug hunting and so the next process chosen would be one that historically has run that method.

 

ProcessorScheduler>>yield could be turned into a noop to make sure that no code relies on its particular behaviour.

 

Performance tuning

 

Reasons for poor performance in highly parallel environments:

  • Cache misses.
  • CPU pipeline not well used.
  • Waiting on other threads.
  • Waiting on network.
  • Waiting on disk (swapping).
  • Waiting on anything else.
  • Bad algorithms.
  • Garbage collection overhead.
  • Scheduler overhead or poor scheduler choices.
  • (On architectures with shared CPU components such as the T2000) resource contention, possibly because of poor scheduler choices.
  • (On replicated architectures e.g. DPON) Poor RepAlg object distribution and load balancing.

 

You need to be able to measure these for performance tuning.

 

How do you measure cache misses?

 

A VM such as Squeak could be modified to measure the amount of time a thread spends waiting.

 

 

Scheduling

 

A process has several scheduling requirements:

  • real_time: It may be real-time or not (e.g. UI tasks, reacting to network). These are usually I/O bound and spend a lot of time waiting.
  • batch: It may be a batch task; temporary CPU starvation is no problem.
  • A process may have a minimum amount of CPU cycles per second that it requires, e.g. a sound or video player. If it cannot achieve this, it cannot run on the current computer.
  • For testing purposes, a scheduler should also support a maximum number of CPU cycles per second to simulate slower computers or high-load machines.

 

Real-time tasks have an ordering by priority - some I/O activities are more important to service than others.

 

In a dominioned system, CPU usage is scheduled per dominion which in turn shares its allocation across the processes in itself. CPU usage would also be billed/accounted according to dominion. It should be difficult for one dominion to deny resources to another, and it should be impossible for low-priority dominions to deny resources to higher-priority dominions.

 

 

Making Squeak multi-threaded

 

If I were working on a concurrent VM, I'd just barge ahead and parallelise the Squeak VM:

 

* I'd refactor the VM so that multiple copies of the interpreter state could be held in the heap. I'd design it so that each pthread would run one instance of the standard Squeak interpreter, and so that each interpreter runs multiple green theads as it does now. This makes forking a very cheap operation and allows for having as many green threads as would fit in memory.

** Each interpreter has it's own scheduler.

** When a pthread goes idle, it can grab jobs off a busy scheduler.

* I'd then see what needs fixing in the VM / plug-ins / image to make it work. I'd probably start with the simple approach: an evil but simple global VM lock for memory writing operations, and no changes to GC. Later, we can revisit this to make it scale better.

* I'd then spend the next decade finding and fixing concurrency bugs.

 

The only operations I can think of off the top of my head that would have concurrency problems are:

* Creating new objects.

* GC.

* Some plug-ins.

 

If writing an instance variable to an object can be done in a single machine code instruction, that would not need and locking or synchronisation.

 

Parallel garbage collection would take a long time to implement correctly. I'd leave that and use a global VM lock.

 

Creating new objects can be made faster by using a separate Eden space for each thread. For simplicity though, a global VM lock would do the same job with less programmer effort.

 

What I'm actually going to do is:

 

* Modify the Squeak scheduler to simulate concurrency better and bring up lots of concurrency bugs in the image.

* Spend a decade fixing concurrency bugs in the image.

* Write enough concurrent code to make it worth having a concurrent VM.

* Sell that code, make a lot of money.

* Buy the rights to Gemstone or some other high-performance, concurrent VM.

* Port my code to that instead.

* Make a lot more money.

 

Currently, there isn't really enough concurrent code to make a parallel VM worthwhile.

 

 

Links

 

http://citeseer.ist.psu.edu/182287.html

 

Jecel's wiki:

http://www.merlintec.com:8080/software

http://www.merlintec.com:8080/hardware

http://www.merlintec.com:8080/hardware/26

 

The plurion:

http://www.merlintec.com/merlin6/e_main.html

 

 

D. Lea. Concurrent Programming in Java: Design Principles and Patterns. Addison-Wesley, Reading

MA, 1997.

 

D. C. Schmidt, M. Stal, H. Rohnert, and F. Buschmann. Pattern-Oriented Software Architecture -

Patterns for Concurrent and Networked Objects. Wiley, 2000.

 

Reading material from Jason Johnson about alternative concurrency architectures:

http://research.microsoft.com/~simonpj/papers/stm/#beautiful

(Simon-Peton-Jones, Beautiful concurrency)

http://research.microsoft.com/~simonpj/papers/stm/#composable (SPJ and

others, Composable memory transactions)

http://www.drjava.de/e-presentation/html-english/img0.html (Stefan

Reich, Escape from multithread hell)

http://Erights.org  (various regarding the E language and the futures

technology)

http://armstrongonsoftware.blogspot.com/2006/09/why-i-dont-like-shared-memory.html

 (Joe Armstrong - explanations why shared memory is bad.  His blog

contains various examples of how things are done in Erlang in the

Actor model)

 

http://www.eros-os.org/pipermail/e-lang/2001-July/005410.html

 

Comments (0)

You don't have permission to comment on this page.