Friday, March 25, 2011

Distributed and Multithreaded Applications

  A problem that maps very well to multiple cores and is easily parallelized is the computation of the Mandelbrot set.
  The following graph is the result of an experiment where I varied the number of cores used to render each frame of a deep zoom to the limit of ''double'' floating point precision. When I run this algorithm as a traditionally single threaded application it takes up to 800 seconds to render a 1024x1024 grid from 1.0 to 1 x 10^-16. However when I start adding threads I see the best speedup when I use the same number of threads as there are hard processors (non-hyperthreaded). The performance increase nears it's maximum 8 times increase for an Intel Corei7-920 when I approach a thread/line of 512 threads.

  As you can see from the graph, we benefit more from a massive number of threads - as long as they are independent. The Mandelbrot calculation however it not homogeneous - computing the central set requires a lot more iteration than outlying areas. This is why each parallel algorithm must be fine tuned to the problem it is solving. If you look at the screen captures of performance during the runs with various thread counts you will see what I mean. The processor is not being exercised at it's maximum capacity when the ''bands'' assigned to particular threads are finished before other threads that are performing more calculations than their peers. If we increase the number of bands - we distribute the unbalanced load among the cores more evenly - at a slight expense of thread coordination/creation/destruction.

Multicore Rendering of Mandelbrot Set
  The following runs are on a 1024x1024 grid starting form 1.0 to 0.0000000000000001 that take from 800 to 67 seconds depending on the number of threads used concurrently. Notice that I have a temporary issue with shared variable access between threads - as some of the pixel coloring is off.
  As you can see - the processor usage goes from 12% for a single core, through 50% for 8 cores - to 99% for 128+ cores. (we need to leave some CPU cycles to the system so our mouse functions)
  Why do we need so many threads? If even one thread takes longer than any other ones that are already completed their work unit - the entire computation is held up. We therefore use more work units than there are threads.
  A better algorithm would be to distribute work units asynchronously instead in the current MapReduce synchronous way we currently use. When a thread is finished, it can work on part of the image that is still waiting processing. We would need to distribute work units more like packets in this case.

  1 thread on an 8-core i7-920 takes 778 sec

  2 threads on an 8-core i7-920 takes 466 sec

 16 threads on an 8-core i7-920 takes 138 sec

  128 threads on an 8-core i7-920 takes 114 sec

Thread Contention for Shared Resources:
  For our multithreaded Mandelbrot application - which currently is not @ThreadSafe - we encounter resource contention specific to the Graphics context. This type of contention is the same for any shared resource such as a database. The issue is that setting a pixel on the screen is not an atomic operation - it consists of setting the current color and then drawing the pixel (The Java2D API may require multiple internal rendering steps as well). The result of this is that another thread may change the color of the graphics context before the current thread actually writes the pixel - resulting in noise - or more accurately - '''Data Corruption'''.

  Note: that no noise or data corruption occurs when we run a single thread. We only get a problem when we run multiple threads concurrently.
color = Mandelbrot.getCurrentColors().get(iterations);
 color2 = color;

 // these 2 lines need to be executed atomically - however we do not control the shared graphics context
 synchronized (color) { // this does not help us with drawRect()
    // drawRect is not atomic, the color of the context may change before the pixel is written by another thread
 if(color2 != mandelbrotManager.getgContext().getColor()) {
    System.out.println("_Thread contention: color was changed mid-function: (thread,x,y) " + threadIndex + "," + x + "," + y);
    // The solution may be to rewrite the pixel until the color is no longer modified or only allow a host thread to write to the GUI

_Thread contention: color was changed mid-function: (thread,x,y) 2,298,22
_Thread contention: color was changed mid-function: (thread,x,y) 15,140,155
_Thread contention: color was changed mid-function: (thread,x,y) 15,140,156
_Thread contention: color was changed mid-function: (thread,x,y) 15,140,157
_Thread contention: color was changed mid-function: (thread,x,y) 15,141,151
_Thread contention: color was changed mid-function: (thread,x,y) 2,307,25
_Thread contention: color was changed mid-function: (thread,x,y) 15,143,154
_Thread contention: color was changed mid-function: (thread,x,y) 15,144,152
_Thread contention: color was changed mid-function: (thread,x,y) 13,0,130
_Thread contention: color was changed mid-function: (thread,x,y) 11,0,110

  The better solution would be designate a host thread that coordinates all the unit of work threads and acts as a single proxy to the GUI - only one thread should update AWT or Swing UI elements - as most of them are not thread safe by design. Multithreaded distributed applications need to be very careful when using GUI elements. For example if I do not introduce at least a 1ms sleep between GUI frames - the entire machine may lock up when 100% of the CPU is given to the calculating threads.