Parallel map: more JRuby concurrency mischief

Like my last post this is more for my future benefit but if anybody else finds it useful then that's cool too. Unlike the last one the fruits of my tinkering yielded a nice linear speedup.

Ok, let's parallelize Array#map. We'll break down the task as follows:

  1. Split the array into chunks
  2. Execute the chunks in asynchronously, in parallel, waiting for them all to complete
  3. Merge the chunks into a new array and return it

How many chunks is optimal? There's no definitive answer; In the past I've opted for a very large number of small sub-arrays, e.g. for concurrent divide & conquer reductions where the minimal array length was some low power of the number of processors (I've played with associative reduction algorithms in the past). For our #pmap method I'm just going to split the original array up into as many chunks as there are logical cores on my machine. How do you find that out in JRuby? Java to the rescue again:

Now we need a pool of workers to assign tasks to. As parallel mapping is strictly CPU bound, a thread pool with fixed thread count but an unbounded work queue is probably most appropriate:

That's our ExecutorService up & running, we just need to do a bit of housekeeping before we can write our Array#pmap method. This is where Java's baroque-complexity boilerplate rears its ugly head (wouldn't it be nice if Executors could take lambdas as arguments for mass invocation?). Basically we implement the Callable interface- I instantiate my Task implementation with a block which the executor calls when it executes:

So now we're good to go. Array#pmap here takes an executor as a first argument because I didn't want the class to be responsible for starting / shutting down the work queue, but that's just an implementation decision.

Once the original array is mapped to a set of Task classes, they can be handed to the executor. I call ExecutorService#invokeAll because it blocks until all the submitted work is done. It returns an array of FutureTasks which can be dereferenced immediately (we want the method to return something!):

So does all that tomfoolery actually buy you any more performance? Time for a highly unscientific benchmark, incrementing an array of the first million Fixnums:

Not bad- I ran it on a Sandy Bridge i5 with 2 cores and 4 CPU threads.

More important than a cheesy parallel mapping imeplementation, I've learned (or is that re-learned?) two axioms about playing with concurrency in JVM hosted languages:

  1. As in the previous post, executors expect some kind of anonymous inner class as an argument in the absence of closures. You need to be aware of the cost of converting a ruby closure to a Java Runnable, Callable, Future etc; think of I/O-bound problems where allocating extra objects for each request/event is almost certainly not a good thing. You'll definitly save time writing in straight Java as you won't have to worry whether or not the closure you just used is going to throw all your performance gains out the window.
  2. Don't discount the JVM's ability to make a fool of your benchmarks. The one above isn't worth the screen real-estate it occupies due to its simplicity and duration relative to startup / shutdown time of the VM. Both JIT and Server modes take time to hit a 'quiescent state' where performance stabilises. Don't just look at wall-time, profile your code properly. Lots has been said about this elsewhere so I won't rehash what others have articulated better, but be aware HotSpot has its own performance quirks.

The example in its entirety is here.

Filed under  //   concurrency   jruby   ruby  

Comments [0]

Filed under  //   concurrency   stm  

Comments [0]

Installing Grand Central Dispatch on Linux

I've been curious about getting libdispatch and the blocks runtime compiling in Ubuntu for some time but naively guessed that it wouldn't be for the faint-hearted since the only useful result yielded by Google was this thread on SO. How wrong I was!

For this exercise I pulled the iso for natty server from the ubuntu site and did a fresh install in a VirtualBox VM. The only extra I added during install was an SSH daemon so I could use the terminal on my Mac.

Libdispatch needs llvm/clang, libkqueue and the blocks runtime which are already available through apt-get in natty so let's install them.

You'll also need libpthread-workqueue0. I had to download the .deb packages from oneiric here but they installed without any hassle:

I compiled libdispatch itself from source. Don't grab it from MacOSforge, download the tarball used to make the .deb package for oneiric. The installation will also need make, autoconf, autogen and libtool. Just to save yourself any hassle trying to solve missing header issues, install the build-essential package and gcc-multilib while you're at it:

You should be set to compile libdispatch now. There should already be a configure file in the 'libdispatch' root folder so let's install:

Make sure you force clang as the compiler, not gcc. The end of the install message should tell you where the libs were installed, for me it was /usr/local/lib. Ubuntu rather stupidly ignores erases the LD_LIBRARY_PATH variable if you set it through the shell, so add a file in /etc/ld.so.conf.d/* pointing to this location if you don't have one already.

You should have Grand Central set up by now. Let's write a hello world programme and see if it works:

And to compile:

And if all is well you should get the message printed to the screen. I haven't tried anything computationally intensive yet and I wouldn't be surprised if there's a performance discrepancy between FreeBSD / OS X, since libkqueue is just a wrapper around epoll(). But hello world works, and that's half the battle, right?

 

Filed under  //   concurrency   libdispatch   linux  

Comments [0]

CMFunctionalAdditions- multicore ruby-like utilities for Objective-C

Hot on the heels of the last post, I've decided say a bit more about CMFunctionalAdditions. As I mentioned last time the project is the result of functional and syntatic-sugar widthdrawl symptoms. In a nutshell, CMFunctionalAdditions is my take on being able to call the following in Objective-C without altering the receiver:

  • Map
  • Map with index (each_with_index.map...)
  • Reduce (inject, fold, whatever)
  • Filter (select)
  • Remove (reject, delete, whatever)
  • Partition (the ruby flavour, not the clojure flavour)
  • Unique
  • Flatten
  • Take (take_while)
  • tbc

Additionally many of the linear-time methods above run in O(n ∕ P) time or better with Grand Central Dispatch. I want concurrency to be transparent to the end user wherever possible. A picture speaks a thousand words so to rip the examples from the CLI demo...

The framework currently requires Snow Leopard or better but Lion might become a prerequisite (see last post). iOS 4 should work but I haven't tried it. I'm also attempting compilation on Linux as this is being written. As the only dependencies are Foundation and libdispatch (compiled with llvm) it'd be a nice bonus.

Proper documentation is in the pipeline. Any feedback would be appreciated.

Filed under  //   concurrency   libdispatch   objc  

Comments [0]

Parallel reduction in Objective-C

Ruby and Clojure have spoilt me and coming back to Objective-C has starved me of a heap of nice ways to manipulate collections.

Born out of this frustration and a drive to use GCD and blocks for something useful, I've started creating my own framework of categories to add into NSArray, NSDictionary and who knows what else right now (that's for another post).

After toiling for an afternoon and realising that you can now create explicitly concurrent dispatch queues in Lion, I managed to work out a divide & conquer reduce algorithm for associative functions:

I mentioned OS X Lion; specifically look at line 18:

Since we assume the user has provided an associative function, we don't care about the order in which our recursive calls return. This means we can use the shiny new DISPATCH_QUEUE_CONCURRENT constant in 10.7 to send work off in a parallel manner (thanks for the tip, Kazuki).

For this to work you need a means of splitting the array into sub-arrays of ever decreasing size, until they hit a limit where it's practical to reduce them serially:

Right now I'm naïvely assuming that the smallest unit of work is when a sub-array satisfies the inequality:

l ≤ n ∕ P² ; {l, n, P} ⊂ ℤ ∖ {0}

Where l the base case, n is the length of original NSArray to be conqured and P is the number of processors.

Given the fact that n will typically be vastly greater than P on most personal computers and mobile devices you'd intuitively think that mandating a relatively large base case is the sensible thing to do- the smaller the job, the more time is spent recursively splitting arrays and dispatching jobs rather than doing any real work.

I'm purposefully omitting any benchmarking because you know what Disraeli said, but from my somewhat confined set of  experiments I'm seeing a ≃1.4x increase in execution speed of associative functions. This isn't production quality and won't be making its way into CMFunctionalAdditions but it illustrates the potential of higher-order functions and Grand Central Dispatch in Objective-C.

Filed under  //   algorithms   concurrency   libdispatch   objc  

Comments [0]