This forum has been archived. All content is frozen. Please use KDE Discuss instead.

Performance hit with Clang/GCD/Blocks

Tags: None
(comma "," separated)
twood
Registered Member
Posts
17
Karma
0
Dear Eigen forum,
I've just ported some projects to use Eigen for the matrix maths. I'm very impressed with the speed increase. These projects process the pixels from some images independently, so I've made the code parallel using GCD/Blocks (I'm on a Mac). A bit bizarrely, for the most complicated of these projects processing per pixel takes about 0.5 seconds per pixel when using a simple for loop to do the processing, but after making this loop parallel (by converting the loop to a block and using dispatch_apply), the time per pixel increases to about 1 second. The block/loop simply sets up some data for the pixel and then calls a processing function. This speed hit does not seem to happen with the other projects.

This is still faster overall, as I have 8 cores on this machine, but I'm a bit mystified as to where this performance hit is coming from. The processing is pretty complicated, so I can't really post all of my code, but I'm using the following capabilities of Eigen:
.setRandom() on a 1000 by 7 matrix.
2x2 fixed-size matrix multiplies, PartialPivotLU solve, and matrix exponential.
6x6 fixed-size matrix multiplies, PartialPivotLU solve, and matrix exponential.
Would there be any reason for any of these to be slower in a threaded environment? Some hidden mutex/lock that I'm unaware of?

I've used Apple's instruments profiler and this indicates that almost all (over 80%) of the run-time is spent inside my processing function. This amount is roughly the same for both parallel and non-parallel versions of my code, so I don't think the overhead involved in launching the threads is to blame.

Does anyone else have any experience of using Eigen with GCD?
User avatar
ggael
Moderator
Posts
3447
Karma
19
OS
Perhaps setRandom is to blame because we are using the std::rand() function which might might be slow in a multi-threaded apps. You can easily check this by precomputing a single random matrix and make all threads simply copy from it. Then you can use your own random generator using the unaryExpr(...) functor. Other tips:
- the big 1000x7 matrix should have Dynamic sizes, no fixed-size because I think it is not good to have such big object allocated on the stack.
- check in your profiler that there is no dynamic memory allocation (well not much because the previous matrix should be in the heap).
twood
Registered Member
Posts
17
Karma
0
Thanks ggael. I too suspected the random number generator. However, from the research I did, on Mac this should not be the case as rand() is only thread-safe on Linux. While this means it should be speedy, it also means that you might end up with the same random numbers in different threads. Because of this, I ended up using the double-precision Mersenne Twister from http://www.math.sci.hiroshima-u.ac.jp/~m-mat/MT/SFMT/ which is thread-safe (the generating functions take a pointer to a state structure that I initialise inside each thread). This ended up making pretty much no difference to the speed. I'm now leaning towards this being a Hyper-Threading issue on my laptop, but I don't have anything more than a gut feeling to base this on, and it doesn't look like you can turn HT off on OSX Lion and above.

The big 1000x7 matrix is dynamic, not fixed-size (declared as MatrixXd samples(nS, nP), where nS and nP are determined at run-time).

I have the opportunity to test this out on a Linux box with 24 cores. Once I've done so I'll report back.
twood
Registered Member
Posts
17
Karma
0
Yes, this is looking like a HyperThreading issue. I added a semaphore to my code so I can manually limit the number of threads that are active at any one time. Running on my 8 core machine, the per-pixel processing times are:
Single thread (normal for loop): 1.77 s
Basic GCD (dispatch_apply): 3.83 s
GCD with semaphore: 2.1 s

This doesn't surprise me that much. The processing is very, very CPU bound. If I'm reading Apple's profiler correctly well over 90% of the runtime is spent inside Eigen functions, particularly the PartialPivLU solver. And unfortunately my matrices have no special properties so I can't use the faster ones!
User avatar
ggael
Moderator
Posts
3447
Karma
19
OS
For the 2x2 case, don't use PartialPivLU but compute the explicit inverse with .inverse().


Bookmarks



Who is online

Registered users: Baidu [Spider], Bing [Bot], Google [Bot], Yahoo [Bot]