Registered Member
|
Greetings,
We developed an application that uses Eigen and we are now attempting to parallelize it. There are cases where we have multiple threads using Eigen to perform operations that are independent from each other, therefore we expect a speedup close to linear with the number of cores. And yet, we are unable to achieve such speedup. Here is a MWE:
We ran this program on a machine with the following specs: - Two Intel Xeon E5-2630 v2 @ 2.60GHz processors (each with 6 cores and 12 threads) - 256GB of memory ..and got the following result: Serial: 60s Parallel (24 threads): 28s While the performed operations are independent, the speedup we get from running it on 24 threads is only of 2. On other programs that do not use Eigen, we get speedups of 12, as expected. The documentation ( http://eigen.tuxfamily.org/dox/TopicMultiThreading.html ) doesn't appear to have more details on the matter. We would appreciate if someone could shed some light on this matter and possibly suggest a solution. Best regards, -Pedro Moreira |
Registered Member
|
I'm only an Eigen begginer user, but I think that Eigen random function uses the standard C random function.
C random function is tend to use locking mechanism (I think it's because of determinism). Therefor, in my opinion, you didn't get optimal performance because of random which locks. A possible solution is to use custom random generator that does not lock. |
Registered Member
|
Also note that you have many memory allocations going on.
Re-using allocations for b and r with the openMp firstprivate clause could help a lot. |
Registered Member
|
My system: Xeon E5-2609 v2 @2.50 GHz (4 cores, 4 threads), 16 Gb RAM
And here is what I get: total = 97421; parallel time = 1.02181 (the number of threads could be only 3 here, I didn't check it) total = 98817; serial time = 3.45789 I don't think you need to enable internal multi-threading in Eigen when using OpenMP. I am wondering how could it take 30s on your system. |
Moderator
|
Your machine as only 12 physical cores, so tell OpenMP to use only 12 threads. The reason is that matrix-matrix products are highly optimized and occupy 100% of the ALU, therefore using hyperthreading is counter productive.
|
Registered users: Bing [Bot], Google [Bot], q.ignora, watchstar