Getting speedup in a multithreaded application

Board index

Page 1 of 1 (5 posts)

Tags:

pedromoreira Registered Member Posts 1 Karma 0	Getting speedup in a multithreaded application Tue Aug 18, 2015 3:53 pm Greetings, We developed an application that uses Eigen and we are now attempting to parallelize it. There are cases where we have multiple threads using Eigen to perform operations that are independent from each other, therefore we expect a speedup close to linear with the number of cores. And yet, we are unable to achieve such speedup. Here is a MWE: Code: Select all #include <iostream> #include <omp.h> #include <Eigen/Dense> #include <Eigen/Core> using namespace Eigen; using namespace std; #define SIZE 30 #define ITERATIONS 100000 int main() { omp_set_num_threads(24); Eigen::setNbThreads(1); Eigen::initParallel(); MatrixXd a = MatrixXd::Random(SIZE,SIZE); auto total = 0; #pragma omp parallel for reduction(+ : total) for(unsigned int i = 0; i < ITERATIONS; ++i) { MatrixXd b = MatrixXd::Random(SIZE,SIZE); MatrixXd r = b * a * b; total += r.sum(); } cout << total << endl; } We ran this program on a machine with the following specs: - Two Intel Xeon E5-2630 v2 @ 2.60GHz processors (each with 6 cores and 12 threads) - 256GB of memory ..and got the following result: Serial: 60s Parallel (24 threads): 28s While the performed operations are independent, the speedup we get from running it on 24 threads is only of 2. On other programs that do not use Eigen, we get speedups of 12, as expected. The documentation ( http://eigen.tuxfamily.org/dox/TopicMultiThreading.html ) doesn't appear to have more details on the matter. We would appreciate if someone could shed some light on this matter and possibly suggest a solution. Best regards, -Pedro Moreira
Tal Registered Member Posts 30 Karma 0	Re: Getting speedup in a multithreaded application Wed Aug 19, 2015 2:57 pm I'm only an Eigen begginer user, but I think that Eigen random function uses the standard C random function. C random function is tend to use locking mechanism (I think it's because of determinism). Therefor, in my opinion, you didn't get optimal performance because of random which locks. A possible solution is to use custom random generator that does not lock. https://www.cialispascherfr24.com/
twithaar Registered Member Posts 23 Karma 0	Re: Getting speedup in a multithreaded application Wed Aug 19, 2015 4:29 pm Also note that you have many memory allocations going on. Re-using allocations for b and r with the openMp firstprivate clause could help a lot.
tienhung Registered Member Posts 29 Karma 0	Re: Getting speedup in a multithreaded application Wed Aug 19, 2015 6:04 pm My system: Xeon E5-2609 v2 @2.50 GHz (4 cores, 4 threads), 16 Gb RAM Code: Select all #include <iostream> #include "Eigen/Dense" #include "Eigen/Core" #include "bench/BenchTimer.h" #define SIZE 30 #define ITERATIONS 100000 int main() { Eigen::MatrixXd a = Eigen::MatrixXd::Random(SIZE, SIZE); Eigen::BenchTimer timer; timer.start(); auto total = 0; #pragma omp parallel for reduction(+ : total) for (int i = 0; i < ITERATIONS; ++i) { Eigen::MatrixXd b = Eigen::MatrixXd::Random(SIZE, SIZE); Eigen::MatrixXd r = b * a * b; total += r.sum(); } timer.stop(); std::cout << "total = " << total << "; parallel time = " << timer.value() << std::endl; timer.start(); total = 0; for (int i = 0; i < ITERATIONS; ++i) { Eigen::MatrixXd b = Eigen::MatrixXd::Random(SIZE, SIZE); Eigen::MatrixXd r = b * a * b; total += r.sum(); } timer.stop(); std::cout << "total = " << total << "; serial time = " << timer.value() << std::endl; getchar(); return 0; } And here is what I get: total = 97421; parallel time = 1.02181 (the number of threads could be only 3 here, I didn't check it) total = 98817; serial time = 3.45789 I don't think you need to enable internal multi-threading in Eigen when using OpenMP. I am wondering how could it take 30s on your system.
ggael Moderator Posts 3447 Karma 19 OS	Re: Getting speedup in a multithreaded application Tue Sep 01, 2015 7:47 am Your machine as only 12 physical cores, so tell OpenMP to use only 12 threads. The reason is that matrix-matrix products are highly optimized and occupy 100% of the ALU, therefore using hyperthreading is counter productive.