![]() Registered Member ![]()
|
Hi,
I am trying to enable the parallel functionality of the Eigen (version 3.3.4) in a nested OpenMP parallel region. Here is the pseudocode together with the numerical result: ============================== // Consider the LU decomposition of a n-by-n matrix, say M, where n=5000 (then, M requires 288MB of memory). MatrixXd M = MatrixXd::Random(n, n); // First, I performed the LU decomposition of the matrix M with a single thread. omp_set_num_threads(1); LU_with_Eigen(M); // Note that the matrix M is copied to this function, and thus not destroyed. The result was 2.37 second on my PC with 8 physical cores. // Next, I used 4 threads. omp_set_num_threads(4); LU_with_Eigen(M); The result was 1.21 second. The scalability is not so good, but we can see that the Eigen actually parallelised the decomposition. // Then, I tested if the Eigen was parallelised or not in an OpenMP parallel region. omp_set_nested(true): // Enable the nesting of OpenMP. omp_set_num_threads(2); // Use two threads (say TH0 and TH1) in the next parallel region. #pragma omp parallel { // Let each thread use 4 threads; the total number of threads is 8 and less than the number of CPU cores. omp_set_num_threads(4); LU_with_Eigen(M); // Unfortunately the result was 2.91 sec for TH0 and 2.92 secfor TH1, although I had expected to obtain the same timing result as the previous one, that is, 1.21 sec. } ============================== In this test, I used the compiler of g++ 6.3.1 (on Fedora 24). The amount of RAM is 8GB, which is enough to store the underlying matrix M as well as its copy. I am assuming that the Eigen can be parallelised in an OpenMP parallel reigion. Is this correct? If correct, am I doing something wrong? Best, Toru Takahashi |
![]() Registered Member ![]()
|
Hi,
I forgot to mention two things: (1) The compile options are "-O3 -mtune=native -march=native -fopenmp". (2) The function LU_by_Eigen() is implemented as void LU_with_Eigen(MatrixXd M) { timerType *t; createTimer(&t); startTimer(t); PartialPivLU<MatrixXd> LU = M.lu(); stopTimer(t); printTimer(t); freeTimer(t); } where "timer" is my utility for timing. Toru |
![]() Registered Member ![]()
|
Hi,
When I replaced "LU_with_Eigen()" with "MatMatProduct_with_Eigen()", which computes the square of a given matrix, the behaviour of Eigen did not change --- Eigen was not parallelised in the OpenMP parallel region. For this regard, I have found the reason in the source code "Eigen/src/Core/products/Parallelizer.h", that is,
Clearly, the value of omp_get_num_threads() is greater than 1 in an OpenMP parallel region. In fact, when I assigned one thread to the OpenMP parallel region of my code, the Eigen was parallelised with the number of threads specified by omp_set_num_threads() inside the parallel region. I am wondering why Eigen assumes "omp_get_num_threads()>1" to parallelize the mat-mat product and, perhaps, LU decomposition. Best, Toru Takahashi |
![]() Moderator ![]()
|
That's on purpose, because if the caller code is already multithreaded then we assume that all other cores are fully used, and thus multithreading the matrix-matrix product would be counter productive (it can even be orders of magnitude slower).
Of course if you have 8 cores, and a parallel section with 4 threads with only one doing matrix-matrix products (or LU), you might still want Eigen to use 5 threads. We need to find a simple mechanism for that. |
![]() Registered Member ![]()
|
Hi Gael,
Thank you for your reply. I understand that, on purpose, the current Eigen cannot parallelise matrix operations in a parallel region. I guess this is because the managing cache sizes is not obvious.... I hope this nested-parallelisation is included in the Eigen's todo list. Thank you, Toru |
![]() Registered Member ![]()
|
Dear ggael,
has this been addressed by any chance? We have a use case in which having nested threading would be extremely useful. Cheers, Bartek |
Registered users: Bing [Bot], Evergrowing, Google [Bot], rblackwell