torutakahashi
Registered Member

Hi,
I am trying to enable the parallel functionality of the Eigen (version 3.3.4) in a nested OpenMP parallel region. Here is the pseudocode together with the numerical result: ============================== // Consider the LU decomposition of a nbyn matrix, say M, where n=5000 (then, M requires 288MB of memory). MatrixXd M = MatrixXd::Random(n, n); // First, I performed the LU decomposition of the matrix M with a single thread. omp_set_num_threads(1); LU_with_Eigen(M); // Note that the matrix M is copied to this function, and thus not destroyed. The result was 2.37 second on my PC with 8 physical cores. // Next, I used 4 threads. omp_set_num_threads(4); LU_with_Eigen(M); The result was 1.21 second. The scalability is not so good, but we can see that the Eigen actually parallelised the decomposition. // Then, I tested if the Eigen was parallelised or not in an OpenMP parallel region. omp_set_nested(true): // Enable the nesting of OpenMP. omp_set_num_threads(2); // Use two threads (say TH0 and TH1) in the next parallel region. #pragma omp parallel { // Let each thread use 4 threads; the total number of threads is 8 and less than the number of CPU cores. omp_set_num_threads(4); LU_with_Eigen(M); // Unfortunately the result was 2.91 sec for TH0 and 2.92 secfor TH1, although I had expected to obtain the same timing result as the previous one, that is, 1.21 sec. } ============================== In this test, I used the compiler of g++ 6.3.1 (on Fedora 24). The amount of RAM is 8GB, which is enough to store the underlying matrix M as well as its copy. I am assuming that the Eigen can be parallelised in an OpenMP parallel reigion. Is this correct? If correct, am I doing something wrong? Best, Toru Takahashi 
torutakahashi
Registered Member

Hi,
I forgot to mention two things: (1) The compile options are "O3 mtune=native march=native fopenmp". (2) The function LU_by_Eigen() is implemented as void LU_with_Eigen(MatrixXd M) { timerType *t; createTimer(&t); startTimer(t); PartialPivLU<MatrixXd> LU = M.lu(); stopTimer(t); printTimer(t); freeTimer(t); } where "timer" is my utility for timing. Toru 
torutakahashi
Registered Member

Hi,
When I replaced "LU_with_Eigen()" with "MatMatProduct_with_Eigen()", which computes the square of a given matrix, the behaviour of Eigen did not change  Eigen was not parallelised in the OpenMP parallel region. For this regard, I have found the reason in the source code "Eigen/src/Core/products/Parallelizer.h", that is,
Clearly, the value of omp_get_num_threads() is greater than 1 in an OpenMP parallel region. In fact, when I assigned one thread to the OpenMP parallel region of my code, the Eigen was parallelised with the number of threads specified by omp_set_num_threads() inside the parallel region. I am wondering why Eigen assumes "omp_get_num_threads()>1" to parallelize the matmat product and, perhaps, LU decomposition. Best, Toru Takahashi 
ggael
Moderator

That's on purpose, because if the caller code is already multithreaded then we assume that all other cores are fully used, and thus multithreading the matrixmatrix product would be counter productive (it can even be orders of magnitude slower).
Of course if you have 8 cores, and a parallel section with 4 threads with only one doing matrixmatrix products (or LU), you might still want Eigen to use 5 threads. We need to find a simple mechanism for that. 
torutakahashi
Registered Member

Hi Gael,
Thank you for your reply. I understand that, on purpose, the current Eigen cannot parallelise matrix operations in a parallel region. I guess this is because the managing cache sizes is not obvious.... I hope this nestedparallelisation is included in the Eigen's todo list. Thank you, Toru 
Registered users: adamatepsilon, Baidu [Spider], Bing [Bot], cylverbak, eshahnazi, eveanned, Exabot [Bot], Google [Bot]