Reply to topic

Eigen in an OpenMP parallel region

torutakahashi
Registered Member
Posts
4
Karma
0
OS

Eigen in an OpenMP parallel region

Thu Nov 16, 2017 9:43 am
Hi,

I am trying to enable the parallel functionality of the Eigen (version 3.3.4) in a nested OpenMP parallel region. Here is the pseudocode together with the numerical result:
==============================
// Consider the LU decomposition of a n-by-n matrix, say M, where n=5000 (then, M requires 288MB of memory).
MatrixXd M = MatrixXd::Random(n, n);

// First, I performed the LU decomposition of the matrix M with a single thread.
omp_set_num_threads(1);
LU_with_Eigen(M); // Note that the matrix M is copied to this function, and thus not destroyed.
The result was 2.37 second on my PC with 8 physical cores.

// Next, I used 4 threads.
omp_set_num_threads(4);
LU_with_Eigen(M);
The result was 1.21 second. The scalability is not so good, but we can see that the Eigen actually parallelised the decomposition.

// Then, I tested if the Eigen was parallelised or not in an OpenMP parallel region.
omp_set_nested(true): // Enable the nesting of OpenMP.
omp_set_num_threads(2); // Use two threads (say TH0 and TH1) in the next parallel region.
#pragma omp parallel
{
// Let each thread use 4 threads; the total number of threads is 8 and less than the number of CPU cores.
omp_set_num_threads(4);
LU_with_Eigen(M); //
Unfortunately the result was 2.91 sec for TH0 and 2.92 secfor TH1, although I had expected to obtain the same timing result as the previous one, that is, 1.21 sec.
}
==============================
In this test, I used the compiler of g++ 6.3.1 (on Fedora 24). The amount of RAM is 8GB, which is enough to store the underlying matrix M as well as its copy.

I am assuming that the Eigen can be parallelised in an OpenMP parallel reigion. Is this correct? If correct, am I doing something wrong?

Best,
Toru Takahashi
torutakahashi
Registered Member
Posts
4
Karma
0
OS
Hi,

I forgot to mention two things:

(1) The compile options are "-O3 -mtune=native -march=native -fopenmp".
(2) The function LU_by_Eigen() is implemented as

void LU_with_Eigen(MatrixXd M)
{
timerType *t;
createTimer(&t);
startTimer(t);
PartialPivLU<MatrixXd> LU = M.lu();
stopTimer(t);
printTimer(t);
freeTimer(t);
}

where "timer" is my utility for timing.

Toru
torutakahashi
Registered Member
Posts
4
Karma
0
OS
Hi,

When I replaced "LU_with_Eigen()" with "MatMatProduct_with_Eigen()", which computes the square of a given matrix, the behaviour of Eigen did not change --- Eigen was not parallelised in the OpenMP parallel region. For this regard, I have found the reason in the source code "Eigen/src/Core/products/Parallelizer.h", that is,


// if multi-threading is explicitely disabled, not useful, or if we already are in a parallel session,
// then abort multi-threading
// FIXME omp_get_num_threads()>1 only works for openmp, what if the user does not use openmp?
if((!Condition) || (threads==1) || (omp_get_num_threads()>1))
return func(0,rows, 0,cols);


Clearly, the value of omp_get_num_threads() is greater than 1 in an OpenMP parallel region. In fact, when I assigned one thread to the OpenMP parallel region of my code, the Eigen was parallelised with the number of threads specified by omp_set_num_threads() inside the parallel region.

I am wondering why Eigen assumes "omp_get_num_threads()>1" to parallelize the mat-mat product and, perhaps, LU decomposition.

Best,
Toru Takahashi
User avatar ggael
Moderator
Posts
3428
Karma
19
OS
That's on purpose, because if the caller code is already multithreaded then we assume that all other cores are fully used, and thus multithreading the matrix-matrix product would be counter productive (it can even be orders of magnitude slower).

Of course if you have 8 cores, and a parallel section with 4 threads with only one doing matrix-matrix products (or LU), you might still want Eigen to use 5 threads. We need to find a simple mechanism for that.
torutakahashi
Registered Member
Posts
4
Karma
0
OS
Hi Gael,

Thank you for your reply. I understand that, on purpose, the current Eigen cannot parallelise matrix operations in a parallel region. I guess this is because the managing cache sizes is not obvious.... I hope this nested-parallelisation is included in the Eigen's todo list.

Thank you,
Toru

 
Reply to topic

Bookmarks



Who is online

Registered users: Bing [Bot], Exabot [Bot], Google [Bot], jackdinn, kde-cfeck, La Ninje, lueck, ramdanifajar, slumpgod, Sogou [Bot], Yahoo [Bot]