This forum has been archived. All content is frozen. Please use KDE Discuss instead.

Eigen in an OpenMP parallel region

Tags: None
(comma "," separated)
torutakahashi
Registered Member
Posts
4
Karma
0
OS

Eigen in an OpenMP parallel region

Thu Nov 16, 2017 9:43 am
Hi,

I am trying to enable the parallel functionality of the Eigen (version 3.3.4) in a nested OpenMP parallel region. Here is the pseudocode together with the numerical result:
==============================
// Consider the LU decomposition of a n-by-n matrix, say M, where n=5000 (then, M requires 288MB of memory).
MatrixXd M = MatrixXd::Random(n, n);

// First, I performed the LU decomposition of the matrix M with a single thread.
omp_set_num_threads(1);
LU_with_Eigen(M); // Note that the matrix M is copied to this function, and thus not destroyed.
The result was 2.37 second on my PC with 8 physical cores.

// Next, I used 4 threads.
omp_set_num_threads(4);
LU_with_Eigen(M);
The result was 1.21 second. The scalability is not so good, but we can see that the Eigen actually parallelised the decomposition.

// Then, I tested if the Eigen was parallelised or not in an OpenMP parallel region.
omp_set_nested(true): // Enable the nesting of OpenMP.
omp_set_num_threads(2); // Use two threads (say TH0 and TH1) in the next parallel region.
#pragma omp parallel
{
// Let each thread use 4 threads; the total number of threads is 8 and less than the number of CPU cores.
omp_set_num_threads(4);
LU_with_Eigen(M); //
Unfortunately the result was 2.91 sec for TH0 and 2.92 secfor TH1, although I had expected to obtain the same timing result as the previous one, that is, 1.21 sec.
}
==============================
In this test, I used the compiler of g++ 6.3.1 (on Fedora 24). The amount of RAM is 8GB, which is enough to store the underlying matrix M as well as its copy.

I am assuming that the Eigen can be parallelised in an OpenMP parallel reigion. Is this correct? If correct, am I doing something wrong?

Best,
Toru Takahashi
torutakahashi
Registered Member
Posts
4
Karma
0
OS
Hi,

I forgot to mention two things:

(1) The compile options are "-O3 -mtune=native -march=native -fopenmp".
(2) The function LU_by_Eigen() is implemented as

void LU_with_Eigen(MatrixXd M)
{
timerType *t;
createTimer(&t);
startTimer(t);
PartialPivLU<MatrixXd> LU = M.lu();
stopTimer(t);
printTimer(t);
freeTimer(t);
}

where "timer" is my utility for timing.

Toru
torutakahashi
Registered Member
Posts
4
Karma
0
OS
Hi,

When I replaced "LU_with_Eigen()" with "MatMatProduct_with_Eigen()", which computes the square of a given matrix, the behaviour of Eigen did not change --- Eigen was not parallelised in the OpenMP parallel region. For this regard, I have found the reason in the source code "Eigen/src/Core/products/Parallelizer.h", that is,


// if multi-threading is explicitely disabled, not useful, or if we already are in a parallel session,
// then abort multi-threading
// FIXME omp_get_num_threads()>1 only works for openmp, what if the user does not use openmp?
if((!Condition) || (threads==1) || (omp_get_num_threads()>1))
return func(0,rows, 0,cols);


Clearly, the value of omp_get_num_threads() is greater than 1 in an OpenMP parallel region. In fact, when I assigned one thread to the OpenMP parallel region of my code, the Eigen was parallelised with the number of threads specified by omp_set_num_threads() inside the parallel region.

I am wondering why Eigen assumes "omp_get_num_threads()>1" to parallelize the mat-mat product and, perhaps, LU decomposition.

Best,
Toru Takahashi
User avatar
ggael
Moderator
Posts
3447
Karma
19
OS
That's on purpose, because if the caller code is already multithreaded then we assume that all other cores are fully used, and thus multithreading the matrix-matrix product would be counter productive (it can even be orders of magnitude slower).

Of course if you have 8 cores, and a parallel section with 4 threads with only one doing matrix-matrix products (or LU), you might still want Eigen to use 5 threads. We need to find a simple mechanism for that.
torutakahashi
Registered Member
Posts
4
Karma
0
OS
Hi Gael,

Thank you for your reply. I understand that, on purpose, the current Eigen cannot parallelise matrix operations in a parallel region. I guess this is because the managing cache sizes is not obvious.... I hope this nested-parallelisation is included in the Eigen's todo list.

Thank you,
Toru
Tschew
Registered Member
Posts
1
Karma
0
Dear ggael,

has this been addressed by any chance? We have a use case in which having nested threading would be extremely useful.

Cheers,
Bartek


Bookmarks



Who is online

Registered users: Bing [Bot], Evergrowing, Google [Bot], rblackwell