This forum has been archived. All content is frozen. Please use KDE Discuss instead.

use of multicore

Tags: None
(comma "," separated)
katakombi
Registered Member
Posts
18
Karma
0

use of multicore

Thu Mar 03, 2011 8:37 pm
Hi,

I am using Qt / Eigen3 to write some neural-network implementation, and I like the way the code stays readable and short.
However, I'd like to speed it up which just seems possible by doing some sort of parallelization of vector X matrix for medium-sized matrices (around 1000x1000).

These are the possibilities I see so far:

1) QThreads + Eigen code
- Is it thread-safe?
- Is it recommended, or too much overhead?

2) blas in Eigen
- Is it possible in std Eigen syntax, or do I have to use Fortran then?
- Is there some example / tutorial for using blas with Eigen?

3) dunno??
- Will automatic parallelization for such simple operations be a
feature in the standard Eigen some day in the future?

Another thing which puzzled me is that some colleague implemented the very
same code in blas, and he claimed it ran 3x faster than with Eigen.

First I thought it was because he was using row-major instead of column-major data order, and I switched it in my code, too.
But it gave me no difference in speed at all.

Can it be, that my Eigen does not use sse? Note, that I passed the compiler-flag, but when not passing it, speed stayed also the same!

thanks a lot in advance, any hint will be appreciated :)

Stefan
User avatar
ggael
Moderator
Posts
3447
Karma
19
OS

Re: use of multicore

Fri Mar 04, 2011 2:26 pm
katakombi wrote:Hi,

1) QThreads + Eigen code
- Is it thread-safe?
- Is it recommended, or too much overhead?


in the meantime you could simply write an openmp for loop. If your matrix is column major:
int bs = mat.rows()/nbthreads;
#pragma omp parallel for
for(int k=0;k<nbthreads;++k)
{
int actual_bs = std::min(bs,mat.rows()-k*bs);
res.segment(k*bs,actual_bs) = mat.middleRows(k*bs,actual_bs) * vec;
}

^^ this is just to give you an idea, maybe it contains bugs...

2) blas in Eigen
- Is it possible in std Eigen syntax, or do I have to use Fortran then?
- Is there some example / tutorial for using blas with Eigen?


that's easy, here is an example

int lda = mat.outerStride();
int m = mat.rows();
int n = mat.cols();
float fzero = 0.0;
float fone = 1.0;
int incres = res.innerStride();
int incvec = vec.innerStride();
sgemv_("N",&m, &n, &fzero, mat.data(), &lda, vec.data(), &incvec, fone, res.data(), &incres);

In the future we plane to allow to fallback to a blas library.

3) dunno??
- Will automatic parallelization for such simple operations be a
feature in the standard Eigen some day in the future?


This is already the case for matrix-matrix product, and yes in a near future this will be the case for matrix-vector products.

Another thing which puzzled me is that some colleague implemented the very
same code in blas, and he claimed it ran 3x faster than with Eigen.


3x ? strange. What is your scalar type? sizes?
First I thought it was because he was using row-major instead of column-major data order, and I switched it in my code, too.
But it gave me no difference in speed at all.

Can it be, that my Eigen does not use sse? Note, that I passed the compiler-flag, but when not passing it, speed stayed also the same!

What is your system? 32 or 64 bits? compiler version?
katakombi
Registered Member
Posts
18
Karma
0

Re: use of multicore

Fri Mar 04, 2011 5:02 pm
Hi ggael!

thanks for the quick and helpful response!

I am using the Eigen3 beta4 with g++ 4.4.5.
I tried g++ 4.5.1 and it seems to give slightly better performance (about 7%).
I use only doubles and build binaries for both 64 and 32 bit Qt.
If it helped I could use floats, but some internal conversions to doubles would be needed in that case.

As mentioned my colleague "claimed" but I don't fully trust him.
However, it is nagging me to ensure that I get the performance which I can expect from Eigen ;)

I didn't know that g++ has support for OpenMP. That looks actually quite cool - is it stable in 4.5 and for 64bit?
Will it play nice with Qt threads?

thanks
Stefan
katakombi
Registered Member
Posts
18
Karma
0

Re: use of multicore

Sat Mar 05, 2011 1:17 pm
I tried openmp.
Except that it eats all available CPU cycles, I see no speed up, for smaller matrices it even is slower...
johnm1019
Registered Member
Posts
46
Karma
0

Re: use of multicore

Fri Mar 11, 2011 2:56 pm
katakombi wrote:I tried openmp.
Except that it eats all available CPU cycles, I see no speed up, for smaller matrices it even is slower...


This is to be expected katakombi -- parallelization requires some overhead (although very small) -- there is certainly some size where the parallelization overhead is greater than the operational complexity.

If your code requires a lot of synchronization or if you have <=2 cores then "parallelizing" probably won't help much.
User avatar
ggael
Moderator
Posts
3447
Karma
19
OS

Re: use of multicore

Tue Mar 15, 2011 10:34 am
katakombi wrote:I tried openmp.
Except that it eats all available CPU cycles, I see no speed up, for smaller matrices it even is slower...


I confirm that even with quite large matrices and vector, it seems not to be so easy to get any benefit from parallelization... I don't know why yet.
mpmcl
Registered Member
Posts
19
Karma
0

Re: use of multicore

Wed Mar 16, 2011 4:43 pm
My two cents:

I use Eigen v2 on Mac OS X (Xcode with GCC 4.2) for a simulation that requires millions of dynamic matrices and vectors in sometimes elaborate infix combinations.

My experience is that I get *negative* scaling with increasing number of threads, i.e., more threads (with more CPUs) gives slower execution! The reason turned out to be malloc contention.

My eventual solution was to replace threads with parallel (sub)tasks. In this case, each task (process) gets its own, protected memory space and there is no contention. Thus, with 4 CPUs + hyperthreading --> 8 virtual CPUs, I can spawn 8 subtasks and get the full 8x speed improvement. The downside is that tasks are much more trouble to set up and a *lot* more trouble to communicate with. Nevertheless, with Mac OS X (Cocoa), it is possible (and probably easy for an expert).

BTW, my code is SIMD with data parallelized. I do not try to parallelize individual matrices.

Also, I am using CPUs only. The plunge into OpenCL is a learning curve I have not yet attempted.
MarkusS
Registered Member
Posts
9
Karma
0
OS

Re: use of multicore

Fri Mar 18, 2011 6:29 pm
@mpmpi: If you ever port your application to eigen v3, you could try my private allocator patch (viewtopic.php?f=74&t=92934&p=186527&hilit=allocator#p186546) and see if this helps your malloc contention... If your threads use the same size of dynamic matrices, a private (but thread global!) allocator should reduce the contention quite a bit.

Hope this helps,

Markus
katakombi
Registered Member
Posts
18
Karma
0

Re: use of multicore

Sun Mar 20, 2011 8:38 am
Dear community!

first - thanks for all your hints, and the new stable Eigen.
I've switched to Eigen3 meanwhile, looks very nice ;)

floats vs. doubles: I switched to floats, now mat X vec is about 2-3x faster!

@mpmcl: I attempted a qthread implementation with threads running in parallel (only invoced once at startup) but gave it up after several pitfalls with QWaitconditions.
Anyways, the solution would really give some not-too-nice hard-to-read code ...


Some pal recommended to slighty modify the forward/backward pass in my training, such that I'd be able to use matrix X matrix (by doing batch-processing several steps at once)

Can you roughly tell me what is the speed of that operation compared to mat X vector? Can it use parallelization?


Math functions in Eigen:
I was using an exp approximation which is really faster than anything else I could find (with doubles). With floats now I'm not sure, but I wonder if Eigen3 has some fast exp/log implementation suitable for vectors.
I used the approximation because libC implementation is horribly slow, but it can cause instabilities eventually. Furthermore, it is a macro and I believe it can't be applied to Eigen3 vectors/matrices...

UPDATE: ok, i found .array.exp() and normalize could do this, I'll try!

thanks again
StefanK
User avatar
ggael
Moderator
Posts
3447
Karma
19
OS

Re: use of multicore

Mon Mar 21, 2011 11:36 pm
Can you roughly tell me what is the speed of that operation compared to mat X vector? Can it use parallelization?


If you can do that, then that's really worth it, you can expect a 2x speedup and maybe even more even for a single core. Moreover, matrix-matrix products are natively parallelized by Eigen with quite good scaling, simply compile with -fopenmp ;)

Regarding array().exp(), with float this operation is vectorized and highly optimized.
katakombi
Registered Member
Posts
18
Karma
0

Re: use of multicore

Thu Jun 16, 2011 8:33 am
Ok! Better late than never, but let me finally post some benchmarks:

Recurrent NN with 500 hidden neurons/1000 output neurons (Eigen3):

Code: Select all
        |MatXVec | MatXMat
 ----------------------------------------------------------
 #seq   |  2^0   | 2^3   2^4   2^5   2^6   2^7   2^8   2^9
 #w/sec |  2300  | 3000  3700  5600  6200  7000  7100  7000
 ----------------------------------------------------------
 OpenMP |   -    |  -     -    6000   ?      ?   8500   ?


#seq is the length of the sequence I am processing in one propagation
#w/sec is avg. samples propagated per second

My Atlas implementation ran 4x-8x slower - I still can't believe that it is properly configured then ;)

BTW: when I defined the macro for row major order and transposed the matrices speed went down 20-30%!

It may seem like OpenMP does not help much for dual core, but there is actually still a large overhead and the sole MatXMat call takes now maybe 60-70% of CPU time.
mfiers
Registered Member
Posts
8
Karma
0
OS

Re: use of multicore

Tue Apr 03, 2012 1:34 pm
ggael wrote:Moreover, matrix-matrix products are natively parallelized by Eigen with quite good scaling, simply compile with -fopenmp ;)



So to summarize, in the near future this will be implemented for matrix-vector as well. That's great! Is that both for dense and sparse matrices?

Furthermore, if you compile with -fopenmp, how do you know the compiler correctly picked this up? We have a very complex building process, that's why I ask.

In another thread I read this:
OMP_NUM_THREADS=number_of_real_cores ./my_app

Is there a way to control which operations are parallellized and which not? Some parts of the code won't need parallellized instructions, but the 'hot' code will. When I used openmp, there was some function you could call in C++ that sets the number of threads to use.. omp_set_number_threads or something like this.

Thanks,
Martin

P.S. I didn't find a tutorial / information about parallellization in the Eigen3 documentation. Maybe I missed it? It would be nice to have 1/2 page about how to use openmp. Apart from that: great work from Eigen, this library is really great!
User avatar
ggael
Moderator
Posts
3447
Karma
19
OS

Re: use of multicore

Tue Apr 03, 2012 9:35 pm
To be sure Eigen is generating parallel code, after including an Eigen's header, you can check that EIGEN_PARALLELIZE is really defined.

To control Eigen's multithreading at runtime: Eigen::setNbThread(n); You can use openmp's API too.

Sparse matrix * vector will be parallelized very soon, but only for row-major matrices. For dense*vector this seems to be more complicated to get real benefits.
mfiers
Registered Member
Posts
8
Karma
0
OS

Re: use of multicore

Wed Apr 04, 2012 9:44 am
ggael wrote:To be sure Eigen is generating parallel code, after including an Eigen's header, you can check that EIGEN_PARALLELIZE is really defined.

To control Eigen's multithreading at runtime: Eigen::setNbThread(n); You can use openmp's API too.

Sparse matrix * vector will be parallelized very soon, but only for row-major matrices. For dense*vector this seems to be more complicated to get real benefits.


That looks very strange to me: a dense matrix-vector looks like just splitting up into pieces and calculating them separately? (just like row-major sparse matrices). Or is it more an issue of caching/...?

So to check that you could do, somewhere in your code:
#ifdef EIGEN_PARALLELIZE
printf("Eigen parallellize is on")
#else
printf("Eigen parallellize is off")
#endif

Kind regards,
Martin
User avatar
ggael
Moderator
Posts
3447
Karma
19
OS

Re: use of multicore

Wed Apr 04, 2012 12:24 pm
mfiers wrote:That looks very strange to me: a dense matrix-vector looks like just splitting up into pieces and calculating them separately? (just like row-major sparse matrices). Or is it more an issue of caching/...?


yes, I'm puzzled too. A main difference is that in the dense case the algorithm is much more efficient in term of GLFOPS (vectorization, pipelining, caching, etc.) and so the pressures on the CPU are very different. Nevertheless there must be a way to get a goos scaling with dens*vector, just need more investigation...


Bookmarks



Who is online

Registered users: Bing [Bot], claydoh, Google [Bot], rblackwell, Yahoo [Bot]