Registered Member
|
Hi,
I am using Qt / Eigen3 to write some neural-network implementation, and I like the way the code stays readable and short. However, I'd like to speed it up which just seems possible by doing some sort of parallelization of vector X matrix for medium-sized matrices (around 1000x1000). These are the possibilities I see so far: 1) QThreads + Eigen code - Is it thread-safe? - Is it recommended, or too much overhead? 2) blas in Eigen - Is it possible in std Eigen syntax, or do I have to use Fortran then? - Is there some example / tutorial for using blas with Eigen? 3) dunno?? - Will automatic parallelization for such simple operations be a feature in the standard Eigen some day in the future? Another thing which puzzled me is that some colleague implemented the very same code in blas, and he claimed it ran 3x faster than with Eigen. First I thought it was because he was using row-major instead of column-major data order, and I switched it in my code, too. But it gave me no difference in speed at all. Can it be, that my Eigen does not use sse? Note, that I passed the compiler-flag, but when not passing it, speed stayed also the same! thanks a lot in advance, any hint will be appreciated Stefan |
Moderator
|
in the meantime you could simply write an openmp for loop. If your matrix is column major: int bs = mat.rows()/nbthreads; #pragma omp parallel for for(int k=0;k<nbthreads;++k) { int actual_bs = std::min(bs,mat.rows()-k*bs); res.segment(k*bs,actual_bs) = mat.middleRows(k*bs,actual_bs) * vec; } ^^ this is just to give you an idea, maybe it contains bugs...
that's easy, here is an example int lda = mat.outerStride(); int m = mat.rows(); int n = mat.cols(); float fzero = 0.0; float fone = 1.0; int incres = res.innerStride(); int incvec = vec.innerStride(); sgemv_("N",&m, &n, &fzero, mat.data(), &lda, vec.data(), &incvec, fone, res.data(), &incres); In the future we plane to allow to fallback to a blas library.
This is already the case for matrix-matrix product, and yes in a near future this will be the case for matrix-vector products.
3x ? strange. What is your scalar type? sizes?
What is your system? 32 or 64 bits? compiler version? |
Registered Member
|
Hi ggael!
thanks for the quick and helpful response! I am using the Eigen3 beta4 with g++ 4.4.5. I tried g++ 4.5.1 and it seems to give slightly better performance (about 7%). I use only doubles and build binaries for both 64 and 32 bit Qt. If it helped I could use floats, but some internal conversions to doubles would be needed in that case. As mentioned my colleague "claimed" but I don't fully trust him. However, it is nagging me to ensure that I get the performance which I can expect from Eigen I didn't know that g++ has support for OpenMP. That looks actually quite cool - is it stable in 4.5 and for 64bit? Will it play nice with Qt threads? thanks Stefan |
Registered Member
|
I tried openmp.
Except that it eats all available CPU cycles, I see no speed up, for smaller matrices it even is slower... |
Registered Member
|
This is to be expected katakombi -- parallelization requires some overhead (although very small) -- there is certainly some size where the parallelization overhead is greater than the operational complexity. If your code requires a lot of synchronization or if you have <=2 cores then "parallelizing" probably won't help much. |
Moderator
|
I confirm that even with quite large matrices and vector, it seems not to be so easy to get any benefit from parallelization... I don't know why yet. |
Registered Member
|
My two cents:
I use Eigen v2 on Mac OS X (Xcode with GCC 4.2) for a simulation that requires millions of dynamic matrices and vectors in sometimes elaborate infix combinations. My experience is that I get *negative* scaling with increasing number of threads, i.e., more threads (with more CPUs) gives slower execution! The reason turned out to be malloc contention. My eventual solution was to replace threads with parallel (sub)tasks. In this case, each task (process) gets its own, protected memory space and there is no contention. Thus, with 4 CPUs + hyperthreading --> 8 virtual CPUs, I can spawn 8 subtasks and get the full 8x speed improvement. The downside is that tasks are much more trouble to set up and a *lot* more trouble to communicate with. Nevertheless, with Mac OS X (Cocoa), it is possible (and probably easy for an expert). BTW, my code is SIMD with data parallelized. I do not try to parallelize individual matrices. Also, I am using CPUs only. The plunge into OpenCL is a learning curve I have not yet attempted. |
Registered Member
|
@mpmpi: If you ever port your application to eigen v3, you could try my private allocator patch (viewtopic.php?f=74&t=92934&p=186527&hilit=allocator#p186546) and see if this helps your malloc contention... If your threads use the same size of dynamic matrices, a private (but thread global!) allocator should reduce the contention quite a bit.
Hope this helps, Markus |
Registered Member
|
Dear community!
first - thanks for all your hints, and the new stable Eigen. I've switched to Eigen3 meanwhile, looks very nice floats vs. doubles: I switched to floats, now mat X vec is about 2-3x faster! @mpmcl: I attempted a qthread implementation with threads running in parallel (only invoced once at startup) but gave it up after several pitfalls with QWaitconditions. Anyways, the solution would really give some not-too-nice hard-to-read code ... Some pal recommended to slighty modify the forward/backward pass in my training, such that I'd be able to use matrix X matrix (by doing batch-processing several steps at once) Can you roughly tell me what is the speed of that operation compared to mat X vector? Can it use parallelization? Math functions in Eigen: I was using an exp approximation which is really faster than anything else I could find (with doubles). With floats now I'm not sure, but I wonder if Eigen3 has some fast exp/log implementation suitable for vectors. I used the approximation because libC implementation is horribly slow, but it can cause instabilities eventually. Furthermore, it is a macro and I believe it can't be applied to Eigen3 vectors/matrices... UPDATE: ok, i found .array.exp() and normalize could do this, I'll try! thanks again StefanK |
Moderator
|
If you can do that, then that's really worth it, you can expect a 2x speedup and maybe even more even for a single core. Moreover, matrix-matrix products are natively parallelized by Eigen with quite good scaling, simply compile with -fopenmp Regarding array().exp(), with float this operation is vectorized and highly optimized. |
Registered Member
|
Ok! Better late than never, but let me finally post some benchmarks:
Recurrent NN with 500 hidden neurons/1000 output neurons (Eigen3):
#seq is the length of the sequence I am processing in one propagation #w/sec is avg. samples propagated per second My Atlas implementation ran 4x-8x slower - I still can't believe that it is properly configured then BTW: when I defined the macro for row major order and transposed the matrices speed went down 20-30%! It may seem like OpenMP does not help much for dual core, but there is actually still a large overhead and the sole MatXMat call takes now maybe 60-70% of CPU time. |
Registered Member
|
So to summarize, in the near future this will be implemented for matrix-vector as well. That's great! Is that both for dense and sparse matrices? Furthermore, if you compile with -fopenmp, how do you know the compiler correctly picked this up? We have a very complex building process, that's why I ask. In another thread I read this: OMP_NUM_THREADS=number_of_real_cores ./my_app Is there a way to control which operations are parallellized and which not? Some parts of the code won't need parallellized instructions, but the 'hot' code will. When I used openmp, there was some function you could call in C++ that sets the number of threads to use.. omp_set_number_threads or something like this. Thanks, Martin P.S. I didn't find a tutorial / information about parallellization in the Eigen3 documentation. Maybe I missed it? It would be nice to have 1/2 page about how to use openmp. Apart from that: great work from Eigen, this library is really great! |
Moderator
|
To be sure Eigen is generating parallel code, after including an Eigen's header, you can check that EIGEN_PARALLELIZE is really defined.
To control Eigen's multithreading at runtime: Eigen::setNbThread(n); You can use openmp's API too. Sparse matrix * vector will be parallelized very soon, but only for row-major matrices. For dense*vector this seems to be more complicated to get real benefits. |
Registered Member
|
That looks very strange to me: a dense matrix-vector looks like just splitting up into pieces and calculating them separately? (just like row-major sparse matrices). Or is it more an issue of caching/...? So to check that you could do, somewhere in your code: #ifdef EIGEN_PARALLELIZE printf("Eigen parallellize is on") #else printf("Eigen parallellize is off") #endif Kind regards, Martin |
Moderator
|
yes, I'm puzzled too. A main difference is that in the dense case the algorithm is much more efficient in term of GLFOPS (vectorization, pipelining, caching, etc.) and so the pressures on the CPU are very different. Nevertheless there must be a way to get a goos scaling with dens*vector, just need more investigation... |
Registered users: Bing [Bot], claydoh, Google [Bot], rblackwell, Yahoo [Bot]