Registered Member
|
Hello,
i got a problem with the performance of the eigen matrix multiplication. I need to multiply two big matrices (N~4000) and it takes about 60 seconds to complete. (Intel(R) Core(TM)2 Duo CPU L7500 @ 1.60GHz) But this is quite slow, as the same matrix-multiplication on the same machine under Matlab just takes around or less than 9 second to complete. (Is eigen using a naive approach: O(n^3)? I am not sure about matlab's (but its better) ) Similar performance differences are revealing when i try to solve a regression problem of similar matrix dimensions. (A.lu().solve(b, &x) , where A is a NxN matrix and b a 3xN , matlab: x=A\b;) How can i improve the performance of this multiplication (or of the solving)? Is there any trick which improves eigen's performance or am i doing something every uncomfortable? I was using g++ 4.3.4, -O2, gentoo-linux and eigen-2.0.11 In fact, i tried the following snippet of code (matrix-multiplication): (don't be confused about the smart pointer, it doesnt really make a difference if the matrix-variable is created on the stack or not, and the source of the problem is not really the multiplication of the matrix with itself, i tested it; beware: it's memory intensive)
I hope for your help kind regards schorfi |
Moderator
|
2^3=8, even |
Registered Member
|
aaaaaaargh !!!
I wanted to reply to Gael's post, instead I edited it !!! Sorry guys, i'll see how to restore Gael's post.
Join us on Eigen's IRC channel: #eigen on irc.freenode.net
Have a serious interest in Eigen? Then join the mailing list! |
Registered Member
|
Sorry guys, I killed Gael's post and the kde-forum staff tells me that it really can't be undone.
Gael was giving you several reasons why eigen could be slower: - first, are you using 2.0? the devel branch can be 2x faster - second, matlab is probably using your 2 cpu cores, while eigen uses only one (for now). Again a 2x factor. - ... are you sure to have enabled SSE2 ? g++ -O2 -msse2
Join us on Eigen's IRC channel: #eigen on irc.freenode.net
Have a serious interest in Eigen? Then join the mailing list! |
Moderator
|
yes, the other reason is that when you compute r = x * x.transpose() only half of the result has to be computed, and matlab is likely to detect that, so one more factor 2.
With the devel branch, on my 2GHz computer I got for r = x * x.transpose() 20sec, that corresponds to 6 GFLOPS/sec, and 3 operations per cycle which is very good. Still with the devel branch, you can compute r = x * x.transpose() even more efficiently: r.setZero(); r.sefladjointView<Upper>().rankUpdate(x); and if you want the full matrix r, then do: r.triangularView<StrictlyLower>() = r.transpose(); but usually you only need to store one half of a symmetric matrix. Here that took 10sec (for 4k x 4k matrices of doubles). Also, the solvers are much much faster in the devel branch. |
Registered Member
|
hey,
thank you for the reply. I'll try the dev-branch and report my results soon. This should solve another need, the product of all coefficients of an matrix/vector, i made my own up to now . Matlab is a not using parallel computing by default and/or it was using just one core for my experiments, as i easily observed that through gkrellm. (If my memory is correct, an extra toolbox for parallel computing was needed...) I tried openmp to parallelise computing on another position with eigen stuff, but failed to use both cores, seems it starts multiple threads but they are running on one only core, strange, guessing it's my fault, but that's another topic . I am using -march=native, i though that includes sse2. I will state that explicitly for a test. I wasn't aware of the symmetric matrix, but true, that could be a reason thank you i will check kind regards schorfi |
Moderator
|
Sorry, but that cannot be true. Let me explain. Your product requires 168e9 flops. If we assume than only half of the result is computed, it remains 84e9 flops. If matlab performs it in 9 secs, that means 9.3 GFLOPS/sec while the peak performance for your 1.6GHz core 2 is 1.6 * 2 (SSE) * 2 (one mul and one add in one cycle) = 6.4 GFLOPS. So matlab must had used your two CPUs. The other explanation is that you used single precision floating point matrices. (I guess the default is double precision right ?) |
Registered Member
|
Hi,
over here, Matlab is definitely using two cores and it does not use or require the parallel computing toolbox for it (you'll need it for things like parfor). I verified via 'license inuse'. Looking at the threads consuming most of the time (via 'Process Explorer') one can observe that libguide40.dll!_kmp_launch_worker (at least over here) is heavy at work and that means OpenMP is used. The run-time is the same, 9 secs. Regards, Hauke |
Registered Member
|
hey,
you are right, now it's using two cores true, i am not sure what i have observed my fault sorry :/ |
Registered Member
|
sorry for that mismatch, i think i mixed up my own reminds, my model is using mex which is running a single core, and duo to lack of memory on my laptop i cannot do "big" experiments which results in matrices with sizes of 4000+, so the observation was a quasi single core behavior, as one core was busy and the other not, but this one had little peaks, which seems to be related to the matrix multiplication and solving but most of the time,
the explanation is that the matlab model is mostly generating random-datas (matrix multiplication and solving has only a small fraction) which seems to occupy only one core, i forgot that fact however, the dev-branch is indeed faster, the "usual" matrix multiplication (matrix*matrix.transpose()) just takes around 35seconds now the half of the symmetric version takes 17 seconds that's a good improvement, i hope parallel computing comes soon best regards schorfi |
Registered users: abc72656, Bing [Bot], daret, Google [Bot], Sogou [Bot], Yahoo [Bot]