This forum has been archived. All content is frozen. Please use KDE Discuss instead.

performance of matrix multiplication

Tags: None
(comma "," separated)
realschorfi
Registered Member
Posts
4
Karma
0
OS
Hello,
i got a problem with the performance of the eigen matrix multiplication.

I need to multiply two big matrices (N~4000) and it takes about 60 seconds to complete. (Intel(R) Core(TM)2 Duo CPU L7500 @ 1.60GHz)

But this is quite slow, as the same matrix-multiplication on the same machine
under Matlab just takes around or less than 9 second to complete.
(Is eigen using a naive approach: O(n^3)? I am not sure about matlab's (but its better) :( )

Similar performance differences are revealing when i try to solve a regression problem of similar matrix dimensions.
(A.lu().solve(b, &x) , where A is a NxN matrix and b a 3xN , matlab: x=A\b;)

How can i improve the performance of this multiplication (or of the solving)?
Is there any trick which improves eigen's performance or am i doing something every uncomfortable?

I was using g++ 4.3.4, -O2, gentoo-linux and eigen-2.0.11

In fact, i tried the following snippet of code (matrix-multiplication):
(don't be confused about the smart pointer, it doesnt really make a difference if the matrix-variable is created on the stack or not,
and the source of the problem is not really the multiplication of the matrix with itself, i tested it;
beware: it's memory intensive)

Code: Select all
#include <Eigen/Core>
#include <Eigen/Array>
#include <boost/shared_ptr.hpp>
#include <boost/scoped_ptr.hpp>
#include <boost/progress.hpp>

int main()
   {
      /* timer... */
      boost::scoped_ptr<boost::progress_timer> timer_progress;
      /* create matrix */
      boost::shared_ptr<Eigen::MatrixXd> testMatrix(new Eigen::MatrixXd(
                                       4096
                                       ,5000
                                       ));
      /* randomize matrix */
      (*testMatrix).setRandom();
      /* start timer */
      timer_progress.reset(new boost::progress_timer);
      /* this takes quite long // slow multiplication? */
      Eigen::MatrixXd result = (*testMatrix)*(*testMatrix).transpose();
      /* get elapsed time */
      std::cout << "multiplication took: ";
      return (EXIT_SUCCESS);
   }


I hope for your help :)

kind regards
schorfi
User avatar
ggael
Moderator
Posts
3447
Karma
19
OS
So overall we have a theoretical factor 6, which is what you observed.


2^3=8, even ;)
User avatar
bjacob
Registered Member
Posts
658
Karma
3
aaaaaaargh !!!

I wanted to reply to Gael's post, instead I edited it !!!

Sorry guys, i'll see how to restore Gael's post.


Join us on Eigen's IRC channel: #eigen on irc.freenode.net
Have a serious interest in Eigen? Then join the mailing list!
User avatar
bjacob
Registered Member
Posts
658
Karma
3
Sorry guys, I killed Gael's post and the kde-forum staff tells me that it really can't be undone.

Gael was giving you several reasons why eigen could be slower:
- first, are you using 2.0? the devel branch can be 2x faster
- second, matlab is probably using your 2 cpu cores, while eigen uses only one (for now). Again a 2x factor.
- ...

are you sure to have enabled SSE2 ?

g++ -O2 -msse2


Join us on Eigen's IRC channel: #eigen on irc.freenode.net
Have a serious interest in Eigen? Then join the mailing list!
User avatar
ggael
Moderator
Posts
3447
Karma
19
OS
yes, the other reason is that when you compute r = x * x.transpose() only half of the result has to be computed, and matlab is likely to detect that, so one more factor 2.

With the devel branch, on my 2GHz computer I got for r = x * x.transpose() 20sec, that corresponds to 6 GFLOPS/sec, and 3 operations per cycle which is very good.

Still with the devel branch, you can compute r = x * x.transpose() even more efficiently:

r.setZero();
r.sefladjointView<Upper>().rankUpdate(x);

and if you want the full matrix r, then do:

r.triangularView<StrictlyLower>() = r.transpose();

but usually you only need to store one half of a symmetric matrix. Here that took 10sec (for 4k x 4k matrices of doubles).

Also, the solvers are much much faster in the devel branch.
realschorfi
Registered Member
Posts
4
Karma
0
OS
hey,
thank you for the reply.
I'll try the dev-branch and report my results soon. This should solve another need, the product of all coefficients of an matrix/vector, i made my own up to now ;).

Matlab is a not using parallel computing by default and/or it was using just one core for my experiments, as i easily observed that through gkrellm. (If my memory is correct, an extra toolbox for parallel computing was needed...) I tried openmp to parallelise computing on another position with eigen stuff, but failed to use both cores, seems it starts multiple threads but they are running on one only core, strange, guessing it's my fault, but that's another topic :).

I am using -march=native, i though that includes sse2. I will state that explicitly for a test.

I wasn't aware of the symmetric matrix, but true, that could be a reason :)

thank you
i will check

kind regards
schorfi
User avatar
ggael
Moderator
Posts
3447
Karma
19
OS
realschorfi wrote:Matlab is a not using parallel computing by default and/or it was using just one core for my experiments


Sorry, but that cannot be true. Let me explain.

Your product requires 168e9 flops. If we assume than only half of the result is computed, it remains 84e9 flops. If matlab performs it in 9 secs, that means 9.3 GFLOPS/sec while the peak performance for your 1.6GHz core 2 is 1.6 * 2 (SSE) * 2 (one mul and one add in one cycle) = 6.4 GFLOPS.

So matlab must had used your two CPUs.

The other explanation is that you used single precision floating point matrices. (I guess the default is double precision right ?)
Hauke
Registered Member
Posts
109
Karma
3
OS
Hi,

over here, Matlab is definitely using two cores and it does not use or require the parallel computing toolbox for it (you'll need it for things like parfor). I verified via 'license inuse'.

Looking at the threads consuming most of the time (via 'Process Explorer') one can observe that libguide40.dll!_kmp_launch_worker (at least over here) is heavy at work and that means OpenMP is used. The run-time is the same, 9 secs.

Regards,
Hauke
realschorfi
Registered Member
Posts
4
Karma
0
OS
hey,
you are right, now it's using two cores true,
i am not sure what i have observed

my fault sorry :/
realschorfi
Registered Member
Posts
4
Karma
0
OS
sorry for that mismatch, i think i mixed up my own reminds, my model is using mex which is running a single core, and duo to lack of memory on my laptop i cannot do "big" experiments which results in matrices with sizes of 4000+, so the observation was a quasi single core behavior, as one core was busy and the other not, but this one had little peaks, which seems to be related to the matrix multiplication and solving but most of the time,
the explanation is that the matlab model is mostly generating random-datas (matrix multiplication and solving has only a small fraction) which seems to occupy only one core, i forgot that fact :-/


however, the dev-branch is indeed faster, the "usual" matrix multiplication (matrix*matrix.transpose())
just takes around 35seconds now

the half of the symmetric version takes 17 seconds

that's a good improvement, i hope parallel computing comes soon

best regards
schorfi


Bookmarks



Who is online

Registered users: abc72656, Bing [Bot], daret, Google [Bot], Sogou [Bot], Yahoo [Bot]