This forum has been archived. All content is frozen. Please use KDE Discuss instead.

Eigen multiplication speed

Tags: None
(comma "," separated)
avishekk
Registered Member
Posts
4
Karma
0

Eigen multiplication speed

Tue Apr 22, 2014 4:04 pm
Hi,

I've been using Eigen and up until now have had any real performance issues with my code. But now I'm multiplying quite large matrices 400+ rows/columns and am getting VERY slow program speeds.

An example piece of code is:
-------------------------
Eigen::MatrixXd Sc;
Sc = Hcl.transpose()*Qblock*Hcl + Hclu.transpose()*Rblock*Hclu + 2*Hcl.transpose()*Nblock*Hclu +Hclnp.transpose()*P*Hclnp;
------------------------

In the example all the matrices (except Sc) are part of a class 'predictionModel', I just removed predictionModel. above for brevity.

Hcl is 400x100
Qblock is 400x400
Hclu is 100x100
Rblock is 100x100
Nblock is 400x100
Hclnp is 8x100

In MATLAB the calculation takes 0.004555s, using Eigen it's taking over 0.5s (hard to get exact timings, in debug mode it's running at 5s).
At present, in release mode I have O2, SSE2, EIGEN_NO_DEBUG all set.

Should I be grouping this calculation differently?
Qblock and Rblock only have non zero entries on the diagonal, should I make these sparse matrices?

Any help would be much appreciated.

Best regards,

Avi
User avatar
ggael
Moderator
Posts
3447
Karma
19
OS

Re: Eigen multiplication speed

Tue Apr 22, 2014 8:56 pm
What's your compiler? The following takes 0.004s on my laptop using eigen 3.2 with a recent clang or gcc:
Code: Select all
#include <bench/BenchTimer.h>
#include <iostream>
using namespace Eigen;
int main()
{
  MatrixXd Hcl(400,100);    Hcl.setRandom();
  MatrixXd Qblock(400,400); Qblock.setRandom();
  MatrixXd Hclu(100,100);   Hclu.setRandom();
  MatrixXd Rblock(100,100); Rblock.setRandom();
  MatrixXd Nblock(400,100); Nblock.setRandom();
  MatrixXd Hclnp(8,100);    Hclnp.setRandom();
  MatrixXd P(8,8);          Hclnp.setRandom();

  MatrixXd Sc;
  BenchTimer t;
  BENCH(t, 3, 1, Sc.noalias() = Hcl.transpose()*Qblock*Hcl + Hclu.transpose()*Rblock*Hclu + 2*Hcl.transpose()*Nblock*Hclu + Hclnp.transpose()*P*Hclnp);
  std::cout << t.best() << " " << Sc(0,0) << "\n";
}

It also takes only 0.0024s on our AVX branch.
avishekk
Registered Member
Posts
4
Karma
0

Re: Eigen multiplication speed

Wed Apr 23, 2014 8:43 am
Hi ggael,

Just while typing this reply I found that even though I had selected /O2 optimisations, they weren't being applied for some reason! My apologies for wasting your time in that regard.

Now the code is much faster but I think I could still speed things up with my own optimisations. Do you know of any additional compiler setting that can slow down the performance of Eigen (MSVS 2008 with Eigen 3 (last updated 29/1/2014)) .

My current set up uses:
/Ob2 /Ot /GL /I "C:\Projects\Controllers\RnD\MPCCART3newAPI\Source\..\Infrastructure\Source\Eigen" /D "WIN32" /D "WIN32_LEAN_AND_MEAN" /D "_USE_32BIT_TIME_T" /D "_CRT_SECURE_NO_WARNINGS" /D "NOMINMAX" /D "EIGEN_NO_DEBUG" /D "DNDEBUG" /D "_MBCS" /FD /EHsc /MT /Zp4 /arch:SSE2 /W3 /nologo /c /Zi /TP

Regards,

Avi
User avatar
ggael
Moderator
Posts
3447
Karma
19
OS

Re: Eigen multiplication speed

Wed Apr 23, 2014 9:18 am
I don't know much about msvc, compiling in 64bit mode helps a lot (more registers are available).
avishekk
Registered Member
Posts
4
Karma
0

Re: Eigen multiplication speed

Wed Apr 23, 2014 11:36 am
Hi ggael,

I've found another bottleneck in my code:

BENCH(t, 3, 1, Lc1.noalias() = 2*((Hcl.transpose()*Qblock*Pcl + Hclu.transpose()*Rblock*Pclu + Hclnp.transpose()*P*Pclnp)*scaledState +
(Hcl.transpose()*Qblock*Pclop + Hclu.transpose()*Rblock*Pcluop + Hclnp.transpose()*P*Pclopnp)*stateTarget +
(Hcl.transpose()*Qblock*Dcl + Hclu.transpose()*Rblock*Dclu + Hclnp.transpose()*P*Dclnp)*deltaDisturbance));

This is a slightly larger version of the previous example where:
stateTarget is 400x1 and deltaDisturbance is 200x1

Currently this is taking about 0.6s (as returned by BENCH) while in MATLAB it's taking 0.006s.

I've separated out the expression and

(Hcl.transpose()*Qblock*Pclop + Hclu.transpose()*Rblock*Pcluop + Hclnp.transpose()*P*Pclopnp)*stateTarget
(Hcl.transpose()*Qblock*Dcl + Hclu.transpose()*Rblock*Dclu + Hclnp.transpose()*P*Dclnp)*deltaDisturbance

seem to be the culprits though I'm not entirely sure why.

Regards,

Avi
avishekk
Registered Member
Posts
4
Karma
0

Re: Eigen multiplication speed

Wed Apr 23, 2014 1:00 pm
Okay after a bit more investigation I have found a slight code revision makes the calculation around 30x faster:
------------------------------------------------------------
// Slow code
Lc1.noalias() = 2*((predictionModel.Hcl.transpose()*predictionModel.Qblock*predictionModel.Pcl +
predictionModel.Hclu.transpose()*predictionModel.Rblock*predictionModel.Pclu +
Hclnp.transpose()*predictionModel.P*Pclnp)*scaledState +
(predictionModel.Hcl.transpose()*predictionModel.Qblock*predictionModel.Pclop +
predictionModel.Hclu.transpose()*predictionModel.Rblock*predictionModel.Pcluop +
Hclnp.transpose()*predictionModel.P*Pclopnp)*stateTarget +
(predictionModel.Hcl.transpose()*predictionModel.Qblock*predictionModel.Dcl +
predictionModel.Hclu.transpose()*predictionModel.Rblock*predictionModel.Dclu +
Hclnp.transpose()*predictionModel.P*Dclnp)*deltaDisturbance);

// Split Lc calculation for speed
temp1 = predictionModel.Hcl.transpose()*predictionModel.Qblock*predictionModel.Pcl +
predictionModel.Hclu.transpose()*predictionModel.Rblock*predictionModel.Pclu +
Hclnp.transpose()*predictionModel.P*Pclnp;
temp1 *= scaledState;

temp2 = predictionModel.Hcl.transpose()*predictionModel.Qblock*predictionModel.Pclop +
predictionModel.Hclu.transpose()*predictionModel.Rblock*predictionModel.Pcluop +
Hclnp.transpose()*predictionModel.P*Pclopnp;
temp2 *= stateTarget;

temp3 = predictionModel.Hcl.transpose()*predictionModel.Qblock*predictionModel.Dcl +
predictionModel.Hclu.transpose()*predictionModel.Rblock*predictionModel.Dclu +
Hclnp.transpose()*predictionModel.P*Dclnp;
temp3 *= deltaDisturbance;

temp1 += temp2 + temp3;
temp1 *= 2;
------------------------------------------------------------

I don't understand why the second case is faster, would you be able to give me some insight?
User avatar
ggael
Moderator
Posts
3447
Karma
19
OS

Re: Eigen multiplication speed

Wed Apr 23, 2014 2:44 pm
I don't understand how the second version could be faster, however let me give you some hints: When using vectors (400x1) make sure that you are using VectorXd (or VectorXd) types instead of matrix types. Then you must take care about the priority of operator *. For instance, A*B*v is much slower than A*(B*v) when A and B are matrices and v a vector because the second version involves much fewer operations.


Bookmarks



Who is online

Registered users: Baidu [Spider], Bing [Bot], Google [Bot], rblackwell