This forum has been archived. All content is frozen. Please use KDE Discuss instead.

Declaration and optimization in Eigen

Tags: None
(comma "," separated)
jr001
Registered Member
Posts
3
Karma
0
I am working with a function that is called many times, and depends on efficient computations. I wanted to make sure I was getting the most performance out of Eigen.

The function uses 3x3 matrices to store intermediate results, and to build the large matrix Y.

-Is it more efficient to declare the 3x3 matrices as class variables, so they are not re-instantiated every time the function is called, or can the compiler "optimize away" the instantiations?

- Will the compiler exploit the fact that some of the matrices are constant?

- Can the compiler exploit the special properties of Y, when multiplying with P_map?
( Less than half of the elements of Y are nonzero, and then only about half of the non-zero elements are changing every time the function is called.)



Note: This code is only the computation intensive of the function. So some of it may not make sense out of context.


Code: Select all
// this part of the code builds the 3x3 matrices that are needed to construct the 15x15 matrix Y

  // capital letters are for 3x3 matrices, lowercase for 3x1 vectors



  // crossprodmat(v) ,where v is a 3x1 vector, is the matrix

  // [ 0    -v(2)   v(1)

  //   v(2)   0    -v(0);

  //  -v(1)  v(0)     0]



   Matrix3d A = B * C.transpose();

   Matrix3d D = B * E.transpose();

   Matrix3d F = B * G.transpose();

   Matrix3d H = B * I.transpose();

 

   Matrix3d  J = 1./24.*( Matrix3d::Identity() + 4.*A + D);

   Matrix3d  K = 1./12.*( Matrix3d::Identity() + 4.*D + F);

   Matrix3d  L = 1./6.*( Matrix3d::Identity() + 4.*F + H);



   Matrix3d L = 1./6.*(4.* K +  L);



   Vector3d t1 = 0.5*(a+ b) - c;

   Vector3d t2 = a - c;

   Matrix3d T1 = crossprodmatrix(t1) * B.transpose() *  K;

   Matrix3d T2 = crossprodmatrix(t2) * B.transpose() *  L;

   Matrix3d M = 1/6.*( 4.*T2 + T1);



   Vector3d t3 = (0.75a+0.25*b) - c;

   Matrix3d T4 = crossprodmatrix(t3) * B.transpose() *  J;

   Matrix3d M_mid = 1/12.*( 4.*T1 + T4);





   Matrix3d N = 1/6.* (  4.*M_mid+ M);

   

   Vector3d o = e - f - d - (0.5*g) ;

   Vector3d p = d - d2 - g;





  // here R and Z are matrices whose values are known at compile time

  Matrix3d R;

  R << const1  , const2, 0,

        const3, const4, 0,

        0, 0, 1;

 

  Vector3d z;

  z << 0 , 0, const5;

  Matrix3d Z = crossprodmat(z);





   Matrix3d Q = -R*crossprodmatrix(o)*B.transpose();

   Matrix3d S = Z*R*crossprodmatrix(o)*B.transpose() - R*crossprodmatrix(p)*B.transpose();

            

   Matrix3d T = R + Z*R*Z;

        Matrix3d U = - Z*w*R*Z;

   Matrix3d V =  -Z*w*R + R;



   Matrix3d M_e = -Z*R*N + R*M;



   Matrix3d W = -R * B.transpose() * L;

   Matrix3d X =  Z*R*B.transpose()*L - R*B.transpose()* L;



 

   //Matrix<double,15,15> Y ;   //  initialized at the constructor

   Y.block(0,0,3,3) = H.transpose();

   Y.block(0,9,3,3) = -(H.transpose())*( L);

   Y.block(3,0,3,3) = Q;

   Y.block(3,3,3,3) = T;

   Y.block(3,6,3,3) = U;

    Y.block(3,9,3,3) = R*N;

   Y.block(3,12,3,3) = W;

   Y.block(6,0,3,3) = S;

   Y.block(6,3,3,3) = V;

   Y.block(6,9,3,3) = M_e;

   Y.block(6,12,3,3) = X;

  Y.block(9,9,3,3) = Matrix3d::Identity();

  Y.block(12,12,3,3) = Matrix3d::Identity();



   // Q_diag is a 15x15 diagonal matrix, whose values are known at compile time

  Matrix<double, 15,1> Q_diag;

  Q_diag << const5, const5,const5, 0, 0, 0, const6,const6,const6, const7,const7,const7, const8,const8,const8;



 

   // P_map_buffer is a double*, which is a class member. It stores the elements of the

  // matrix P_map in Row major format, with a stride of LDIM



  Eigen::Map <Matrix<double,15,LDIM,Eigen::RowMajor> > Y_map(P_map_buffer,15,LDIM);



  P_map.block<15,15>(0,0) += ( 0.5 * Q_diag.asDiagonal());

  P_map.block<15,15>(0,0) = Y * P_map.block<15,15>(0,0);

  P_map.block<15,15>(0,0) = P_map.block<15,15>(0,0) * Y.transpose();

  P_map.block<15,15>(0,0) += ( 0.5 * Q_diag.asDiagonal());















 

 




Thank you for any help.
User avatar
ggael
Moderator
Posts
3447
Karma
19
OS
-Is it more efficient to declare the 3x3 matrices as class variables, so they are not re-instantiated every time the function is called, or can the compiler "optimize away" the instantiations?


Since they are fixed size matrices, they are allocated on the stack, so declaring them as class variables won't help. On the other hand, you can declare them directly as a sub matrix of Y:
Code: Select all
Block<YType,3,3> Q(Y,3,0);


Likewise, it is faster to rewrite:
Code: Select all
Y.block(3,3,3,3) = T;

as
Code: Select all
Y.block<3,3>(3,3) = T;


- Will the compiler exploit the fact that some of the matrices are constant?


- Can the compiler exploit the special properties of Y, when multiplying with P_map?
( Less than half of the elements of Y are nonzero, and then only about half of the non-zero elements are changing every time the function is called.)


Unfortunately the answer is no. For your use case, the best would be to store Y in a sparse block matrix of fixed size blocks, but we don't have that in Eigen yet. We do have generic sparse matrices, but in your case I don't think that's worth it (your matrices are too small).

Make sure you compile you compile program with a recent compiler, with -DNDEBUG, and with optimizations (-O2) and SSE2 (-msse2) enabled. Also, since your matrices are 15x15, you could try to define:
Code: Select all
#define EIGEN_CACHEFRIENDLY_PRODUCT_THRESHOLD 15

just before including Eigen/Core. This constant control the use a naive or cache friendly product implementation (the default is 16).

Also:
Code: Select all
P_map.block<15,15>(0,0) += ( 0.5 * Q_diag.asDiagonal());

can be optimized as:
Code: Select all
P_map.block<15,15>(0,0).diagonal() += 0.5 * Q_diag.asDiagonal();
User avatar
bjacob
Registered Member
Posts
658
Karma
3
ggael wrote:Also:
Code: Select all
P_map.block<15,15>(0,0) += ( 0.5 * Q_diag.asDiagonal());

can be optimized as:
Code: Select all
P_map.block<15,15>(0,0).diagonal() += 0.5 * Q_diag.asDiagonal();


You mean:
Code: Select all
P_map.block<15,15>(0,0).diagonal() += 0.5 * Q_diag;


Join us on Eigen's IRC channel: #eigen on irc.freenode.net
Have a serious interest in Eigen? Then join the mailing list!
jr001
Registered Member
Posts
3
Karma
0
Thanks for the suggestions. I have another performance question with a simple example.

I tested this in a loop against a C version using static arrays, and the tvmet library.

The eigen version always seems much slower. I did notice an improvement when I forced lazy evaluation though.

Here are some average times:

for 100,000,000 loops:
static: 4 seconds
tvmet: 7 seconds
eigen: 25 seconds

for 1,000,000,000 loops:
static: 36 seconds
tvmet: 1 minute,10 seconds
eigen: 4 minutes, 9 seconds


Is there something else I could do to increase speed?



(All matrices A,B,C.. are Matrix3d)
Code: Select all
      A = (B * C.transpose()).lazy();

      D = (B * E.transpose() ).lazy();
      F = ( B * G.transpose() ).lazy();
      H = ( B * P.transpose() ).lazy();
      
      I = ( a/24.*( Matrix3d::Identity() + 4.*A + D) ).lazy();
      J = ( a/12.*( Matrix3d::Identity() + 4.*D + F) ).lazy();
      K = ( a/6.*( Matrix3d::Identity() + 4.*F + H) ).lazy();

      L = ( a/6.*(4.*J + K) ).lazy();
User avatar
ggael
Moderator
Posts
3447
Karma
19
OS
hm, that's strange. Can you post the static and tvmet versions as well as the code around because I suspect the compiler detects that some parts can be factorized out of the loop (or completely removed), and so your benchmark would be meaningless.

Indeed, here you have about 337 floating point operations per iteration, so 36s for 1e9 iterations means 9 GFlops => a 9 GHz CPU ! (well in some cases current CPU are able to perform one + and one * in one cycle, so let's say this imply a 4.5 GHz CPU)

So, could you also tell us was is your CPU and compiler ?

Here it takes ~10s for 1e8 iterations on a 2.66GHz Core2 (only one core used). I check the assembly and nothing has been removed out of the loop. This means ~3.37GFlops. Compared to the theoretical 5.32 peak performance this is not bad at all.

Also, the .lazy() are only needed around the matrix product. The other expressions are always evaluated lazily.
User avatar
ggael
Moderator
Posts
3447
Karma
19
OS
To give you an example, here is a fair benchmark wrapping the key piece of code into a function which is enforced not to be inlined:

http://pastebin.com/m12b8aa87

This way you guarantee the compiler won't optimize too much.
jr001
Registered Member
Posts
3
Karma
0
Sorry, I forgot to mention the compiler. Those estimates were with icpc and intel i7 processor, using the -O3 flag. I know it isn't a proper benchmark, I just wanted to roughly recreate what my main program is dealing with.

Recently, I tried the tests with g++ and the results were the opposite. The static C version was slowest, while eigen and tvmet got faster.

Are there any ways I can improve performance when using icpc?

I will post the rest of the tests soon if it will help.


Bookmarks



Who is online

Registered users: Bing [Bot], Google [Bot], Sogou [Bot]