This forum has been archived. All content is frozen. Please use KDE Discuss instead.

Low efficiency using Eigen...

Tags: None
(comma "," separated)
afhidalgo77
Registered Member
Posts
7
Karma
0
OS

Low efficiency using Eigen...

Fri May 14, 2010 11:39 am
Hi
I started using Eigen because I am looking for real time applications. Watching the benchmark onhttp://eigen.tuxfamily.org/index.php?title=Benchmark I was encouraged to give them a try.
As far as the accuracy of the results is concerned, Eigen was OK but, I have encountered some efficiency problems with the example that I am trying to implement. Because of this, I am writing you to enquire whether I am not using EIGEN well or I am not setting the best compiler options; or maybe both.

To begin with, I am going to explain the example that I am trying to solve, and subsequently, the main setting compiler options that I have used to run the program.

Testing example
I have to solve several times the following general expression:
A (4,1) = B(4,4) * C(4,1); //where A and C are column vectors and B is a 4x4 matrix.

The matrix expression above must be solved within two for loops, the external goes from 0 to nB, and the internal one goes from 0 to nPoints, where nPoints takes different values from 0 to nB but they are always the same.
Each C vector represents a row within a matrix TRP(nRows,4) and the A vector must be placed in a row of a matrix equivalent to the previous TRP, called TP.

Now, the variables of my problem are:
ATT(4,4) //This is the matrix B, showed in the previous general expression
TRP (nRows,4) //Matrix with nRows rows and 4 columns.
TP (nRows,4) //Matrix with nRows rows and 4 columns.

The matrices A44, TRP and all the other values (nB, nPoints and nRows) are known at any time. Only the TP matrix is needed to be calculated.

Resolution
To work out the solution I have made two different functions for the same program. The first one solves the problem by the classical way, and the second one by means of Eigen. The numerical results obtained with both functions are identical in each case, but the first one is almost 30% faster than the function that solves the problem with Eigen.

Since arrays ATT and TRP are known at any time I have introduced the Eigen matrix pointing to them by using the Map expression, as below:

Map<Matrix4d, Aligned> A44(AT[0],4,4);
Map<MatrixXd, Aligned> mTNP(TP[0],4,nRows);
Map<MatrixXd, Aligned> mTNRP(TRP[0],4,nRows);

Downwards I have copied a brief summary of the C++ code.

Setting options of compiler
I have implemented and executed the program on two platforms, MVS2005 and 2008. With MVS2008 I have used both, its own compiler and the Intel® Compiler Suite Professional Edition 11.1 for Windows.

The main options I have set on Release mode are:
C/C++
Optimization
Optimization: Maximize Speed (/O2)
Code Generation
Runtime Library: Multi-threaded DLL (/MD)
Enable Enhanced Instruction Set: Streaming SIMD Extensions 2 (/arch:SSE2)
Linker
Debugging
Generate Debug Info: Yes (/DEBUG)

I have been trying to set the last option on NO Generate Debug Info but, despite the compilation and link process are successful, the program fails on run time execution because it lacks debug information.

The code is the following:

Code: Select all
#define EIGEN_NO_MALLOC
#include <Eigen/Eigen>

void simulIterPos(double **TRP, double **TP, double AT[4][4]);
void simulIterPosEigen_B(double **TRP, double **TP, double AT[4][4], Eigen:: Map <Matrix4d, Aligned> A44, Eigen:: Map <MatrixXd, Aligned> mTNP, Eigen:: Map <MatrixXd, Aligned> mTNRP);

void main()
{
   
   int nRows, j;
   double** TRP, **TP;
    double AT[4][4];
   //_________________________________
    //All the following data are known
   //AT(4,4); Matrix
   //TP(nRows, 4); Matrix
   //TRP(nRows, 4); Matrix
   //nRows
   //_________________________________
   
    //Eigen matrices as pointers
   Map<Matrix4d, Aligned> A44(AT[0],4,4);
   Map<MatrixXd, Aligned> mTNP(TP[0],4,nRows);
   Map<MatrixXd, Aligned> mTNRP(TRP[0],4,nRows);
   //___________________________________________
   //########### MAIN BLOCK ##############

      for(j=0; j<20000; j++){
         simulIterPos(TRP, TP, AT); //Classic method (Iterative)
         simulIterPosEigen_B(TRP, TP, AR, AT, A44, mTNP, mTNRP); //Eigen method
      }
   //___________________________________________
         
}


//Function Without EIGEN
void simulIterPos(double **TRP, double **TP, double AT[4][4]){
   
   int i, ii, j;
   int nB;
   int nPoints;
   int nPrev = 0;


   for(i=0; i<nB; i++){
      //nPoints is evaluated on each one of the nB loops
      for (ii=0; ii<4; ii++){
         for (j=0; j<nPoints; j++) {
            TP[nPrev+j][ii] = AT[ii][0] * TRP[nPrev+j][0]+
                           AT[ii][1] * TRP[nPrev+j][1]+
                             AT[ii][2] * TRP[nPrev+j][2]+
                             AT[ii][3] * TRP[nPrev+j][3];
         }
      nPrev += j;      
   }

}

//Function with EIGEN
void simulIterPosEigen_B(double **TRP, double **TP, double AT[4][4], Eigen:: Map <Matrix4d, Aligned> A44, Eigen:: Map <MatrixXd, Aligned> mTNP, Eigen:: Map <MatrixXd, Aligned> mTNRP){
   
   int i, j;
   int nB;
   int nPoints;
   int nPrev = 0;


   for(i=0; i<nB; i++){
      //nPoints is evaluated on each one of the nB loops
      for (j=0; j<nPoints; j++) {
         mTNP.col( nPrev+j)= A44.transpose() * mTNRP.col( nPrev+j);
      }
      nPrev += j;
   }

}



Could anybody help me with this low efficiency problem?
Thanks'.
Hauke
Registered Member
Posts
109
Karma
3
OS

Re: Low efficiency using Eigen...

Fri May 14, 2010 3:08 pm
I have little time at the moment but could you try out the following two options.

1) Change your maps to: Map< Matrix<double,4,Dynamic>, Aligned>
Code: Select all
Map<Matrix4d, Aligned> A44(AT[0],4,4);
Map<Matrix<double,4,Dynamic>, Aligned> mTNP(TP[0],4,nRows);
Map<Matrix<double,4,Dynamic>, Aligned> mTNRP(TRP[0],4,nRows);

2) Simply change your product to (leave out the for-loops)
Code: Select all
mTNP = A44.transpose() * mTNRP;


HTH,
Hauke

Note: These advises apply to the development branch...
drewm1980
Registered Member
Posts
13
Karma
0

Re: Low efficiency using Eigen...

Fri May 14, 2010 4:39 pm
I see that you have automatic vectorization turned on in your compiler, but you might want to put the compiler in verbose mode to figure out if your loops are actually getting vectorized. I know gcc and icc can do this. In fact, I seem to remember icc giving output of this sort by default. In our work we've seen at least one matrix multiplication that ran faster with (auto-vectorized) nested for loops than with calls to opencv (which I think was wrapping precompiled debian Atlas BLAS at the time).

Cheers,
Drew
User avatar
bjacob
Registered Member
Posts
658
Karma
3

Re: Low efficiency using Eigen...

Fri May 14, 2010 4:50 pm
For Aligned Map, you absolutely need the development branch of Eigem, aka Eigen3. This isn't working in Eigen2.

But then do note that you must guarantee that the pointer you're passing to it is 16-byte aligned. So you must declare your AT array with an alignment attribute. You can use EIGEN_ALIGN16.


Join us on Eigen's IRC channel: #eigen on irc.freenode.net
Have a serious interest in Eigen? Then join the mailing list!
afhidalgo77
Registered Member
Posts
7
Karma
0
OS

Re: Low efficiency using Eigen...

Tue May 18, 2010 11:12 am
First, which is the advantage of using Aligned Map instead of only Map??

All that I need is to pass the arrays AT and TNRP, which have been defined before in an external function.

How I can declare the AT array with an alignment attribute?, using ZP Struct Member Alignment on setting compiler options??
User avatar
bjacob
Registered Member
Posts
658
Karma
3

Re: Low efficiency using Eigen...

Tue May 18, 2010 11:21 am
If you want vectorization (SSE...) of your 4x4 objects to happen, the arrays must be aligned at 16-byte boundaries. This is how SSE and friends work. You can do that in a portable way by

Code: Select all
EIGEN_ALIGN16 double my_array[size];


So this requires that you have control over the creation of the array in question.

Then you must tell Eigen that it can rely on the pointer being aligned, this is what Aligned does, but it's only working correctly in the devel branch.

Otherwise, you can also forget about Aligned altogether. You code will run safely, just without SSE. The point is that here you have a fixed size. If you had dynamic size, you'd still get vectorization without Aligned (although Aligned would still help).


Join us on Eigen's IRC channel: #eigen on irc.freenode.net
Have a serious interest in Eigen? Then join the mailing list!
Hauke
Registered Member
Posts
109
Karma
3
OS

Re: Low efficiency using Eigen...

Tue May 18, 2010 11:22 am
afhidalgo77 wrote:First, which is the advantage of using Aligned Map instead of only Map??

All that I need is to pass the arrays AT and TNRP, which have been defined before in an external function.

How I can declare the AT array with an alignment attribute?, using ZP Struct Member Alignment on setting compiler options??


The advantage of Aligned Map is that Eigen knows that in that case it may use aligned loading routines when using SSE.

Aligning stack data is achieved by (though you need to test this since I am not sure when it comes to array declarations; EIGEN_ALIGN16 double AT[16] works for sure):

EIGEN_ALIGN16 double AT[4][4];

- Hauke
afhidalgo77
Registered Member
Posts
7
Karma
0
OS

Re: Low efficiency using Eigen...

Tue May 18, 2010 11:45 am
I have declared:

EIGEN_ALIGN16 double AT[4][4];

..., but there is a compiling error C2065: 'EIGEN_ALIGN16' : undeclared identifier.

I have tried with a duble AA[10] only for testing but is the same...
Hauke
Registered Member
Posts
109
Karma
3
OS

Re: Low efficiency using Eigen...

Tue May 18, 2010 11:52 am
afhidalgo77 wrote:I have declared:

EIGEN_ALIGN16 double AT[4][4];

..., but there is a compiling error C2065: 'EIGEN_ALIGN16' : undeclared identifier.

I have tried with a duble AA[10] only for testing but is the same...


Did you include <Eigen/Core> ? The definition is located in Eigen\src\Core\util\Macros.h.

Alternatively, since you are running on MSVC you can do:

Code: Select all
__declspec(align(16)) double AA[10];


The Eigen declaration is just more portable.

- Hauke
afhidalgo77
Registered Member
Posts
7
Karma
0
OS

Re: Low efficiency using Eigen...

Tue May 18, 2010 12:40 pm
Setting the (/ZP16) Struct Member Alignment on setting compiler options and declaring all variables (AT, TNP, TP) as __declspec(align(16)) the program runs well, but still is lower than the classic method...
Hauke
Registered Member
Posts
109
Karma
3
OS

Re: Low efficiency using Eigen...

Tue May 18, 2010 2:47 pm
afhidalgo77 wrote:Setting the (/ZP16) Struct Member Alignment on setting compiler options and declaring all variables (AT, TNP, TP) as __declspec(align(16)) the program runs well, but still is lower than the classic method...


Hi again,

as soon, as you post a self contained example which allows us to verify your issue we will be happy to assist you further.

Below you will find a starting point to create such a self contained example. For timing your code please take a look at Eigen/Bench/BenchTimer.h.

Code: Select all
#include <Eigen/Eigen>

using namespace Eigen;

void main()
{
  const int nRows = 1000;

  Eigen::Matrix4d _AT = Eigen::Matrix4d::Random();
  Eigen::MatrixX4d _TP = Eigen::MatrixX4d::Random(nRows, 4);
  Eigen::MatrixX4d _TRP = Eigen::MatrixX4d::Random(nRows, 4);

  double** TRP = new double*[nRows];
  double** TP = new double*[nRows];
  for (int i=0; i<nRows; ++i)
  {
    TRP[i] = new double[4];
    Vector4d::Map(TRP[i]) = _TRP.row(i);

    TP[i] = new double[4];
    Vector4d::Map(TP[i]) = _TP.row(i);
  }

  // needs some more code ...

  for (int i=0; i<nRows; ++i)
  {
    delete [] TRP[i];
    delete [] TP[i];
  }
}


Regards,
Hauke
afhidalgo77
Registered Member
Posts
7
Karma
0
OS

Re: Low efficiency using Eigen...

Tue May 18, 2010 10:58 pm
Hi
Here I have pasted a complete simplified example for testing. I have used both, Intel compiler and VC++ 2008. I have set ZP16 for aligned data and vectorization SSE2. The times were taken with ‘tbb’ libraries, I should see more carefully how to use the bench timer of Eigen. Finally, using ‘EIGEN_METHOD’ I keep obtaining lower runtimes that using the ‘CLASSIC_METHOD’…

Code: Select all
//#define EIGEN_NO_MALLOC
#include <Eigen/Eigen>
#include<iomanip>

#define EIGEN_TIMES
//#define OUTPUT_CHEKING //Commandline output cheking
#define CLASSIC_METHOD   
//#define EIGEN_METHOD   
using namespace Eigen;

using std::cout;
using std::endl;
using std::setw;
#ifdef EIGEN_TIMES
   #include "tbb/tick_count.h"
#endif


void main()
{
   FILE *pTimesEigen;
   int ierr;
   int nRows, i, j, jj,ii;
   int tt = 1;
   
#ifdef EIGEN_TIMES
   static double time_step[2] = { 0, 0};
   static tbb::tick_count time[2+1];
#endif

   //Transformation matrix  __declspec(align(16))
   __declspec(align(16)) double AT[4][4]={0.1, 0.2, 0.3, 2.0, 0.2, 0.3, 0.4, 3.0, 0.3, 0.4, 0.5, 4.0, 0.0, 0.0, 0.0, 1.0};
   __declspec(align(16)) double VI[4]={1, 2, 3, 4};
   __declspec(align(16)) double VO[4];
   

#ifdef OUTPUT_CHEKING
   cout<<"___________________________________________________"<<endl;
   cout<<"Transformation matrix"<<endl;
   cout<<"___________________________________________________"<<endl;
   for(i=0; i<4; i++){
      cout<<endl;
      for(j=0; j<4; j++){
         cout<<setw(10)<<AT[i][j]<< " ";
      }
      cout<<endl;
   }
#endif

    //Eigen matrices
   Map<Matrix4d, Aligned> A44(AT[0],4,4);
   Vector4d VOE;
      
//_____________________________________________________________________________________________________________
//########### MAIN BLOCK ##############

   for(jj=0; jj<20000; jj++){ //This simulates a 5s simulation with 4th order Runge-Kutta method using an integration step of 0.001s
      if (tt==0){
         VI[1] += 0.5;
         AT[1][1] += 0.5;
         tt = 1;
      }else{
         VI[1] -= 0.5;
         AT[1][1] -= 0.5;
         tt = 0;
      }
#ifdef EIGEN_TIMES
   time[0] = tbb::tick_count::now();
#endif
#ifdef CLASSIC_METHOD      
          //Block Without EIGEN
      for (ii=0; ii<4; ii++){
            VO[ii] = AT[ii][0] * VI[0]+
                   AT[ii][1] * VI[1]+
                   AT[ii][2] * VI[2]+
                   AT[ii][3] * VI[3];
      }
#endif
#ifdef EIGEN_METHOD      
      Map<Vector4d, Aligned> VIE(VI,4);
      VOE = A44.transpose() * VIE;
#endif      


#ifdef OUTPUT_CHEKING      
   cout<<"___________________________________________________"<<endl;
   cout<<"Option 1"<<endl;
   cout<<"___________________________________________________"<<endl;
   
   cout<<endl;
   for (j=0; j<4; j++){
      cout<<setw(10)<<(VOE[j]-VO[j])<< " "; //Option 1: cheking difference
      //cout<<setw(10)<<Tn_Points[i][j]<< " ";                  //Option 2: cheking Tn_Points, classic method
       //cout<<setw(10)<<Tn_Points1[i][j]<< " ";                 //Option 3: cheking Tn_Points1, BLASEIGEN method
   }
   cout<<endl;
#endif      
//________________________________________________________________________________________________________________


#ifdef EIGEN_TIMES
   time[1] = tbb::tick_count::now();
   time_step[1] += (time[1]-time[0]).seconds();
#endif
   }
    for (i=1; i<2; i++)  cout<<"Step"<<setw(5)<<i<<setw(10)<<time_step[i]<<endl;
//Printing times (Ouput file)
#ifdef EIGEN_TIMES
   /*
   ierr = fopen_s(&pTimesEigen, "EigenTimes.d","w");
   for (i=1; i<2; i++)
      fprintf_s(pTimesEigen, "Step %i %24.18f\n", i, time_step[i]);
   fclose(pTimesEigen);
   */
#endif

}


Regards,
Andrés
Hauke
Registered Member
Posts
109
Karma
3
OS

Re: Low efficiency using Eigen...

Wed May 19, 2010 8:41 am
I am seeing a tiny performance impact, too. With 64 bit builds it vanishes though.

Here is my even more simplified test code:
Code: Select all
#include <Eigen/Eigen>
#include <Bench/Benchtimer.h>

#include <iostream>

using namespace Eigen;
using namespace std;

//#define CLASSIC_METHOD

EIGEN_DONT_INLINE void classic(double VO[4], double AT[4][4], double VI[4])
{
  for (int ii=0; ii<4; ii++)
  {
    VO[ii] = AT[ii][0] * VI[0] +
      AT[ii][1] * VI[1] +
      AT[ii][2] * VI[2] +
      AT[ii][3] * VI[3];
  }
};

template <typename OutputType, typename MatrixType, typename VectorType>
EIGEN_DONT_INLINE void modern(MatrixBase<OutputType>& VOE, const MatrixBase<MatrixType>& A44, const MatrixBase<VectorType>& VIE)
{
  VOE.noalias() = A44.transpose()*VIE;
};

void main()
{
  EIGEN_ALIGN16 double AT[4][4] = {0.1, 0.2, 0.3, 2.0, 0.2, 0.3, 0.4, 3.0, 0.3, 0.4, 0.5, 4.0, 0.0, 0.0, 0.0, 1.0};
  EIGEN_ALIGN16 double VI[4] = {1, 2, 3, 4};
  EIGEN_ALIGN16 double VO[4];

  //Eigen matrices
  Matrix4d A44 = Matrix4d::MapAligned(AT[0]);
  Vector4d VIE = Vector4d::MapAligned(VI);
  Vector4d VOE(0,0,0,0);

  BenchTimer timer;

  const int num_tries = 5;
  const int num_repetitions = 2000000;

#ifdef CLASSIC_METHOD
  BENCH(timer, num_tries, num_repetitions, classic(VO, AT, VI));
  std::cout << Vector4d::MapAligned(VO) << std::endl;
#else
  BENCH(timer, num_tries, num_repetitions, modern(VOE, A44, VIE));
  std::cout << VOE << std::endl;
#endif

  double elapsed = timer.best();
  std::cout << "elapsed time: " << elapsed*1000.0 << " ms" << std::endl;
}


For 64bit builds, the assembly of the classic and modern methods are identical (when I remove the transposition). When the transposition is present Eigen produces even faster code than the hand-written one though I did not really investigate further.

In 32bit builds there seems to be some issues with MSVC vectorizing properly. There are some movesd/fstp/fld ops hidden between the SSE stuff. I don't really know why and have no time to dig deeper.

Since many users rely on those tiny matrix multiplications, at some point we will definitely try to fix the issue. I filed a small enhancement-report so we don't forget to look into it in the future.

Regards,
Hauke
afhidalgo77
Registered Member
Posts
7
Karma
0
OS

Re: Low efficiency using Eigen...

Wed May 19, 2010 10:13 am
Thanks Hauke,
If you don't mind, could you tell me, in percentage, how faster was your implementation using Eigen instead hand-written code?? , how many cores it has your processor and; are you using vectorization, aren’t you?

I will try to pass to 64 bits but, if I want to achieve real time applications, which are the main features to concern?

Summarizing I can say:
- 64 bits
- To use vectorization
- To work with aligned arrays
- Other setting compilations???

Regards
Andrés
Hauke
Registered Member
Posts
109
Karma
3
OS

Re: Low efficiency using Eigen...

Wed May 19, 2010 10:34 am
afhidalgo77 wrote:Thanks Hauke,
If you don't mind, could you tell me, in percentage, how faster was your implementation using Eigen instead hand-written code?? , how many cores it has your processor and; are you using vectorization, aren’t you?

I will try to pass to 64 bits but, if I want to achieve real time applications, which are the main features to concern?

Summarizing I can say:
- 64 bits
- To use vectorization
- To work with aligned arrays
- Other setting compilations???

Regards
Andrés


What I am going to say applies to MSVC. First, you don't have to take care of alignment anymore when you are compiling in 64bit mode - and only then. Neither do you need to take care of SSE/vectorization. Alignment as well as the usage of SSE or vectorization are all used/urned on by default. Having that said, it does not hurt if you do enforce alignment in all cases since then your code should perform nearly equally well in 32 as well as in 64bit.

The number of cores does not matter here, since your tiny product will always be evaluated on a single core and there was no OpenMP/TBB involved.

One of the core features of Eigen is that you can use it for fixed size as well as dynamic size problems while getting almost the same performance as hand-written code. In theory the 'almost' should actually be at least 'exact same performance' but different compilers manage to optimize the code at different levels and we have seen now many times that MSVC is not optimal, in particular for 32bit builds. Your hand-written loop is trivial to vectorize for the compiler and thus in this special case, you are really expecting Eigen to produce the absolute optimal SSE code.

In case, you can reformulate your problems, such that you are working on bigger systems, Eigen will most likely outperform any naive hand-written code since Gael took care of optimal cache utilization. So going back to your initial problem, where you wanted to apply the 4x4 matrix to a bunch of points and not a single one Eigen should perform faster than your hand-written loop.

I hope that helps you a bit more...

- Hauke


Bookmarks



Who is online

Registered users: Bing [Bot], Google [Bot], Sogou [Bot]