This forum has been archived. All content is frozen. Please use KDE Discuss instead.

Performance issue with outer product.

Tags: None
(comma "," separated)
egery
Registered Member
Posts
2
Karma
0
Hello,

First, thanks to all Eigen team for the great work.

I've encountered a performance issue doing a simple outer product of vectors (M = V1 * V2.transpose();)
It's a slower than a hand made loop !
We can make it almost twice faster by explicitly iterating on V2 elements (thus having vector*scalar operation for each col of M). Vectorization issue ?
I've searched the forum for this specific issue, without success.

Tested with Eigen 3.1, gcc 4.4.5, gcc 4.7.1 with options :
-mmmx -msse -msse2 -O3 -msse3 -msse4 -mtune=pentium4 -march=i686 -DNDEBUG -lrt -I~/Eigen31


Results :
| *Dim1* | *Dim2* | *nbIter* | *By hand* | *Eigen* | *Eigen fixed* |
| 1024 | 50 | 3906 | 77 ms | 116 ms | 62 ms |

From this code :
Code: Select all
/*!
* @brief   Benchmark vector * vector.t() outer product.
*
*/

#include <stdio.h>
#include <time.h>
#include "Eigen/Core"

const int DIM_1 = 1024;
const int DIM_2 = 50;

namespace
{
   typedef Eigen::MatrixXf Matrix ;
   typedef Eigen::VectorXf Vector ;

   //Helper, get realtime in micro seconds
   long long GetRealTime_ms()
   {
      timespec currentTime;
      clock_gettime(CLOCK_REALTIME, &currentTime);
      return currentTime.tv_sec*1e6 + currentTime.tv_nsec/1000;
   }

   //Loops by hand.
   void OuterProduct_Ref(
         const Vector& v1,
         const Vector& v2,
         Matrix& o_result)
   {
      const int dim1 = v1.rows() ;
      const int dim2 = v2.rows() ;

      //Caller must have allocated room for result.
      assert(o_result.rows() == dim1);
      assert(o_result.cols() == dim2);

      const float* input  = v1.data() ;
      const float* output = v2.data() ;
      float* result = o_result.data() ;

      for (int j=0; j<dim2; ++j)
      {
         const float beta = output[j];
         const float * in = input ;
         for(int i=0; i<dim1; ++i)
         {
            // result[i+j*dim1] = beta*input[i] ;
            *(result++) = beta * *(in++) ;
         }
      }
   }

   void OuterProduct_Eigen(
         const Vector& v1,
         const Vector& v2,
         Matrix & o_result)
   {
      //Caller must have allocated room for result.
      assert(o_result.rows() == v1.rows());
      assert(o_result.cols() == v2.rows());

      o_result.noalias() = v1 * v2.transpose() ;
   }

   void OuterProduct_Eigen_Fix(
         const Vector& v1,
         const Vector& v2,
         Matrix & o_result)
   {
      //Caller must have allocated room for result.
      assert(o_result.rows() == v1.rows());
      assert(o_result.cols() == v2.rows());

      const int dim2 = v2.rows() ;
      for (int j=0 ; j != dim2 ; ++j)
      {
         o_result.col(j) = v1 * v2(j) ;
      }
   }

   void benchmark_OuterProduct()
   {
      long timeRef, timeEigen, timeEigenFix ;

      const int nbOperations = DIM_1*DIM_2 ;
      const int nbIter = 2e08/nbOperations ;

      Vector V1 (DIM_1);
      Vector V2 (DIM_2);
      V1.setRandom();
      V2.setRandom();

      Matrix resultRef, result ;
      resultRef.resize(DIM_1, DIM_2) ;
      result.resize(DIM_1, DIM_2) ;
      OuterProduct_Ref (V1, V2, resultRef);

      //By hand
      {
         long long timeRef_start = GetRealTime_ms();
         for(int n = 0; n < nbIter; n++)
         {
            OuterProduct_Ref (V1, V2, result);
         }
         timeRef = GetRealTime_ms() - timeRef_start ;
         assert (result.isApprox(resultRef)) ;
      }

      //With Eigen
      {
         long long timeEigen_start = GetRealTime_ms();
         for(int n = 0; n < nbIter; n++)
         {
            OuterProduct_Eigen (V1, V2, result);
         }
         timeEigen = GetRealTime_ms() - timeEigen_start ;
         assert (result.isApprox(resultRef)) ;
      }
     
      //With Eigen, fixed implementation.
      {
         long long timeEigenFix_start = GetRealTime_ms();
         for(int n = 0; n < nbIter; n++)
         {
            OuterProduct_Eigen_Fix (V1, V2, result);
         }
         timeEigenFix = GetRealTime_ms() - timeEigenFix_start ;
         assert (result.isApprox(resultRef)) ;
      }

      // Result //////////////////////////////////////////////////////////////////////
      printf(
            "| *Dim1* | *Dim2* | *nbIter* | *By hand* | *Eigen* | *Eigen fixed* | \n"
            "| %d | %d | %d | %d ms | %d ms | %d ms |\n",
               DIM_1, DIM_2, nbIter,
               timeRef/1000, timeEigen/1000, timeEigenFix/1000) ;
   }
} //anonymous namespace

int main()
{
   benchmark_OuterProduct() ;
   return 0 ;
}



Thanks,

Yves
User avatar
ggael
Moderator
Posts
3447
Karma
19
OS
your "OuterProduct_Eigen_Fix" version is how it is implemented. There might be a recent performance regression here. I'll check.
User avatar
ggael
Moderator
Posts
3447
Karma
19
OS
alright, the issue is that the code is optimized to perform rank-1 updates:

A += v1 * v2.transpose();

If you adjust your OuterProduct_Eigen_Fix to do this operation, then you get the exact same performance. Here the overhead for a simple A = v1 * v2.transpose(); is about 30%, and it is still quite faster than OuterProduct_Ref.
egery
Registered Member
Posts
2
Karma
0
Interesting !
It's true that for update, Eigen and EigenFix give the same performance, which is 1.7x better than hand-made update.
For assignation, I still have to use the "fix" to obtain the best performance, which remains very close to hand-made version.

I've compared 3 operations (assignation, assignation simulated with SetZero() and +=, and update).
BTW, noalias() helps here.


Byhand = : 77ms
Byhand += : 130ms

Eigen = (operator =) : 116ms
Eigen = (SetZero then +=) : 115ms
Eigen += : 77ms

EigenFix = (operator =) : 62ms
EigenFix = (SetZero then +=) : 115ms (no difference with plain Eigen expression)
EigenFix += : 77ms (no difference with plain Eigen expression)

FYI, /proc/cpuinfo gives :

Code: Select all
 processor       : 0
 vendor_id       : GenuineIntel
 cpu family      : 6
 model           : 44
 model name      : Intel(R) Xeon(R) CPU           E5645  @ 2.40GHz
 stepping        : 2
 cpu MHz         : 1596.000
 cache size      : 12288 KB
 physical id     : 1
 siblings        : 12
 core id         : 0
 cpu cores       : 6
 apicid          : 32
 initial apicid  : 32
 fdiv_bug        : no
 hlt_bug         : no
 f00f_bug        : no
 coma_bug        : no
 fpu             : yes
 fpu_exception   : yes
 cpuid level     : 11
 wp              : yes
 flags           : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm pbe nx pdpe1gb rdtscp lm constant_tsc arch_pe rfmon pebs bts xtopology nonstop_tsc aperfmperf pni pclmulqdq dtes64 monitor ds_cpl vmx smx est tm2 ssse3 cx16 xtpr pdcm dca sse4_1 sse4_2 popcnt aes lahf_lm ida arat tpr_shadow v nmi flexpriority ept vpid
 bogomips        : 4787.79
 clflush size    : 64
 cache_alignment : 64
 address sizes   : 40 bits physical, 48 bits virtual


Yves
User avatar
ggael
Moderator
Posts
3447
Karma
19
OS
yes that's the expect behavior of the current implementation. I've added an entry there:

http://eigen.tuxfamily.org/bz/show_bug.cgi?id=483

so that we don't forget to optimize this case.


Bookmarks



Who is online

Registered users: Baidu [Spider], Bing [Bot], Google [Bot]