This forum has been archived. All content is frozen. Please use KDE Discuss instead.

Matrix vs. Map efficiency

Tags: None
(comma "," separated)
alejandrocastro
Registered Member
Posts
15
Karma
0
OS

Matrix vs. Map efficiency

Mon May 05, 2014 7:14 am
I am running a slightly modified version of a code posted in the topic "Low efficiency using Eigen..." by Hauke(viewtopic.php?f=74&t=87929&p=158729&hilit=map+efficiency#p158729), though my issue/question is a bit different. In the code by Hauke vectors (matrix) are initialized from C arrays using Maps that then are copied into vectors (matrix) in the initialization as:
Vector4d VIE = Vector4d::MapAligned(VI);

with VI a C vector. I was wondering how efficiency would get affected if instead I did:
Map<Vector4d,Aligned> VIE(VI);

so VIE now is an actual Map. This is very convenient if you are modifying an existing C code and you want to manage your own memory. However I found out the code with Map's to be 20% slower!!!

Is there anything I am missing? Find my version of the code below. I compiled with:
g++ -Ofast -DNDEBUG -DEIGEN_NO_MALLOC -I/path_to_my_Eigen benchmark1.cpp -o benchmark1 -lrt

with gcc version 4.7.2. I used EIGEN_VECTORIZE to verify that indeed vectorization is on. I have an intel i5 processor.

Using maps (#define USE_MAPS at the top) the code runs in 1370 ms.
NOT using maps (commenting out #define USE_MAPS) it runs in 1123 ms.

I would like to add that I'd like to use maps if possible so I can manage the memory, specially because I can use pointers within my code to different location in the data.

Thank you a lot in advance.

This is the code (full and self contained so you can try it out if you can):

Code: Select all

#include <Eigen/Eigen>
#include <bench/BenchTimer.h>

#include <iostream>

using namespace Eigen;
using namespace std;

//#define CLASSIC_METHOD                                                                                                                                                                                                                                                     

#define USE_MAPS

EIGEN_DONT_INLINE void classic(double VO[4], double AT[4][4], double VI[4])
{
  for (int ii=0; ii<4; ii++)
    {
    VO[ii] = AT[ii][0] * VI[0] +
      AT[ii][1] * VI[1] +
      AT[ii][2] * VI[2] +
      AT[ii][3] * VI[3];
    }
};

template <typename OutputType, typename MatrixType, typename VectorType>
EIGEN_DONT_INLINE void modern(MatrixBase<OutputType>& VOE, const MatrixBase<MatrixType>& A44, const MatrixBase<VectorType>& VIE)
{
  VOE.noalias() = A44.transpose()*VIE;
};

int main()
{
  EIGEN_ALIGN16 double AT[4][4] = {0.1, 0.2, 0.3, 2.0, 0.2, 0.3, 0.4, 3.0, 0.3, 0.4, 0.5, 4.0, 0.0, 0.0, 0.0, 1.0};
  EIGEN_ALIGN16 double VI[4] = {1, 2, 3, 4};
  EIGEN_ALIGN16 double VO[4];

  //Eigen matrices                                                                                                                                                                                                                                                           
#ifndef USE_MAPS
  Matrix4d A44 = Matrix4d::MapAligned(AT[0]);
  Vector4d VIE = Vector4d::MapAligned(VI);
  Vector4d VOE(0,0,0,0);
#else
  Map<Matrix4d,Aligned> A44(AT[0]);
  Map<Vector4d,Aligned> VIE(VI);
  Map<Vector4d,Aligned> VOE(VO);

  // Map<Matrix4d> A44(AT[0]);                                                                                                                                                                                                                                               
  // Map<Vector4d> VIE(VI);                                                                                                                                                                                                                                                 
  // Map<Vector4d> VOE(VO);                                                                                                                                                                                                                                                 
#endif

#ifdef EIGEN_VECTORIZE
  cout << "EIGEN_VECTORIZE defined" << endl;
#else
    cout << "EIGEN_VECTORIZE NOT defined" << endl;
#endif

  cout << "VIE:" << endl;
  cout << VIE << endl;

  VI[0] = 3.14;
  cout << "VIE:" << endl;
  cout << VIE << endl;

  BenchTimer timer;

  const int num_tries = 5;
  const int num_repetitions = 200000000;

#ifdef CLASSIC_METHOD
  BENCH(timer, num_tries, num_repetitions, classic(VO, AT, VI));
  std::cout << Vector4d::MapAligned(VO) << std::endl;
#else
  BENCH(timer, num_tries, num_repetitions, modern(VOE, A44, VIE));
  std::cout << VOE << std::endl;
#endif

  double elapsed = timer.best();
  std::cout << "elapsed time: " << elapsed*1000.0 << " ms" << std::endl;

  return 0;
}

alejandrocastro
Registered Member
Posts
15
Karma
0
OS

Re: Matrix vs. Map efficiency

Mon May 05, 2014 8:30 pm
An interesting fact: If the arrays passed to the maps are allocated dynamically (with new) the program runs about 5-7% faster.

A.
User avatar
ggael
Moderator
Posts
3447
Karma
19
OS

Re: Matrix vs. Map efficiency

Mon May 05, 2014 9:24 pm
There is not enough computation to get reliable performance measures. The best in this case is still to at the generated code:

Without Eigen::Map:
Code: Select all
   .globl __Z6modernIN5Eigen6MatrixIdLi4ELi1ELi0ELi4ELi1EEENS1_IdLi4ELi4ELi0ELi4ELi4EEES2_EvRNS0_10MatrixBaseIT_EERKNS4_IT0_EERKNS4_IT1_EE
   .weak_definition __Z6modernIN5Eigen6MatrixIdLi4ELi1ELi0ELi4ELi1EEENS1_IdLi4ELi4ELi0ELi4ELi4EEES2_EvRNS0_10MatrixBaseIT_EERKNS4_IT0_EERKNS4_IT1_EE
__Z6modernIN5Eigen6MatrixIdLi4ELi1ELi0ELi4ELi1EEENS1_IdLi4ELi4ELi0ELi4ELi4EEES2_EvRNS0_10MatrixBaseIT_EERKNS4_IT0_EERKNS4_IT1_EE:
LFB5684:
   movapd   (%rsi), %xmm0
   movapd   16(%rsi), %xmm1
   mulpd   (%rdx), %xmm0
   mulpd   16(%rdx), %xmm1
   addpd   %xmm1, %xmm0
   haddpd   %xmm0, %xmm0
   movlpd   %xmm0, (%rdi)
   movapd   32(%rsi), %xmm0
   movapd   48(%rsi), %xmm1
   mulpd   (%rdx), %xmm0
   mulpd   16(%rdx), %xmm1
   addpd   %xmm1, %xmm0
   haddpd   %xmm0, %xmm0
   movlpd   %xmm0, 8(%rdi)
   movapd   64(%rsi), %xmm0
   movapd   80(%rsi), %xmm1
   mulpd   (%rdx), %xmm0
   mulpd   16(%rdx), %xmm1
   addpd   %xmm1, %xmm0
   haddpd   %xmm0, %xmm0
   movlpd   %xmm0, 16(%rdi)
   movapd   96(%rsi), %xmm0
   movapd   112(%rsi), %xmm1
   mulpd   (%rdx), %xmm0
   mulpd   16(%rdx), %xmm1
   addpd   %xmm1, %xmm0
   haddpd   %xmm0, %xmm0
   movlpd   %xmm0, 24(%rdi)
   ret


With Eigen::Map:
Code: Select all
   .globl __Z6modernIN5Eigen3MapINS0_6MatrixIdLi4ELi1ELi0ELi4ELi1EEELi1ENS0_6StrideILi0ELi0EEEEENS1_INS2_IdLi4ELi4ELi0ELi4ELi4EEELi1ES5_EES6_EvRNS0_10MatrixBaseIT_EERKNS9_IT0_EERKNS9_IT1_EE
   .weak_definition __Z6modernIN5Eigen3MapINS0_6MatrixIdLi4ELi1ELi0ELi4ELi1EEELi1ENS0_6StrideILi0ELi0EEEEENS1_INS2_IdLi4ELi4ELi0ELi4ELi4EEELi1ES5_EES6_EvRNS0_10MatrixBaseIT_EERKNS9_IT0_EERKNS9_IT1_EE
__Z6modernIN5Eigen3MapINS0_6MatrixIdLi4ELi1ELi0ELi4ELi1EEELi1ENS0_6StrideILi0ELi0EEEEENS1_INS2_IdLi4ELi4ELi0ELi4ELi4EEELi1ES5_EES6_EvRNS0_10MatrixBaseIT_EERKNS9_IT0_EERKNS9_IT1_EE:
LFB5658:
   movq   (%rsi), %rcx
   movq   (%rdx), %rax
   movq   (%rdi), %rdx
   movapd   (%rcx), %xmm0
   movapd   16(%rcx), %xmm1
   mulpd   (%rax), %xmm0
   mulpd   16(%rax), %xmm1
   addpd   %xmm1, %xmm0
   haddpd   %xmm0, %xmm0
   movlpd   %xmm0, (%rdx)
   movapd   32(%rcx), %xmm0
   movapd   48(%rcx), %xmm1
   mulpd   (%rax), %xmm0
   mulpd   16(%rax), %xmm1
   addpd   %xmm1, %xmm0
   haddpd   %xmm0, %xmm0
   movlpd   %xmm0, 8(%rdx)
   movapd   64(%rcx), %xmm0
   movapd   80(%rcx), %xmm1
   mulpd   (%rax), %xmm0
   mulpd   16(%rax), %xmm1
   addpd   %xmm1, %xmm0
   haddpd   %xmm0, %xmm0
   movlpd   %xmm0, 16(%rdx)
   movapd   96(%rcx), %xmm0
   movapd   112(%rcx), %xmm1
   mulpd   (%rax), %xmm0
   mulpd   16(%rax), %xmm1
   addpd   %xmm1, %xmm0
   haddpd   %xmm0, %xmm0
   movlpd   %xmm0, 24(%rdx)
   ret


The only differences are the three movq instructions at the beginning because you have to access the data through a pointer. However, in practice everything is inlined and therefore the overhead of this indirection will be amortised over the multiple operations, or even completely removed.

Finally, you can explicitly avoid this indirection by casting your raw data into a Matrix4d or Vector4d through a reinterpret_cast, but that's rather ugly, and again, in real world code I doubt you'll get any speedup by doing though.
alejandrocastro
Registered Member
Posts
15
Karma
0
OS

Re: Matrix vs. Map efficiency

Mon May 05, 2014 11:00 pm
Fist of all thank you so much for taking the time to look into this. You even looked into the assembly code!!

What do you mean with "the overhead of this indirection will be amortised over the multiple operations"? this code is calling the testing routine many times to get enough statistics. Wouldn't that simulate a real application where these operations are performed multiple times? if not, how would you make this test more realistic?

And would you know why Eigen bothers to have a separate class Map to map C arrays instead of just having a class constructor for Matrix doing this? I had the idea this was because Eigen didn't want to loose any efficiency doing this but that was just a very non-educated guess.

Finally, do you think it would be ok then to have in my project all Map's instead of Matrix's without a final impact in my performance?. Having Map's instead of Matrix's would allow me to have C pointers in the background as I want it.

Thank much you again.
User avatar
ggael
Moderator
Posts
3447
Karma
19
OS

Re: Matrix vs. Map efficiency

Mon May 05, 2014 11:28 pm
alejandrocastro wrote:What do you mean with "the overhead of this indirection will be amortised over the multiple operations"? this code is calling the testing routine many times to get enough statistics. Wouldn't that simulate a real application where these operations are performed multiple times? if not, how would you make this test more realistic?


Your test function only does one matrix-vector product, and the Map objects are created outside the function. Is it really what you're going to do in your "real" code?

And would you know why Eigen bothers to have a separate class Map to map C arrays instead of just having a class constructor for Matrix doing this? I had the idea this was because Eigen didn't want to loose any efficiency doing this but that was just a very non-educated guess.


What you suggest would only be possible for dynamically allocated matrices, not for fixed, statically allocated ones.

Finally, do you think it would be ok then to have in my project all Map's instead of Matrix's without a final impact in my performance?. Having Map's instead of Matrix's would allow me to have C pointers in the background as I want it.


Yes, especially if you declare your Map objects as you need them.


Bookmarks



Who is online

Registered users: Bing [Bot], claydoh, Google [Bot], rblackwell, Yahoo [Bot]