Matrix vs. Map efficiency

Board index

Page 1 of 1 (5 posts)

Tags:

alejandrocastro Registered Member Posts 15 Karma 0 OS	Matrix vs. Map efficiency Mon May 05, 2014 7:14 am I am running a slightly modified version of a code posted in the topic "Low efficiency using Eigen..." by Hauke(viewtopic.php?f=74&t=87929&p=158729&hilit=map+efficiency#p158729), though my issue/question is a bit different. In the code by Hauke vectors (matrix) are initialized from C arrays using Maps that then are copied into vectors (matrix) in the initialization as: Vector4d VIE = Vector4d::MapAligned(VI); with VI a C vector. I was wondering how efficiency would get affected if instead I did: Map<Vector4d,Aligned> VIE(VI); so VIE now is an actual Map. This is very convenient if you are modifying an existing C code and you want to manage your own memory. However I found out the code with Map's to be 20% slower!!! Is there anything I am missing? Find my version of the code below. I compiled with: g++ -Ofast -DNDEBUG -DEIGEN_NO_MALLOC -I/path_to_my_Eigen benchmark1.cpp -o benchmark1 -lrt with gcc version 4.7.2. I used EIGEN_VECTORIZE to verify that indeed vectorization is on. I have an intel i5 processor. Using maps (#define USE_MAPS at the top) the code runs in 1370 ms. NOT using maps (commenting out #define USE_MAPS) it runs in 1123 ms. I would like to add that I'd like to use maps if possible so I can manage the memory, specially because I can use pointers within my code to different location in the data. Thank you a lot in advance. This is the code (full and self contained so you can try it out if you can): Code: Select all #include <Eigen/Eigen> #include <bench/BenchTimer.h> #include <iostream> using namespace Eigen; using namespace std; //#define CLASSIC_METHOD #define USE_MAPS EIGEN_DONT_INLINE void classic(double VO[4], double AT[4][4], double VI[4]) { for (int ii=0; ii<4; ii++) { VO[ii] = AT[ii][0] * VI[0] + AT[ii][1] * VI[1] + AT[ii][2] * VI[2] + AT[ii][3] * VI[3]; } }; template <typename OutputType, typename MatrixType, typename VectorType> EIGEN_DONT_INLINE void modern(MatrixBase<OutputType>& VOE, const MatrixBase<MatrixType>& A44, const MatrixBase<VectorType>& VIE) { VOE.noalias() = A44.transpose()VIE; }; int main() { EIGEN_ALIGN16 double AT[4][4] = {0.1, 0.2, 0.3, 2.0, 0.2, 0.3, 0.4, 3.0, 0.3, 0.4, 0.5, 4.0, 0.0, 0.0, 0.0, 1.0}; EIGEN_ALIGN16 double VI[4] = {1, 2, 3, 4}; EIGEN_ALIGN16 double VO[4]; //Eigen matrices #ifndef USE_MAPS Matrix4d A44 = Matrix4d::MapAligned(AT[0]); Vector4d VIE = Vector4d::MapAligned(VI); Vector4d VOE(0,0,0,0); #else Map<Matrix4d,Aligned> A44(AT[0]); Map<Vector4d,Aligned> VIE(VI); Map<Vector4d,Aligned> VOE(VO); // Map<Matrix4d> A44(AT[0]); // Map<Vector4d> VIE(VI); // Map<Vector4d> VOE(VO); #endif #ifdef EIGEN_VECTORIZE cout << "EIGEN_VECTORIZE defined" << endl; #else cout << "EIGEN_VECTORIZE NOT defined" << endl; #endif cout << "VIE:" << endl; cout << VIE << endl; VI[0] = 3.14; cout << "VIE:" << endl; cout << VIE << endl; BenchTimer timer; const int num_tries = 5; const int num_repetitions = 200000000; #ifdef CLASSIC_METHOD BENCH(timer, num_tries, num_repetitions, classic(VO, AT, VI)); std::cout << Vector4d::MapAligned(VO) << std::endl; #else BENCH(timer, num_tries, num_repetitions, modern(VOE, A44, VIE)); std::cout << VOE << std::endl; #endif double elapsed = timer.best(); std::cout << "elapsed time: " << elapsed1000.0 << " ms" << std::endl; return 0; }
alejandrocastro Registered Member Posts 15 Karma 0 OS	Re: Matrix vs. Map efficiency Mon May 05, 2014 8:30 pm An interesting fact: If the arrays passed to the maps are allocated dynamically (with new) the program runs about 5-7% faster. A.
ggael Moderator Posts 3447 Karma 19 OS	Re: Matrix vs. Map efficiency Mon May 05, 2014 9:24 pm There is not enough computation to get reliable performance measures. The best in this case is still to at the generated code: Without Eigen::Map: Code: Select all .globl __Z6modernIN5Eigen6MatrixIdLi4ELi1ELi0ELi4ELi1EEENS1_IdLi4ELi4ELi0ELi4ELi4EEES2_EvRNS0_10MatrixBaseIT_EERKNS4_IT0_EERKNS4_IT1_EE .weak_definition __Z6modernIN5Eigen6MatrixIdLi4ELi1ELi0ELi4ELi1EEENS1_IdLi4ELi4ELi0ELi4ELi4EEES2_EvRNS0_10MatrixBaseIT_EERKNS4_IT0_EERKNS4_IT1_EE __Z6modernIN5Eigen6MatrixIdLi4ELi1ELi0ELi4ELi1EEENS1_IdLi4ELi4ELi0ELi4ELi4EEES2_EvRNS0_10MatrixBaseIT_EERKNS4_IT0_EERKNS4_IT1_EE: LFB5684: movapd (%rsi), %xmm0 movapd 16(%rsi), %xmm1 mulpd (%rdx), %xmm0 mulpd 16(%rdx), %xmm1 addpd %xmm1, %xmm0 haddpd %xmm0, %xmm0 movlpd %xmm0, (%rdi) movapd 32(%rsi), %xmm0 movapd 48(%rsi), %xmm1 mulpd (%rdx), %xmm0 mulpd 16(%rdx), %xmm1 addpd %xmm1, %xmm0 haddpd %xmm0, %xmm0 movlpd %xmm0, 8(%rdi) movapd 64(%rsi), %xmm0 movapd 80(%rsi), %xmm1 mulpd (%rdx), %xmm0 mulpd 16(%rdx), %xmm1 addpd %xmm1, %xmm0 haddpd %xmm0, %xmm0 movlpd %xmm0, 16(%rdi) movapd 96(%rsi), %xmm0 movapd 112(%rsi), %xmm1 mulpd (%rdx), %xmm0 mulpd 16(%rdx), %xmm1 addpd %xmm1, %xmm0 haddpd %xmm0, %xmm0 movlpd %xmm0, 24(%rdi) ret With Eigen::Map: Code: Select all .globl __Z6modernIN5Eigen3MapINS0_6MatrixIdLi4ELi1ELi0ELi4ELi1EEELi1ENS0_6StrideILi0ELi0EEEEENS1_INS2_IdLi4ELi4ELi0ELi4ELi4EEELi1ES5_EES6_EvRNS0_10MatrixBaseIT_EERKNS9_IT0_EERKNS9_IT1_EE .weak_definition __Z6modernIN5Eigen3MapINS0_6MatrixIdLi4ELi1ELi0ELi4ELi1EEELi1ENS0_6StrideILi0ELi0EEEEENS1_INS2_IdLi4ELi4ELi0ELi4ELi4EEELi1ES5_EES6_EvRNS0_10MatrixBaseIT_EERKNS9_IT0_EERKNS9_IT1_EE __Z6modernIN5Eigen3MapINS0_6MatrixIdLi4ELi1ELi0ELi4ELi1EEELi1ENS0_6StrideILi0ELi0EEEEENS1_INS2_IdLi4ELi4ELi0ELi4ELi4EEELi1ES5_EES6_EvRNS0_10MatrixBaseIT_EERKNS9_IT0_EERKNS9_IT1_EE: LFB5658: movq (%rsi), %rcx movq (%rdx), %rax movq (%rdi), %rdx movapd (%rcx), %xmm0 movapd 16(%rcx), %xmm1 mulpd (%rax), %xmm0 mulpd 16(%rax), %xmm1 addpd %xmm1, %xmm0 haddpd %xmm0, %xmm0 movlpd %xmm0, (%rdx) movapd 32(%rcx), %xmm0 movapd 48(%rcx), %xmm1 mulpd (%rax), %xmm0 mulpd 16(%rax), %xmm1 addpd %xmm1, %xmm0 haddpd %xmm0, %xmm0 movlpd %xmm0, 8(%rdx) movapd 64(%rcx), %xmm0 movapd 80(%rcx), %xmm1 mulpd (%rax), %xmm0 mulpd 16(%rax), %xmm1 addpd %xmm1, %xmm0 haddpd %xmm0, %xmm0 movlpd %xmm0, 16(%rdx) movapd 96(%rcx), %xmm0 movapd 112(%rcx), %xmm1 mulpd (%rax), %xmm0 mulpd 16(%rax), %xmm1 addpd %xmm1, %xmm0 haddpd %xmm0, %xmm0 movlpd %xmm0, 24(%rdx) ret The only differences are the three movq instructions at the beginning because you have to access the data through a pointer. However, in practice everything is inlined and therefore the overhead of this indirection will be amortised over the multiple operations, or even completely removed. Finally, you can explicitly avoid this indirection by casting your raw data into a Matrix4d or Vector4d through a reinterpret_cast, but that's rather ugly, and again, in real world code I doubt you'll get any speedup by doing though.
alejandrocastro Registered Member Posts 15 Karma 0 OS	Re: Matrix vs. Map efficiency Mon May 05, 2014 11:00 pm Fist of all thank you so much for taking the time to look into this. You even looked into the assembly code!! What do you mean with "the overhead of this indirection will be amortised over the multiple operations"? this code is calling the testing routine many times to get enough statistics. Wouldn't that simulate a real application where these operations are performed multiple times? if not, how would you make this test more realistic? And would you know why Eigen bothers to have a separate class Map to map C arrays instead of just having a class constructor for Matrix doing this? I had the idea this was because Eigen didn't want to loose any efficiency doing this but that was just a very non-educated guess. Finally, do you think it would be ok then to have in my project all Map's instead of Matrix's without a final impact in my performance?. Having Map's instead of Matrix's would allow me to have C pointers in the background as I want it. Thank much you again.
ggael Moderator Posts 3447 Karma 19 OS	Re: Matrix vs. Map efficiency Mon May 05, 2014 11:28 pm alejandrocastro wrote:What do you mean with "the overhead of this indirection will be amortised over the multiple operations"? this code is calling the testing routine many times to get enough statistics. Wouldn't that simulate a real application where these operations are performed multiple times? if not, how would you make this test more realistic? Your test function only does one matrix-vector product, and the Map objects are created outside the function. Is it really what you're going to do in your "real" code? And would you know why Eigen bothers to have a separate class Map to map C arrays instead of just having a class constructor for Matrix doing this? I had the idea this was because Eigen didn't want to loose any efficiency doing this but that was just a very non-educated guess. What you suggest would only be possible for dynamically allocated matrices, not for fixed, statically allocated ones. Finally, do you think it would be ok then to have in my project all Map's instead of Matrix's without a final impact in my performance?. Having Map's instead of Matrix's would allow me to have C pointers in the background as I want it. Yes, especially if you declare your Map objects as you need them.