speed issue with elementwise product

Board index

Page 1 of 1 (7 posts)

Tags:

martin_IM Registered Member Posts 9 Karma 0 OS	speed issue with elementwise product Thu Apr 19, 2012 10:11 am Hello I have a speed issue with elementwise product between two matrices. I compared a simple implementation using loops and pointers and two eigen approaches (see the code below). The loops based method is 3 to 4 time faster: C=A.cwiseProduct(B) Eigen 1: 40ms Eigen 2: 31ms loops: 11ms Is it expected ? What can i do to get eigen running this operation as fast as the loops based implementation ? Code: Select all #include <Eigen/Core> #include <iostream> #include <time.h> void main(void) { Eigen::MatrixXd A(100,100); Eigen::VectorXd b(100,1); int idcol=10; clock_t start; double duration; Eigen::MatrixXd B(100,100); Eigen::MatrixXd C(100,100); std::cout<<"\n"; std::cout<<"C=A.cwiseProduct(B)\n"; // Eigen implementation 1 start=clock(); for (int t=0;t<1000;t++) { //C.array()=A.array()B.array(); C=A.cwiseProduct(B); } duration=double(clock()-start)/((double)CLOCKS_PER_SEC); std::cout<<" Eigen 1: "<<1000.0duration<<"ms\n"; // Eigen implementation 2 start=clock(); for (int t=0;t<1000;t++) { C.array()=A.array()B.array(); } duration=double(clock()-start)/((double)CLOCKS_PER_SEC); std::cout<<" Eigen 2: "<<1000.0duration<<"ms\n"; // Loops implementation start=clock(); for (int t=0;t<1000;t++) { double* A_ptr=&A(0,0); double* B_ptr=&B(0,0); double* C_ptr=&C(0,0); int nb=A.size(); for (int k=0;k<nb;k++) (C_ptr++)=(A_ptr++)(B_ptr++); } duration=double(clock()-start)/((double)CLOCKS_PER_SEC); std::cout<<" loops: "<<1000.0*duration<<"ms\n"; getchar(); }
ggael Moderator Posts 3447 Karma 19 OS	Re: speed issue with elementwise product Thu Apr 19, 2012 11:06 am Make sure you compiled with optimizations enabled.
martin_IM Registered Member Posts 9 Karma 0 OS	Re: speed issue with elementwise product Thu Apr 19, 2012 1:55 pm ggael wrote:Make sure you compiled with optimizations enabled. I had optimization enabled (Maximize Speed (/O2) in visual studio 2010) I recompiled using SSE2 and augmented the number of loops to 100000 (100x more). I now get these timings C=A.cwiseProduct(B) Eigen 1: 1295ms Eigen 2: 1653ms loops: 1046ms Eigen is still slower than the loops implementation despite it uses SSE2 (the loop implementation does not). I am using Eigen 3.0.3. By the way, why would C.array()=A.array()*B.array() be slower than C=A.cwiseProduct(B) ?
jitseniesen Registered Member Posts 204 Karma 2	Re: speed issue with elementwise product Thu Apr 19, 2012 2:30 pm I can confirm the issue. I also get that the loop implementation is faster by about 20%. But in my case, "Eigen 1" (the one with cwiseProduct) is slightly slower than "Eigen 2" (the one with array multiplication). This is with gcc 4.5.1, compiler flags "-O2 -DNDEBUG -msse2", Intel Core2 Duo Processor E8500 (6M Cache, 3.16 GHz), 32-bits Linux. Other optimization flags do not seem to make a difference. However, if SSE2 is turned off, then Eigen is as fast as the loop implementation.
ggael Moderator Posts 3447 Karma 19 OS	Re: speed issue with elementwise product Thu Apr 19, 2012 8:10 pm Again, such expressions cannot be well optimized with vectorization, 2 loads, 1 store for only a single arithmetic operation. Nevertheless, the vectorized code should not be slower !! Here I get similar behavior: Eigen0: 0.726506ms Eigen1: 0.730992ms Loop: 0.627779ms There is no reason the two Eigen variants leads to different performance since they generate the same code. Looking at the assembly generated by gcc for the Eigen's version: L27: movq (%rdi), %r9 movq (%rsi), %r8 movapd (%r9,%rax,8), %xmm0 mulpd (%r8,%rax,8), %xmm0 movq (%rdx), %r8 movapd %xmm0, (%r8,%rax,8) addq $2, %rax cmpq %rax, %rcx jg L27 we can see that there are 3 stupids movq which should clearly not be there and that probably kill the performance.
ggael Moderator Posts 3447 Karma 19 OS	Re: speed issue with elementwise product Thu Apr 19, 2012 8:17 pm hm, that's strange, these stupid movq appear only with double, with float we get a nice: L47: movss (%rsi,%rax), %xmm0 addq $1, %rcx mulss (%r9,%rax), %xmm0 movss %xmm0, (%rdx,%rax) addq $4, %rax cmpq %r8, %rcx jne L47 no need to say that the Eiegn's code for float and double is exactly the same... very strange.
manuels Registered Member Posts 47 Karma 0	Re: speed issue with elementwise product Sat Apr 21, 2012 10:17 pm this issue is not present for gcc (Ubuntu/Linaro 4.6.1-9ubuntu3) 4.6.1 with -O2: Code: Select all `Eigen 1: 20ms Eigen 2: 20ms loops: 20ms` with -O1: Code: Select all `Eigen 1: 40ms Eigen 2: 50ms loops: 20ms` with -O0: Code: Select all `Eigen 1: 1020ms Eigen 2: 1360ms loops: 120ms`