Registered Member
|
Hello
I have a speed issue with elementwise product between two matrices. I compared a simple implementation using loops and pointers and two eigen approaches (see the code below). The loops based method is 3 to 4 time faster: C=A.cwiseProduct(B) Eigen 1: 40ms Eigen 2: 31ms loops: 11ms Is it expected ? What can i do to get eigen running this operation as fast as the loops based implementation ?
|
Moderator
|
Make sure you compiled with optimizations enabled.
|
Registered Member
|
I had optimization enabled (Maximize Speed (/O2) in visual studio 2010) I recompiled using SSE2 and augmented the number of loops to 100000 (100x more). I now get these timings C=A.cwiseProduct(B) Eigen 1: 1295ms Eigen 2: 1653ms loops: 1046ms Eigen is still slower than the loops implementation despite it uses SSE2 (the loop implementation does not). I am using Eigen 3.0.3. By the way, why would C.array()=A.array()*B.array() be slower than C=A.cwiseProduct(B) ? |
Registered Member
|
I can confirm the issue. I also get that the loop implementation is faster by about 20%. But in my case, "Eigen 1" (the one with cwiseProduct) is slightly slower than "Eigen 2" (the one with array multiplication).
This is with gcc 4.5.1, compiler flags "-O2 -DNDEBUG -msse2", Intel Core2 Duo Processor E8500 (6M Cache, 3.16 GHz), 32-bits Linux. Other optimization flags do not seem to make a difference. However, if SSE2 is turned off, then Eigen is as fast as the loop implementation. |
Moderator
|
Again, such expressions cannot be well optimized with vectorization, 2 loads, 1 store for only a single arithmetic operation. Nevertheless, the vectorized code should not be slower !!
Here I get similar behavior: Eigen0: 0.726506ms Eigen1: 0.730992ms Loop: 0.627779ms There is no reason the two Eigen variants leads to different performance since they generate the same code. Looking at the assembly generated by gcc for the Eigen's version: L27: movq (%rdi), %r9 movq (%rsi), %r8 movapd (%r9,%rax,8), %xmm0 mulpd (%r8,%rax,8), %xmm0 movq (%rdx), %r8 movapd %xmm0, (%r8,%rax,8) addq $2, %rax cmpq %rax, %rcx jg L27 we can see that there are 3 stupids movq which should clearly not be there and that probably kill the performance. |
Moderator
|
hm, that's strange, these stupid movq appear only with double, with float we get a nice:
L47: movss (%rsi,%rax), %xmm0 addq $1, %rcx mulss (%r9,%rax), %xmm0 movss %xmm0, (%rdx,%rax) addq $4, %rax cmpq %r8, %rcx jne L47 no need to say that the Eiegn's code for float and double is exactly the same... very strange. |
Registered Member
|
this issue is not present for gcc (Ubuntu/Linaro 4.6.1-9ubuntu3) 4.6.1
with -O2:
with -O1:
with -O0:
|
Registered users: Bing [Bot], claydoh, Google [Bot], rblackwell, Yahoo [Bot]