This forum has been archived. All content is frozen. Please use KDE Discuss instead.

speed issue with elementwise product

Tags: None
(comma "," separated)
martin_IM
Registered Member
Posts
9
Karma
0
OS

speed issue with elementwise product

Thu Apr 19, 2012 10:11 am
Hello

I have a speed issue with elementwise product between two matrices.
I compared a simple implementation using loops and pointers and two eigen
approaches (see the code below). The loops based method is 3 to 4 time faster:

C=A.cwiseProduct(B)

Eigen 1: 40ms
Eigen 2: 31ms
loops: 11ms


Is it expected ?
What can i do to get eigen running this operation as fast as the loops based implementation ?


Code: Select all
#include <Eigen/Core>
#include <iostream>
#include <time.h>


void main(void)
{
   Eigen::MatrixXd A(100,100);
   Eigen::VectorXd b(100,1);
   int idcol=10;
   clock_t start;
   double duration;

   Eigen::MatrixXd B(100,100);
   Eigen::MatrixXd C(100,100);
   std::cout<<"\n";
   std::cout<<"C=A.cwiseProduct(B)\n";


// Eigen implementation 1
   start=clock();
   for (int t=0;t<1000;t++)
   {
      //C.array()=A.array()*B.array();   
      C=A.cwiseProduct(B);
   }
    duration=double(clock()-start)/((double)CLOCKS_PER_SEC);
    std::cout<<"   Eigen 1: "<<1000.0*duration<<"ms\n";


   // Eigen implementation 2
   start=clock();
   for (int t=0;t<1000;t++)
   {
      C.array()=A.array()*B.array();         
   }
    duration=double(clock()-start)/((double)CLOCKS_PER_SEC);
    std::cout<<"   Eigen 2: "<<1000.0*duration<<"ms\n";


   // Loops implementation
   
   start=clock();
   for (int t=0;t<1000;t++)
   {
      double* A_ptr=&A(0,0);
      double* B_ptr=&B(0,0);
      double* C_ptr=&C(0,0);
      int nb=A.size();
      for (int k=0;k<nb;k++)
      (*C_ptr++)=(*A_ptr++)*(*B_ptr++);
   }
    duration=double(clock()-start)/((double)CLOCKS_PER_SEC);
    std::cout<<"   loops:   "<<1000.0*duration<<"ms\n";
   getchar();
}
User avatar
ggael
Moderator
Posts
3447
Karma
19
OS
Make sure you compiled with optimizations enabled.
martin_IM
Registered Member
Posts
9
Karma
0
OS
ggael wrote:Make sure you compiled with optimizations enabled.

I had optimization enabled (Maximize Speed (/O2) in visual studio 2010)
I recompiled using SSE2 and augmented the number of loops to 100000 (100x more).
I now get these timings


C=A.cwiseProduct(B)
Eigen 1: 1295ms
Eigen 2: 1653ms
loops: 1046ms

Eigen is still slower than the loops implementation despite it uses SSE2 (the loop implementation does not). I am using Eigen 3.0.3.
By the way, why would C.array()=A.array()*B.array() be slower than C=A.cwiseProduct(B) ?
jitseniesen
Registered Member
Posts
204
Karma
2
I can confirm the issue. I also get that the loop implementation is faster by about 20%. But in my case, "Eigen 1" (the one with cwiseProduct) is slightly slower than "Eigen 2" (the one with array multiplication).

This is with gcc 4.5.1, compiler flags "-O2 -DNDEBUG -msse2", Intel Core2 Duo Processor E8500 (6M Cache, 3.16 GHz), 32-bits Linux. Other optimization flags do not seem to make a difference. However, if SSE2 is turned off, then Eigen is as fast as the loop implementation.
User avatar
ggael
Moderator
Posts
3447
Karma
19
OS
Again, such expressions cannot be well optimized with vectorization, 2 loads, 1 store for only a single arithmetic operation. Nevertheless, the vectorized code should not be slower !!

Here I get similar behavior:

Eigen0: 0.726506ms
Eigen1: 0.730992ms
Loop: 0.627779ms

There is no reason the two Eigen variants leads to different performance since they generate the same code. Looking at the assembly generated by gcc for the Eigen's version:

L27:
movq (%rdi), %r9
movq (%rsi), %r8
movapd (%r9,%rax,8), %xmm0
mulpd (%r8,%rax,8), %xmm0
movq (%rdx), %r8
movapd %xmm0, (%r8,%rax,8)
addq $2, %rax
cmpq %rax, %rcx
jg L27

we can see that there are 3 stupids movq which should clearly not be there and that probably kill the performance.
User avatar
ggael
Moderator
Posts
3447
Karma
19
OS
hm, that's strange, these stupid movq appear only with double, with float we get a nice:

L47:
movss (%rsi,%rax), %xmm0
addq $1, %rcx
mulss (%r9,%rax), %xmm0
movss %xmm0, (%rdx,%rax)
addq $4, %rax
cmpq %r8, %rcx
jne L47

no need to say that the Eiegn's code for float and double is exactly the same... very strange.
manuels
Registered Member
Posts
47
Karma
0
this issue is not present for gcc (Ubuntu/Linaro 4.6.1-9ubuntu3) 4.6.1

with -O2:
Code: Select all
   Eigen 1: 20ms
   Eigen 2: 20ms
   loops:   20ms


with -O1:
Code: Select all
   Eigen 1: 40ms
   Eigen 2: 50ms
   loops:   20ms


with -O0:
Code: Select all
   Eigen 1: 1020ms
   Eigen 2: 1360ms
   loops:   120ms


Bookmarks



Who is online

Registered users: Bing [Bot], claydoh, Google [Bot], rblackwell, Yahoo [Bot]