This forum has been archived. All content is frozen. Please use KDE Discuss instead.

Eigen code in C++ 10-20 times slower than matlab

Tags: matlab, eigen, c, optimization, fast, time matlab, eigen, c, optimization, fast, time matlab, eigen, c, optimization, fast, time
(comma "," separated)
parrimin
Registered Member
Posts
4
Karma
0
Hi,

I am trying to port some Matlab coded prototype to C++ program. I have chosen Eigen to make some calculations, but it has become that the consuming time is about 10 times slower in best case. Let me explain the situation:
- Matrix a, unsigned chars, 2782 rows x 128 cols.
- Matrix b, unsigned chars, 4000 rows x 128 cols.
I need to know for each row (vector) in a, the vector in b with minimum distance.

Here is the matlab code, that is running in 0.05 seconds:

Code: Select all
aa=sum(a.*a,2); bb=sum(b.*b,2); ab=a*b';
d = sqrt(abs(repmat(aa,[1 size(bb,1)]) + repmat(bb',[size(aa,1) 1]) - 2*ab));
[minz index]=min(d,[],2);


and here the c++ code. I tried the same code using ints, floats and doubles:

Code: Select all
   MatrixXf a(a_size, descrSize);
   MatrixXf b(b_size, descrSize);
   MatrixXf ab(a_size, b_size);

   const unsigned char* dataPtr = matrixa;
   for (int i=0; i<a_size; ++i)
   {
      for (int j=0; j<descrSize; ++j)
      {
         a(i,j)=(float)*dataPtr++;
      }
   }
   const unsigned char* vocPtr = matrixb;
   for (int i=0; i<b_size; ++i)
   {
      for (int j=0; j<descrSize; ++j)
      {
         b(i,j)=(float)*vocPtr ++;
      }
   }
   ab = a*b.transpose();
   a.cwiseProduct(a);
   b.cwiseProduct(b);
   MatrixXf aa = a.rowwise().sum();
   MatrixXf bb = b.rowwise().sum().transpose();
   MatrixXf d = (aa.replicate(1,b_size) + bb.replicate(a_size,1) - 2*ab).cwiseAbs2();

   indexes = new int[a_size];
   for (int i=0; i<a_size; ++i)
   {
      d.row(i).minCoeff(&indexes[i]);
   }


Time using SSE2 optimizations
int: 0.74916s.
float: 0.502975s.
double: 0.72532s.
Without SSE2
int: 1.25108s.
float: 0.940441s.
double: 1.02473s.

The best time (float with SSE2) is far from the results from Matlab. Is there something I am missing to make this faster?
nicaiwss
Registered Member
Posts
1
Karma
0
I found that eigen replicate is kind of slow, can you try like this:
Code: Select all
MatrixXf a(128, 2782);
MatrixXf b(128, 4000);
......
MatrixXf ab = a.transpose()*b;
a.cwiseProduct(a);
b.cwiseProduct(b);
aa = a.colwise().sum();
bb = b.colwise().sum();
MatrixXf d(a.cols(),b.cols());
    for (int i=0; i<b.cols(); i++) {
      for (int j=0; j<a.cols(); j++) {
        d(j,i) = (aa(0,j)+bb(0,i)-2*ab(j,i));
      }
    }


Also, I think you can replace your matlab code
Code: Select all
repmat(aa,[1 size(bb,1)]) + repmat(bb',[size(aa,1) 1])

with
Code: Select all
bsxfun(@plus,aa,bb')
.
It should be slightly faster.
parrimin
Registered Member
Posts
4
Karma
0
You are right. repmat 2 matrices is almost half of the time. Thanks for your solution, I did the same in other way, but yours is better. And of course thank you for the Matlab solution, but my Matlab prototype is fast enough, I need speed in C++. Any clue on speeding up the matrix product? That is almost the other half of the time spent in the function.
smajewski
Registered Member
Posts
6
Karma
0
Hi

I am new to Eigen and Eigen community but maybe I'll be able to help.
Make sure you use -O2 , -O3 or -Ofast options with g++/gcc compiler. It gives much better performance, some of my programs using Eigen started running even 20-30 times faster after I have used compiler optimization.

Cheers
manuels
Registered Member
Posts
47
Karma
0
Is it really neccessary to copy the dataPtr and vecPtr.
Use can probably use them directly using Mappings
parrimin
Registered Member
Posts
4
Karma
0
I am compiling using Visual Studio 2010, and optimizations are on.

I can use mappings, but, I should cast to float, so, it is the same. Copying the data is about 0.001s., so its not a problem at all. I was just copying because my luck of knowledge about Eigen library. I had some fool error using them last week.
manuels
Registered Member
Posts
47
Karma
0
You can try gperf (http://www.gnu.org/software/gperf/) to determine which function call takes the most time.
parrimin
Registered Member
Posts
4
Karma
0
Thanks, but I cannot understand how can that software/library help to this task. The part that is taking most of the time is the line that says
[code]
ab = a*b.transpose();
[\code]

If I transpose the b matrix beforehand, the product time is similar, so transposing is not the problem.
twood
Registered Member
Posts
17
Karma
0
How large are your matrices, roughly? What happens if you change that line to:
Code: Select all
ab.noalias() = a*b.transpose();
?
(Aliasing explained here: http://eigen.tuxfamily.org/dox/TopicAliasing.html)
hughperkins
Registered Member
Posts
6
Karma
0
I tried using Jeigen http://github.com/hughperkins/jeigen , which is a java wrapper around eigen, and therefore has wwwaayyy more overhead than calling it directly.

My timings are bad, but not as bad as for yours:

matlab: 0.76 seconds
jeigen: 2.79 seconds

matlab code:

a = rand(2782,128);
b = rand(4000,128);
tic, a * b'; toc

jeigen code:

DenseMatrix a = rand(2782,128);
DenseMatrix b = rand(2782,128);
tic(); DenseMatrix c = a.mmul(b.t()); toc();

Some thoughts:
- the optimization one has already been mentioned
- I'm using ubuntu, rather than windows. maybe something system specific
- the one I'm thinking most likely: are you perhaps running on a multicore system? matlab automatically parallelizes over multiple cores. So, let's say you're running on two cores (like me), then matlab will give half the elapsed time (as we see above). If you have 12 cores, then matlab will be twelve times faster...
hughperkins
Registered Member
Posts
6
Karma
0
Ooooh, I was using the wrong dimension for the second matrix for the jeigen test. In fact, using the correct dimensions, the timing becomes:

matlab: 0.76 seconds
jeigen: 3.05 seconds

I made a dummy end-to-end multiplier to test java/jna latency, which does everything except actually ask Eigen to multiply the matrices, ie the native method looks like this:

// dummy operation to measure end to end latency
void dummy_op2( int rows, int middle, int cols, double *afirst, double *asecond, double *aresult ) {
MatrixXd first(rows,middle);
valuesToMatrix( rows, middle, afirst, &first );
MatrixXd second(middle,cols);
valuesToMatrix( middle, cols, asecond, &second );
//MatrixXd result = first * second;
MatrixXd result(rows,cols);
matrixToValues( rows, cols, &result, aresult );
}

Then, testing from java:

DenseMatrix a = rand(2782,128);
DenseMatrix b = rand(4000,128);
tic();
DenseMatrix c;
c = a.dummy_mmul(b.t()); toc();
c = a.dummy_mmul(b.t()); toc();
c = a.dummy_mmul(b.t()); toc();

Elapsed time: 570 ms
Elapsed time: 442 ms
Elapsed time: 435 ms

.... so the overhead of java/jna for these matrices is about 0.44 seconds. So the native time is about 2.5 seconds.

If we assume matlab was running multithreaded on 2 cores, then we would expect matlab to take about 1.5 seconds on a single core.

So in fact it does seem that eigen is about 60% slower than matlab/MKL in this case.
manuels
Registered Member
Posts
47
Karma
0
Can you wrap the scalar product with
Code: Select all
__asm__("# begin");

and
Code: Select all
__asm__("# end");

compile it with the -S flag and post the resulting assembly code here?


Bookmarks



Who is online

Registered users: Bing [Bot], Google [Bot], Yahoo [Bot]