parrimin
Registered Member

Hi,
I am trying to port some Matlab coded prototype to C++ program. I have chosen Eigen to make some calculations, but it has become that the consuming time is about 10 times slower in best case. Let me explain the situation:  Matrix a, unsigned chars, 2782 rows x 128 cols.  Matrix b, unsigned chars, 4000 rows x 128 cols. I need to know for each row (vector) in a, the vector in b with minimum distance. Here is the matlab code, that is running in 0.05 seconds:
and here the c++ code. I tried the same code using ints, floats and doubles:
Time using SSE2 optimizations int: 0.74916s. float: 0.502975s. double: 0.72532s. Without SSE2 int: 1.25108s. float: 0.940441s. double: 1.02473s. The best time (float with SSE2) is far from the results from Matlab. Is there something I am missing to make this faster? 
nicaiwss
Registered Member

I found that eigen replicate is kind of slow, can you try like this:
Also, I think you can replace your matlab code
with
It should be slightly faster. 
parrimin
Registered Member

You are right. repmat 2 matrices is almost half of the time. Thanks for your solution, I did the same in other way, but yours is better. And of course thank you for the Matlab solution, but my Matlab prototype is fast enough, I need speed in C++. Any clue on speeding up the matrix product? That is almost the other half of the time spent in the function.

smajewski
Registered Member

Hi
I am new to Eigen and Eigen community but maybe I'll be able to help. Make sure you use O2 , O3 or Ofast options with g++/gcc compiler. It gives much better performance, some of my programs using Eigen started running even 2030 times faster after I have used compiler optimization. Cheers 
manuels
Registered Member

Is it really neccessary to copy the dataPtr and vecPtr.
Use can probably use them directly using Mappings 
parrimin
Registered Member

I am compiling using Visual Studio 2010, and optimizations are on.
I can use mappings, but, I should cast to float, so, it is the same. Copying the data is about 0.001s., so its not a problem at all. I was just copying because my luck of knowledge about Eigen library. I had some fool error using them last week. 
manuels
Registered Member

You can try gperf (http://www.gnu.org/software/gperf/) to determine which function call takes the most time.

parrimin
Registered Member

Thanks, but I cannot understand how can that software/library help to this task. The part that is taking most of the time is the line that says
[code] ab = a*b.transpose(); [\code] If I transpose the b matrix beforehand, the product time is similar, so transposing is not the problem. 
twood
Registered Member

How large are your matrices, roughly? What happens if you change that line to:
(Aliasing explained here: http://eigen.tuxfamily.org/dox/TopicAliasing.html) 
hughperkins
Registered Member

I tried using Jeigen http://github.com/hughperkins/jeigen , which is a java wrapper around eigen, and therefore has wwwaayyy more overhead than calling it directly.
My timings are bad, but not as bad as for yours: matlab: 0.76 seconds jeigen: 2.79 seconds matlab code: a = rand(2782,128); b = rand(4000,128); tic, a * b'; toc jeigen code: DenseMatrix a = rand(2782,128); DenseMatrix b = rand(2782,128); tic(); DenseMatrix c = a.mmul(b.t()); toc(); Some thoughts:  the optimization one has already been mentioned  I'm using ubuntu, rather than windows. maybe something system specific  the one I'm thinking most likely: are you perhaps running on a multicore system? matlab automatically parallelizes over multiple cores. So, let's say you're running on two cores (like me), then matlab will give half the elapsed time (as we see above). If you have 12 cores, then matlab will be twelve times faster... 
hughperkins
Registered Member

Ooooh, I was using the wrong dimension for the second matrix for the jeigen test. In fact, using the correct dimensions, the timing becomes:
matlab: 0.76 seconds jeigen: 3.05 seconds I made a dummy endtoend multiplier to test java/jna latency, which does everything except actually ask Eigen to multiply the matrices, ie the native method looks like this: // dummy operation to measure end to end latency void dummy_op2( int rows, int middle, int cols, double *afirst, double *asecond, double *aresult ) { MatrixXd first(rows,middle); valuesToMatrix( rows, middle, afirst, &first ); MatrixXd second(middle,cols); valuesToMatrix( middle, cols, asecond, &second ); //MatrixXd result = first * second; MatrixXd result(rows,cols); matrixToValues( rows, cols, &result, aresult ); } Then, testing from java: DenseMatrix a = rand(2782,128); DenseMatrix b = rand(4000,128); tic(); DenseMatrix c; c = a.dummy_mmul(b.t()); toc(); c = a.dummy_mmul(b.t()); toc(); c = a.dummy_mmul(b.t()); toc(); Elapsed time: 570 ms Elapsed time: 442 ms Elapsed time: 435 ms .... so the overhead of java/jna for these matrices is about 0.44 seconds. So the native time is about 2.5 seconds. If we assume matlab was running multithreaded on 2 cores, then we would expect matlab to take about 1.5 seconds on a single core. So in fact it does seem that eigen is about 60% slower than matlab/MKL in this case. 
manuels
Registered Member

Can you wrap the scalar product with
and
compile it with the S flag and post the resulting assembly code here? 
Registered users: aaricabbot, Baidu [Spider], Bing [Bot], broadstairs, dargaud, davidemme, edward.81, elvisangelaccio, Exabot [Bot], fabianr, Google [Bot], google01103, Hoog, jmacleod, Majestic12 [Bot], mgraesslin, Uri_Herrera, YaCy [Bot], Yahoo [Bot], zebulon