This forum has been archived. All content is frozen. Please use KDE Discuss instead.

Questions on Eigen + MKL performance

Tags: None
(comma "," separated)
dhlin
Registered Member
Posts
18
Karma
0
I performed a simple benchmark on Eigen, comparing the settings with and without MKL turned on. After looking at the results, I have some questions.

Here is the experiment setting. I compare the performance on double-precision square matrix product, on five settings:
- a hand-coded implementation using simple for-loop
- Eigen without MKL (use MatrixXd)
- Eigen + MKL (NUM_MKL_THREADS = 1) (use MatrixXd)
- Eigen + MKL (NUM_MKL_THREADS = 2) (use MatrixXd)
- Eigen (use Matrix4d): I found that whether to use MKL does not affect this

Note: when using MatrixXd, the result matrix is pre-allocated, and I use the following statement

C.noalias() = A * B

Platform: MacBook Pro (OS X 10.7) / Intel Core i7 (2.66 GHz) / 8 GB Ram
Compiler: clang++ (LLVM 3.0)
Optimization flag: -O3

I repeated the same computation for many times (such that the total runtime is about 1 sec), and then evaluated the GFLOPS as the performance measure (a brief code skeleton shows how I timed stuff). The results are shown in the code block below.

Code: Select all

... some codes to generate A and B randomly ....

MatrixXd C(n, n);

timer.start();
for (int i = 0; i < nrepeats; ++i) C.noalias() = A * B;
elapsed_secs = timer.elapsed();

... some codes to evaluate GFLOPS ......



A couple of questions:

(1) I notice that the setting with MKL is consistently and notably better than that without MKL, when MatrixXd is used (for all dimensions). Does it mean that MKL is turned on even when the matrix size is as small as 4 x 4 ?

(2) I notice that the performance with NUM_MKL_THREADS = 1 and that with NUM_MKL_THREADS = 2 are basically the same when n < 30. There may be either of the following causes: (a) MKL is not used when n < 30, but that does not explain why the performance is still considerably better than when MKL is turned off. (b) MKL just disables threading when n is small.

(3) I am surprised to find that when n < 10, Eigen is even slower than hand-coded for loop. What are the reasons? (I thought the program will not re-allocate memory every for-loop, but then what would incur such a huge overhead ?)

I guess a reasonable implementation would only turn on MKL when the matrix is large enough. But by default, how large is "large enough" to bring in MKL ?


Here are the results

Code: Select all

# The matrix size is n x n

Hand-code:
------------------------------------------
n =    4:   1.3462 GFLOP/s
n =    8:   1.4522 GFLOP/s
n =   10:   1.5089 GFLOP/s
n =   20:   1.6291 GFLOP/s
n =   30:   1.6622 GFLOP/s
n =   40:   1.6595 GFLOP/s
n =   60:   1.6451 GFLOP/s
n =   80:   1.5426 GFLOP/s
n =  100:   1.5447 GFLOP/s
n =  200:   1.4502 GFLOP/s
n =  400:   1.1189 GFLOP/s


Eigen Benchmark (Matrix4d):
------------------------------------------
n =    4:   6.5269 GFLOP/s


Eigen without MKL (MatrixXd)
------------------------------------------
n =    4:   0.2605 GFLOP/s
n =    8:   1.2848 GFLOP/s
n =   10:   1.9232 GFLOP/s
n =   20:   4.5917 GFLOP/s
n =   30:   5.5190 GFLOP/s
n =   40:   6.2192 GFLOP/s
n =   60:   6.8257 GFLOP/s
n =   80:   7.1658 GFLOP/s
n =  100:   7.2200 GFLOP/s
n =  200:   7.4408 GFLOP/s
n =  400:   7.4796 GFLOP/s
n =  800:   7.3855 GFLOP/s
n = 1600:   7.7084 GFLOP/s


Eigen + MKL (MatrixXd): (NUM_MKL_THREADS=1)
------------------------------------------
n =    4:   0.5312 GFLOP/s
n =    8:   2.4924 GFLOP/s
n =   10:   3.5573 GFLOP/s
n =   20:   6.1772 GFLOP/s
n =   30:   7.4957 GFLOP/s
n =   40:   9.3723 GFLOP/s
n =   60:  10.2933 GFLOP/s
n =   80:  10.9199 GFLOP/s
n =  100:  10.9439 GFLOP/s
n =  200:  11.7925 GFLOP/s
n =  400:  11.6909 GFLOP/s
n =  800:  11.5765 GFLOP/s
n = 1600:  11.9823 GFLOP/s


Eigen + MKL (MatrixXd): (NUM_MKL_THREADS=2)
------------------------------------------
n =    4:   0.5105 GFLOP/s
n =    8:   2.4208 GFLOP/s
n =   10:   3.5824 GFLOP/s
n =   20:   6.1843 GFLOP/s
n =   30:   7.6997 GFLOP/s
n =   40:  10.3168 GFLOP/s
n =   60:  14.6734 GFLOP/s
n =   80:  16.6423 GFLOP/s
n =  100:  17.5865 GFLOP/s
n =  200:  18.9480 GFLOP/s
n =  400:  19.4909 GFLOP/s
n =  800:  19.1130 GFLOP/s
n = 1600:  20.6289 GFLOP/s
manuels
Registered Member
Posts
47
Karma
0
such that the total runtime is about 1 sec
Is this for the n=4 case?

Maybe you should increase your number of realizations so that it runs at least about 10 sec

1sec is really short for a benchmark
User avatar
ggael
Moderator
Posts
3447
Karma
19
OS
1 sec is largely enough, what is important is that the system runs as little processes as possible, typically the graphic system should be turned off.

The reason for the nearly x2 between Eigen's MatrixXd and MKL is that MKL exploits the AVX instruction set supported by pour core i7, while Eigen only exploits SSE instructions for the moment (4 doubles at once for AVX versus only 2 for SSE). For a MatrixX* we always trigger the same algorithm regardless of its actual sise. The rational is that when the matrices are really small, then this information is usually known at compile time, and so the user should really use a fixed size matrix which is an order of magnitude faster (no allocation, loop unrolling, etc.).

Note that Eigen can also exploit multithreading, (-fopenmp), and like MKL we enable it only for matrices larger than about 32.
User avatar
ggael
Moderator
Posts
3447
Karma
19
OS
btw, which Eigen version did you use ? I remember that I've recently reduced the overhead of the general matrix product algorithm for small matrices. This is probably only in the devel branch.
dhlin
Registered Member
Posts
18
Karma
0
I was using the default branch of Mercurial. Also, I was only running the benchmark (without other running apps) during the experiment (Mac OS X's own desktop UI and an xterm was up, though ...)

Code: Select all
changeset:   4597:3128918ee8de
tag:         tip
user:        Gael Guennebaud <g.gael@free.fr>
date:        Wed Apr 11 09:49:52 2012 +0200
summary:     remove an extra ';' and suppress a 'variable used before its value is set' warning


Does the update that reduces overhead take effect in this version ?

You said "For a MatrixX* we always trigger the same algorithm regardless of its actual sise."

Does it mean that even when the size is 1 x 1, it will trigger a call of the external gemm routine?

I agree that in some cases (especially when doing geometry/graphics), the sizes of small matrices/vectors may be known at compile-time (e.g. 2D/3D points or transforms). However, in many other domains, this is not the case.

I am working on machine learning, and would like to use Eigen to develop some generic algorithms.
Say, I want to write a multivariate Gaussian class. Considering that MatrixNd and MatrixXd would have significant differences on performance, then I may have to write something as such:

Code: Select all

template<int N>
class Gauss
{
    ...... using MatrixXd as entities ......
};

typedef Gauss<Dynamic> GaussXd;
typedef Gauss<2> Gauss2d;
typedef Gauss<3> Gauss3d;



This is feasible, but a little bit cumbersome, especially when they are not just Gauss, but also hundreds of other classes that would face such issue, and thus need to bring up an additional template argument to specify the dimensions. This may also complicate things when dealing with interactions between them.

I think the following way in Eigen would be very useful

Code: Select all
if (matrix size > some threshold)
{
      call some heavy weight routine (e.g. dgemm)
}
else
{
      call some (inlined) hand-written codes
}


The only overhead incurred here is a simple comparison in the if statement (the threshold can be set empirically as a compile-time resolved number), which would often be negligible when the computation if relatively heavy-weight (e.g. matrix product). What do you think about this?
User avatar
ggael
Moderator
Posts
3447
Karma
19
OS
defining a template<int N> class Gauss is really what I would do. Look how Eigen::Matrix4f is considerably faster than all other generic variants !

Nevertheless I admit that MatrixXd should always be at least as fast as a naive implementation. In Eigen2 we had such a threshold with dynamic branching but in practice the threshold to switch to a simpler algorithm depends a lot on the compiler, scalar type, and hardware so I'd rather try to reduce the overhead even more.

Could you share your benchmark, I4d like to try it with different architecture/compiler and also test the lazy product algorithm (A.lazyProduct(B)) on very small matrices.
User avatar
ggael
Moderator
Posts
3447
Karma
19
OS
ok, for the record here is what I get on a core i7, gcc 4.6, in MFLOPS, using double for the following 6 variants:

- fixed size, C.noalias() += A * B;
- fixed size, C = A.lazyProduct(B);
- fixed size, hand written naive loops.
- same 3 variants with MatrixXd

1x1 ; 619 ; 563 ; 619 ; ;; 6 ; 146 ; 274 ;
2x2 ; 5438 ; 5438 ; 3295 ; ;; 48 ; 672 ; 708 ;
3x3 ; 4515 ; 4524 ; 1817 ; ;; 149 ; 934 ; 965 ;
4x4 ; 8894 ; 8863 ; 1969 ; ;; 354 ; 1420 ; 1166 ;
5x5 ; 3644 ; 3610 ; 1962 ; ;; 557 ; 1400 ; 1167 ;
6x6 ; 7030 ; 7027 ; 1937 ; ;; 784 ; 1757 ; 1309 ;
7x7 ; 5061 ; 5064 ; 2181 ; ;; 1104 ; 1746 ; 1361 ;
8x8 ; 3976 ; 8512 ; 2179 ; ;; 1675 ; 1874 ; 1396 ;
9x9 ; 3754 ; 4999 ; 2215 ; ;; 1927 ; 1879 ; 1405 ;
10x10 ; 4326 ; 8282 ; 2191 ; ;; 2403 ; 1858 ; 1434 ;
11x11 ; 4156 ; 4489 ; 1980 ; ;; 2629 ; 1956 ; 1399 ;
12x12 ; 5499 ; 8245 ; 2026 ; ;; 3949 ; 1938 ; 1423 ;
13x14 ; 5394 ; 4663 ; 2052 ; ;; 4096 ; 1990 ; 1434 ;
14x14 ; 5719 ; 7582 ; 2149 ; ;; 4512 ; 1945 ; 1369 ;
15x15 ; 5538 ; 4361 ; 1905 ; ;; 4519 ; 2002 ; 1414 ;
16x16 ; 6848 ; 7483 ; 1844 ; ;; 5631 ; 1984 ; 1440 ;

In this case using the lazyProduct for matrices smaller than 8x8 seems to be a good threshold, but the results are still very far to fixed sizes!
Andre
Registered Member
Posts
90
Karma
1
ggael: I'm a bit confused by the original numbers:


Hand-code: n = 4: 1.3462 GFLOP/s

Eigen without MKL (MatrixXd): n = 4: 0.2605 GFLOP/s


Especially as I thought that clang does not support autovectorization?


'And all those exclamation marks, you notice? Five? A sure sign of someone who wears his underpants on his head.' ~Terry Pratchett

'It's funny. All you have to do is say something nobody understands and they'll do practically anything you want them to.' ~J.D. Salinger


Bookmarks



Who is online

Registered users: Bing [Bot], claydoh, Google [Bot], rblackwell, Yahoo [Bot]