Registered Member
|
I performed a simple benchmark on Eigen, comparing the settings with and without MKL turned on. After looking at the results, I have some questions.
Here is the experiment setting. I compare the performance on double-precision square matrix product, on five settings: - a hand-coded implementation using simple for-loop - Eigen without MKL (use MatrixXd) - Eigen + MKL (NUM_MKL_THREADS = 1) (use MatrixXd) - Eigen + MKL (NUM_MKL_THREADS = 2) (use MatrixXd) - Eigen (use Matrix4d): I found that whether to use MKL does not affect this Note: when using MatrixXd, the result matrix is pre-allocated, and I use the following statement C.noalias() = A * B Platform: MacBook Pro (OS X 10.7) / Intel Core i7 (2.66 GHz) / 8 GB Ram Compiler: clang++ (LLVM 3.0) Optimization flag: -O3 I repeated the same computation for many times (such that the total runtime is about 1 sec), and then evaluated the GFLOPS as the performance measure (a brief code skeleton shows how I timed stuff). The results are shown in the code block below.
A couple of questions: (1) I notice that the setting with MKL is consistently and notably better than that without MKL, when MatrixXd is used (for all dimensions). Does it mean that MKL is turned on even when the matrix size is as small as 4 x 4 ? (2) I notice that the performance with NUM_MKL_THREADS = 1 and that with NUM_MKL_THREADS = 2 are basically the same when n < 30. There may be either of the following causes: (a) MKL is not used when n < 30, but that does not explain why the performance is still considerably better than when MKL is turned off. (b) MKL just disables threading when n is small. (3) I am surprised to find that when n < 10, Eigen is even slower than hand-coded for loop. What are the reasons? (I thought the program will not re-allocate memory every for-loop, but then what would incur such a huge overhead ?) I guess a reasonable implementation would only turn on MKL when the matrix is large enough. But by default, how large is "large enough" to bring in MKL ? Here are the results
|
Registered Member
|
Is this for the n=4 case? Maybe you should increase your number of realizations so that it runs at least about 10 sec 1sec is really short for a benchmark |
Moderator
|
1 sec is largely enough, what is important is that the system runs as little processes as possible, typically the graphic system should be turned off.
The reason for the nearly x2 between Eigen's MatrixXd and MKL is that MKL exploits the AVX instruction set supported by pour core i7, while Eigen only exploits SSE instructions for the moment (4 doubles at once for AVX versus only 2 for SSE). For a MatrixX* we always trigger the same algorithm regardless of its actual sise. The rational is that when the matrices are really small, then this information is usually known at compile time, and so the user should really use a fixed size matrix which is an order of magnitude faster (no allocation, loop unrolling, etc.). Note that Eigen can also exploit multithreading, (-fopenmp), and like MKL we enable it only for matrices larger than about 32. |
Moderator
|
btw, which Eigen version did you use ? I remember that I've recently reduced the overhead of the general matrix product algorithm for small matrices. This is probably only in the devel branch.
|
Registered Member
|
I was using the default branch of Mercurial. Also, I was only running the benchmark (without other running apps) during the experiment (Mac OS X's own desktop UI and an xterm was up, though ...)
Does the update that reduces overhead take effect in this version ? You said "For a MatrixX* we always trigger the same algorithm regardless of its actual sise." Does it mean that even when the size is 1 x 1, it will trigger a call of the external gemm routine? I agree that in some cases (especially when doing geometry/graphics), the sizes of small matrices/vectors may be known at compile-time (e.g. 2D/3D points or transforms). However, in many other domains, this is not the case. I am working on machine learning, and would like to use Eigen to develop some generic algorithms. Say, I want to write a multivariate Gaussian class. Considering that MatrixNd and MatrixXd would have significant differences on performance, then I may have to write something as such:
This is feasible, but a little bit cumbersome, especially when they are not just Gauss, but also hundreds of other classes that would face such issue, and thus need to bring up an additional template argument to specify the dimensions. This may also complicate things when dealing with interactions between them. I think the following way in Eigen would be very useful
The only overhead incurred here is a simple comparison in the if statement (the threshold can be set empirically as a compile-time resolved number), which would often be negligible when the computation if relatively heavy-weight (e.g. matrix product). What do you think about this? |
Moderator
|
defining a template<int N> class Gauss is really what I would do. Look how Eigen::Matrix4f is considerably faster than all other generic variants !
Nevertheless I admit that MatrixXd should always be at least as fast as a naive implementation. In Eigen2 we had such a threshold with dynamic branching but in practice the threshold to switch to a simpler algorithm depends a lot on the compiler, scalar type, and hardware so I'd rather try to reduce the overhead even more. Could you share your benchmark, I4d like to try it with different architecture/compiler and also test the lazy product algorithm (A.lazyProduct(B)) on very small matrices. |
Moderator
|
ok, for the record here is what I get on a core i7, gcc 4.6, in MFLOPS, using double for the following 6 variants:
- fixed size, C.noalias() += A * B; - fixed size, C = A.lazyProduct(B); - fixed size, hand written naive loops. - same 3 variants with MatrixXd 1x1 ; 619 ; 563 ; 619 ; ;; 6 ; 146 ; 274 ; 2x2 ; 5438 ; 5438 ; 3295 ; ;; 48 ; 672 ; 708 ; 3x3 ; 4515 ; 4524 ; 1817 ; ;; 149 ; 934 ; 965 ; 4x4 ; 8894 ; 8863 ; 1969 ; ;; 354 ; 1420 ; 1166 ; 5x5 ; 3644 ; 3610 ; 1962 ; ;; 557 ; 1400 ; 1167 ; 6x6 ; 7030 ; 7027 ; 1937 ; ;; 784 ; 1757 ; 1309 ; 7x7 ; 5061 ; 5064 ; 2181 ; ;; 1104 ; 1746 ; 1361 ; 8x8 ; 3976 ; 8512 ; 2179 ; ;; 1675 ; 1874 ; 1396 ; 9x9 ; 3754 ; 4999 ; 2215 ; ;; 1927 ; 1879 ; 1405 ; 10x10 ; 4326 ; 8282 ; 2191 ; ;; 2403 ; 1858 ; 1434 ; 11x11 ; 4156 ; 4489 ; 1980 ; ;; 2629 ; 1956 ; 1399 ; 12x12 ; 5499 ; 8245 ; 2026 ; ;; 3949 ; 1938 ; 1423 ; 13x14 ; 5394 ; 4663 ; 2052 ; ;; 4096 ; 1990 ; 1434 ; 14x14 ; 5719 ; 7582 ; 2149 ; ;; 4512 ; 1945 ; 1369 ; 15x15 ; 5538 ; 4361 ; 1905 ; ;; 4519 ; 2002 ; 1414 ; 16x16 ; 6848 ; 7483 ; 1844 ; ;; 5631 ; 1984 ; 1440 ; In this case using the lazyProduct for matrices smaller than 8x8 seems to be a good threshold, but the results are still very far to fixed sizes! |
Registered Member
|
ggael: I'm a bit confused by the original numbers:
Hand-code: n = 4: 1.3462 GFLOP/s Eigen without MKL (MatrixXd): n = 4: 0.2605 GFLOP/s Especially as I thought that clang does not support autovectorization?
'And all those exclamation marks, you notice? Five? A sure sign of someone who wears his underpants on his head.' ~Terry Pratchett
'It's funny. All you have to do is say something nobody understands and they'll do practically anything you want them to.' ~J.D. Salinger |
Registered users: Bing [Bot], claydoh, Google [Bot], rblackwell, Yahoo [Bot]