Questions on Eigen + MKL performance • KDE Community Forums

This forum has been archived. All content is frozen. Please use KDE Discuss instead.

Board index

Questions on Eigen + MKL performance

Page 1 of 1 (8 posts)

Tags:

dhlin Registered Member Posts 18 Karma 0	Questions on Eigen + MKL performance Fri Apr 13, 2012 5:01 am I performed a simple benchmark on Eigen, comparing the settings with and without MKL turned on. After looking at the results, I have some questions. Here is the experiment setting. I compare the performance on double-precision square matrix product, on five settings: - a hand-coded implementation using simple for-loop - Eigen without MKL (use MatrixXd) - Eigen + MKL (NUM_MKL_THREADS = 1) (use MatrixXd) - Eigen + MKL (NUM_MKL_THREADS = 2) (use MatrixXd) - Eigen (use Matrix4d): I found that whether to use MKL does not affect this Note: when using MatrixXd, the result matrix is pre-allocated, and I use the following statement C.noalias() = A * B Platform: MacBook Pro (OS X 10.7) / Intel Core i7 (2.66 GHz) / 8 GB Ram Compiler: clang++ (LLVM 3.0) Optimization flag: -O3 I repeated the same computation for many times (such that the total runtime is about 1 sec), and then evaluated the GFLOPS as the performance measure (a brief code skeleton shows how I timed stuff). The results are shown in the code block below. Code: Select all `... some codes to generate A and B randomly .... MatrixXd C(n, n); timer.start(); for (int i = 0; i < nrepeats; ++i) C.noalias() = A * B; elapsed_secs = timer.elapsed(); ... some codes to evaluate GFLOPS ......` A couple of questions: (1) I notice that the setting with MKL is consistently and notably better than that without MKL, when MatrixXd is used (for all dimensions). Does it mean that MKL is turned on even when the matrix size is as small as 4 x 4 ? (2) I notice that the performance with NUM_MKL_THREADS = 1 and that with NUM_MKL_THREADS = 2 are basically the same when n < 30. There may be either of the following causes: (a) MKL is not used when n < 30, but that does not explain why the performance is still considerably better than when MKL is turned off. (b) MKL just disables threading when n is small. (3) I am surprised to find that when n < 10, Eigen is even slower than hand-coded for loop. What are the reasons? (I thought the program will not re-allocate memory every for-loop, but then what would incur such a huge overhead ?) I guess a reasonable implementation would only turn on MKL when the matrix is large enough. But by default, how large is "large enough" to bring in MKL ? Here are the results Code: Select all # The matrix size is n x n Hand-code: ------------------------------------------ n = 4: 1.3462 GFLOP/s n = 8: 1.4522 GFLOP/s n = 10: 1.5089 GFLOP/s n = 20: 1.6291 GFLOP/s n = 30: 1.6622 GFLOP/s n = 40: 1.6595 GFLOP/s n = 60: 1.6451 GFLOP/s n = 80: 1.5426 GFLOP/s n = 100: 1.5447 GFLOP/s n = 200: 1.4502 GFLOP/s n = 400: 1.1189 GFLOP/s Eigen Benchmark (Matrix4d): ------------------------------------------ n = 4: 6.5269 GFLOP/s Eigen without MKL (MatrixXd) ------------------------------------------ n = 4: 0.2605 GFLOP/s n = 8: 1.2848 GFLOP/s n = 10: 1.9232 GFLOP/s n = 20: 4.5917 GFLOP/s n = 30: 5.5190 GFLOP/s n = 40: 6.2192 GFLOP/s n = 60: 6.8257 GFLOP/s n = 80: 7.1658 GFLOP/s n = 100: 7.2200 GFLOP/s n = 200: 7.4408 GFLOP/s n = 400: 7.4796 GFLOP/s n = 800: 7.3855 GFLOP/s n = 1600: 7.7084 GFLOP/s Eigen + MKL (MatrixXd): (NUM_MKL_THREADS=1) ------------------------------------------ n = 4: 0.5312 GFLOP/s n = 8: 2.4924 GFLOP/s n = 10: 3.5573 GFLOP/s n = 20: 6.1772 GFLOP/s n = 30: 7.4957 GFLOP/s n = 40: 9.3723 GFLOP/s n = 60: 10.2933 GFLOP/s n = 80: 10.9199 GFLOP/s n = 100: 10.9439 GFLOP/s n = 200: 11.7925 GFLOP/s n = 400: 11.6909 GFLOP/s n = 800: 11.5765 GFLOP/s n = 1600: 11.9823 GFLOP/s Eigen + MKL (MatrixXd): (NUM_MKL_THREADS=2) ------------------------------------------ n = 4: 0.5105 GFLOP/s n = 8: 2.4208 GFLOP/s n = 10: 3.5824 GFLOP/s n = 20: 6.1843 GFLOP/s n = 30: 7.6997 GFLOP/s n = 40: 10.3168 GFLOP/s n = 60: 14.6734 GFLOP/s n = 80: 16.6423 GFLOP/s n = 100: 17.5865 GFLOP/s n = 200: 18.9480 GFLOP/s n = 400: 19.4909 GFLOP/s n = 800: 19.1130 GFLOP/s n = 1600: 20.6289 GFLOP/s
manuels Registered Member Posts 47 Karma 0	Re: Questions on Eigen + MKL performance Fri Apr 13, 2012 11:37 am such that the total runtime is about 1 sec Is this for the n=4 case? Maybe you should increase your number of realizations so that it runs at least about 10 sec 1sec is really short for a benchmark
ggael Moderator Posts 3447 Karma 19 OS	Re: Questions on Eigen + MKL performance Fri Apr 13, 2012 12:41 pm 1 sec is largely enough, what is important is that the system runs as little processes as possible, typically the graphic system should be turned off. The reason for the nearly x2 between Eigen's MatrixXd and MKL is that MKL exploits the AVX instruction set supported by pour core i7, while Eigen only exploits SSE instructions for the moment (4 doubles at once for AVX versus only 2 for SSE). For a MatrixX* we always trigger the same algorithm regardless of its actual sise. The rational is that when the matrices are really small, then this information is usually known at compile time, and so the user should really use a fixed size matrix which is an order of magnitude faster (no allocation, loop unrolling, etc.). Note that Eigen can also exploit multithreading, (-fopenmp), and like MKL we enable it only for matrices larger than about 32.
ggael Moderator Posts 3447 Karma 19 OS	Re: Questions on Eigen + MKL performance Fri Apr 13, 2012 12:43 pm btw, which Eigen version did you use ? I remember that I've recently reduced the overhead of the general matrix product algorithm for small matrices. This is probably only in the devel branch.
dhlin Registered Member Posts 18 Karma 0	Re: Questions on Eigen + MKL performance Fri Apr 13, 2012 4:06 pm I was using the default branch of Mercurial. Also, I was only running the benchmark (without other running apps) during the experiment (Mac OS X's own desktop UI and an xterm was up, though ...) Code: Select all `changeset: 4597:3128918ee8de tag: tip user: Gael Guennebaud <g.gael@free.fr> date: Wed Apr 11 09:49:52 2012 +0200 summary: remove an extra ';' and suppress a 'variable used before its value is set' warning` Does the update that reduces overhead take effect in this version ? You said "For a MatrixX* we always trigger the same algorithm regardless of its actual sise." Does it mean that even when the size is 1 x 1, it will trigger a call of the external gemm routine? I agree that in some cases (especially when doing geometry/graphics), the sizes of small matrices/vectors may be known at compile-time (e.g. 2D/3D points or transforms). However, in many other domains, this is not the case. I am working on machine learning, and would like to use Eigen to develop some generic algorithms. Say, I want to write a multivariate Gaussian class. Considering that MatrixNd and MatrixXd would have significant differences on performance, then I may have to write something as such: Code: Select all `template<int N> class Gauss { ...... using MatrixXd as entities ...... }; typedef Gauss<Dynamic> GaussXd; typedef Gauss<2> Gauss2d; typedef Gauss<3> Gauss3d;` This is feasible, but a little bit cumbersome, especially when they are not just Gauss, but also hundreds of other classes that would face such issue, and thus need to bring up an additional template argument to specify the dimensions. This may also complicate things when dealing with interactions between them. I think the following way in Eigen would be very useful Code: Select all `if (matrix size > some threshold) { call some heavy weight routine (e.g. dgemm) } else { call some (inlined) hand-written codes }` The only overhead incurred here is a simple comparison in the if statement (the threshold can be set empirically as a compile-time resolved number), which would often be negligible when the computation if relatively heavy-weight (e.g. matrix product). What do you think about this?
ggael Moderator Posts 3447 Karma 19 OS	Re: Questions on Eigen + MKL performance Fri Apr 13, 2012 7:40 pm defining a template<int N> class Gauss is really what I would do. Look how Eigen::Matrix4f is considerably faster than all other generic variants ! Nevertheless I admit that MatrixXd should always be at least as fast as a naive implementation. In Eigen2 we had such a threshold with dynamic branching but in practice the threshold to switch to a simpler algorithm depends a lot on the compiler, scalar type, and hardware so I'd rather try to reduce the overhead even more. Could you share your benchmark, I4d like to try it with different architecture/compiler and also test the lazy product algorithm (A.lazyProduct(B)) on very small matrices.
ggael Moderator Posts 3447 Karma 19 OS	Re: Questions on Eigen + MKL performance Fri Apr 13, 2012 8:05 pm ok, for the record here is what I get on a core i7, gcc 4.6, in MFLOPS, using double for the following 6 variants: - fixed size, C.noalias() += A * B; - fixed size, C = A.lazyProduct(B); - fixed size, hand written naive loops. - same 3 variants with MatrixXd 1x1 ; 619 ; 563 ; 619 ; ;; 6 ; 146 ; 274 ; 2x2 ; 5438 ; 5438 ; 3295 ; ;; 48 ; 672 ; 708 ; 3x3 ; 4515 ; 4524 ; 1817 ; ;; 149 ; 934 ; 965 ; 4x4 ; 8894 ; 8863 ; 1969 ; ;; 354 ; 1420 ; 1166 ; 5x5 ; 3644 ; 3610 ; 1962 ; ;; 557 ; 1400 ; 1167 ; 6x6 ; 7030 ; 7027 ; 1937 ; ;; 784 ; 1757 ; 1309 ; 7x7 ; 5061 ; 5064 ; 2181 ; ;; 1104 ; 1746 ; 1361 ; 8x8 ; 3976 ; 8512 ; 2179 ; ;; 1675 ; 1874 ; 1396 ; 9x9 ; 3754 ; 4999 ; 2215 ; ;; 1927 ; 1879 ; 1405 ; 10x10 ; 4326 ; 8282 ; 2191 ; ;; 2403 ; 1858 ; 1434 ; 11x11 ; 4156 ; 4489 ; 1980 ; ;; 2629 ; 1956 ; 1399 ; 12x12 ; 5499 ; 8245 ; 2026 ; ;; 3949 ; 1938 ; 1423 ; 13x14 ; 5394 ; 4663 ; 2052 ; ;; 4096 ; 1990 ; 1434 ; 14x14 ; 5719 ; 7582 ; 2149 ; ;; 4512 ; 1945 ; 1369 ; 15x15 ; 5538 ; 4361 ; 1905 ; ;; 4519 ; 2002 ; 1414 ; 16x16 ; 6848 ; 7483 ; 1844 ; ;; 5631 ; 1984 ; 1440 ; In this case using the lazyProduct for matrices smaller than 8x8 seems to be a good threshold, but the results are still very far to fixed sizes!
Andre Registered Member Posts 90 Karma 1	Re: Questions on Eigen + MKL performance Mon Apr 16, 2012 7:23 am ggael: I'm a bit confused by the original numbers: Hand-code: n = 4: 1.3462 GFLOP/s Eigen without MKL (MatrixXd): n = 4: 0.2605 GFLOP/s Especially as I thought that clang does not support autovectorization? 'And all those exclamation marks, you notice? Five? A sure sign of someone who wears his underpants on his head.' ~Terry Pratchett 'It's funny. All you have to do is say something nobody understands and they'll do practically anything you want them to.' ~J.D. Salinger

Page 1 of 1 (8 posts)

Bookmarks

Who is online

Registered users: Bing [Bot], claydoh, Google [Bot], rblackwell, Yahoo [Bot]