Registered Member
|
Hi,
I'm really interested to use eigen but I cannot manage to reproduce performance similar to the ones presented in the benchmarks. I suppose I'm doing something wrong but I can't figure what. I only need quite big ( O(1000) ) matrix-matrix products. my cpu is : Intel(R) Core(TM) i7-2720QM CPU @ 2.20GHz cache size 6144 KB I tried g++4.6.1 and g++4.4 (debian, 64 bits) with the flags -msse4.2 -O3 but for a dgemm operation on square matrices of size 1000, I only get half the speed of mkl (10.3.6.233_intel64, sequential). Even Atlas is 40% faster. Those ratio are more of less the same in single and double precision. I'm using eigen 3.0.2 debian package (also tried hg revision 4277 but don't get significant differences). If I was able to get performance somwhere between atlas and mkl, I would switch immediately. Here is my code (I also tried the eigen_blas library so that I can use the same code to test mkl and eigen but I get similar results):
|
Moderator
|
are you comparing against a sequential or parallel version of MKL ? To enable parallelization you have to add the -fopenmp flag. Be careful about hyper threading: if hyper threading is enabled run your program with:
$ OMP_NUM_THREADS=<true number of core> ./myprog If you're unsure, please report the timing number to see how far we are from the theoretical peak perf. for your system. |
Registered Member
|
thank you for your answer.
I'm linking against the sequential mkl, and a compiled a sequential atlas for the timing. (I'm not really interested in parallel blas for this application because the main program itself is parallel (MPI) ). Here are my timing of a dgemm of square (size 1000) matrices (blas = mkl): time eigen = 1.713138e-01 time blas = 8.847635e-02 It's obtained with the following code compiled with :
|
Moderator
|
0.0884s with blas is theoretically impossible to achieve in a sequential code. Your CPU can run at most at 3.3GHz (turbo boost), that implies a maximum of 14.17e9 double operations per seconds: 3.3G * 2 (for SSE) * 2 (for 1 add and 1 mul per cycle). The 0.084s you get would imply a rate of 22e9 ops per second...
|
Registered Member
|
somehow I get better results than this number : time for square matrix (size = 5000, double precision) products : eigen : 20.5s (12.2e9 FLOPS) atlas : 13.3s (18.8e9 FLOPS) mkl : 10.5s (23.8e9 FLOPS) eigen parallel : 7.8s (8.0e9 FLOPS per core) mkl parallel : 3.5s (17.8e9 FLOPS per core) my processor is an Intel Core i7-2720QM (quad core, hyper threading disabled in BIOS and I run the parallel test with OMP_NUM_THREADS=4) I'm sorry to insist. I'm really interested in eigen and willing to use it but I have to convince myself and my team that, in sequential, its performance are comparable to atlas or mkl as stated on the website. |
Moderator
|
ok, I got it, you have a very recent processor with support for AVX instructions, and Eigen not yet... So a factor 2 has to be expected, indeed. As soon as we'll get AVX we'll very likely reach MKL.
|
Registered Member
|
ccorail, have you tried compiling with -march=native -mtune=native flags?
It's advised by the GCC manual /* http://gcc.gnu.org/onlinedocs/gcc/i386- ... tions.html */, otherwise the code is targeted to generic lowest-common-denominator CPU (and not tuned). // Make sure you have a recent GCC version, too, so that targeting & tuning for corei7/corei7-avx/core-avx-i cpu-type is supported. I see one of your compilers (g++ 4.6.1) should support it: Support for Intel Core i3/i5/i7 processors is now available through the -march=corei7 and -mtune=corei7 options. Support for Intel Core i3/i5/i7 processors with AVX is now available through the -march=corei7-avx and -mtune=corei7-avx options. // http://gcc.gnu.org/gcc-4.6/changes.html You will have to settle for targeting older architectures (and potentially sub-optimal performance that goes with it) if you're using a 4.4 compiler, though. // GCC 4.4.0 was released on April 21, 2009, while Core i7-2720QM was released on January 9, 2011, so it's only natural. Note also that it usually takes time to implement proper support and optimizations in a compiler, so it's always best to use the most recent version of GCC if you a have recent CPU and want to make the most of it. /* On a side note, core-avx-i is for yet-to-be released (some time in 2012) Intel Ivy Bridge architecture /* http://patchwork.ozlabs.org/patch/108601/ */ your CPU is the current Sandy Bridge arch., hence corei7/corei7-avx are the appropriate cpu-types -- so don't worry about core-avx-i. */ Incidentally, ggael, I have a question about this: AVX floating-point arithmetic can now be enabled by default at configure time with the new --with-fpmath=avx option. Since AVX is not yet supported, would you advise against using the above flag (in other words, can AVX fpmath hurt more than help)? Similarly, would you suggest corei7 over corei7-avx? // OP: "native" selects the best target & tuning for your CPU as deemed by the compiler and is recommended in most cases, but based on the answer to the above you may try to manually experiment with -march=corei7 -mtune=corei7 vs. -march=corei7-avx -mtune=corei7-avx and see which one fares better in your case. |
Registered Member
|
Independently of the above, you may also try using the -Ofast flag (caveats apply):
http://gcc.gnu.org/onlinedocs/gcc/Optim ... -Ofast-689 |
Registered Member
|
Thanks for your suggestions.
I tried -Ofast and -march,-mtune=[corei7,corei7-avx,native] but none of those options made a significant difference. I'll try an older CPU on monday. |
Moderator
|
I meant that Eigen does not exploit AVX instructions yet, regardless of the compilation flags. This is something we have and planed to do.
|
Registered Member
|
ccorail ,
I have added support for both AVX and FMA instructions to Eigen. On SandyBridge and IvyBridge machines, the code runs almost twice as fast on the set of benchmarks I used when working on this. On Haswell, I have measured an additional 30% speedup. The code is currently available in this branch. I am working with Gael to get it merged in the next version of the Eigen library. In the meantime, can you give it a try and let us know how it performs for you performance-wise? That will help us find any potential remaining issue. Thanks. |
Registered users: Baidu [Spider], Bing [Bot], Google [Bot]