don't get expected performances for gemm operation • KDE Community Forums

This forum has been archived. All content is frozen. Please use KDE Discuss instead.

Board index

don't get expected performances for gemm operation

Page 1 of 1 (11 posts)

Tags:

ccorail Registered Member Posts 4 Karma 0 OS	don't get expected performances for gemm operation Thu Sep 08, 2011 9:39 pm Hi, I'm really interested to use eigen but I cannot manage to reproduce performance similar to the ones presented in the benchmarks. I suppose I'm doing something wrong but I can't figure what. I only need quite big ( O(1000) ) matrix-matrix products. my cpu is : Intel(R) Core(TM) i7-2720QM CPU @ 2.20GHz cache size 6144 KB I tried g++4.6.1 and g++4.4 (debian, 64 bits) with the flags -msse4.2 -O3 but for a dgemm operation on square matrices of size 1000, I only get half the speed of mkl (10.3.6.233_intel64, sequential). Even Atlas is 40% faster. Those ratio are more of less the same in single and double precision. I'm using eigen 3.0.2 debian package (also tried hg revision 4277 but don't get significant differences). If I was able to get performance somwhere between atlas and mkl, I would switch immediately. Here is my code (I also tried the eigen_blas library so that I can use the same code to test mkl and eigen but I get similar results): Code: Select all #include <Eigen/Core> #include <cstdio> #include <time.h> using namespace Eigen; int main(int argc, char *argv) { int SIZE1 = 1000; int SIZE2 = 1000; int SIZE3 = 1000; MatrixXd a = MatrixXd::Random(SIZE1, SIZE3), b = MatrixXd::Random(SIZE1, SIZE2), c= MatrixXd::Random(SIZE2, SIZE3); timespec t0, t1; clock_gettime(CLOCK_MONOTONIC_RAW, &t0); a.noalias() += b c; clock_gettime(CLOCK_MONOTONIC_RAW, &t1); printf("time = %e\n", (t1.tv_sec - t0.tv_sec) + (t1.tv_nsec - t0.tv_nsec) * 1e-9); }
ggael Moderator Posts 3447 Karma 19 OS	Re: don't get expected performances for gemm operation Fri Sep 09, 2011 6:27 am are you comparing against a sequential or parallel version of MKL ? To enable parallelization you have to add the -fopenmp flag. Be careful about hyper threading: if hyper threading is enabled run your program with: $ OMP_NUM_THREADS=<true number of core> ./myprog If you're unsure, please report the timing number to see how far we are from the theoretical peak perf. for your system.
ccorail Registered Member Posts 4 Karma 0 OS	Re: don't get expected performances for gemm operation Fri Sep 09, 2011 7:33 am thank you for your answer. I'm linking against the sequential mkl, and a compiled a sequential atlas for the timing. (I'm not really interested in parallel blas for this application because the main program itself is parallel (MPI) ). Here are my timing of a dgemm of square (size 1000) matrices (blas = mkl): time eigen = 1.713138e-01 time blas = 8.847635e-02 It's obtained with the following code compiled with : Code: Select all `main_eigen : main_eigen.cc g++ -o $@ -I/usr/include/eigen3 -lrt -O3 -msse4.2 -L$(MKL_LIB_DIR) -lmkl_intel_lp64 -lmkl_sequential -lmkl_core -lm -Wl,-rpath,$(MKL_LIB_DIR) $< -D DGEMM=dgemm` Code: Select all #include <Eigen/Core> #include <cstdio> #include <time.h> using namespace Eigen; extern "C" { void DGEMM (const char transa, const char transb, int m, int n, int k, double alpha, double a, int lda, double b, int ldb, double beta, double c, int ldc); } void gemm(MatrixXd &a, MatrixXd &b, MatrixXd &c) { int m = a.rows(); int n = c.rows(); int o = a.cols(); double alpha = 1.; double beta = 0.; DGEMM("N", "N", &m, &o, &n, &alpha, b.data(), &m, c.data(), &n, &beta, a.data(), &m); } int main(int argc, char argv) { int SIZE1 = 1000; int SIZE2 = 1000; int SIZE3 = 1000; MatrixXd a(SIZE1, SIZE3), b = MatrixXd::Random(SIZE1, SIZE2), c= MatrixXd::Random(SIZE2, SIZE3); timespec t0, t1, t2; clock_gettime(CLOCK_MONOTONIC_RAW, &t0); a.noalias() += b c; clock_gettime(CLOCK_MONOTONIC_RAW, &t1); gemm(a,b,c); clock_gettime(CLOCK_MONOTONIC_RAW, &t2); printf("time eigen = %e\n", (t1.tv_sec - t0.tv_sec) + (t1.tv_nsec - t0.tv_nsec) * 1e-9); printf("time blas = %e\n", (t2.tv_sec - t1.tv_sec) + (t2.tv_nsec - t1.tv_nsec) * 1e-9); }
ggael Moderator Posts 3447 Karma 19 OS	Re: don't get expected performances for gemm operation Fri Sep 09, 2011 7:55 am 0.0884s with blas is theoretically impossible to achieve in a sequential code. Your CPU can run at most at 3.3GHz (turbo boost), that implies a maximum of 14.17e9 double operations per seconds: 3.3G * 2 (for SSE) * 2 (for 1 add and 1 mul per cycle). The 0.084s you get would imply a rate of 22e9 ops per second...
ccorail Registered Member Posts 4 Karma 0 OS	Re: don't get expected performances for gemm operation Fri Sep 09, 2011 11:43 am ggael wrote:that implies a maximum of 14.17e9 double operations per seconds: 3.3G * 2 (for SSE) * 2 (for 1 add and 1 mul per cycle). somehow I get better results than this number : time for square matrix (size = 5000, double precision) products : eigen : 20.5s (12.2e9 FLOPS) atlas : 13.3s (18.8e9 FLOPS) mkl : 10.5s (23.8e9 FLOPS) eigen parallel : 7.8s (8.0e9 FLOPS per core) mkl parallel : 3.5s (17.8e9 FLOPS per core) my processor is an Intel Core i7-2720QM (quad core, hyper threading disabled in BIOS and I run the parallel test with OMP_NUM_THREADS=4) I'm sorry to insist. I'm really interested in eigen and willing to use it but I have to convince myself and my team that, in sequential, its performance are comparable to atlas or mkl as stated on the website.
ggael Moderator Posts 3447 Karma 19 OS	Re: don't get expected performances for gemm operation Fri Sep 09, 2011 3:20 pm ok, I got it, you have a very recent processor with support for AVX instructions, and Eigen not yet... So a factor 2 has to be expected, indeed. As soon as we'll get AVX we'll very likely reach MKL.
mattd Registered Member Posts 28 Karma 0	Re: don't get expected performances for gemm operation Fri Sep 09, 2011 6:24 pm ccorail, have you tried compiling with -march=native -mtune=native flags? It's advised by the GCC manual /* http://gcc.gnu.org/onlinedocs/gcc/i386- ... tions.html /, otherwise the code is targeted to generic lowest-common-denominator CPU (and not tuned). // Make sure you have a recent GCC version, too, so that targeting & tuning for corei7/corei7-avx/core-avx-i cpu-type is supported. I see one of your compilers (g++ 4.6.1) should support it: Support for Intel Core i3/i5/i7 processors is now available through the -march=corei7 and -mtune=corei7 options. Support for Intel Core i3/i5/i7 processors with AVX is now available through the -march=corei7-avx and -mtune=corei7-avx options. // http://gcc.gnu.org/gcc-4.6/changes.html You will have to settle for targeting older architectures (and potentially sub-optimal performance that goes with it) if you're using a 4.4 compiler, though. // GCC 4.4.0 was released on April 21, 2009, while Core i7-2720QM was released on January 9, 2011, so it's only natural. Note also that it usually takes time to implement proper support and optimizations in a compiler, so it's always best to use the most recent version of GCC if you a have recent CPU and want to make the most of it. / On a side note, core-avx-i is for yet-to-be released (some time in 2012) Intel Ivy Bridge architecture /* http://patchwork.ozlabs.org/patch/108601/ / your CPU is the current Sandy Bridge arch., hence corei7/corei7-avx are the appropriate cpu-types -- so don't worry about core-avx-i. / Incidentally, ggael, I have a question about this: AVX floating-point arithmetic can now be enabled by default at configure time with the new --with-fpmath=avx option. Since AVX is not yet supported, would you advise against using the above flag (in other words, can AVX fpmath hurt more than help)? Similarly, would you suggest corei7 over corei7-avx? // OP: "native" selects the best target & tuning for your CPU as deemed by the compiler and is recommended in most cases, but based on the answer to the above you may try to manually experiment with -march=corei7 -mtune=corei7 vs. -march=corei7-avx -mtune=corei7-avx and see which one fares better in your case.
mattd Registered Member Posts 28 Karma 0	Re: don't get expected performances for gemm operation Fri Sep 09, 2011 6:50 pm Independently of the above, you may also try using the -Ofast flag (caveats apply): http://gcc.gnu.org/onlinedocs/gcc/Optim ... -Ofast-689
ccorail Registered Member Posts 4 Karma 0 OS	Re: don't get expected performances for gemm operation Fri Sep 09, 2011 9:15 pm Thanks for your suggestions. I tried -Ofast and -march,-mtune=[corei7,corei7-avx,native] but none of those options made a significant difference. I'll try an older CPU on monday.
ggael Moderator Posts 3447 Karma 19 OS	Re: don't get expected performances for gemm operation Fri Sep 09, 2011 10:03 pm I meant that Eigen does not exploit AVX instructions yet, regardless of the compilation flags. This is something we have and planed to do.
benoitsteiner Registered Member Posts 13 Karma 0	Re: don't get expected performances for gemm operation Thu Mar 27, 2014 4:33 am ccorail , I have added support for both AVX and FMA instructions to Eigen. On SandyBridge and IvyBridge machines, the code runs almost twice as fast on the set of benchmarks I used when working on this. On Haswell, I have measured an additional 30% speedup. The code is currently available in this branch. I am working with Gael to get it merged in the next version of the Eigen library. In the meantime, can you give it a try and let us know how it performs for you performance-wise? That will help us find any potential remaining issue. Thanks. Benoit

Page 1 of 1 (11 posts)

Bookmarks

Who is online

Registered users: Bing [Bot], claydoh, Google [Bot], rblackwell