This forum has been archived. All content is frozen. Please use KDE Discuss instead.

don't get expected performances for gemm operation

Tags: None
(comma "," separated)
ccorail
Registered Member
Posts
4
Karma
0
OS
Hi,
I'm really interested to use eigen but I cannot manage to reproduce performance similar to the ones presented in the benchmarks. I suppose I'm doing something wrong but I can't figure what.

I only need quite big ( O(1000) ) matrix-matrix products.
my cpu is :
Intel(R) Core(TM) i7-2720QM CPU @ 2.20GHz cache size 6144 KB
I tried g++4.6.1 and g++4.4 (debian, 64 bits) with the flags -msse4.2 -O3

but for a dgemm operation on square matrices of size 1000, I only get half the speed of mkl (10.3.6.233_intel64, sequential). Even Atlas is 40% faster.
Those ratio are more of less the same in single and double precision.

I'm using eigen 3.0.2 debian package (also tried hg revision 4277 but don't get significant differences).

If I was able to get performance somwhere between atlas and mkl, I would switch immediately.

Here is my code (I also tried the eigen_blas library so that I can use the same code to test mkl and eigen but I get similar results):
Code: Select all
#include <Eigen/Core>
#include <cstdio>
#include <time.h>

using namespace Eigen;

int main(int argc, char **argv)
{
  int SIZE1 = 1000;
  int SIZE2 = 1000;
  int SIZE3 = 1000;
  MatrixXd a = MatrixXd::Random(SIZE1, SIZE3), b = MatrixXd::Random(SIZE1, SIZE2), c= MatrixXd::Random(SIZE2, SIZE3);

  timespec t0, t1;
  clock_gettime(CLOCK_MONOTONIC_RAW, &t0);

  a.noalias() += b * c;

  clock_gettime(CLOCK_MONOTONIC_RAW, &t1);
  printf("time = %e\n", (t1.tv_sec - t0.tv_sec) + (t1.tv_nsec - t0.tv_nsec) * 1e-9);
}
User avatar
ggael
Moderator
Posts
3447
Karma
19
OS
are you comparing against a sequential or parallel version of MKL ? To enable parallelization you have to add the -fopenmp flag. Be careful about hyper threading: if hyper threading is enabled run your program with:

$ OMP_NUM_THREADS=<true number of core> ./myprog

If you're unsure, please report the timing number to see how far we are from the theoretical peak perf. for your system.
ccorail
Registered Member
Posts
4
Karma
0
OS
thank you for your answer.
I'm linking against the sequential mkl, and a compiled a sequential atlas for the timing. (I'm not really interested in parallel blas for this application because the main program itself is parallel (MPI) ).

Here are my timing of a dgemm of square (size 1000) matrices (blas = mkl):
time eigen = 1.713138e-01
time blas = 8.847635e-02


It's obtained with the following code compiled with :
Code: Select all
main_eigen : main_eigen.cc
   g++ -o $@ -I/usr/include/eigen3 -lrt -O3 -msse4.2 -L$(MKL_LIB_DIR) -lmkl_intel_lp64 -lmkl_sequential -lmkl_core -lm -Wl,-rpath,$(MKL_LIB_DIR) $< -D DGEMM=dgemm

Code: Select all
#include <Eigen/Core>
#include <cstdio>
#include <time.h>

using namespace Eigen;

extern "C" {
  void DGEMM (const char *transa, const char *transb, int *m, int *n, int *k, double *alpha, double *a, int *lda, double *b, int *ldb, double *beta, double *c, int *ldc);
}

void gemm(MatrixXd &a, MatrixXd &b, MatrixXd &c) {
  int m = a.rows();
  int n = c.rows();
  int o = a.cols();
  double alpha = 1.;
  double beta = 0.;
  DGEMM("N", "N", &m, &o, &n, &alpha, b.data(), &m, c.data(), &n, &beta, a.data(), &m);
}

int main(int argc, char **argv)
{
  int SIZE1 = 1000;
  int SIZE2 = 1000;
  int SIZE3 = 1000;
  MatrixXd a(SIZE1, SIZE3), b = MatrixXd::Random(SIZE1, SIZE2), c= MatrixXd::Random(SIZE2, SIZE3);

  timespec t0, t1, t2;
  clock_gettime(CLOCK_MONOTONIC_RAW, &t0);
  a.noalias() += b * c;
  clock_gettime(CLOCK_MONOTONIC_RAW, &t1);
  gemm(a,b,c);
  clock_gettime(CLOCK_MONOTONIC_RAW, &t2);

  printf("time eigen = %e\n", (t1.tv_sec - t0.tv_sec) + (t1.tv_nsec - t0.tv_nsec) * 1e-9);
  printf("time blas  = %e\n", (t2.tv_sec - t1.tv_sec) + (t2.tv_nsec - t1.tv_nsec) * 1e-9);
}
User avatar
ggael
Moderator
Posts
3447
Karma
19
OS
0.0884s with blas is theoretically impossible to achieve in a sequential code. Your CPU can run at most at 3.3GHz (turbo boost), that implies a maximum of 14.17e9 double operations per seconds: 3.3G * 2 (for SSE) * 2 (for 1 add and 1 mul per cycle). The 0.084s you get would imply a rate of 22e9 ops per second...
ccorail
Registered Member
Posts
4
Karma
0
OS
ggael wrote:that implies a maximum of 14.17e9 double operations per seconds: 3.3G * 2 (for SSE) * 2 (for 1 add and 1 mul per cycle).

somehow I get better results than this number :
time for square matrix (size = 5000, double precision) products :
eigen : 20.5s (12.2e9 FLOPS)
atlas : 13.3s (18.8e9 FLOPS)
mkl : 10.5s (23.8e9 FLOPS)
eigen parallel : 7.8s (8.0e9 FLOPS per core)
mkl parallel : 3.5s (17.8e9 FLOPS per core)

my processor is an Intel Core i7-2720QM (quad core, hyper threading disabled in BIOS and I run the parallel test with OMP_NUM_THREADS=4)

I'm sorry to insist. I'm really interested in eigen and willing to use it but I have to convince myself and my team that, in sequential, its performance are comparable to atlas or mkl as stated on the website.
User avatar
ggael
Moderator
Posts
3447
Karma
19
OS
ok, I got it, you have a very recent processor with support for AVX instructions, and Eigen not yet... So a factor 2 has to be expected, indeed. As soon as we'll get AVX we'll very likely reach MKL.
mattd
Registered Member
Posts
28
Karma
0
ccorail, have you tried compiling with -march=native -mtune=native flags?

It's advised by the GCC manual /* http://gcc.gnu.org/onlinedocs/gcc/i386- ... tions.html */, otherwise the code is targeted to generic lowest-common-denominator CPU (and not tuned).
// Make sure you have a recent GCC version, too, so that targeting & tuning for corei7/corei7-avx/core-avx-i cpu-type is supported.

I see one of your compilers (g++ 4.6.1) should support it:
Support for Intel Core i3/i5/i7 processors is now available through the -march=corei7 and -mtune=corei7 options.
Support for Intel Core i3/i5/i7 processors with AVX is now available through the -march=corei7-avx and -mtune=corei7-avx options.

// http://gcc.gnu.org/gcc-4.6/changes.html

You will have to settle for targeting older architectures (and potentially sub-optimal performance that goes with it) if you're using a 4.4 compiler, though.
// GCC 4.4.0 was released on April 21, 2009, while Core i7-2720QM was released on January 9, 2011, so it's only natural. Note also that it usually takes time to implement proper support and optimizations in a compiler, so it's always best to use the most recent version of GCC if you a have recent CPU and want to make the most of it.

/* On a side note, core-avx-i is for yet-to-be released (some time in 2012) Intel Ivy Bridge architecture /* http://patchwork.ozlabs.org/patch/108601/ */ your CPU is the current Sandy Bridge arch., hence corei7/corei7-avx are the appropriate cpu-types -- so don't worry about core-avx-i. */

Incidentally, ggael, I have a question about this:
AVX floating-point arithmetic can now be enabled by default at configure time with the new --with-fpmath=avx option.

Since AVX is not yet supported, would you advise against using the above flag (in other words, can AVX fpmath hurt more than help)?

Similarly, would you suggest corei7 over corei7-avx?

// OP: "native" selects the best target & tuning for your CPU as deemed by the compiler and is recommended in most cases, but based on the answer to the above you may try to manually experiment with -march=corei7 -mtune=corei7 vs. -march=corei7-avx -mtune=corei7-avx and see which one fares better in your case.
mattd
Registered Member
Posts
28
Karma
0
Independently of the above, you may also try using the -Ofast flag (caveats apply):
http://gcc.gnu.org/onlinedocs/gcc/Optim ... -Ofast-689
ccorail
Registered Member
Posts
4
Karma
0
OS
Thanks for your suggestions.
I tried -Ofast and -march,-mtune=[corei7,corei7-avx,native] but none of those options made a significant difference.
I'll try an older CPU on monday.
User avatar
ggael
Moderator
Posts
3447
Karma
19
OS
I meant that Eigen does not exploit AVX instructions yet, regardless of the compilation flags. This is something we have and planed to do.
User avatar
benoitsteiner
Registered Member
Posts
13
Karma
0
ccorail ,
I have added support for both AVX and FMA instructions to Eigen. On SandyBridge and IvyBridge machines, the code runs almost twice as fast on the set of benchmarks I used when working on this. On Haswell, I have measured an additional 30% speedup.

The code is currently available in this branch. I am working with Gael to get it merged in the next version of the Eigen library. In the meantime, can you give it a try and let us know how it performs for you performance-wise? That will help us find any potential remaining issue.

Thanks.




Bookmarks



Who is online

Registered users: Baidu [Spider], Bing [Bot], Google [Bot]