This forum has been archived. All content is frozen. Please use KDE Discuss instead.

Eigen GEMM Benchmarks vs MKL and my own code

Tags: None
(comma "," separated)
raxhem
Registered Member
Posts
5
Karma
0
I have written my own code to do large (1000x1000) dense matrix multiplication. I estimate the max GFOPs/s of an INTEL CPU (core 2 through Ivy Bridge) as

Max SP FLOPs/s = frequencey * 4 SSE(8AVX) * 2 (MAC) * number of cores (not HW threads)
Max DP FLOPs/s = 0.5 * Max SP FLOPs/s

By MAC I mean that the CPU can do a SSE (AVX) multiplication and addition at the same time.

I estimate the number of GLOPs in matrix multiplication as 2.0*m*n*p *1E-9 (2 GFLOPs for 1000x1000). Once I know the time I can get GFLOPs/s. Then I divide that by the max GFLOPs/s to get the efficiency. I get about about a 45% efficiency with Eigen in my own tests on a SSE only system. I also get about 45% with my own matrix multiplication code on this system.

Based on the benchmarks at the link below (Intel(R) Core(TM)2 Quad CPU Q9400 @ 2.66GHz ( x86_64 )) I get Max SP GLOPS/s = 2.66 GHz* 4 (SSE)* 2 (MAC) * 4 (cores) = 85.1.2 and Max DP GFLOPs/s = 42.56. In the link it's clear that matrix multiplication does not even reach 20 GFLOPs (I assume this is DGEMM). That's about a 45% efficiency. The same I get when I run Eigen myself and with my own GEMM code.

However, when I run MKL on a system with AVX I get about an 80% efficient (using eight instead of four due to AVX doubles the max and I still get 80%). I don't have MKL on a system with SSE only but I would guess the efficiency would still be at least 80% so it would get around 35 GFLOPs/s on the system used in that benchmark with SSE only instead of the less than 20 GFLOPs/s shown. What's going on?

http://eigen.tuxfamily.org/index.php?title=Benchmark
User avatar
ggael
Moderator
Posts
3447
Karma
19
OS
The link you saw is for a single core execution and floating point arithmetic. So the efficiency is about 20/21.28=94%. Multithreading is a nearly orthogonal problem. First you should make sure your kernel can nearly achieve the peak-perf. on a single core as Eigen does, and then you start thinking about multi-threading and organise the computation to reduce duplicate works, synchro, memory issues...
User avatar
ggael
Moderator
Posts
3447
Karma
19
OS
btw, I have some scaling plots there:
https://plafrim.bordeaux.inria.fr/doku. ... e:guenneba

On the Xeon with 8 cores/threads MKL is nearly at 100% of efficiency, and Eigen is at about 80%.
raxhem
Registered Member
Posts
5
Karma
0
"The link you saw is for a single core execution and floating point arithmetic. So the efficiency is about 20/21.28=94%."
Thanks, that makes sense now. Though the rate is a less than 20 so it's probably closer to 80% or so.

From your plot I saw something I did not expect. Using the hyper threads actually makes the rate worse! In other words if I have a four physical cores system and eight HW threads the rate is lower running eight threads then four threads! That I did not expect. I tried running Eigen with only four threads and the rate went up quite a bit. I tried my own code with four threads and it went up as well. However, in both cases the rate is unstable with four threads instead of eight. I'm using OpenMP for my GEMM code as well. I mean the rate jumps around a lot for each iteration.

I am using a Xeon E5630 @2.66 GHz. The max rate for single float should be 85.12 GFLOPs/s (2.66*4*2*4). The best I get with Eigen (using four threads instead of eight) is 55 GFLOPs/s. That's about 65% efficiency (only 45% using eight threads). The efficiency for a single thread is 80% (16.9 GFLOPs/s).
raxhem
Registered Member
Posts
5
Karma
0
I looked up the processor used in your plot

http://ark.intel.com/products/37109/Int ... -Intel-QPI

The clock speed is 2.8 GHz. But the max turbo boost is 3.2 GHz. The first has a max rate of 179.2 GFLOPs/s and the second would be 204.8 GFLOPs/s. Which number should I use when estimating the maximum efficiency? I thought turbo boost would be enabled for all cores under load.
raxhem
Registered Member
Posts
5
Karma
0
Sorry, I misunderstood turbo boost. It varies depending on how many threads are used. I guess when you quote XEON X5560 @ 2.80 GHz that means turbo boost was not enabled? Because with turbo boost the frequency under load will be between 2.8 and 3.2 with that processor.
User avatar
ggael
Moderator
Posts
3447
Karma
19
OS
Yes turbo boost was disabled, as always when doing benchmarking. Turbo boost depends a lot on system load, CPU temperature, etc.... It's impossible to do any benchmark with it enabled.
raxhem
Registered Member
Posts
5
Karma
0
Thank you for the information. Turboboost is a real pain. It seems very difficult to find the operating CPU frequency under load in code with Turboboost. I think I will disable it since I mostly care about benchmarking my code and don't need the boost.


Bookmarks



Who is online

Registered users: Bing [Bot], Google [Bot], Sogou [Bot]