This forum has been archived. All content is frozen. Please use KDE Discuss instead.

EIGEN seems to be much slower on Intel-CPUs than AMD-CPUs ?

Tags: None
(comma "," separated)
lamtung
Registered Member
Posts
28
Karma
0
Hi,

I'm having a weird situation, where my code using EIGEN runs slower on Intel-CPU than that of AMD. I'm using EIGEN Beta 2.

In my program, I reimplemented some time critical functions using EIGEN, but still keep the old code (using a command line switch) in order to compare the speed-up made by Eigen-code

On the AMD-CPU, the result is quite promising, for some datasets I can even get the speed up to 50% (factor 2)

My AMD-Machine has the following configuration:
model name : AMD Phenom(tm) II X3 720 Processor
stepping : 2
cpu MHz : 2800.190
cache size : 512 KB

GCC version : 4.3.2 (SUSE)
Runtime:
Old code took : 4.53s
New code (with EIGEN) took : 2.73s
Speed up = 40%


But on my centrino 2 laptop it looks not that good:
model name : Intel(R) Core(TM)2 Duo CPU P8400 @ 2.26GHz
stepping : 6
cpu MHz : 800.000
cache size : 3072 KB

GCC version: 4.4.3 (Ubuntu)
Old code took: 4.14s
New code (with EIGEN) took: 3.25s
Speed up = 22%


So you can see that my Intel laptop run the old code obviously faster than my AMD-Machine, but on the contrary the EIGEN code run slower ??

One can say that it is because of different version of GCC, but I also tested the code on another Intel-CPU with the same version of GCC

model name : Intel(R) Xeon(R) CPU X5560 @ 2.80GHz
stepping : 5
cpu MHz : 1596.000
cache size : 8192 KB

GCC version: 4.3.2(SUSE)
Old code took : 2.71s
New code (with EIGEN) took : 2.07s
Speed up = 24%


This machine is the faster one but still the speed up is relatively low in comparision with AMD CPU.

The results are consistent since I did it many times and for other datasets it also look pretty much the same.

Can some of you give me an explanation ?

Thanks
User avatar
ggael
Moderator
Posts
3447
Karma
19
OS
usually it is the other way round: gcc 4.4 gives better performance than 4.3 and Eigen is faster with Intel CPUs than with AMD, but who knows... So could you try gcc 4.3 on your ubuntu machine ? I'm pretty sure they provide packages for it.

Also can you be more specific about the kind of operation you are doing? problem sizes, scalar type, etc.

Oh, last thing, do you have a 64 bit system installed everywhere?
lamtung
Registered Member
Posts
28
Karma
0
Hi ggael,

GCC 4.4 gives better performance than 4.3, that is true. I just tried to compile the program on my Ubuntu machine (core2duo / centrino 2) with GCC 4.3 and the result was indeed worse than that of 4.4. But the EIGEN code still run slower than that of AMD-CPU, although both were compiled with GCC 4.3.

model name : Intel(R) Core(TM)2 Duo CPU P8400 @ 2.26GHz
stepping : 6
cpu MHz : 800.000
cache size : 3072 KB

GCC version: 4.3.4 (Ubuntu)
Old code took: 4.60s
New code (with EIGEN) took: 3.46s


Compare to the result of the AMD-CPU on my first code, we can assume that both CPUs are equally fast, since they finished the old code in almost the same time (4.60s vs. 4.53s). But the Intel CPU ran the EIGEN code much slower (3.46s vs. 2.73s)

I forgot to mention that all my machines are 64 bit and the code were compile with -O3 (I noticed that -O3 give better performance than -O2). This behavior is very strange to me, given the fact, that I've also heard/read that Intel CPUs are supposed to perform SSE better than AMD's

I appreciate any solution or explanation. Thanks !

Last edited by lamtung on Sun Nov 07, 2010 4:00 pm, edited 2 times in total.
lamtung
Registered Member
Posts
28
Karma
0
The EIGEN Code that were used are very simple:

In my program there 2 functions, which take up 90% runtime. The EIGEN code in one function is an expression :
Code: Select all
VectorA.noalias() = (VectorB * MatrixA).cwiseProduct(VectorA)


And in the other:
Code: Select all
double score = (VectorB * MatrixA).dot(VectorA)


All the Matrices and Vectors are of fixed size, either 4x4 or 20x20 (I used template for that), depending on the dataset
lamtung
Registered Member
Posts
28
Karma
0
To make the difference a little clearer, I ran the test with a bigger dataset and of course with the same GCC version (4.3)

The AMD machine (AMD Phenom II X3 720) finished the code without EIGEN in 25.3s, whereas my Intel-Laptop (Core 2 Duo) finished it in 19.5s. At this point I can conclude that my Intel-Laptop is faster.

But with the EIGEN code, the AMD machine finished faster: 11.5s vs. Intel: 13.8s. Here it is obvious that EIGEN run much more efficient on the AMD machine, than that of Intel ???!

I'm still waiting for an explanation and it would be nice if there is a "fix"

Thanks !!!
User avatar
ggael
Moderator
Posts
3447
Karma
19
OS
well, if I normalize your results with respect to the CPU speed I get for Eigen:

AMD: 7.644s
Core2: 7.345s
Xeon: 5.796s

and these results are very consistent to me! I don't see any evidence that there is any bug here. Why would it be Eigen and not your old code that behaves in strange manner...
lamtung
Registered Member
Posts
28
Karma
0
Hi Gael,

I don't quite understand your point of normalizing the results with CPU speed, since there are obviously other important factors in the CPU that would affect the performance ... ?!

For the benchmark to be more precise, I used the same binary on different test machines (4 x AMD Phenom machines and 4 x Core2). The old code and the Eigen code are simply 2 different functions in that binary, which were switch on/off in turn using command line option for the sake of comparison.

For specific type of dataset, the AMD machines consistently gave me a speed up factor of 2.5 using Eigen, whereas the speed up factor of core2 was only 1.6 (also consistently).

I cannot think of any reason, why it could be the old code that behaves strangely, since it is just an out of the box implementation of the vector*matrix operations I mentioned previously using for loop. And the old code has actually been widely used, since it was published as a scientific paper long ago, some even used it for benchmarking CPUs.

On the other hand, Eigen contains low level code, which is more or less CPU-specific (cache size, SSE, loop unrolling,...). . Therefore I assume, it is Eigen that makes the difference.

To be more concrete, the AMD Phenom II X3 720 CPU is slower than Intel Core2 P8400, since it ran my old code slower and it was also showed the same in this benchmark. But the AMD CPU ran the Eigen code faster than the Intel CPU ?!

Is there anyway I can do to make Eigen run more efficient on Intel, like cache size tuning, etc... ?

Thanks !
User avatar
ggael
Moderator
Posts
3447
Karma
19
OS
sorry, but again, according to your result, the two intel CPUs are already slightly faster than the AMD one. Comparing speed ups from a naive code makes no sense at all. Your results simply means that the Intel CPUs are even more efficient than the AMD cpu with naive code. There are many explanation for that: better built-in instruction scheduling, better built-in prefetching, the compiler generates more Intel friendly code (Intel has some engineer working on GCC), etc.

Regarding tuning, you told me you are doing 4x4 and 20x20 matrix products: for such small matrices there is nothing to tune. The only think you can try to improve performance is to optimize cache use at the level of your application...
Andre
Registered Member
Posts
90
Karma
1
For the benchmark to be more precise, I used the same binary on different test machines (4 x AMD Phenom machines and 4 x Core2).

I assume this means that you didn't even compile it natively on the Intel machine? In that case, what happens if you compile it on the intel machine with e.g. -O3 -march=native?


'And all those exclamation marks, you notice? Five? A sure sign of someone who wears his underpants on his head.' ~Terry Pratchett

'It's funny. All you have to do is say something nobody understands and they'll do practically anything you want them to.' ~J.D. Salinger
lamtung
Registered Member
Posts
28
Karma
0
@ggael: I've already thought about what you said. And it is probably true that the Intel CPUs run the naive code faster than AMDs, whereas with Eigen code the 2 CPUs produce comparable result. I just wanted to ask if there is anyway to make the Eigen code scales, I mean if the CPU run the naive code efficiently then it is also supposed to run Eigen with the same efficiency. And why couldn't Eigen be that efficient when it is also run on the same CPU, generated from the same GCC, ...

@andre: I've done that all before (native compilation,..), and nothing changed. Then I think, maybe for some reason GCC produces better code on my AMD machines than Intel, that why I also tried to use the same binary, which is produced from the AMD machines.


Bookmarks



Who is online

Registered users: Bing [Bot], Evergrowing, Google [Bot], rockscient