This forum has been archived. All content is frozen. Please use KDE Discuss instead.

eigen3 openmp vs openmp/sse2 performance comparison

Tags: None
(comma "," separated)
isluser
Registered Member
Posts
10
Karma
0
Hi,

I'd like to know what kind of performance gain I can expect by enabling sse2 with Visual Studio 2008 on the developpement branch of eigen 3?

I'm currently using the test between n and z to do some benchmarking.
For 15 itérations, I get 4 min 15 sec with only openmp.
When I activate SSE2, I get 32 sec. I just seems too good to be true.

As a more precise example, the product_symm test takes 55 secs with openmp and 5 secs with sse2.

I was wondering if those number were plausibles, or is there something else going on?

Thank you,
User avatar
bjacob
Registered Member
Posts
658
Karma
3
For 15 itérations, I get 4 min 15 sec with only openmp.
When I activate SSE2, I get 32 sec. I just seems too good to be true.


No it's normal. You get a total x8, decomposing into:
x4 because 4 floats fit in a packet
x2 because SSE addition and multiplication can run together in 1 cycle


Join us on Eigen's IRC channel: #eigen on irc.freenode.net
Have a serious interest in Eigen? Then join the mailing list!
renorm
Registered Member
Posts
31
Karma
0
x2 because SSE addition and multiplication can run together in 1 cycle

I am very curious how it is possible? Is it in PacketMath.h?
User avatar
bjacob
Registered Member
Posts
658
Karma
3
renorm wrote:
x2 because SSE addition and multiplication can run together in 1 cycle

I am very curious how it is possible? Is it in PacketMath.h?


No, it's in your CPU. Recent Intel CPUs (not sure about AMD) are able to execute in 1 cycle a (mulps,addps) pair, and likewise a (mulpd,addpd) pair, EVEN if operating on the same registers. It's as if there were an combined-mul-and-add instruction working in 1 cycle.


Join us on Eigen's IRC channel: #eigen on irc.freenode.net
Have a serious interest in Eigen? Then join the mailing list!
renorm
Registered Member
Posts
31
Karma
0
How do you get it triggered? Follow _mm_mul_ps with _mm_add_ps?
User avatar
bjacob
Registered Member
Posts
658
Karma
3
Yes: see ei_pmadd() in GenericPacketMath.h:

/** \internal \returns a * b + c (coeff-wise) */
template<typename Packet> inline Packet
ei_pmadd(const Packet& a,
const Packet& b,
const Packet& c)
{ return ei_padd(ei_pmul(a, b),c); }


Join us on Eigen's IRC channel: #eigen on irc.freenode.net
Have a serious interest in Eigen? Then join the mailing list!
renorm
Registered Member
Posts
31
Karma
0
Thanks for explaining.

The original question mentioned OpenMP. Does Eigen explicitly use it?
User avatar
bjacob
Registered Member
Posts
658
Karma
3
Yes, Eigen 3 uses OpenMP if it's enabled. By default it's disabled, but for example with GCC you just have to pass -fopenmp.


Join us on Eigen's IRC channel: #eigen on irc.freenode.net
Have a serious interest in Eigen? Then join the mailing list!
red_cat
Registered Member
Posts
5
Karma
0
Hi!
I use a function of matrix-vector multiplication. This function uses only one core?
When multiplying two matrices are used all the cores.
System Configuration: Intel I7-970, Windows 7. Compiler VS2008, OpenMP enabled.
User avatar
ggael
Moderator
Posts
3447
Karma
19
OS
currently only matrix * matrix products is multi-threaded, not matrix * vector
red_cat
Registered Member
Posts
5
Karma
0
Will there be implemented multithreading for matrix-vector multiplication?
User avatar
ggael
Moderator
Posts
3447
Karma
19
OS
sure, but I cannot say when...
red_cat
Registered Member
Posts
5
Karma
0
If enable OpenMP, then LU decomposition is slower.
Why?
User avatar
ggael
Moderator
Posts
3447
Karma
19
OS
strange because I observed the opposite. Which matrix size? Do you have multi-threading enabled? if so call your executable with:

$ OMP_NUM_THREADS=number_of_real_cores ./my_app
red_cat
Registered Member
Posts
5
Karma
0
thanks, it helped


Bookmarks



Who is online

Registered users: bartoloni, Bing [Bot], Google [Bot], Yahoo [Bot]