Eigen performance on Cygwin/Mingw and inlining issues

Mon Sep 03, 2012 1:14 pm

There seems to be some problem with Eigen when used with cygwin and Mingw, as the performance is not satisfying, while under linux the performance is quite good.

Lets take the Coulomb energy as an example:

when compiled on linux g++ -S -m32 -msse2 -mfpmath=sse -O3 (version 4.5.0), the loop body looks something like below, wich is pretty good

on the other hand using cygwin g++ -S -O3 -msse2 -mfpmath=sse (vertion 4.5.3) it results int the following:

with __ZNK5Eigen16CwiseUnaryOpImplINS_8internal18scalar_multiple_opIfEEKNS_5ArrayIfLin1ELi1ELi0ELin1ELi1EEENS_5DenseEE6packetILi1EEEU8__vectorfi

obviously gcc does not inline the .square() call, thus dramatically reducing performance. It also does not recognize xp, yp ... as loop constants and therefore reloads them every loop iteration.

Ive tried playing with gcc params and flags, however to no success. Has anyone else come across this problem and maybe found a solution? I could really need some help here. Thanks!

martinb Registered Member Posts 1 Karma 0	Eigen performance on Cygwin/Mingw and inlining issues Mon Sep 03, 2012 1:14 pm There seems to be some problem with Eigen when used with cygwin and Mingw, as the performance is not satisfying, while under linux the performance is quite good. Lets take the Coulomb energy as an example: Code: Select all `Eigen::ArrayXf x,y,z,charges,res; ... // initialization const float xp = x[id1], yp = z[id1], zp = z[id1], qp = charges[id1]; res = (charges * qp) * (((x - xp).square() + (y - yp).square()) + ((z - zp).square() + cutoff)).inverse().sqrt();` when compiled on linux g++ -S -m32 -msse2 -mfpmath=sse -O3 (version 4.5.0), the loop body looks something like below, wich is pretty good Code: Select all movl -88(%ebp), %edx movl -84(%ebp), %eax movaps -136(%ebp), %xmm1 movaps -152(%ebp), %xmm2 movaps -168(%ebp), %xmm0 addps (%eax,%ebx,4), %xmm2 leal -56(%ebp), %eax addps (%edx,%ebx,4), %xmm0 mulps %xmm2, %xmm2 addps (%esi,%ebx,4), %xmm1 mulps %xmm0, %xmm0 movl %eax, (%esp) mulps %xmm1, %xmm1 addps %xmm2, %xmm0 addps -120(%ebp), %xmm1 addps %xmm1, %xmm0 movaps .LC23, %xmm1 divps %xmm0, %xmm1 movaps %xmm1, -56(%ebp) call _ZN5Eigen8internal5psqrtIU8__vectorfEET_RKS3_ movl -172(%ebp), %edx movaps -104(%ebp), %xmm1 mulps (%edx,%ebx,4), %xmm1 mulps %xmm0, %xmm1 movaps %xmm1, (%edi,%ebx,4) addl $4, %ebx cmpl %ebx, -64(%ebp) on the other hand using cygwin g++ -S -O3 -msse2 -mfpmath=sse (vertion 4.5.3) it results int the following: Code: Select all movl -68(%ebp), %eax movl %ebx, 4(%esp) movl %eax, (%esp) call __ZNK5Eigen16CwiseUnaryOpImplINS_8internal16scalar_square_opIfEEKNS_12CwiseUnaryOpINS1_13scalar_add_opIfEEKNS_5ArrayIfLin1ELi1ELi0ELin1ELi1EEEEENS_5DenseEE6packetI$ movl -64(%ebp), %edx movaps %xmm0, %xmm1 movss 84(%esi), %xmm0 movl %edx, (%esp) movl %ebx, 4(%esp) shufps $0, %xmm0, %xmm0 addps %xmm0, %xmm1 movaps %xmm1, -104(%ebp) call __ZNK5Eigen16CwiseUnaryOpImplINS_8internal16scalar_square_opIfEEKNS_12CwiseUnaryOpINS1_13scalar_add_opIfEEKNS_5ArrayIfLin1ELi1ELi0ELin1ELi1EEEEENS_5DenseEE6packetI$ movl -76(%ebp), %eax movl %ebx, 4(%esp) movl %eax, (%esp) movaps %xmm0, -56(%ebp) call __ZNK5Eigen16CwiseUnaryOpImplINS_8internal16scalar_square_opIfEEKNS_12CwiseUnaryOpINS1_13scalar_add_opIfEEKNS_5ArrayIfLin1ELi1ELi0ELin1ELi1EEEEENS_5DenseEE6packetI$ leal -40(%ebp), %eax movaps -104(%ebp), %xmm1 movl %eax, (%esp) addps -56(%ebp), %xmm0 addps %xmm1, %xmm0 movaps LC19, %xmm1 divps %xmm0, %xmm1 movaps %xmm1, -40(%ebp) call __ZN5Eigen8internal5psqrtIU8__vectorfEET_RKS3_ movl -72(%ebp), %edx movl %ebx, 4(%esp) movl %edx, (%esp) movaps %xmm0, -56(%ebp) call __ZNK5Eigen16CwiseUnaryOpImplINS_8internal18scalar_multiple_opIfEEKNS_5ArrayIfLin1ELi1ELi0ELin1ELi1EEENS_5DenseEE6packetILi1EEEU8__vectorfi mulps -56(%ebp), %xmm0 movaps %xmm0, (%edi,%ebx,4) addl $4, %ebx cmpl %ebx, -60(%ebp) jg L5512 with __ZNK5Eigen16CwiseUnaryOpImplINS_8internal18scalar_multiple_opIfEEKNS_5ArrayIfLin1ELi1ELi0ELin1ELi1EEENS_5DenseEE6packetILi1EEEU8__vectorfi Code: Select all `LFB21081: pushl %ebp LCFI2872: movl %esp, %ebp LCFI2873: movl 8(%ebp), %eax movl 12(%ebp), %edx popl %ebp LCFI2874: movss 8(%eax), %xmm0 movl 4(%eax), %eax shufps $0, %xmm0, %xmm0 movl (%eax), %eax addps (%eax,%edx,4), %xmm0 mulps %xmm0, %xmm0 ret` obviously gcc does not inline the .square() call, thus dramatically reducing performance. It also does not recognize xp, yp ... as loop constants and therefore reloads them every loop iteration. Ive tried playing with gcc params and flags, however to no success. Has anyone else come across this problem and maybe found a solution? I could really need some help here. Thanks!
ggael Moderator Posts 3447 Karma 19 OS	Re: Eigen performance on Cygwin/Mingw and inlining issues Tue Sep 04, 2012 10:21 pm that's gcc weirdness. Can you try to add EIGEN_STRONG_INLINE in front of the relevant not properly inlined functions?
ggael Moderator Posts 3447 Karma 19 OS	Re: Eigen performance on Cygwin/Mingw and inlining issues Tue Sep 04, 2012 10:22 pm btw, you might also play with gcc's parameters controlling inlining (like the maximal number of instructions for inlining, etc)

Eigen performance on Cygwin/Mingw and inlining issues

Page 1 of 1 (3 posts)

Eigen performance on Cygwin/Mingw and inlining issues

Re: Eigen performance on Cygwin/Mingw and inlining issues

Re: Eigen performance on Cygwin/Mingw and inlining issues

Bookmarks

Who is online