This forum has been archived. All content is frozen. Please use KDE Discuss instead.

Eigen performance on Cygwin/Mingw and inlining issues

Tags: None
(comma "," separated)
martinb
Registered Member
Posts
1
Karma
0
There seems to be some problem with Eigen when used with cygwin and Mingw, as the performance is not satisfying, while under linux the performance is quite good.

Lets take the Coulomb energy as an example:

Code: Select all
Eigen::ArrayXf x,y,z,charges,res;
... // initialization
const float xp = x[id1], yp = z[id1], zp = z[id1], qp = charges[id1];
res = (charges * qp) * (((x - xp).square() + (y - yp).square()) + ((z - zp).square() + cutoff)).inverse().sqrt();


when compiled on linux g++ -S -m32 -msse2 -mfpmath=sse -O3 (version 4.5.0), the loop body looks something like below, wich is pretty good
Code: Select all
    movl    -88(%ebp), %edx
    movl    -84(%ebp), %eax
    movaps  -136(%ebp), %xmm1
    movaps  -152(%ebp), %xmm2
    movaps  -168(%ebp), %xmm0
    addps   (%eax,%ebx,4), %xmm2
    leal    -56(%ebp), %eax
    addps   (%edx,%ebx,4), %xmm0
    mulps   %xmm2, %xmm2
    addps   (%esi,%ebx,4), %xmm1
    mulps   %xmm0, %xmm0
    movl    %eax, (%esp)
    mulps   %xmm1, %xmm1
    addps   %xmm2, %xmm0
    addps   -120(%ebp), %xmm1
    addps   %xmm1, %xmm0
    movaps  .LC23, %xmm1
    divps   %xmm0, %xmm1
    movaps  %xmm1, -56(%ebp)
    call    _ZN5Eigen8internal5psqrtIU8__vectorfEET_RKS3_
    movl    -172(%ebp), %edx
    movaps  -104(%ebp), %xmm1
    mulps   (%edx,%ebx,4), %xmm1
    mulps   %xmm0, %xmm1
    movaps  %xmm1, (%edi,%ebx,4)
    addl    $4, %ebx
    cmpl    %ebx, -64(%ebp)


on the other hand using cygwin g++ -S -O3 -msse2 -mfpmath=sse (vertion 4.5.3) it results int the following:

Code: Select all
        movl    -68(%ebp), %eax
        movl    %ebx, 4(%esp)
        movl    %eax, (%esp)
        call    __ZNK5Eigen16CwiseUnaryOpImplINS_8internal16scalar_square_opIfEEKNS_12CwiseUnaryOpINS1_13scalar_add_opIfEEKNS_5ArrayIfLin1ELi1ELi0ELin1ELi1EEEEENS_5DenseEE6packetI$
        movl    -64(%ebp), %edx
        movaps  %xmm0, %xmm1
        movss   84(%esi), %xmm0
        movl    %edx, (%esp)
        movl    %ebx, 4(%esp)
        shufps  $0, %xmm0, %xmm0
        addps   %xmm0, %xmm1
        movaps  %xmm1, -104(%ebp)
        call    __ZNK5Eigen16CwiseUnaryOpImplINS_8internal16scalar_square_opIfEEKNS_12CwiseUnaryOpINS1_13scalar_add_opIfEEKNS_5ArrayIfLin1ELi1ELi0ELin1ELi1EEEEENS_5DenseEE6packetI$
        movl    -76(%ebp), %eax
        movl    %ebx, 4(%esp)
        movl    %eax, (%esp)
        movaps  %xmm0, -56(%ebp)
        call    __ZNK5Eigen16CwiseUnaryOpImplINS_8internal16scalar_square_opIfEEKNS_12CwiseUnaryOpINS1_13scalar_add_opIfEEKNS_5ArrayIfLin1ELi1ELi0ELin1ELi1EEEEENS_5DenseEE6packetI$
        leal    -40(%ebp), %eax
        movaps  -104(%ebp), %xmm1
        movl    %eax, (%esp)
        addps   -56(%ebp), %xmm0
        addps   %xmm1, %xmm0
        movaps  LC19, %xmm1
        divps   %xmm0, %xmm1
        movaps  %xmm1, -40(%ebp)
        call    __ZN5Eigen8internal5psqrtIU8__vectorfEET_RKS3_
        movl    -72(%ebp), %edx
        movl    %ebx, 4(%esp)
        movl    %edx, (%esp)
        movaps  %xmm0, -56(%ebp)
        call    __ZNK5Eigen16CwiseUnaryOpImplINS_8internal18scalar_multiple_opIfEEKNS_5ArrayIfLin1ELi1ELi0ELin1ELi1EEENS_5DenseEE6packetILi1EEEU8__vectorfi
        mulps   -56(%ebp), %xmm0
        movaps  %xmm0, (%edi,%ebx,4)
        addl    $4, %ebx
        cmpl    %ebx, -60(%ebp)
        jg      L5512


with __ZNK5Eigen16CwiseUnaryOpImplINS_8internal18scalar_multiple_opIfEEKNS_5ArrayIfLin1ELi1ELi0ELin1ELi1EEENS_5DenseEE6packetILi1EEEU8__vectorfi
Code: Select all
LFB21081:
        pushl   %ebp
LCFI2872:
        movl    %esp, %ebp
LCFI2873:
        movl    8(%ebp), %eax
        movl    12(%ebp), %edx
        popl    %ebp
LCFI2874:
        movss   8(%eax), %xmm0
        movl    4(%eax), %eax
        shufps  $0, %xmm0, %xmm0
        movl    (%eax), %eax
        addps   (%eax,%edx,4), %xmm0
        mulps   %xmm0, %xmm0
        ret


obviously gcc does not inline the .square() call, thus dramatically reducing performance. It also does not recognize xp, yp ... as loop constants and therefore reloads them every loop iteration.

Ive tried playing with gcc params and flags, however to no success. Has anyone else come across this problem and maybe found a solution? I could really need some help here. Thanks!
User avatar
ggael
Moderator
Posts
3447
Karma
19
OS
that's gcc weirdness. Can you try to add EIGEN_STRONG_INLINE in front of the relevant not properly inlined functions?
User avatar
ggael
Moderator
Posts
3447
Karma
19
OS
btw, you might also play with gcc's parameters controlling inlining (like the maximal number of instructions for inlining, etc)


Bookmarks



Who is online

Registered users: Baidu [Spider], Bing [Bot], Google [Bot], Yahoo [Bot]