This forum has been archived. All content is frozen. Please use KDE Discuss instead.

ArrayBase< MatrixXcf >::abs2 not auto-vectorizing.

Tags: None
(comma "," separated)
User avatar
willo
Registered Member
Posts
14
Karma
0
OS
Eigen community and developers,

I've found a case where I expect some auto-vectorized code to result, but am observing none.

I've trying to use a complex, dynamic matrix, and use the ArrayBase< Derived >::abs2() to get a real, dynamic matrix with the squared magnitude of the input.

I'm observing that the resulting code is not auto-vectorized.

Doubly worse, I'm observing code that gets the complex-absolute-value, then squares it. This is wasteful, as the complex-absolute-value does a square-root which is un-done by the subsequent square operation.

I'd expect that 'array.abs2()' should evaluate to something nearly identical to:
array.real().square() + array.imag().square()

My question is, should I expect:
- the result of abs2 to skip the square-root then square?
- the resulting code to be auto-vectorized?


My processor is Intel 64-bit Core 2 Duo.

My OS is Ubuntu 11.04.

My Eigen library version is 3.0.2.

My compiler is g++-4.5.

My compiler flags are: -Wall -O3 -DNDEBUG -march=native

My test code is here:
Code: Select all
#include <Eigen/Dense>

#include <iostream>

int main( int argc, char* argv[] )
{
    Eigen::MatrixXcf in_matrix( 20, 40 );
    Eigen::MatrixXf out_matrix( 20, 40 );

    asm( "# BEGIN LIBRARY" );
    out_matrix.array() = in_matrix.array().abs2();
    asm( "# END LIBRARY" );

    asm( "# BEGIN MANUAL" );
    out_matrix.array() = in_matrix.array().real().square()
                         + in_matrix.array().imag().square();
    asm( "# END MANUAL" );

    std::cout << "In matrix: " << in_matrix << std::endl;
    std::cout << "Out matrix: " << out_matrix << std::endl;

    return 0;
}
User avatar
ggael
Moderator
Posts
3447
Karma
19
OS
The difficulty is that the result of abs2 is not a complex, and so we would need to take two packets of complexes to return one of reals. That's not impossible, but difficult to do in a general manner. In the future we'll support packet of various sizes for the same scalar type. Therefore, we can imagine meta packet types that will aggregate multiple atomic packets that is exactly what we need for abs2 and efficient loop peeling.
User avatar
willo
Registered Member
Posts
14
Karma
0
OS
Gaël,

Thanks for the quick response.

Okay, so I can't expect Array< MatrixXcf >::abs2() to be vectorized any time soon.

Is my manual formulation of 'abs2' the best I can do with Eigen 3.0.x?
User avatar
ggael
Moderator
Posts
3447
Karma
19
OS
in the meantime the best is to use .abs2(). Indeed, std::norm(complex) returns the squared magnitude and is therefore implemented as re^2+im^2. Yes, I know that's very misleading, a mistake of C++ among many others...
User avatar
willo
Registered Member
Posts
14
Karma
0
OS
Gaël,

I've discovered that, for my environment, it is better to use the "manual" evaluation of 'abs2' for a complex matrix type. Specifically, the resulting assembly code is making SSE packet-calls, and the resulting run-time is roughly an order of magnitude improved.

I suspect that, despite the limitations of Eigen to auto-vectorize complex-to-real operations, this is still faster because:
[list=]
[*] some auto-vectorized code is being created.
[*] the square-root then square operations are not required.
[/list]

What do you think about these results?


Below is my updated example:
Code: Select all
#include <Eigen/Dense>

#include <ctime> // timespec, clock_gettime, CLOCK_REALTIME
#include <iostream>

double get_time()
{
    timespec timespec;
    int rc = -1;
    rc = clock_gettime( CLOCK_REALTIME, &timespec );
    assert( rc == 0 );
    const double time = static_cast< double >( timespec.tv_sec )
                        + static_cast< double >( timespec.tv_nsec ) * 1e-9;
    return time;
}

int main( int argc, char* argv[] )
{   
    const size_t num_rows = 1000u;
    const size_t num_cols = 1000u;
    Eigen::MatrixXcf in_matrix = Eigen::MatrixXcf::Random( num_rows, num_cols );
    Eigen::MatrixXf out_matrix = Eigen::MatrixXf::Random( num_rows, num_cols );
       
    const double library_begin_time = get_time();
    asm( "# BEGIN LIBRARY" );
    out_matrix.array() = in_matrix.array().abs2();
    asm( "# END LIBRARY" );
    const double library_end_time = get_time();

    const double manual_begin_time = get_time();
    asm( "# BEGIN MANUAL" );
    out_matrix.array() = in_matrix.array().real().square()
                         + in_matrix.array().imag().square();
    asm( "# END MANUAL" );
    const double manual_end_time = get_time();

    std::cout << "In matrix: " << in_matrix << std::endl;
    std::cout << "Out matrix: " << out_matrix << std::endl;

    std::cout << "Library elapsed-time: "
              << library_end_time - library_begin_time << "." << std::endl;
    std::cout << "Manual elapsed-time: "
              << manual_end_time - manual_begin_time << "." << std::endl;

    return 0;
}



My compiler was: g++ (Ubuntu/Linaro 4.5.2-8ubuntu4) 4.5.2

My compiler flags were: "-Wall -O2 -l rt"

Below is a section of the resulting assembly code:
Code: Select all
        call    _Z8get_timev
        movsd   %xmm0, (%rsp)
#APP
# 25 "test_eigen_complex_abs2_vectorization.cpp" 1
        # BEGIN LIBRARY
# 0 "" 2
#NO_APP
        movq    40(%rsp), %rax
        cmpq    72(%rsp), %rax
        jne     .L722
        movq    48(%rsp), %rbp
        cmpq    80(%rsp), %rbp
        jne     .L722
        imulq   %rax, %rbp
        testq   %rbp, %rbp
        jle     .L724
        movq    32(%rsp), %r13
        movq    64(%rsp), %r12
        xorl    %ebx, %ebx
        .p2align 4,,10
        .p2align 3
.L725:
        movq    (%r12,%rbx,8), %xmm0
        call    cabsf
        mulss   %xmm0, %xmm0
        movss   %xmm0, 0(%r13,%rbx,4)
        addq    $1, %rbx
        cmpq    %rbp, %rbx
        jne     .L725
.L724:
#APP
# 27 "test_eigen_complex_abs2_vectorization.cpp" 1
        # END LIBRARY
# 0 "" 2
#NO_APP
        call    _Z8get_timev
        movsd   %xmm0, 8(%rsp)
        call    _Z8get_timev
        movsd   %xmm0, 16(%rsp)
#APP
# 31 "test_eigen_complex_abs2_vectorization.cpp" 1
        # BEGIN MANUAL
# 0 "" 2
#NO_APP
        movq    72(%rsp), %rax
        cmpq    40(%rsp), %rax
        movq    80(%rsp), %rcx
        jne     .L726
        cmpq    48(%rsp), %rcx
        jne     .L726
        imulq   %rax, %rcx
        testq   %rcx, %rcx
        jle     .L728
        movq    32(%rsp), %rsi
        movq    64(%rsp), %rdx
        xorl    %eax, %eax
        .p2align 4,,10
        .p2align 3
.L729:
        movss   4(%rdx), %xmm0
        movss   (%rdx), %xmm1
        mulss   %xmm0, %xmm0
        addq    $8, %rdx
        mulss   %xmm1, %xmm1
        addss   %xmm1, %xmm0
        movss   %xmm0, (%rsi,%rax,4)
        addq    $1, %rax
        cmpq    %rcx, %rax
        jne     .L729
.L728:
#APP
# 34 "test_eigen_complex_abs2_vectorization.cpp" 1
        # END MANUAL
# 0 "" 2
#NO_APP
        call    _Z8get_timev
        movl    $11, %edx



The resuling timing numbers are usually:
In matrix: (-0.211234,0.680375)
Out matrix: 0.507531
Library elapsed-time: 0.0228312.
Manual elapsed-time: 0.00304437.
User avatar
ggael
Moderator
Posts
3447
Karma
19
OS
The "manual" version is not vectorized. It uses scalar SSE instructions, not the vector ones. The problem of .abs2() is that it calls the function cabsf and then squares. This clearly must be reimplemented in a more optimized way!
User avatar
willo
Registered Member
Posts
14
Karma
0
OS
You're right, they are SIMD instructions, but not the vectorized/packet ones.

I agree 'abs2' for complex types should be implemented in a different, optimized way. Should I file a bug-report, or is this something that is already being addressed?
User avatar
ggael
Moderator
Posts
3447
Karma
19
OS
this is already fixed in the default branch.
User avatar
willo
Registered Member
Posts
14
Karma
0
OS
I see your change at:
https://bitbucket.org/eigen/eigen/chang ... a04dc004e5

I just wrote a small unit-test for std::norm, vs. a manual 'magnitude squared', and it is doing 'cabs' then squaring it! I can't believe GNU's 'std::norm' has such a basic bug.


Bookmarks



Who is online

Registered users: Bing [Bot], Google [Bot], Yahoo [Bot]