Can't get Eigen to use ARM NEON instructions

Fri Mar 13, 2015 11:34 am

I'm having trouble getting Eigen to use ARM NEON vector instructions. It's detecting that the NEON instructions are available, but doesn't seem to actually use them and instead just generates four separate operations. I'm not sure if this is because something isn't set up quite right, I'm missing some compiler options, or perhaps the compiler I'm using is not supported. When targeting an Intel processor with SSE enabled, it uses the vector instructions as expected. Any ideas how to get Eigen working with ARM NEON instructions?

Test Code
Based on the suggested test operating on vectors of 4 floats at http://eigen.tuxfamily.org/index.php?ti ... bly_output. I've added to it to check what Eigen is detecting, check that the NEON intrinsics are supported by the compiler, and try with arrays and vectors of floats and ints:

Targeting Intel
Compiler version and command:

The resulting assembly for foo() is similar to that given in the example, using vector instructions to operate on all 4 elements of the vector in one pass:

Targeting ARM
Compiler version and command targeting ARM:

The resulting assembly contains:

confirming that Eigen has detected that NEON instructions are available. Looking at the intrin() function confirms that NEON intrinsics are supported:

But the foo() function operates on a single element of the vector at a time, repeated 4 times, rather than using vector instructions:

Similar behaviour is seen with the other functions I added to operate on arrays and vectors of floats and ints.

Things I've already tried:

Omit -mfpu=neon: Not expecting it to help, just checking that the detection works. As expected, Eigen detects that NEON is not available (EIGEN_VECTORIZE_NEON not defined) and that it will not vectorise (EIGEN_VECTORIZE not defined). The generated assembly is functionally identical.
Use -mfloat-abi=softfp (as specified at http://eigen.tuxfamily.org/index.php?ti ... ization.3F) instead of hard: The generated assembly omits ".eabi_attribute 28, 1" near the top, but is otherwise identical to -mfloat-abi=hard. This fails to link an executable, I think because main startup code or something else g++ links it to is precompiled using -mfloat-abi=hard.
Add -ffast-math -funsafe-math-optimizations -ftree-vectorize -mvectorize-with-neon-quad to the compiler options, from suggestions in various places: Not much difference. Some of the ".eabi_attribute" are different or omitted, .LFB lines have different numbering, and registers are used differently in intrin(), but no change to the assembly for foo() or the other functions using Eigen, which still operate on the 4 floats separately.

System Info

Compiling from an Intel x86_64 PC running Ubuntu 14.04.
Intel native compiler is the one on that PC.
Target ARM processor is an ARM Cortex-A9 MPCore (within an Altera Cyclone V SoC 5CSXFC6D6F31C6 on a Cyclone V SoC FPGA Development Kit) running a Yocto Linux build from http://rocketboards.org/ (although I'm not actually running the code on it yet - just looking at the assembly to check if Eigen is using the NEON instructions).
Arm cross-compiler is the version included in the latest Altera SoC Embedded Design Suite (version 14.1, from http://dl.altera.com/soceds/?edition=subscription).
Eigen is the latest release, version 3.2.4, "installed" by extracting the Eigen/ directory to the include path.

Any suggestions on how to get Eigen working with ARM NEON instructions would be very welcome.

Mark.

mbourne Registered Member Posts 2 Karma 0	Can't get Eigen to use ARM NEON instructions Fri Mar 13, 2015 11:34 am I'm having trouble getting Eigen to use ARM NEON vector instructions. It's detecting that the NEON instructions are available, but doesn't seem to actually use them and instead just generates four separate operations. I'm not sure if this is because something isn't set up quite right, I'm missing some compiler options, or perhaps the compiler I'm using is not supported. When targeting an Intel processor with SSE enabled, it uses the vector instructions as expected. Any ideas how to get Eigen working with ARM NEON instructions? Test Code Based on the suggested test operating on vectors of 4 floats at http://eigen.tuxfamily.org/index.php?ti ... bly_output. I've added to it to check what Eigen is detecting, check that the NEON intrinsics are supported by the compiler, and try with arrays and vectors of floats and ints: Code: Select all #include<Eigen/Core> using namespace Eigen; // EIGEN_ASM_COMMENT does not support __arm__ #if (defined __GNUC__) && ( defined(__i386__) \|\| defined(__x86_64__) ) #define CUSTOM_ASM_COMMENT(X) __asm__("#" X) #elif (defined __GNUC__) && (defined __arm__) #define CUSTOM_ASM_COMMENT(X) __asm__("@" X) #else #define CUSTOM_ASM_COMMENT(X) #endif // Check if vectorisation is enabled #if defined EIGEN_VECTORIZE CUSTOM_ASM_COMMENT("Vectorisation enabled"); #else CUSTOM_ASM_COMMENT("Vectorisation NOT enabled"); #endif // Check which instruction set is detected #if defined EIGEN_VECTORIZE_SSE CUSTOM_ASM_COMMENT("EIGEN_VECTORIZE_SSE"); #elif defined EIGEN_VECTORIZE_ALTIVEC CUSTOM_ASM_COMMENT("EIGEN_VECTORIZE_ALTIVEC"); #elif defined EIGEN_VECTORIZE_NEON CUSTOM_ASM_COMMENT("EIGEN_VECTORIZE_NEON"); #else CUSTOM_ASM_COMMENT("No EIGEN_VECTORIZE_"); #endif #if defined EIGEN_VECTORIZE_NEON // Check that NEON intrinsics work void intrin(float32x4_t& u, float32x4_t& v, float32x4_t& w) { CUSTOM_ASM_COMMENT("begin intrin"); static float32x4_t const t = {3.0f,3.0f,3.0f,3.0f}; u = vmlaq_f32(v, w, t); CUSTOM_ASM_COMMENT("end intrin"); } #endif // Vector4f, based on http://eigen.tuxfamily.org/index.php?title=Developer%27s_Corner#Studying_assembly_output void foo(Vector4f& u, Vector4f& v, Vector4f& w) { CUSTOM_ASM_COMMENT("begin foo"); u = v + (3w); CUSTOM_ASM_COMMENT("end foo"); } // Int version of foo void foo_i(Vector4i& u, Vector4i& v, Vector4i& w) { CUSTOM_ASM_COMMENT("begin foo_i"); u = v + 3w; CUSTOM_ASM_COMMENT("end foo_i"); } // Coefficient-wise array multiply void arrmult(Array4f& u, Array4f& v, Array4f& w) { CUSTOM_ASM_COMMENT("begin arrmult"); u = v w; CUSTOM_ASM_COMMENT("end arrmult"); } // Int version of arrmult void arrmult_i(Array4i& u, Array4i& v, Array4i& w) { CUSTOM_ASM_COMMENT("begin arrmult_i"); u = v * w; CUSTOM_ASM_COMMENT("end arrmult_i"); } Targeting Intel Compiler version and command: Code: Select all `$ g++ --version g++ (Ubuntu 4.8.2-19ubuntu1) 4.8.2 Copyright (C) 2013 Free Software Foundation, Inc. This is free software; see the source for copying conditions. There is NO warranty; not even for MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. $ g++ -I/usr/local/include/eigen3 -O3 -Wall -S -msse2 -o foo-intel.s foo.cpp` The resulting assembly for foo() is similar to that given in the example, using vector instructions to operate on all 4 elements of the vector in one pass: Code: Select all `#APP # 46 "foo.cpp" 1 #begin foo # 0 "" 2 #NO_APP movaps .LC0(%rip), %xmm0 mulps (%rdx), %xmm0 addps (%rsi), %xmm0 movaps %xmm0, (%rdi) #APP # 48 "foo.cpp" 1 #end foo # 0 "" 2 #NO_APP` Targeting ARM Compiler version and command targeting ARM: Code: Select all $ /opt/altera/14.1/embedded/ds-5/sw/gcc/bin/arm-linux-gnueabihf-g++ --version arm-linux-gnueabihf-g++ (crosstool-NG linaro-1.13.1-4.8-2014.04 - Linaro GCC 4.8-2014.04) 4.8.3 20140401 (prerelease) Copyright (C) 2013 Free Software Foundation, Inc. This is free software; see the source for copying conditions. There is NO warranty; not even for MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. $ /opt/altera/14.1/embedded/ds-5/sw/gcc/bin/arm-linux-gnueabihf-g++ -I/opt/altera/14.1/embedded/ds-5/sw/gcc/arm-linux-gnueabihf/include/eigen3 -O3 -Wall -S -marm -mcpu=cortex-a9 -mfpu=neon -mfloat-abi=hard -o foo-arm.s foo.cpp The resulting assembly contains: Code: Select all `@Vectorisation enabled @EIGEN_VECTORIZE_NEON` confirming that Eigen has detected that NEON instructions are available. Looking at the intrin() function confirms that NEON intrinsics are supported: Code: Select all `#APP @ 36 "foo.cpp" 1 @begin intrin @ 0 "" 2 vmov.f32 q10, #3.0e+0 @ v4sf vld1.64 {d16-d17}, [r1:64] vld1.64 {d18-d19}, [r2:64] vmla.f32 q8, q9, q10 vst1.64 {d16-d17}, [r0:64] #APP @ 39 "foo.cpp" 1 @end intrin @ 0 "" 2` But the foo() function operates on a single element of the vector at a time, repeated 4 times, rather than using vector instructions: Code: Select all #APP @ 46 "foo.cpp" 1 @begin foo @ 0 "" 2 flds s13, [r2] flds s14, [r1] fconsts s15, #8 fmacs s14, s13, s15 fsts s14, [r0] flds s13, [r2, #4] flds s14, [r1, #4] fmacs s14, s13, s15 fsts s14, [r0, #4] flds s13, [r2, #8] flds s14, [r1, #8] fmacs s14, s13, s15 fsts s14, [r0, #8] flds s13, [r2, #12] flds s14, [r1, #12] fmacs s14, s13, s15 fsts s14, [r0, #12] #APP @ 48 "foo.cpp" 1 @end foo @ 0 "" 2 Similar behaviour is seen with the other functions I added to operate on arrays and vectors of floats and ints. Things I've already tried: Omit -mfpu=neon: Not expecting it to help, just checking that the detection works. As expected, Eigen detects that NEON is not available (EIGEN_VECTORIZE_NEON not defined) and that it will not vectorise (EIGEN_VECTORIZE not defined). The generated assembly is functionally identical. Use -mfloat-abi=softfp (as specified at http://eigen.tuxfamily.org/index.php?ti ... ization.3F) instead of hard: The generated assembly omits ".eabi_attribute 28, 1" near the top, but is otherwise identical to -mfloat-abi=hard. This fails to link an executable, I think because main startup code or something else g++ links it to is precompiled using -mfloat-abi=hard. Add -ffast-math -funsafe-math-optimizations -ftree-vectorize -mvectorize-with-neon-quad to the compiler options, from suggestions in various places: Not much difference. Some of the ".eabi_attribute" are different or omitted, .LFB lines have different numbering, and registers are used differently in intrin(), but no change to the assembly for foo() or the other functions using Eigen, which still operate on the 4 floats separately. System Info Compiling from an Intel x86_64 PC running Ubuntu 14.04. Intel native compiler is the one on that PC. Target ARM processor is an ARM Cortex-A9 MPCore (within an Altera Cyclone V SoC 5CSXFC6D6F31C6 on a Cyclone V SoC FPGA Development Kit) running a Yocto Linux build from http://rocketboards.org/ (although I'm not actually running the code on it yet - just looking at the assembly to check if Eigen is using the NEON instructions). Arm cross-compiler is the version included in the latest Altera SoC Embedded Design Suite (version 14.1, from http://dl.altera.com/soceds/?edition=subscription). Eigen is the latest release, version 3.2.4, "installed" by extracting the Eigen/ directory to the include path. Any suggestions on how to get Eigen working with ARM NEON instructions would be very welcome. Mark.
ggael Moderator Posts 3447 Karma 19 OS	Re: Can't get Eigen to use ARM NEON instructions Tue Mar 17, 2015 9:11 am On ARM, only dynamically sized vector and matrices are vectorized because, to be worth the effort, the vectorization of small fixed sized vector requires that the stack is 16-bytes aligned, which cannot be guaranteed on ARM.
mbourne Registered Member Posts 2 Karma 0	Re: Can't get Eigen to use ARM NEON instructions Wed Mar 18, 2015 1:23 pm Thanks. That makes sense. Changing the parameters from Vector4f to VectorXf does indeed result in vectorised instructions on groups of 4 elements, so looks like my setup is working - just an unsuitable test in this case. The FAQs give the example using Vector4f as a test that vectorisation is working: http://eigen.tuxfamily.org/index.php?ti ... ng_used.3F http://eigen.tuxfamily.org/index.php?ti ... bly_output Would it be worth noting in those sections that these examples using fixed-size vectors may not be vectorised anyway on some architectures?

Can't get Eigen to use ARM NEON instructions

Page 1 of 1 (3 posts)

Can't get Eigen to use ARM NEON instructions

Re: Can't get Eigen to use ARM NEON instructions

Re: Can't get Eigen to use ARM NEON instructions

Bookmarks

Who is online