This forum has been archived. All content is frozen. Please use KDE Discuss instead.

Can't get Eigen to use ARM NEON instructions

Tags: None
(comma "," separated)
mbourne
Registered Member
Posts
2
Karma
0
I'm having trouble getting Eigen to use ARM NEON vector instructions. It's detecting that the NEON instructions are available, but doesn't seem to actually use them and instead just generates four separate operations. I'm not sure if this is because something isn't set up quite right, I'm missing some compiler options, or perhaps the compiler I'm using is not supported. When targeting an Intel processor with SSE enabled, it uses the vector instructions as expected. Any ideas how to get Eigen working with ARM NEON instructions?

Test Code
Based on the suggested test operating on vectors of 4 floats at http://eigen.tuxfamily.org/index.php?ti ... bly_output. I've added to it to check what Eigen is detecting, check that the NEON intrinsics are supported by the compiler, and try with arrays and vectors of floats and ints:
Code: Select all
#include<Eigen/Core>

using namespace Eigen;

// EIGEN_ASM_COMMENT does not support __arm__
#if (defined __GNUC__) && ( defined(__i386__) || defined(__x86_64__) )
  #define CUSTOM_ASM_COMMENT(X) __asm__("#" X)
#elif (defined __GNUC__) && (defined __arm__)
  #define CUSTOM_ASM_COMMENT(X) __asm__("@" X)
#else
  #define CUSTOM_ASM_COMMENT(X)
#endif

// Check if vectorisation is enabled
#if defined EIGEN_VECTORIZE
   CUSTOM_ASM_COMMENT("Vectorisation enabled");
#else
   CUSTOM_ASM_COMMENT("Vectorisation NOT enabled");
#endif

// Check which instruction set is detected
#if defined EIGEN_VECTORIZE_SSE
   CUSTOM_ASM_COMMENT("EIGEN_VECTORIZE_SSE");
#elif defined EIGEN_VECTORIZE_ALTIVEC
   CUSTOM_ASM_COMMENT("EIGEN_VECTORIZE_ALTIVEC");
#elif defined EIGEN_VECTORIZE_NEON
   CUSTOM_ASM_COMMENT("EIGEN_VECTORIZE_NEON");
#else
   CUSTOM_ASM_COMMENT("No EIGEN_VECTORIZE_*");
#endif

#if defined EIGEN_VECTORIZE_NEON
// Check that NEON intrinsics work
void intrin(float32x4_t& u, float32x4_t& v, float32x4_t& w)
{
   CUSTOM_ASM_COMMENT("begin intrin");
   static float32x4_t const t = {3.0f,3.0f,3.0f,3.0f};
   u = vmlaq_f32(v, w, t);
   CUSTOM_ASM_COMMENT("end intrin");
}
#endif

// Vector4f, based on http://eigen.tuxfamily.org/index.php?title=Developer%27s_Corner#Studying_assembly_output
void foo(Vector4f& u, Vector4f& v, Vector4f& w)
{
   CUSTOM_ASM_COMMENT("begin foo");
   u = v + (3*w);
   CUSTOM_ASM_COMMENT("end foo");
}

// Int version of foo
void foo_i(Vector4i& u, Vector4i& v, Vector4i& w)
{
   CUSTOM_ASM_COMMENT("begin foo_i");
   u = v + 3*w;
   CUSTOM_ASM_COMMENT("end foo_i");
}

// Coefficient-wise array multiply
void arrmult(Array4f& u, Array4f& v, Array4f& w)
{
   CUSTOM_ASM_COMMENT("begin arrmult");
   u = v * w;
   CUSTOM_ASM_COMMENT("end arrmult");
}

// Int version of arrmult
void arrmult_i(Array4i& u, Array4i& v, Array4i& w)
{
   CUSTOM_ASM_COMMENT("begin arrmult_i");
   u = v * w;
   CUSTOM_ASM_COMMENT("end arrmult_i");
}

Targeting Intel
Compiler version and command:
Code: Select all
$ g++ --version
g++ (Ubuntu 4.8.2-19ubuntu1) 4.8.2
Copyright (C) 2013 Free Software Foundation, Inc.
This is free software; see the source for copying conditions.  There is NO
warranty; not even for MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.

$ g++ -I/usr/local/include/eigen3 -O3 -Wall -S -msse2 -o foo-intel.s foo.cpp

The resulting assembly for foo() is similar to that given in the example, using vector instructions to operate on all 4 elements of the vector in one pass:
Code: Select all
#APP
# 46 "foo.cpp" 1
   #begin foo
# 0 "" 2
#NO_APP
   movaps   .LC0(%rip), %xmm0
   mulps   (%rdx), %xmm0
   addps   (%rsi), %xmm0
   movaps   %xmm0, (%rdi)
#APP
# 48 "foo.cpp" 1
   #end foo
# 0 "" 2
#NO_APP

Targeting ARM
Compiler version and command targeting ARM:
Code: Select all
$ /opt/altera/14.1/embedded/ds-5/sw/gcc/bin/arm-linux-gnueabihf-g++ --version
arm-linux-gnueabihf-g++ (crosstool-NG linaro-1.13.1-4.8-2014.04 - Linaro GCC 4.8-2014.04) 4.8.3 20140401 (prerelease)
Copyright (C) 2013 Free Software Foundation, Inc.
This is free software; see the source for copying conditions.  There is NO
warranty; not even for MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.

$ /opt/altera/14.1/embedded/ds-5/sw/gcc/bin/arm-linux-gnueabihf-g++ -I/opt/altera/14.1/embedded/ds-5/sw/gcc/arm-linux-gnueabihf/include/eigen3 -O3 -Wall -S -marm -mcpu=cortex-a9 -mfpu=neon -mfloat-abi=hard -o foo-arm.s foo.cpp

The resulting assembly contains:
Code: Select all
   @Vectorisation enabled
   @EIGEN_VECTORIZE_NEON

confirming that Eigen has detected that NEON instructions are available. Looking at the intrin() function confirms that NEON intrinsics are supported:
Code: Select all
#APP
@ 36 "foo.cpp" 1
   @begin intrin
@ 0 "" 2
   vmov.f32   q10, #3.0e+0  @ v4sf
   vld1.64   {d16-d17}, [r1:64]
   vld1.64   {d18-d19}, [r2:64]
   vmla.f32   q8, q9, q10
   vst1.64   {d16-d17}, [r0:64]
#APP
@ 39 "foo.cpp" 1
   @end intrin
@ 0 "" 2

But the foo() function operates on a single element of the vector at a time, repeated 4 times, rather than using vector instructions:
Code: Select all
#APP
@ 46 "foo.cpp" 1
   @begin foo
@ 0 "" 2
   flds   s13, [r2]
   flds   s14, [r1]
   fconsts   s15, #8
   fmacs   s14, s13, s15
   fsts   s14, [r0]
   flds   s13, [r2, #4]
   flds   s14, [r1, #4]
   fmacs   s14, s13, s15
   fsts   s14, [r0, #4]
   flds   s13, [r2, #8]
   flds   s14, [r1, #8]
   fmacs   s14, s13, s15
   fsts   s14, [r0, #8]
   flds   s13, [r2, #12]
   flds   s14, [r1, #12]
   fmacs   s14, s13, s15
   fsts   s14, [r0, #12]
#APP
@ 48 "foo.cpp" 1
   @end foo
@ 0 "" 2

Similar behaviour is seen with the other functions I added to operate on arrays and vectors of floats and ints.

Things I've already tried:
  • Omit -mfpu=neon: Not expecting it to help, just checking that the detection works. As expected, Eigen detects that NEON is not available (EIGEN_VECTORIZE_NEON not defined) and that it will not vectorise (EIGEN_VECTORIZE not defined). The generated assembly is functionally identical.
  • Use -mfloat-abi=softfp (as specified at http://eigen.tuxfamily.org/index.php?ti ... ization.3F) instead of hard: The generated assembly omits ".eabi_attribute 28, 1" near the top, but is otherwise identical to -mfloat-abi=hard. This fails to link an executable, I think because main startup code or something else g++ links it to is precompiled using -mfloat-abi=hard.
  • Add -ffast-math -funsafe-math-optimizations -ftree-vectorize -mvectorize-with-neon-quad to the compiler options, from suggestions in various places: Not much difference. Some of the ".eabi_attribute" are different or omitted, .LFB lines have different numbering, and registers are used differently in intrin(), but no change to the assembly for foo() or the other functions using Eigen, which still operate on the 4 floats separately.

System Info
  • Compiling from an Intel x86_64 PC running Ubuntu 14.04.
  • Intel native compiler is the one on that PC.
  • Target ARM processor is an ARM Cortex-A9 MPCore (within an Altera Cyclone V SoC 5CSXFC6D6F31C6 on a Cyclone V SoC FPGA Development Kit) running a Yocto Linux build from http://rocketboards.org/ (although I'm not actually running the code on it yet - just looking at the assembly to check if Eigen is using the NEON instructions).
  • Arm cross-compiler is the version included in the latest Altera SoC Embedded Design Suite (version 14.1, from http://dl.altera.com/soceds/?edition=subscription).
  • Eigen is the latest release, version 3.2.4, "installed" by extracting the Eigen/ directory to the include path.

Any suggestions on how to get Eigen working with ARM NEON instructions would be very welcome.

Mark.
User avatar
ggael
Moderator
Posts
3447
Karma
19
OS
On ARM, only dynamically sized vector and matrices are vectorized because, to be worth the effort, the vectorization of small fixed sized vector requires that the stack is 16-bytes aligned, which cannot be guaranteed on ARM.
mbourne
Registered Member
Posts
2
Karma
0
Thanks. That makes sense. Changing the parameters from Vector4f to VectorXf does indeed result in vectorised instructions on groups of 4 elements, so looks like my setup is working - just an unsuitable test in this case.

The FAQs give the example using Vector4f as a test that vectorisation is working:
http://eigen.tuxfamily.org/index.php?ti ... ng_used.3F
http://eigen.tuxfamily.org/index.php?ti ... bly_output
Would it be worth noting in those sections that these examples using fixed-size vectors may not be vectorised anyway on some architectures?


Bookmarks



Who is online

Registered users: Bing [Bot], Google [Bot], Yahoo [Bot]