Registered Member
|
Hi Eigen users,
I need to get the solution x to the equation Ax=b. A is square and invertible, so I'm using `partialPivLu`. In our software, the typical sizes for N is in between 1000 and 6000. The time Eigen takes to solve this problem is too long for our use case. Therefore, I tried alternative solvers (LAPACK from Accelerate framework of MacOSX and MKL). Surprisingly, it showed that they run approx. 2.5-3x faster than Eigen's `partialPivLu`. I use Eigen 3.2.8 on MacOSX/Xcode 7.3. I did a cross check on windows using VisualStudio 2013 and get similar results. I've set the appropriate environment variables to run all the cases in single threaded mode (the time difference in multi-threaded mode is even bigger). I'm wondering if it's expected that Eigen's `partialPivLu` is much slower. This seems to confirm it (although it's with beta 3.3): http://www.mathematik.uni-ulm.de/~lehn/ ... age01.html but the official Eigen benchmark page gives a different result: http://download.tuxfamily.org/eigen/btl ... decomp.pdf I've tried to run Eigen's benchmarks (on MacOSX), but compilation failed on `use of undeclared identifier 'CLOCK_*`. What can I try to get better performance out of Eigen's solver? (I'd like to avoid MKL If possible) Below my simple benchmark code that gives following result:
Thanks for your help. |
Moderator
|
Here for n=6000 it takes 4.7s on my macbook 2.6GHz, compiled with "g++ -mfma -O3" on OSX and using Eigen 3.3. You need it to exploit AVX and FMA, otherwise it takes 12s (as expected: AVX -> x2, FMA -> x1.5).
|
Registered Member
|
Thanks. Great to see the improvments in 3.3 to close the gap. I could observe similar improvments when using the default clang compiler of Xcode 7.3 (with 3.3 beta 1 and using -mfma flag). With VisuaStudio however, the results were less spectacular. I tried with both VisualStudio 2013 and VisualStudio 2015. In all the combinations I tried, the runtime for n=6000 drops from 12s to 8.5s when using the '/arch:AVX2' flag. I tried with 3.3 beta1 and todays 'default' branch in the repository. According to https://blogs.msdn.microsoft.com/vcblog/2014/02/28/avx2-support-in-visual-studio-c-compiler/, the '/arch:AVX2' flag should enable fma. Is there any other flag I missed? |
Moderator
|
the only way to investigate this is too look at the generated assembly of the function Eigen::internal::gebp_kernel. Maybe it is messing with register allocation.
|
Registered Member
|
I actually saw no fma instructions in the assembly. I couldn't find a separate FMA flag in MSVC, but the page linked above suggested that '/arch:AVX2' would enable fma. However I noticed that the '__FMA__' macro was not defined. It seems that MSVC defines '__AVX2__', but not '__FMA__' (https://msdn.microsoft.com/en-us/library/b0084kay.aspx). I then tried to add a '#define __FMA__' before including eigen and the runtime improved by 1.5s (7s, vs 8.5s without __FMA__ macro). This is still quite a lot worse than what gcc or clang produces. I'm not used to read assembly. I can't judge if MSVC is messing with register allocation. Below is the assembly (with the '#define __FMA__' hack). Please let me know if I can help otherwise.
|
Moderator
|
Thank you for the asm. As I guessed, MSVC fails to properly allocate register and it stupidly introduces costly register spilling though our code is designed in a way that the register allocator has nothing to do: we declare 16 variables (=number of registers) and perform all operations in a way that no temporary is required.
I don't know how to workaround this. |
Moderator
|
Actually, we has a similar issue with clang/gcc on ARM targets that we fixed by adding inline assembly comments to break the optimizer. You could try them by defining:
#define EIGEN_ASM_COMMENT(X) __asm { ; X } before including Eigen. Also, the interesting part of the generated assembly are the following the first occurence of "EIGEN_GEBP_ONESTEP(0);" where you can see problematic register copies wasting registers: 00007FF6F87D4FAE vmovupd ymm6,ymm0 00007FF6F87D4FB2 vmovupd ymm7,ymm8 00007FF6F87D4FB7 vmovupd ymm8,ymm11 as well as register spilling such as: 00007FF6F87D4FF3 vmovupd ymm0,ymmword ptr [C3] 00007FF6F87D5010 vmovupd ymmword ptr [C5],ymm0 The variables C* should stay in registers... |
Registered Member
|
Thanks for the explanations about the asm output. I tried your suggestion with the `__asm` comment. Unfortunately, it seems that MSVC doesn't support inline assembler on x64 [1]. I get `Error C4235 nonstandard extension used: '__asm' keyword not supported on this architecture`. Is there any point in trying using a compiler intrinsics instead [2]? After reading [3], I guess I'm out of luck. [1] https://msdn.microsoft.com/en-us/library/wbk4z78b.aspx [2] https://msdn.microsoft.com/en-us/library/hh977022.aspx [3] http://stackoverflow.com/questions/13955162/why-does-adding-assembly-comments-cause-such-radical-change-in-generated-code |
Registered Member
|
Intel Compiler C++ can be integrated in MCVS and I guess that it fully supports inline assembly
https://software.intel.com/en-us/node/513428 |
Registered Member
|
Thanks for the suggestion. Yes, I think 'icc' does support inline assembly for x64 and it's likely that it wouldn't need the inline assembly hack in the first place. Using 'icc' is not possible in my project. A working solution is to use MKL with "EIGEN_USE_MKL_ALL" which solves all the performance issues. I prefer to avoid MKL because of the 50+MB of dll's do distribute, the proprietary license and the more complicated build setup. I wanted to investigate if it's possible to come close to MKL performance directly in Eigen with the standard compilers. If this is not possible, I think I'll take the MKL option. |
Moderator
|
Alright, I'm running out of idea on this issue. Maybe you could give a try to revision bb9fc0721496c (e.g., run: hg up bb9fc0721496c) for which the kernel code is slightly simpler, or maybe add the "register" keyword to the declaration of the variables C0-C11, A0, A1, A2, and B_0 to see if that helps.
|
Registered users: Bing [Bot], Evergrowing, Google [Bot], rockscient