Registered Member
|
Hi
I started using Eigen because I am looking for real time applications. Watching the benchmark onhttp://eigen.tuxfamily.org/index.php?title=Benchmark I was encouraged to give them a try. As far as the accuracy of the results is concerned, Eigen was OK but, I have encountered some efficiency problems with the example that I am trying to implement. Because of this, I am writing you to enquire whether I am not using EIGEN well or I am not setting the best compiler options; or maybe both. To begin with, I am going to explain the example that I am trying to solve, and subsequently, the main setting compiler options that I have used to run the program. Testing example I have to solve several times the following general expression: A (4,1) = B(4,4) * C(4,1); //where A and C are column vectors and B is a 4x4 matrix. The matrix expression above must be solved within two for loops, the external goes from 0 to nB, and the internal one goes from 0 to nPoints, where nPoints takes different values from 0 to nB but they are always the same. Each C vector represents a row within a matrix TRP(nRows,4) and the A vector must be placed in a row of a matrix equivalent to the previous TRP, called TP. Now, the variables of my problem are: ATT(4,4) //This is the matrix B, showed in the previous general expression TRP (nRows,4) //Matrix with nRows rows and 4 columns. TP (nRows,4) //Matrix with nRows rows and 4 columns. The matrices A44, TRP and all the other values (nB, nPoints and nRows) are known at any time. Only the TP matrix is needed to be calculated. Resolution To work out the solution I have made two different functions for the same program. The first one solves the problem by the classical way, and the second one by means of Eigen. The numerical results obtained with both functions are identical in each case, but the first one is almost 30% faster than the function that solves the problem with Eigen. Since arrays ATT and TRP are known at any time I have introduced the Eigen matrix pointing to them by using the Map expression, as below: Map<Matrix4d, Aligned> A44(AT[0],4,4); Map<MatrixXd, Aligned> mTNP(TP[0],4,nRows); Map<MatrixXd, Aligned> mTNRP(TRP[0],4,nRows); Downwards I have copied a brief summary of the C++ code. Setting options of compiler I have implemented and executed the program on two platforms, MVS2005 and 2008. With MVS2008 I have used both, its own compiler and the Intel® Compiler Suite Professional Edition 11.1 for Windows. The main options I have set on Release mode are: C/C++ Optimization Optimization: Maximize Speed (/O2) Code Generation Runtime Library: Multi-threaded DLL (/MD) Enable Enhanced Instruction Set: Streaming SIMD Extensions 2 (/arch:SSE2) Linker Debugging Generate Debug Info: Yes (/DEBUG) I have been trying to set the last option on NO Generate Debug Info but, despite the compilation and link process are successful, the program fails on run time execution because it lacks debug information. The code is the following:
Could anybody help me with this low efficiency problem? Thanks'. |
Registered Member
|
I have little time at the moment but could you try out the following two options.
1) Change your maps to: Map< Matrix<double,4,Dynamic>, Aligned>
2) Simply change your product to (leave out the for-loops)
HTH, Hauke Note: These advises apply to the development branch... |
Registered Member
|
I see that you have automatic vectorization turned on in your compiler, but you might want to put the compiler in verbose mode to figure out if your loops are actually getting vectorized. I know gcc and icc can do this. In fact, I seem to remember icc giving output of this sort by default. In our work we've seen at least one matrix multiplication that ran faster with (auto-vectorized) nested for loops than with calls to opencv (which I think was wrapping precompiled debian Atlas BLAS at the time).
Cheers, Drew |
Registered Member
|
For Aligned Map, you absolutely need the development branch of Eigem, aka Eigen3. This isn't working in Eigen2.
But then do note that you must guarantee that the pointer you're passing to it is 16-byte aligned. So you must declare your AT array with an alignment attribute. You can use EIGEN_ALIGN16.
Join us on Eigen's IRC channel: #eigen on irc.freenode.net
Have a serious interest in Eigen? Then join the mailing list! |
Registered Member
|
First, which is the advantage of using Aligned Map instead of only Map??
All that I need is to pass the arrays AT and TNRP, which have been defined before in an external function. How I can declare the AT array with an alignment attribute?, using ZP Struct Member Alignment on setting compiler options?? |
Registered Member
|
If you want vectorization (SSE...) of your 4x4 objects to happen, the arrays must be aligned at 16-byte boundaries. This is how SSE and friends work. You can do that in a portable way by
So this requires that you have control over the creation of the array in question. Then you must tell Eigen that it can rely on the pointer being aligned, this is what Aligned does, but it's only working correctly in the devel branch. Otherwise, you can also forget about Aligned altogether. You code will run safely, just without SSE. The point is that here you have a fixed size. If you had dynamic size, you'd still get vectorization without Aligned (although Aligned would still help).
Join us on Eigen's IRC channel: #eigen on irc.freenode.net
Have a serious interest in Eigen? Then join the mailing list! |
Registered Member
|
The advantage of Aligned Map is that Eigen knows that in that case it may use aligned loading routines when using SSE. Aligning stack data is achieved by (though you need to test this since I am not sure when it comes to array declarations; EIGEN_ALIGN16 double AT[16] works for sure): EIGEN_ALIGN16 double AT[4][4]; - Hauke |
Registered Member
|
I have declared:
EIGEN_ALIGN16 double AT[4][4]; ..., but there is a compiling error C2065: 'EIGEN_ALIGN16' : undeclared identifier. I have tried with a duble AA[10] only for testing but is the same... |
Registered Member
|
Did you include <Eigen/Core> ? The definition is located in Eigen\src\Core\util\Macros.h. Alternatively, since you are running on MSVC you can do:
The Eigen declaration is just more portable. - Hauke |
Registered Member
|
Setting the (/ZP16) Struct Member Alignment on setting compiler options and declaring all variables (AT, TNP, TP) as __declspec(align(16)) the program runs well, but still is lower than the classic method...
|
Registered Member
|
Hi again, as soon, as you post a self contained example which allows us to verify your issue we will be happy to assist you further. Below you will find a starting point to create such a self contained example. For timing your code please take a look at Eigen/Bench/BenchTimer.h.
Regards, Hauke |
Registered Member
|
Hi
Here I have pasted a complete simplified example for testing. I have used both, Intel compiler and VC++ 2008. I have set ZP16 for aligned data and vectorization SSE2. The times were taken with ‘tbb’ libraries, I should see more carefully how to use the bench timer of Eigen. Finally, using ‘EIGEN_METHOD’ I keep obtaining lower runtimes that using the ‘CLASSIC_METHOD’…
Regards, Andrés |
Registered Member
|
I am seeing a tiny performance impact, too. With 64 bit builds it vanishes though.
Here is my even more simplified test code:
For 64bit builds, the assembly of the classic and modern methods are identical (when I remove the transposition). When the transposition is present Eigen produces even faster code than the hand-written one though I did not really investigate further. In 32bit builds there seems to be some issues with MSVC vectorizing properly. There are some movesd/fstp/fld ops hidden between the SSE stuff. I don't really know why and have no time to dig deeper. Since many users rely on those tiny matrix multiplications, at some point we will definitely try to fix the issue. I filed a small enhancement-report so we don't forget to look into it in the future. Regards, Hauke |
Registered Member
|
Thanks Hauke,
If you don't mind, could you tell me, in percentage, how faster was your implementation using Eigen instead hand-written code?? , how many cores it has your processor and; are you using vectorization, aren’t you? I will try to pass to 64 bits but, if I want to achieve real time applications, which are the main features to concern? Summarizing I can say: - 64 bits - To use vectorization - To work with aligned arrays - Other setting compilations??? Regards Andrés |
Registered Member
|
What I am going to say applies to MSVC. First, you don't have to take care of alignment anymore when you are compiling in 64bit mode - and only then. Neither do you need to take care of SSE/vectorization. Alignment as well as the usage of SSE or vectorization are all used/urned on by default. Having that said, it does not hurt if you do enforce alignment in all cases since then your code should perform nearly equally well in 32 as well as in 64bit. The number of cores does not matter here, since your tiny product will always be evaluated on a single core and there was no OpenMP/TBB involved. One of the core features of Eigen is that you can use it for fixed size as well as dynamic size problems while getting almost the same performance as hand-written code. In theory the 'almost' should actually be at least 'exact same performance' but different compilers manage to optimize the code at different levels and we have seen now many times that MSVC is not optimal, in particular for 32bit builds. Your hand-written loop is trivial to vectorize for the compiler and thus in this special case, you are really expecting Eigen to produce the absolute optimal SSE code. In case, you can reformulate your problems, such that you are working on bigger systems, Eigen will most likely outperform any naive hand-written code since Gael took care of optimal cache utilization. So going back to your initial problem, where you wanted to apply the 4x4 matrix to a bunch of points and not a single one Eigen should perform faster than your hand-written loop. I hope that helps you a bit more... - Hauke |
Registered users: Bing [Bot], Google [Bot], Sogou [Bot]