Registered Member
|
I am working with a function that is called many times, and depends on efficient computations. I wanted to make sure I was getting the most performance out of Eigen.
The function uses 3x3 matrices to store intermediate results, and to build the large matrix Y. -Is it more efficient to declare the 3x3 matrices as class variables, so they are not re-instantiated every time the function is called, or can the compiler "optimize away" the instantiations? - Will the compiler exploit the fact that some of the matrices are constant? - Can the compiler exploit the special properties of Y, when multiplying with P_map? ( Less than half of the elements of Y are nonzero, and then only about half of the non-zero elements are changing every time the function is called.) Note: This code is only the computation intensive of the function. So some of it may not make sense out of context.
Thank you for any help. |
Moderator
|
Since they are fixed size matrices, they are allocated on the stack, so declaring them as class variables won't help. On the other hand, you can declare them directly as a sub matrix of Y:
Likewise, it is faster to rewrite:
as
Unfortunately the answer is no. For your use case, the best would be to store Y in a sparse block matrix of fixed size blocks, but we don't have that in Eigen yet. We do have generic sparse matrices, but in your case I don't think that's worth it (your matrices are too small). Make sure you compile you compile program with a recent compiler, with -DNDEBUG, and with optimizations (-O2) and SSE2 (-msse2) enabled. Also, since your matrices are 15x15, you could try to define:
just before including Eigen/Core. This constant control the use a naive or cache friendly product implementation (the default is 16). Also:
can be optimized as:
|
Registered Member
|
You mean:
Join us on Eigen's IRC channel: #eigen on irc.freenode.net
Have a serious interest in Eigen? Then join the mailing list! |
Registered Member
|
Thanks for the suggestions. I have another performance question with a simple example.
I tested this in a loop against a C version using static arrays, and the tvmet library. The eigen version always seems much slower. I did notice an improvement when I forced lazy evaluation though. Here are some average times: for 100,000,000 loops: static: 4 seconds tvmet: 7 seconds eigen: 25 seconds for 1,000,000,000 loops: static: 36 seconds tvmet: 1 minute,10 seconds eigen: 4 minutes, 9 seconds Is there something else I could do to increase speed? (All matrices A,B,C.. are Matrix3d)
|
Moderator
|
hm, that's strange. Can you post the static and tvmet versions as well as the code around because I suspect the compiler detects that some parts can be factorized out of the loop (or completely removed), and so your benchmark would be meaningless.
Indeed, here you have about 337 floating point operations per iteration, so 36s for 1e9 iterations means 9 GFlops => a 9 GHz CPU ! (well in some cases current CPU are able to perform one + and one * in one cycle, so let's say this imply a 4.5 GHz CPU) So, could you also tell us was is your CPU and compiler ? Here it takes ~10s for 1e8 iterations on a 2.66GHz Core2 (only one core used). I check the assembly and nothing has been removed out of the loop. This means ~3.37GFlops. Compared to the theoretical 5.32 peak performance this is not bad at all. Also, the .lazy() are only needed around the matrix product. The other expressions are always evaluated lazily. |
Moderator
|
To give you an example, here is a fair benchmark wrapping the key piece of code into a function which is enforced not to be inlined:
http://pastebin.com/m12b8aa87 This way you guarantee the compiler won't optimize too much. |
Registered Member
|
Sorry, I forgot to mention the compiler. Those estimates were with icpc and intel i7 processor, using the -O3 flag. I know it isn't a proper benchmark, I just wanted to roughly recreate what my main program is dealing with.
Recently, I tried the tests with g++ and the results were the opposite. The static C version was slowest, while eigen and tvmet got faster. Are there any ways I can improve performance when using icpc? I will post the rest of the tests soon if it will help. |
Registered users: Bing [Bot], Google [Bot], Sogou [Bot]