Registered Member
|
I've done some benchmarks to see how EIGEN actually performs, since after integrating EIGEN into my program, the performance was much lower than I had expected. The benchmark results were up and down.
For very simple operation like the dot product of 2 vectors (size tested are 4 and 20), EIGEN performs quite good, usually 3-4 times faster than a naive implementation using for loop. In my program, there are several big arrays, whose sizes are multiple of 4 or 20. The dot product is applied on the segments (size 4 or 20) of those arrays. For example, if I have 2 big Eigen Vector ei_a and ei_b, then the code would look like this :
I have benchmarked the code and it is only 20% faster than a naive implementation, which is for me very disappointing, given the simple benchmark above where EIGEN run 3-4 times faster than a naive implementation in a single dot product. I also try to manage the segment myself using pointer like it is done in a naive implementation, and then use Map to create the correponding Eigen Vector for each segment. The result is also only 20% faster than a naive implementation. I am awared that, this decrease in performace is caused by malloc. How can I avoided it ? Should I split my big array into chunks, so that I don't have to call segment or using Map ? Thanks in advanced |
Registered Member
|
Sounds like you have high expectations The reason why the original dot product was fast was probably SIMD vectorization. The reason why the segment() version isn't as fast is probably that you didn't give Eigen any way to know that this segment is starting at a 16-byte-aligned location. So it can't know that it can use SIMD there. For large (dynamic-size) segments, it would do runtime checks, but for small fixed-size segments it just gives up SIMD. So you're getting no SIMD at all there. To fix that, you can use Eigen3's aligned map feature: 1. upgrade to Eigen 3 (development branch) 2. define this typedef (here I assume you're using floats) typedef Matrix<float,20,1> Vector20f; 3. Replace vector.segment<20>(i * 20) by Vector20f::MapAligned(vector.data() + i * 20)
Yep, for the same reason. You need Eigen3's MapAligned here.
I don't think so, nothing in this for loop should be causing a malloc.
Join us on Eigen's IRC channel: #eigen on irc.freenode.net
Have a serious interest in Eigen? Then join the mailing list! |
Registered Member
|
THANKSSSSSSS ! It now works perfectly !!!!
|
Registered Member
|
Dear bjacob, I have another question concerning the usage of Map. I am actually using Eigen 3, and the code I used for benchmark look like this:
This code is only 20% faster than a naive implementation, what is actually wrong with it, I think I am using aligned map already, am I ? |
Registered Member
|
Here the issue is that SSE2 doesn't have a great way to vectorize small (here, 4D) dot products. SSE2 is only good at vectorizing larger (like your size 20 above) dot products.
SSE4 does bring the missing dot product instruction, but we're not yet using it.
Join us on Eigen's IRC channel: #eigen on irc.freenode.net
Have a serious interest in Eigen? Then join the mailing list! |
Registered Member
|
Yes, it is true, and I did notice the performance difference between 4D and 20D (SMID speed-up for 4D is about 2x, whereas 20D is 3-4x)
In my test benchmark for 4D, I noticed a very change behavior, where using vector of dynamic size outperforms that of fixed size, which contradicts what is written in the tutorial ?! Here are the code Dynamic Size
Fixed Size
The code for dynamic size is 2 times faster than that of fixed size. BUT, if I use vector of size 20, then the code for fixed Size vector is 1,5 times faster than that of dynamic size ?? Can you give me an explanation ? Thanks |
Registered Member
|
Hm.
So this is really interesting. What's your exact CPU and compiler version ? The fixed-size version is actually probably not vectorized at all because Eigen decides that for 4D vectors it's not worth it. The dynamic-size version is vectorized, but with runtime checks that slow it down. It seems that you have proven that on your particular CPU, it's worth vectorizing already the 4D case.
Join us on Eigen's IRC channel: #eigen on irc.freenode.net
Have a serious interest in Eigen? Then join the mailing list! |
Registered Member
|
I am really confusing now, probably it is my CPU that has a strange behavior (AMD Phenom(tm) II X3 710). Now, out of nowhere , things seem to be normal again, where fixed size is faster than dynamic size (on both 4d and 20d).
I swear I ran the test 10 times before posting here.For now, just ignore my previous post, I will let you know if I can reproduce that. |
Registered users: Bing [Bot], blue_bullet, Google [Bot], rockscient, Yahoo [Bot]