This forum has been archived. All content is frozen. Please use KDE Discuss instead.

Segment/block/map expressions very slow !

Tags: None
(comma "," separated)
lamtung
Registered Member
Posts
28
Karma
0
I've done some benchmarks to see how EIGEN actually performs, since after integrating EIGEN into my program, the performance was much lower than I had expected. The benchmark results were up and down.

For very simple operation like the dot product of 2 vectors (size tested are 4 and 20), EIGEN performs quite good, usually 3-4 times faster than a naive implementation using for loop.

In my program, there are several big arrays, whose sizes are multiple of 4 or 20. The dot product is applied on the segments (size 4 or 20) of those arrays. For example, if I have 2 big Eigen Vector ei_a and ei_b, then the code would look like this :

Code: Select all
   
for (i = 0; i < 20; ++i) {
  result += ei_a.segment<20>(i * 20).dot(ei_b.segment<20>(i * 20));
}


I have benchmarked the code and it is only 20% faster than a naive implementation, which is for me very disappointing, given the simple benchmark above where EIGEN run 3-4 times faster than a naive implementation in a single dot product.

I also try to manage the segment myself using pointer like it is done in a naive implementation, and then use Map to create the correponding Eigen Vector for each segment. The result is also only 20% faster than a naive implementation.

I am awared that, this decrease in performace is caused by malloc. How can I avoided it ? Should I split my big array into chunks, so that I don't have to call segment or using Map ?

Thanks in advanced
User avatar
bjacob
Registered Member
Posts
658
Karma
3
lamtung wrote:
Code: Select all
   
for (i = 0; i < 20; ++i) {
  result += ei_a.segment<20>(i * 20).dot(ei_b.segment<20>(i * 20));
}


I have benchmarked the code and it is only 20% faster than a naive implementation, which is for me very disappointing, given the simple benchmark above where EIGEN run 3-4 times faster than a naive implementation in a single dot product.


Sounds like you have high expectations :-)

The reason why the original dot product was fast was probably SIMD vectorization.

The reason why the segment() version isn't as fast is probably that you didn't give Eigen any way to know that this segment is starting at a 16-byte-aligned location. So it can't know that it can use SIMD there. For large (dynamic-size) segments, it would do runtime checks, but for small fixed-size segments it just gives up SIMD. So you're getting no SIMD at all there.

To fix that, you can use Eigen3's aligned map feature:

1. upgrade to Eigen 3 (development branch)
2. define this typedef (here I assume you're using floats)
typedef Matrix<float,20,1> Vector20f;
3. Replace
vector.segment<20>(i * 20)
by
Vector20f::MapAligned(vector.data() + i * 20)


I also try to manage the segment myself using pointer like it is done in a naive implementation, and then use Map to create the correponding Eigen Vector for each segment. The result is also only 20% faster than a naive implementation.


Yep, for the same reason. You need Eigen3's MapAligned here.

I am awared that, this decrease in performace is caused by malloc.


I don't think so, nothing in this for loop should be causing a malloc.


Join us on Eigen's IRC channel: #eigen on irc.freenode.net
Have a serious interest in Eigen? Then join the mailing list!
lamtung
Registered Member
Posts
28
Karma
0
THANKSSSSSSS ! It now works perfectly !!!!
lamtung
Registered Member
Posts
28
Karma
0
bjacob wrote:
I also try to manage the segment myself using pointer like it is done in a naive implementation, and then use Map to create the correponding Eigen Vector for each segment. The result is also only 20% faster than a naive implementation.


Yep, for the same reason. You need Eigen3's MapAligned here.


Dear bjacob,

I have another question concerning the usage of Map. I am actually using Eigen 3, and the code I used for benchmark look like this:

Code: Select all
    va = Map<Vector4d, Aligned> (a);
    vb = Map<Vector4d, Aligned> (b);
    return va.dot(vb);


This code is only 20% faster than a naive implementation, what is actually wrong with it, I think I am using aligned map already, am I ?
User avatar
bjacob
Registered Member
Posts
658
Karma
3
Here the issue is that SSE2 doesn't have a great way to vectorize small (here, 4D) dot products. SSE2 is only good at vectorizing larger (like your size 20 above) dot products.

SSE4 does bring the missing dot product instruction, but we're not yet using it.


Join us on Eigen's IRC channel: #eigen on irc.freenode.net
Have a serious interest in Eigen? Then join the mailing list!
lamtung
Registered Member
Posts
28
Karma
0
Yes, it is true, and I did notice the performance difference between 4D and 20D (SMID speed-up for 4D is about 2x, whereas 20D is 3-4x)

In my test benchmark for 4D, I noticed a very change behavior, where using vector of dynamic size outperforms that of fixed size, which contradicts what is written in the tutorial ?!

Here are the code

Dynamic Size
Code: Select all
double eiDotMapDyn(double* a, double* b) {
    return Map<VectorXd, Aligned> (a, 4).dot(Map<VectorXd, Aligned> (b, 4));
}


Fixed Size
Code: Select all
double eiDotMapFix(double* a, double* b) {
    return Map<Vector4d, Aligned>(a).dot(Map<Vector4d, Aligned>(b));

}


The code for dynamic size is 2 times faster than that of fixed size. BUT, if I use vector of size 20, then the code for fixed Size vector is 1,5 times faster than that of dynamic size ??

Can you give me an explanation ?
Thanks
User avatar
bjacob
Registered Member
Posts
658
Karma
3
Hm.

So this is really interesting.

What's your exact CPU and compiler version ?

The fixed-size version is actually probably not vectorized at all because Eigen decides that for 4D vectors it's not worth it. The dynamic-size version is vectorized, but with runtime checks that slow it down.

It seems that you have proven that on your particular CPU, it's worth vectorizing already the 4D case.


Join us on Eigen's IRC channel: #eigen on irc.freenode.net
Have a serious interest in Eigen? Then join the mailing list!
lamtung
Registered Member
Posts
28
Karma
0
I am really confusing now, probably it is my CPU that has a strange behavior (AMD Phenom(tm) II X3 710). Now, out of nowhere , things seem to be normal again, where fixed size is faster than dynamic size (on both 4d and 20d).

I swear I ran the test 10 times before posting here.For now, just ignore my previous post, I will let you know if I can reproduce that.


Bookmarks



Who is online

Registered users: Bing [Bot], blue_bullet, Google [Bot], rockscient, Yahoo [Bot]