This forum has been archived. All content is frozen. Please use KDE Discuss instead.

Using map for fast small dynamic matrices: howto?

Tags: None
(comma "," separated)
EamonNerbonne
Registered Member
Posts
36
Karma
0
A while back on the mailing list, the topic of malloc-free dynamic matrices came up. It was said that optimizing for small dynamic matrices was mostly hopeless anyhow; and that a better approach would be to use Map.

How would this be done?

For the sake of argument, let's say you'd have a large number of 3-element vectors - i.e. a bad case for static-sized Vector3d anyhow. How would one go about filling 1000000 such vectors and then doing some math on them (say, computing the sum under projection of a MatrixXd of size 3x3)?

I'm thinking you'd need something along the line of
Code: Select all
typedef Eigen::Map<VectorXd,Eigen::Aligned> MappedVectorXd;
//....
int count = 1000000;
int dims = 3;
int padded_size = dims*sizeof(double) + (16-1) & -16;
int dSize = padded_size / sizeof(double);//blech
double* storage = (double*)ei_aligned_malloc(padded_size*count);
//access n-th vector:
MappedVectorXd vecN( storage+dSize*n,dims);


Is there a better way to do this? Does using map like this avoid any calls to resize and the like?

Last edited by EamonNerbonne on Fri Mar 26, 2010 7:03 am, edited 1 time in total.
User avatar
bjacob
Registered Member
Posts
658
Karma
3
There are two things that make small dynamically-sized matrices slow:
- allocating their arrays with malloc() is slow
- since the sizes are not known at compile time, loops can't be unrolled.

Using Map can help you avoid the first problem, essentially by doing the memory management yourself instead of letting Eigen call malloc().

Your program could first be allocating a big array once and for all:
Code: Select all
float *array = ei_aligned_new<float>(1000000);


Then your program could know where in that big array there is room to allocate a vector of n floats, and do:
Code: Select all
Map<VectorXf, Aligned> my_vector(ptr, n);


Now my_vector can be used exactly like a VectorXf of size n, the difference is that the above line creating it didn't call malloc at all, actually with compiler optimizations enabled it typically vanishes and has 0 cost.

I'm sorry i didn't really get the rest of your post...


Join us on Eigen's IRC channel: #eigen on irc.freenode.net
Have a serious interest in Eigen? Then join the mailing list!
EamonNerbonne
Registered Member
Posts
36
Karma
0
Thanks for the answer!

There's another factor that potentially causes a slowdown; and that's that the matrix contains size information and actually checks that information to see if a resize is needed when you write to it. If you have thousands of small vectors, the size information dilutes your cache; also, the compiler can't optimize away size checks if each vector can potentially have its own size.

What I'm looking for is a way to store many, many small vectors and matrices of unknown precise size, and to be able to process these without each store checking weather the target has enough space (at least, not when NDEBUG's on)

Basically, I'm looking for an alternative for

Code: Select all
VectorXf vecs=new VectorXf[10000];
for(int i=0;i<10000;++i) vecs[i] = VectorXf::Random(2);
//...
while(a-long-time)
   for(int i=0;i<10000;++i) vecs[i] = process(vecs[i]);


So, in the above code, initialization isn't ideal because it involves 10000 distinct calls to malloc, and then the processing loop isn't ideal because each step accesses a vector that's much larger than need be and non-contiguous in memory (the vector's backing store is separately malloc'd after all, and the vector itself contains unnecessary info about its size). Finally, when set, each vecs[i] element checks to see if the right hand expression has the same number of rows (which it always will have - the vector size is unknown compile time but generally 1-16, however, for a given run all vectors are of the same size).

However, I can't just do...
Code: Select all
float* array = ei_aligned_new(10000*dims);
//...
Map<VectorXf, Aligned> vecs(int i) {return Map<VectorXf, Aligned>(array +dims*i,dims);}
//...
for(int i=0;i<10000;++i) vecs(i) = VectorXf::Random(dims);
//...
while(a-long-time)
   for(int i=0;i<10000;++i) vecs(i) = process(vecs(i));
...because after all there's no guarantee that array+dims*i is actually aligned.

So, is the best alternative to manually compute the required padding as in the original question? And does using map like this actually avoid checking for the necessity of resize?
User avatar
bjacob
Registered Member
Posts
658
Karma
3
EamonNerbonne wrote:There's another factor that potentially causes a slowdown; and that's that the matrix contains size information and actually checks that information to see if a resize is needed when you write to it. If you have thousands of small vectors, the size information dilutes your cache; also, the compiler can't optimize away size checks if each vector can potentially have its own size.


True, I didn't think of that.

And while we're at it, here's another potential source of slowdown. As you may know, for dynamic sizes, we always try to vectorize. What if the matrix size is not a multiple of the packet size (typically 16 bytes)? Then we vectorize as much as possible and finish the last few scalars separately. For example, if you have a vector of 5 floats, it will be handled as 1 packet of 4 floats and then 1 last float. This relies of course on some runtime logic. By contrast, for fixed-size objects, we only vectorize at all if the object is a multiple of 16 bytes, in which case we align it to 1 16 bytes boundary and don't need to check anything at runtime.

What I'm looking for is a way to store many, many small vectors and matrices of unknown precise size, and to be able to process these without each store checking weather the target has enough space (at least, not when NDEBUG's on)


Notice that if you don't want assignments to check sizes, then in particular you don't want assignment to automatically resize the destination. Thus, the feature that you want is incompatible with Eigen's default semantics.

Now in Eigen (dev branch), we have a EIGEN_NO_AUTOMATIC_RESIZING (see in DenseStorageBase.h, in _resize_to_match) that disables this automatic resizing, except that when assigning to a zero-sized matrix, it will still resize it.

You could implement what you want by doing something similar to this but going even further: never resize, not even if the matrix had size 0. Only then, could you get rid of the size check. The key is that your want _resize_to_match to be a NOP.

I don't think we want this in Eigen itself. Among other reasons, even if you did this, you would still be paying the other costs mentioned above (runtime alignment checks). But feel free to tweak your Eigen :)

Basically, I'm looking for an alternative for

Code: Select all
VectorXf vecs=new VectorXf[10000];
for(int i=0;i<10000;++i) vecs[i] = VectorXf::Random(2);
//...
while(a-long-time)
   for(int i=0;i<10000;++i) vecs[i] = process(vecs[i]);


So, in the above code, initialization isn't ideal because it involves 10000 distinct calls to malloc, and then the processing loop isn't ideal because each step accesses a vector that's much larger than need be and non-contiguous in memory (the vector's backing store is separately malloc'd after all, and the vector itself contains unnecessary info about its size). Finally, when set, each vecs[i] element checks to see if the right hand expression has the same number of rows (which it always will have - the vector size is unknown compile time but generally 1-16, however, for a given run all vectors are of the same size).


If for a given run all vectors have the same size, then you have a good solution: just allocate 1 matrix, of size 10000*vectorsize! Check that you pick the right storage order for what you're doing.

EDIT: this won't guarantee that each column (each vector) is aligned.

However, I can't just do...
Code: Select all
float* array = ei_aligned_new(10000*dims);
//...
Map<VectorXf, Aligned> vecs(int i) {return Map<VectorXf, Aligned>(array +dims*i,dims);}
//...
for(int i=0;i<10000;++i) vecs(i) = VectorXf::Random(dims);
//...
while(a-long-time)
   for(int i=0;i<10000;++i) vecs(i) = process(vecs(i));
...because after all there's no guarantee that array+dims*i is actually aligned.

So, is the best alternative to manually compute the required padding as in the original question? And does using map like this actually avoid checking for the necessity of resize?


Ah yes, indeed it is a good alternative to manually compute the required padding. And then, yes, provided that you use NDEBUG, you won't have any size checks, because expressions like Map are not resizable (only plain matrices are resizable).

Finally, consider the possibility of using the max-size-at-compile-time, if you really know it's not more than 16.

Code: Select all
typedef Matrix<float, Dynamic, 1, 0, 16, 1> VectorOfSizeAtMost16;


Such vectors are guaranteed to be aligned and don't cause mallocs. But... yes, they do have runtime checks.


Join us on Eigen's IRC channel: #eigen on irc.freenode.net
Have a serious interest in Eigen? Then join the mailing list!
EamonNerbonne
Registered Member
Posts
36
Karma
0
(quotes not in-order)
bjacob wrote:Finally, consider the possibility of using the max-size-at-compile-time, if you really know it's not more than 16.
Code: Select all
typedef Matrix<float, Dynamic, 1, 0, 16, 1> VectorOfSizeAtMost16;
Such vectors are guaranteed to be aligned and don't cause mallocs. But... yes, they do have runtime checks.
This kind of limited size vector isn't really an option if you have lots of vectors, however; after all, you're basically just adding tons of padding. Adding a bit of padding at runtime is OK, but if you're dealing with a few dimensions, it's not so great to pad upto whatever maximal value I think of (and of course, it degrades poorly - using Map if someone uses a too-large vector size, I just need more memory and my code isn't necessarily tuned optimally, whereas using a maximum size, the code breaks (which you can check for -but the old code still won't work).

bjacob wrote:
EamonNerbonne wrote:There's another factor that potentially causes a slowdown; and that's that the matrix contains size information and actually checks that information to see if a resize is needed when you write to it. If you have thousands of small vectors, the size information dilutes your cache; also, the compiler can't optimize away size checks if each vector can potentially have its own size.
True, I didn't think of that.

And while we're at it, here's another potential source of slowdown. As you may know, for dynamic sizes, we always try to vectorize. What if the matrix size is not a multiple of the packet size (typically 16 bytes)? Then we vectorize as much as possible and finish the last few scalars separately. For example, if you have a vector of 5 floats, it will be handled as 1 packet of 4 floats and then 1 last float. This relies of course on some runtime logic. By contrast, for fixed-size objects, we only vectorize at all if the object is a multiple of 16 bytes, in which case we align it to 1 16 bytes boundary and don't need to check anything at runtime.

Right, that's the price you pay for flexibility. Incidentally - why can't fixed size vectors have the same logic? I mean, if you're working with fixed 7-dimensional vectors, why can't the first 6 (if doubles) or 4(if floats) be vectorized as usual and the rest looped? In any case, that's a tangent...

bjacob wrote:Ah yes, indeed it is a good alternative to manually compute the required padding. And then, yes, provided that you use NDEBUG, you won't have any size checks, because expressions like Map are not resizable (only plain matrices are resizable).

I guess that confirms my fears. Map's fine, but it's not very attractive to use. Your (unquoted) suggestion to just use a matrix and interpret each column as a vector indeed isn't aligned (as you said), and in any case, it's not really a general solution since it won't work for small matrices (say, 2x2 matrices).

I did some really quick testing with Map; unfortunately, the results are confusing. Sometimes it's faster and sometimes it's slower than plain VectorXd (the test is written to ensure VectorXd only allocates once - otherwise it loses dramatically, of course). That's weird, but probably due to odd interactions with gcc's and msc's optimizers; that kind of thing simply matters at small sizes...

Right now, my solution is to use the multi-column matrix as a storage for single column vectors you suggested and then to copy out the vector currently being processed into a normal (but reused) VectorXd that *is* aligned. That works, but it's kind of messy and of course the various processing steps and intermediates all involve the unnecessary size calculations so almost certainly performance could be better.

Anyhow, it's a pretty minor speed-bump - I'll experiment and if I find anything really weird I'll post again. Thanks for the clarification and ideas, anyhow!
User avatar
bjacob
Registered Member
Posts
658
Karma
3
I have another idea for you: if you really want the speed of fixed-size and you don't care too much about compilation times and about the size of your executable, you can simply compile code specialized for each fixed size from 2 to 16, and have a function that dispatches at runtime between these paths.

Your compile-time-specific paths could share the same source code, by templating it by the SizeAtCompileTime. Your dispatcher code is where you instantiate this templates for all the values that you want.

Incidentally - why can't fixed size vectors have the same logic? I mean, if you're working with fixed 7-dimensional vectors, why can't the first 6 (if doubles) or 4(if floats) be vectorized as usual and the rest looped? In any case, that's a tangent...


Because, in order for this to work without any runtime check, we would need to align that to a 16-byte boundary, which would round up the sizeof of that object to the next multiple of 16 bytes, which would break our design principle that fixed-size objects are zero-overhead (in cpu and in memory).


Join us on Eigen's IRC channel: #eigen on irc.freenode.net
Have a serious interest in Eigen? Then join the mailing list!
EamonNerbonne
Registered Member
Posts
36
Karma
0
bjacob wrote:Because, in order for this to work without any runtime check, we would need to align that to a 16-byte boundary, which would round up the sizeof of that object to the next multiple of 16 bytes, which would break our design principle that fixed-size objects are zero-overhead (in cpu and in memory).

No, I mean like this: let's say you've 5 doubles, i.e. 40 bytes. That could be templated to act like 4+1 doubles - right? Not like 6-1 doubles, if you get what I mean. I.e. do what dynamic sized vectors do and vectorize the easy stuff first and then just loop to finish the stragglers.
EamonNerbonne
Registered Member
Posts
36
Karma
0
bjacob wrote:I have another idea for you: if you really want the speed of fixed-size and you don't care too much about compilation times and about the size of your executable, you can simply compile code specialized for each fixed size from 2 to 16, and have a function that dispatches at runtime between these paths.
Yeah, I thought of that; and I tried it using vectors with a maximum size (dynamic still so that vectorization of odd sizes works) but it's a bit tricky, and it started generating truly huge executables in gcc 4.5 (no idea why - 4.4 was still sane), didn't really improve things that much, and it complicates the rest of the app too. I might do that at a more fine grained level (I originally simply templated pretty much the whole algorithm and if-then-else'd to choose the right template instance), but that'd take some more thought. Part of the problem is that maximum-size vectors are at hart still dynamic matrices, and that means they still track their size (even though it never actually changes) - fixed size would be better, but then you run into vectorization problems, and that probably outweighs any gains from not tracking sizes.

Honestly, I think I'm starting to get close to what's possible. 'course, I've thought that before and found another 30% or so since :-).
User avatar
bjacob
Registered Member
Posts
658
Karma
3
EamonNerbonne wrote:
bjacob wrote:Because, in order for this to work without any runtime check, we would need to align that to a 16-byte boundary, which would round up the sizeof of that object to the next multiple of 16 bytes, which would break our design principle that fixed-size objects are zero-overhead (in cpu and in memory).

No, I mean like this: let's say you've 5 doubles, i.e. 40 bytes. That could be templated to act like 4+1 doubles - right? Not like 6-1 doubles, if you get what I mean. I.e. do what dynamic sized vectors do and vectorize the easy stuff first and then just loop to finish the stragglers.


Let me re-explain. For fixed sizes we want to be able to work without any extra runtime logic, so we stay zero-overhead. But SIMD will only be fast if our packets are 16-byte aligned. If we have a vector of 5 doubles, and we are on a platform where doubles are aligned to a 8-byte boundary, then there are 2 cases:
Code: Select all
(unaligned) (aligned packet) (aligned packet)

and
Code: Select all
(aligned packet) (aligned packet) (unaligned)


Now this was the "good case". On other platforms, such as 32-bit x86 linux, actually doubles may not even be 8-byte aligned, so it may even be that no packet is aligned at all!

Anyway, even in that "good case" (when doubles are 8-byte aligned) we need a runtime check to determine in which case we are; and then we are dealing with runtime offsets. All that is overhead, which we don't want: it is a basic Eigen design principle that fixed-size objects come with zero overhead. The only way to avoid that overhead is to explicitly align our vector of 5 doubles to 16-byte boundary, so we are now guaranteed to be in this case:
Code: Select all
(aligned packet) (aligned packet) (unaligned)

but if we do that, then the compile will give our 5-double vector type a sizeof() of 48 instead of 40. This is so, when you create an array of such vectors, they all have the requested alignment. Thus we have inflated the sizeof() of our vector types, which is something we don't want to do: we want fixed-size types to remain absolutely zero-overhead.

In this case it might not sound too bad to go from 40 to 48 but think of the case of a vector of 5 floats, now it goes from 20 to 32 bytes, almost a twofold increase.

We might still do that as a non-default option in the future. It can be added in the future. It just isn't something we want to do by default when people just ask for 5-vectors.


Join us on Eigen's IRC channel: #eigen on irc.freenode.net
Have a serious interest in Eigen? Then join the mailing list!
User avatar
bjacob
Registered Member
Posts
658
Karma
3
EamonNerbonne wrote:
bjacob wrote:I have another idea for you: if you really want the speed of fixed-size and you don't care too much about compilation times and about the size of your executable, you can simply compile code specialized for each fixed size from 2 to 16, and have a function that dispatches at runtime between these paths.
Yeah, I thought of that; and I tried it using vectors with a maximum size (dynamic still so that vectorization of odd sizes works) but it's a bit tricky, and it started generating truly huge executables in gcc 4.5 (no idea why - 4.4 was still sane)


Don't let that sound mysterious. First of all, GCC 4.5 is still 3 months away from release. Also, since you're using GCC, your friend is 'nm':
Code: Select all
nm -C yourprogram | grep Eigen

will dump you a list of Eigen symbols.

Also, when comparing executable size, always do it on executables compiled with -O2 -g0 and make sure it's stripped:
Code: Select all
strip yourprogram


didn't really improve things that much


If you did it correctly and it doesn't improve things much, that is pretty much the end of the story... indeed, it is definitely the most efficient solution you can get, provided that the functions being dispatched to are expensive enough to make the overhead of the dispatcher negligible, and show no issue with flushing the CPU's instruction cache...


Part of the problem is that maximum-size vectors are at hart still dynamic matrices, and that means they still track their size (even though it never actually changes)


Here with this dispatcher idea I was suggesting to dispatch to fully fixed-size types.

fixed size would be better, but then you run into vectorization problems, and that probably outweighs any gains from not tracking sizes.


Since you're working with doubles, vectorization is at best a 2x speed increase. By contrast, for very small sizes, loop unrolling can be a 5x-10x speed increase.


Join us on Eigen's IRC channel: #eigen on irc.freenode.net
Have a serious interest in Eigen? Then join the mailing list!


Bookmarks



Who is online

Registered users: abc72656, Bing [Bot], daret, Google [Bot], Sogou [Bot], Yahoo [Bot]