This forum has been archived. All content is frozen. Please use KDE Discuss instead.

using a Vector4f for lower dimensions

Tags: None
(comma "," separated)
xgrnds
Registered Member
Posts
7
Karma
0
Hi.

I'm mostly doing 3D work, and am in the process of evaluating Eigen. One of my major reasons for wanting to use Eigen is to use the SSE optimisations. I am exclusively interested in the low dimension fixed size types.

According to the documentation, I have to use a Vector4f (or other similarly sized types) to enable the SSE optimisations, and I'm perfectly happy to do this. I have read http://eigen.tuxfamily.org/index.php?ti ... Operations .

However, when using a Vector4f to do 3D work, I'd like to automatically initialize the w component to 0 in the 3 argument constructor, and to be able to call functions like the cross product without issue (even though it's not defined for 4 dimensions...) I am prevented from doing this by THIS_METHOD_IS_ONLY_FOR_VECTORS_OF_A_SPECIFIC_SIZE.

Coming at this from a different direction (which I am also perfectly happy to do) is to enable SSE for 3 dimensional vectors and have the 4th component ignored / set to 0.

This seems like a specific case of a general optimisation where all fixed size matrix storage sizes are rounded up to the next 128 bits in order to be able to use SSE instructions for all components?

One point which skews the argument in favor of using Vec4 instead of rounding up the Vec3 storage is that the Eigen Transform class uses a N+1 dimension matrix to operate on N dimensional vectors, due to the requirements of translation. By using 4 dimensional vectors throughout, the complexity of the Transform system may be reduced (and it may well operate faster with pre-aligned SSE ready inputs.)

This assumes that end users are aware that vector operations (dot, cross) require w to be set to 0 and point operations (translation) have w=1.

Thanks for your time.

Last edited by xgrnds on Mon Apr 13, 2009 5:14 pm, edited 1 time in total.
User avatar
bjacob
Registered Member
Posts
658
Karma
3
xgrnds wrote:I'm mostly doing 3D work, and am in the process of evaluating Eigen. One of my major reasons for wanting to use Eigen is to use the SSE optimisations. I am exclusively interested in the low dimension fixed size types.

According to the documentation, I have to use a Vector4f (or other similarly sized types) to enable the SSE optimisations, and I'm perfectly happy to do this. I have read http://eigen.tuxfamily.org/index.php?ti ... Operations .

However, when using a Vector4f to do 3D work, I'd like to automatically initialize the w component to 0 in the 3 argument constructor, and to be able to call functions like the cross product without issue (even though it's not defined for 4 dimensions...) I am prevented from doing this by THIS_METHOD_IS_ONLY_FOR_VECTORS_OF_A_SPECIFIC_SIZE.


Yes, this has been discussed recently on this forum and on the mailing list. We definitely do have plans to make it more convenient, but it's true that at the moment it isn't.

The basic idea is that Eigen already does the distinction between actual size and storage size, so you could do:

Matrix

where 0 means default options, and the final 4,1 is the storage size.

However, currently, Eigen doesn't vectorize operations on such vectors because its logic is based purely on the actual size, but we can envision changes here. But for now we have a lot to do already.

Coming at this from a different direction (which I am also perfectly happy to do) is to enable SSE for 3 dimensional vectors and have the 4th component ignored / set to 0.


This won't work because the SSE "load" instruction wants 128 bits so you'd be doing an invalid read. So really, there must be a 4th component allocated.

This seems like a specific case of a general optimisation where all fixed size matrix storage sizes are rounded up to the next 128 bits in order to be able to use SSE instructions for all components?


Yes, indeed. However, because of the wasted memory, this will never be the default. Also notice that this really only applies to vector operations. For example, for matrix-matrix product, it's much more nontrivial to optimize. A 3x3 matrix would need to be transformed into a 4x4 matrix for this to work so it's much more wasted memory and the benefit is much lower.

One point which skews the argument in favor of using Vec4 instead of rounding up the Vec3 storage is that the Eigen Transform class uses a N+1 dimension matrix to operate on N dimensional vectors, due to the requirements of translation. By using 4 dimensional vectors throughout, the complexity of the Transform system may be reduced (and it may well operate faster with pre-aligned SSE ready inputs.)


Indeed, some particular operations here become a bit simpler.

This assumes that end users are aware that vector operations (dot, cross) require w to be set to 0 and point operations (translation) have w=1.


This is another reason why this can't ever become the default. We have to find an API for that while preserving the current, safer behavior as the default.


Join us on Eigen's IRC channel: #eigen on irc.freenode.net
Have a serious interest in Eigen? Then join the mailing list!
xgrnds
Registered Member
Posts
7
Karma
0
Thanks for your response - I'm glad this general area is being considered even if it's not currently being actively worked on. I'm going to start a second thread in a moment discussing the current api and error reporting mechanism, which I have been having some trouble with.

bjacob wrote:
This seems like a specific case of a general optimisation where all fixed size matrix storage sizes are rounded up to the next 128 bits in order to be able to use SSE instructions for all components?


Yes, indeed. However, because of the wasted memory, this will never be the default. Also notice that this really only applies to vector operations. For example, for matrix-matrix product, it's much more nontrivial to optimize. A 3x3 matrix would need to be transformed into a 4x4 matrix for this to work so it's much more wasted memory and the benefit is much lower.


I had assumed that initializing the padding components to zero and only extending the padding out across each row would be fine for dealing with matrices. My knowledge of the more advanced matrix algorithms is probably letting me down here. I'd be interested in any pointers to algorithms which would be broken by zeros in the padding area which would be involved with the calculation but ultimately considered discardable.

I'm quite surprised to hear that memory wastage is a concern. The percentage wastage is fairly high with small matrices but it drops off sharply, and is ultimately amortized by the matrix data itself.

On a similar topic, I was assuming that purely in terms of memory alignment, padding to take the 3x3 matrix example into a 4x3 would be necessary to maintain SSE alignment for each row? I was under the impression that SSE2 instructions would crash if their data was misaligned. Perhaps your cache aware data structures are doing something clever here?

Thanks!
User avatar
bjacob
Registered Member
Posts
658
Karma
3
bjacob wrote:Yes, indeed. However, because of the wasted memory, this will never be the default. Also notice that this really only applies to vector operations. For example, for matrix-matrix product, it's much more nontrivial to optimize. A 3x3 matrix would need to be transformed into a 4x4 matrix for this to work so it's much more wasted memory and the benefit is much lower.


I had assumed that initializing the padding components to zero and only extending the padding out across each row would be fine for dealing with matrices.


OK, right. So for 3x3 matrices, one only needs to store a 3x4 or 4x3 matrix, not a 4x4 one.

I'm quite surprised to hear that memory wastage is a concern. The percentage wastage is fairly high with small matrices but it drops off sharply, and is ultimately amortized by the matrix data itself.


But that's not the main issue: in Eigen, for fixed-size types, we want zero overhead at all: no speed overhead, no memory overhead. At least by default.

There are many reasons for that, but for example: we want that by default, Eigen produces binary-compatible code regardless of whether vectorization is enabled. This is really important for binary libraries exposing Eigen types. So if we added padding by default, we'd have to add it also when there is no vectorization -- so even when the padding is useless and actually slows things down!

Also, the argument that the waste drops sharply for bigger matrices is irrelevant, because this whole issue only exists for fixed-size matrices, which are assumed to be very small.

For large matrices, the user is expected to use dynamic-sized objects like MatrixXf. Then Eigen allows itself some runtime logic at the cost of a little speed overhead, and is then able to vectorize operations on matrices of any size (dealing with unaligned boundaries with a bit of non-vectorized code).

On a similar topic, I was assuming that purely in terms of memory alignment, padding to take the 3x3 matrix example into a 4x3 would be necessary to maintain SSE alignment for each row? I was under the impression that SSE2 instructions would crash if their data was misaligned. Perhaps your cache aware data structures are doing something clever here?


No, you are perfectly right. For best performance, data must be aligned. Dealing with unligned data is possible with separate SSE instructions but then the write operations are slow (and the read operations also a bit slower).

This is one of the issues i had in mind when i told you that vectorizing 3x3 matrix products would require making them 4x3 or 3x4.

It's not the only one, as with a packed 3x3 matrix there's also the issue that any contiguous packet of 4 entries overlaps over 2 rows or columns.


Join us on Eigen's IRC channel: #eigen on irc.freenode.net
Have a serious interest in Eigen? Then join the mailing list!
kajala
Registered Member
Posts
11
Karma
0
Hi,

If you're doing mainly 3D graphics, check out also slmath library at
http://sourceforge.net/projects/slmath

It's a C++ math lib for 3D-graphics programming with GPU shading language-like classes such as vec2, vec3, vec4, mat4, etc., plus lots of helper functionality. GLSL specs followed closely. Robust implementation: All results checked in debug build + many unit tests. The code is very portable, besides Visual Studio 2008 I know people have used the lib on iPhone, Linux, N-Gage and Symbian S60. The code is licensed under BSD/MIT, so as free/liberal as it gets.

The library doesn't have the SSE optimizations though.


Br,
Jani

xgrnds wrote:Hi.

I'm mostly doing 3D work, and am in the process of evaluating Eigen. One of my major reasons for wanting to use Eigen is to use the SSE optimisations. I am exclusively interested in the low dimension fixed size types.

Thanks for your time.
kajala
Registered Member
Posts
11
Karma
0
FYI:
I just added SIMD/SSE2 support (VS2008 ) to slmath lib:
https://sourceforge.net/projects/slmath

Not comprehensive (yet), but I'd guess the most critical functions are in SSE2 now. If you're doing mostly hw-accelerated 3D-graphics, I'd guess matrix*matrix multiplication is the most performance-sensitive function (at least it is for me). The 4x4 matrix*matrix multiplication SIMD version is about 5x faster.

Maybe the code could be useful in Eigen as well, but I'm not so comfortable with Eigen library yet so that I could help with the VS2008 SIMD support at this point. Maybe later. I'm planning to use Eigen for some least squares stuff, where I need n-vectors and matrices.

Some performance tests:
(Vista 64bit, Intel Core 2 Duo E8400 @ 3GHz)

--------------------------------------------------------------
Y += alpha * X (4-vector fixed)
--------------------------------------------------------------
ops (slmath, SIMD) = 9.96341e+008 (res=6.71089e+007)
ops (Eigen) = 7.46835e+008 (res=6.71089e+007)
ops (slmath) = 7.46671e+008 (res=6.71089e+007)
--------------------------------------------------------------
matrix * matrix (4x4 fixed)
--------------------------------------------------------------
ops (slmath, SIMD) = 7.10329e+007 (res=0.215397)
ops (slmath) = 2.09706e+007 (res=0.215397)
ops (Eigen) = 1.49464e+007 (res=0.215397)

The last difference of matrix*matrix non-SIMD versions puzzles me a bit, since 40% is quite much difference. Maybe unrolling in slmath helps vs. Eigen, but not sure. SIMD version is still clearly the fastest of course.


Br,
Jani

xgrnds wrote:Hi.

I'm mostly doing 3D work


Last edited by kajala on Mon May 04, 2009 8:41 am, edited 1 time in total.
User avatar
ggael
Moderator
Posts
3447
Karma
19
OS
kajala wrote:FYI:
I just added SIMD/SSE2 support (VS2008 ) to slmath lib:
https://sourceforge.net/projects/slmath

Not comprehensive (yet), but I'd guess the most critical functions are in SSE2 now. If you're doing mostly hw-accelerated 3D-graphics, I'd guess matrix*matrix multiplication is the most performance-sensitive function (at least it is for me). The 4x4 matrix*matrix multiplication SIMD version is about 5x faster.

Maybe the code could be useful in Eigen as well, but I'm not so comfortable with Eigen library yet so that I could help with the VS2008 SIMD support at this point. Maybe later. I'm planning to use Eigen for some least squares stuff, where I need n-vectors and matrices.

Some performance tests:
(Vista 64bit, Intel Core 2 Duo E8400 @ 3GHz)

--------------------------------------------------------------
Y += alpha * X (4-vector fixed)
--------------------------------------------------------------
ops (slmath, SIMD) = 9.96341e+008 (res=6.71089e+007)
ops (Eigen) = 7.46835e+008 (res=6.71089e+007)
ops (slmath) = 7.46671e+008 (res=6.71089e+007)
--------------------------------------------------------------
matrix * matrix (4x4 fixed)
--------------------------------------------------------------
ops (slmath, SIMD) = 7.10329e+007 (res=0.215397)
ops (slmath) = 2.09706e+007 (res=0.215397)
ops (Eigen) = 1.49464e+007 (res=0.215397)

The last difference of matrix*matrix non-SIMD versions puzzles me a bit, since 40% is quite much difference. Maybe unrolling in slmath helps vs. Eigen, but not sure. SIMD version is still clearly the fastest of course.


Br,
Jani

xgrnds wrote:Hi.

I'm mostly doing 3D work




actually, the 4x4 matrix-matrix product is automatically unrolled and vectorized by Eigen. Just like Y+=alpha*X. Here with GCC, the generated ASM is as good as a hand written one (like the one you wrote). So I'm a bit puzzled by your results. Perhaps you forgot to enable SSE when compiling the eigen version ? or maybe, MSVC08 is just very bad at generating optimized code.

BTW, such discussions are probably better suited for the mailing-list.

To get back to the initial topic of this thread, I recently committed an AlignedVector3 class which enables vectorization for 3D vectors (internally it is stored as a 4D vector).
kajala
Registered Member
Posts
11
Karma
0
Ok, now that I looked at it in more detail, the SSE2 MSVC platform detection doesn't work in Eigen: EIGEN_SSE2_ON_MSVC_2008_OR_LATER does not get automatically defined, because when compiling for x64 _M_IX86_FP is not defined (since SSE2 is supported by all x64 CPUs).

From MSVC docs: "/arch is only available when compiling for the x86 platform. This compiler option is not available when compiling for x64 or Itanium."

So I think the check should be in Core:
#ifdef _MSC_VER
#include // for _aligned_malloc -- need it regardless of whether vectorization is enabled
#if (_MSC_VER >= 1500) // 2008 or later
// Remember that usage of defined() in a #define is undefined by the standard, but since we have checked _MSC_VER already...
#if (defined(_M_IX86_FP) && _M_IX86_FP >= 2) || defined(_M_X64)
#define EIGEN_SSE2_ON_MSVC_2008_OR_LATER
#endif
#endif
#endif

Now I get much better results, but there is still a big difference (2x). Maybe MSVC is bad at generating code, or then there is some problem in my test.


Jani


actually, the 4x4 matrix-matrix product is automatically unrolled and vectorized by Eigen. Just like Y+=alpha*X. Here with GCC, the generated ASM is as good as a hand written one (like the one you wrote). So I'm a bit puzzled by your results. Perhaps you forgot to enable SSE when compiling the eigen version ? or maybe, MSVC08 is just very bad at generating optimized code.

BTW, such discussions are probably better suited for the mailing-list.

To get back to the initial topic of this thread, I recently committed an AlignedVector3 class which enables vectorization for 3D vectors (internally it is stored as a 4D vector).

Last edited by kajala on Mon May 04, 2009 10:19 am, edited 1 time in total.
User avatar
bjacob
Registered Member
Posts
658
Karma
3
Thanks, I commited your fix to both trunk and branch.


Join us on Eigen's IRC channel: #eigen on irc.freenode.net
Have a serious interest in Eigen? Then join the mailing list!
xgrnds
Registered Member
Posts
7
Karma
0
kajala wrote:If you're doing mainly 3D graphics, check out also slmath library at
http://sourceforge.net/projects/slmath


I hadn't found that library in my searches - thank you. I'll certainly have a go at evaluating this. It's true that I don't need any types above 4 dimensions. My first impressions of this library are that there is no support for expression templates, which is a minus point for me at this point.

I'm not continuing to evaluate Eigen - it's almost perfect but the one issue I do have with it is too big to ignore - it's the fact that errors are generated at the site of the template instantiation as opposed to the location the error actually occurred at (in VC2008 only, I understand.)

Thanks for your support.


Bookmarks



Who is online

Registered users: Bing [Bot], Google [Bot], Sogou [Bot]