This forum has been archived. All content is frozen. Please use KDE Discuss instead.

AVX/Vectorization Performance with 3.3 beta-1

Tags: None
(comma "," separated)
renes
Registered Member
Posts
5
Karma
0
I'm a long time Eigen 3 user and recently updated to 3.3 beta-1 to give it a try. I benchmarked Eigen performance with/without vectorization mostly to verify that vectorization was working but the results surprised me, especially given the AVX comments on the 3.3 page. The results I got suggest that vectorization is disabled or not used for Matrix4f and Vector4f types when AVX is enabled. Enabling AVX clearly improves double float performance (Matrix4d*Matrix4d) but overall performance is better without AVX. I verified with SimdInstructionSetsInUse() that the expected instruction sets were enabled. Also, interestingly -- but perhaps expected? -- with AVX enabled Matrix4f and Matrix4d are 32 byte aligned (checking with alignof()) but only 16 byte aligned without AVX; Vector4f is always 16 byte aligned.

Below are the numbers I got. The timings are the average time per operation over 100M iterations. I'm compiling with VS2015 (I plan to try this on Linux later).

EIGEN_DONT_VECTORIZE defined:
Matrix4f*Vector4f: 49ns
Matrix4f*Matrix4f: 235ns
Matrix4d*Matrix4d: 238ns

With /arch:AVX:
Matrix4f*Vector4f: 46ns -- About the same as EIGEN_DONT_VECTORIZE
Matrix4f*Matrix4f: 252ns -- About the same as EIGEN_DONT_VECTORIZE
Matrix4d*Matrix4d: 11ns

Without /arch:AVX:
Matrix4f*Vector4f: 2ns
Matrix4f*Matrix4f: 7ns
Matrix4d*Matrix4d: 14ns -- Almost 30% slower than AVX version
User avatar
ggael
Moderator
Posts
3447
Karma
19
OS
Be careful with such micro-benchmark as the compiler might over-optimize some versions, or maybe it just messing wit inlining. For instance, a factor x25 for the matrix-vector cases does not make any sense. You might also try with the devel branch as the beta1 is already quite old.

Your observations regarding alignment are right and on purpose: sizeof(Vector4f)==16, so no need for 32 bytes alignment (that would also waste memory).
renes
Registered Member
Posts
5
Karma
0
Thanks for the reply. I modified my benchmark test to use a realistic workload from one of my applications in order to (hopefully!) prevent the compiler from optimizing away the tests. Plus I downloaded the latest Eigen devel version.

Now, the performance difference between EIGEN_DONT_VECTORIZE and the SSE/SSE2 vectorized versions is 5x-10x instead of 25x. Roughly speaking, what would be the expected improvement? However, the basic problem remains -- it seems that with /march:AVX the single float tests produce the same results as DONT_VECTORIZE but the double float tests show a ~30% improvement over the SSE/SSE2 version.

I guess I'll look at the ASM output to see what the compiler is doing and I'll test this on Linux in a few days to see if there's any difference.
User avatar
ggael
Moderator
Posts
3447
Karma
19
OS
ok, looks like there is indeed an issue for fixed-size products for which half-register instructions are not considered.
User avatar
ggael
Moderator
Posts
3447
Karma
19
OS
Fixed:
https://bitbucket.org/eigen/eigen/commits/31f783860864/
Summary: Enable the use of half-packet in coeff-based product. For instance, Matrix4f*Vector4f is now vectorized again when using AVX
renes
Registered Member
Posts
5
Karma
0
Thanks! I'll update today and test it out.
renes
Registered Member
Posts
5
Karma
0
ggael wrote:Fixed:
https://bitbucket.org/eigen/eigen/commits/31f783860864/
Summary: Enable the use of half-packet in coeff-based product. For instance, Matrix4f*Vector4f is now vectorized again when using AVX


I tested this out and, indeed, Matrix4f*Vector4f performs much better now with AVX enabled. Matrix4f*Vector4f with AVX is now about the same as it is with SSE/SSE2 only. Is that the expected result? However, when AVX is enabled Matrix4f*Matrix4f still seems to perform the same as if EIGEN_DONT_VECTORIZE was defined. The SSE/SSE2 only version of Matrix4f*Matrix4f is ~9x faster.
User avatar
ggael
Moderator
Posts
3447
Karma
19
OS
Make sure you have updated your clone. I fixed that case 2 days ago: https://bitbucket.org/eigen/eigen/commi ... d0dd05b906
renes
Registered Member
Posts
5
Karma
0
Yep, that worked. I updated this morning to the latest revision -- af907dececc0 -- and now the SSE/SSE2 and AVX versions of my tests all run as expected.

Thanks for your help!


Bookmarks



Who is online

Registered users: Bing [Bot], Evergrowing, Google [Bot], rockscient