Registered Member
|
I'm a long time Eigen 3 user and recently updated to 3.3 beta-1 to give it a try. I benchmarked Eigen performance with/without vectorization mostly to verify that vectorization was working but the results surprised me, especially given the AVX comments on the 3.3 page. The results I got suggest that vectorization is disabled or not used for Matrix4f and Vector4f types when AVX is enabled. Enabling AVX clearly improves double float performance (Matrix4d*Matrix4d) but overall performance is better without AVX. I verified with SimdInstructionSetsInUse() that the expected instruction sets were enabled. Also, interestingly -- but perhaps expected? -- with AVX enabled Matrix4f and Matrix4d are 32 byte aligned (checking with alignof()) but only 16 byte aligned without AVX; Vector4f is always 16 byte aligned.
Below are the numbers I got. The timings are the average time per operation over 100M iterations. I'm compiling with VS2015 (I plan to try this on Linux later). EIGEN_DONT_VECTORIZE defined: Matrix4f*Vector4f: 49ns Matrix4f*Matrix4f: 235ns Matrix4d*Matrix4d: 238ns With /arch:AVX: Matrix4f*Vector4f: 46ns -- About the same as EIGEN_DONT_VECTORIZE Matrix4f*Matrix4f: 252ns -- About the same as EIGEN_DONT_VECTORIZE Matrix4d*Matrix4d: 11ns Without /arch:AVX: Matrix4f*Vector4f: 2ns Matrix4f*Matrix4f: 7ns Matrix4d*Matrix4d: 14ns -- Almost 30% slower than AVX version |
Moderator
|
Be careful with such micro-benchmark as the compiler might over-optimize some versions, or maybe it just messing wit inlining. For instance, a factor x25 for the matrix-vector cases does not make any sense. You might also try with the devel branch as the beta1 is already quite old.
Your observations regarding alignment are right and on purpose: sizeof(Vector4f)==16, so no need for 32 bytes alignment (that would also waste memory). |
Registered Member
|
Thanks for the reply. I modified my benchmark test to use a realistic workload from one of my applications in order to (hopefully!) prevent the compiler from optimizing away the tests. Plus I downloaded the latest Eigen devel version.
Now, the performance difference between EIGEN_DONT_VECTORIZE and the SSE/SSE2 vectorized versions is 5x-10x instead of 25x. Roughly speaking, what would be the expected improvement? However, the basic problem remains -- it seems that with /march:AVX the single float tests produce the same results as DONT_VECTORIZE but the double float tests show a ~30% improvement over the SSE/SSE2 version. I guess I'll look at the ASM output to see what the compiler is doing and I'll test this on Linux in a few days to see if there's any difference. |
Moderator
|
ok, looks like there is indeed an issue for fixed-size products for which half-register instructions are not considered.
|
Moderator
|
Fixed:
https://bitbucket.org/eigen/eigen/commits/31f783860864/ Summary: Enable the use of half-packet in coeff-based product. For instance, Matrix4f*Vector4f is now vectorized again when using AVX |
Registered Member
|
Thanks! I'll update today and test it out.
|
Registered Member
|
I tested this out and, indeed, Matrix4f*Vector4f performs much better now with AVX enabled. Matrix4f*Vector4f with AVX is now about the same as it is with SSE/SSE2 only. Is that the expected result? However, when AVX is enabled Matrix4f*Matrix4f still seems to perform the same as if EIGEN_DONT_VECTORIZE was defined. The SSE/SSE2 only version of Matrix4f*Matrix4f is ~9x faster. |
Moderator
|
Make sure you have updated your clone. I fixed that case 2 days ago: https://bitbucket.org/eigen/eigen/commi ... d0dd05b906
|
Registered Member
|
Yep, that worked. I updated this morning to the latest revision -- af907dececc0 -- and now the SSE/SSE2 and AVX versions of my tests all run as expected.
Thanks for your help! |
Registered users: Bing [Bot], Evergrowing, Google [Bot], rockscient