This forum has been archived. All content is frozen. Please use KDE Discuss instead.

Affine3d slower than Matrix4d

Tags: None
(comma "," separated)
tobiaskunz
Registered Member
Posts
4
Karma
0

Affine3d slower than Matrix4d

Tue Jun 11, 2013 6:55 pm
Our code uses a lot of 3D affine transformations. Thus, I compared the speed of concatenating transformations when using the Affine3d vs. the Matrix4d class for those transformations. My expectation was that Affine3d is faster because it can make additional assumptions. However, in my experiments Affine3d is slower by factor 3 or 4.

Am I doing anything wrong? I don't think a specialized class should be slower than the general matrix class. I also compared the speed to hard-coding the concatenation, which was faster than Affine3d and Matrix4d. I am using Visual Studio 2010 32-bit. Here are the results for 100 million concatenations:

WIth SSE2:
Affine3d: 9.3 s
Matrix4d: 3.1 s
Hard-coded: 1.8 s

Without SSE:
Affine3d: 9.7 s
Matrix4d: 2.4 s
Hard-coded: 2.0 s

I am also confused about the fact that Matrix4d is slowed down by SSE2.

Here is my code for the experiment:

Code: Select all
#include <Eigen/Dense>
#include <ctime>
#include <iostream>

using namespace Eigen;
using namespace std;

Affine3d concatenate(const Affine3d& A1, const Affine3d& A2) {
   Affine3d res;

   res(0,0) = A1(0,0) * A2(0,0) + A1(0,1) * A2(1,0) + A1(0,2) * A2(2,0);
   res(1,0) = A1(1,0) * A2(0,0) + A1(1,1) * A2(1,0) + A1(1,2) * A2(2,0);
   res(2,0) = A1(2,0) * A2(0,0) + A1(2,1) * A2(1,0) + A1(2,2) * A2(2,0);

   res(0,1) = A1(0,0) * A2(0,1) + A1(0,1) * A2(1,1) + A1(0,2) * A2(2,1);
   res(1,1) = A1(1,0) * A2(0,1) + A1(1,1) * A2(1,1) + A1(1,2) * A2(2,1);
   res(2,1) = A1(2,0) * A2(0,1) + A1(2,1) * A2(1,1) + A1(2,2) * A2(2,1);

   res(0,2) = A1(0,0) * A2(0,2) + A1(0,1) * A2(1,2) + A1(0,2) * A2(2,2);
   res(1,2) = A1(1,0) * A2(0,2) + A1(1,1) * A2(1,2) + A1(1,2) * A2(2,2);
   res(2,2) = A1(2,0) * A2(0,2) + A1(2,1) * A2(1,2) + A1(2,2) * A2(2,2);

   res(0,3) = A1(0,0) * A2(0,3) + A1(0,1) * A2(1,3) + A1(0,2) * A2(2,3) + A1(0,3);
   res(1,3) = A1(1,0) * A2(0,3) + A1(1,1) * A2(1,3) + A1(1,2) * A2(2,3) + A1(1,3);
   res(2,3) = A1(2,0) * A2(0,3) + A1(2,1) * A2(1,3) + A1(2,2) * A2(2,3) + A1(2,3);

   res(3,0) = 0.0;
   res(3,1) = 0.0;
   res(3,2) = 0.0;
   res(3,3) = 1.0;

   return res;
}

int main() {
   const int iterations = 100000000;
   
   Affine3d A1 = Translation3d(0.1, 0.2, 0.3) * AngleAxisd(0.5, Vector3d(1.0 / sqrt(2.0), 1.0 / sqrt(2.0), 0.0));
   Affine3d A2 = A1;
   Matrix4d M1 = A1.matrix();
   Matrix4d M2 = M1;

   clock_t start = clock();
   for(int i = 0; i < iterations; i++) {
      A2 = A2 * A1;
   }
   cout << "Affine3d: " << (double)(clock() - start) / CLOCKS_PER_SEC << " s\n";

   start = clock();
   for(int i = 0; i < iterations; i++) {
      M2 = M2 * M1;
   }
   cout << "Matrix4d: " << (double)(clock() - start) / CLOCKS_PER_SEC << " s\n";

   A2 = A1;
   start = clock();
   for(int i = 0; i < iterations; i++) {
      A2 = concatenate(A2, A1);
   }
   cout << "Hard-coded: " << (double)(clock() - start) / CLOCKS_PER_SEC << " s\n";

   return 0;
}
User avatar
ggael
Moderator
Posts
3447
Karma
19
OS

Re: Affine3d slower than Matrix4d

Tue Jun 11, 2013 7:55 pm
Your benchmark is broken has it allows the compiler to completely, or partially, remove the loops. That's what happens with gcc. Here is a fixed version:

Code: Select all
// viewtopic.php?f=74&t=111512

#include <Eigen/Dense>
#include <ctime>
#include <iostream>

using namespace Eigen;
using namespace std;

EIGEN_DONT_INLINE
void concatenate(const Affine3d& A1, const Affine3d& A2, Affine3d& res) {
   res(0,0) = A1(0,0) * A2(0,0) + A1(0,1) * A2(1,0) + A1(0,2) * A2(2,0);
   res(1,0) = A1(1,0) * A2(0,0) + A1(1,1) * A2(1,0) + A1(1,2) * A2(2,0);
   res(2,0) = A1(2,0) * A2(0,0) + A1(2,1) * A2(1,0) + A1(2,2) * A2(2,0);

   res(0,1) = A1(0,0) * A2(0,1) + A1(0,1) * A2(1,1) + A1(0,2) * A2(2,1);
   res(1,1) = A1(1,0) * A2(0,1) + A1(1,1) * A2(1,1) + A1(1,2) * A2(2,1);
   res(2,1) = A1(2,0) * A2(0,1) + A1(2,1) * A2(1,1) + A1(2,2) * A2(2,1);

   res(0,2) = A1(0,0) * A2(0,2) + A1(0,1) * A2(1,2) + A1(0,2) * A2(2,2);
   res(1,2) = A1(1,0) * A2(0,2) + A1(1,1) * A2(1,2) + A1(1,2) * A2(2,2);
   res(2,2) = A1(2,0) * A2(0,2) + A1(2,1) * A2(1,2) + A1(2,2) * A2(2,2);

   res(0,3) = A1(0,0) * A2(0,3) + A1(0,1) * A2(1,3) + A1(0,2) * A2(2,3) + A1(0,3);
   res(1,3) = A1(1,0) * A2(0,3) + A1(1,1) * A2(1,3) + A1(1,2) * A2(2,3) + A1(1,3);
   res(2,3) = A1(2,0) * A2(0,3) + A1(2,1) * A2(1,3) + A1(2,2) * A2(2,3) + A1(2,3);

   res(3,0) = 0.0;
   res(3,1) = 0.0;
   res(3,2) = 0.0;
   res(3,3) = 1.0;
}

template<typename T>
EIGEN_DONT_INLINE
void prod(const T& a, const T& b, T& c) { c = a*b; }


int main() {
   const int iterations = 100000000;
   
   Affine3d A1 = Translation3d(0.1, 0.2, 0.3) * AngleAxisd(0.5, Vector3d(1.0 / sqrt(2.0), 1.0 / sqrt(2.0), 0.0));
   Affine3d A2 = A1, A3;
   Matrix4d M1 = A1.matrix();
   Matrix4d M2 = M1, M3;

   clock_t start = clock();
   for(int i = 0; i < iterations; i++) {
      prod( A2 , A1, A3);
   }
   cout << "Affine3d: " << (double)(clock() - start) / CLOCKS_PER_SEC << " s\n";

   start = clock();
   for(int i = 0; i < iterations; i++) {
      prod ( M2 , M1, M3 );
   }
   cout << "Matrix4d: " << (double)(clock() - start) / CLOCKS_PER_SEC << " s\n";

   A2 = A1;
   start = clock();
   for(int i = 0; i < iterations; i++) {
      concatenate(A2, A1, A3);
   }
   cout << "Hard-coded: " << (double)(clock() - start) / CLOCKS_PER_SEC << " s\n";

   return 0;
}


And with g++ -O2 -DNDEBUG I get:

With SSE:
Affine3d: 2.22058 s
Matrix4d: 2.12294 s
Hard-coded: 2.13547 s

No SSE:
Affine3d: 2.02658 s
Matrix4d: 3.95067 s
Hard-coded: 2.1355 s
tobiaskunz
Registered Member
Posts
4
Karma
0

Re: Affine3d slower than Matrix4d

Tue Jun 11, 2013 9:22 pm
Thanks for correcting my benchmark. With your changes I am getting similar results as you when using gcc. Here are my results with gcc on a 64-bit Linux machine with the same CPU as my Windows machine:

Affine3d: 1.49 s
Matrix4d: 1.42 s
Hard-coded: 1.33 s

However, with Visual Studio my results haven't changed much. Instead, the difference between Affine3d and Matrix4d has become even larger. Here are my results using your code with Visual Studio. I also tried VS 2012 and 64-bit compilation. 64-bit seems to improve the situation. VS 2012 makes it a lot worse. For now I am just going to assume the problem is caused by VS not optimizing as well as GCC does.

VS 2010, 32-bit, SSE2:
Affine3d: 9.6 s
Matrix4d: 2.0 s
Hard-coded: 1.5 s

VS 2010, 32-bit, No SSE:
Affine3d: 10.0 s
Matrix4d: 2.9 s
Hard-coded: 1.5 s

VS 2010, 64-bit:
Affine3d: 3.3 s
Matrix4d: 2.2 s
Hard-coded: 1.4

VS 2012, 32-bit, SSE2:
Affine3d: 20.9 s
Matrix4d: 1.6 s
Hard-coded: 1.4 s

VS 2012, 64-bit:
Affine3d: 17.8 s
Matrix4d: 1.9 s
Hard-coded: 1.4 s
User avatar
ggael
Moderator
Posts
3447
Karma
19
OS

Re: Affine3d slower than Matrix4d

Tue Jun 11, 2013 9:50 pm
Make sure you compiled in"release" mode, i.e., with optimizations on and -DNDEBUG. If that was the case, then this is very likely an inlining issue. The generated assembly would help identifying where MSVC is failing.
tobiaskunz
Registered Member
Posts
4
Karma
0

Re: Affine3d slower than Matrix4d

Tue Jun 11, 2013 10:15 pm
Yes, it was compiled in release mode.
tobiaskunz
Registered Member
Posts
4
Karma
0

Re: Affine3d slower than Matrix4d

Thu Jun 13, 2013 7:15 pm
Is there a specific reason you are not using g++ -O3? With g++ -O3 I am getting Affine3d to still be significantly slower than Matrix4d.

Affine3d: 1.43 s
Matrix4d: 1.11 s
Hard-coded: 1.27 s
Hauke
Registered Member
Posts
109
Karma
3
OS

Re: Affine3d slower than Matrix4d

Mon Jun 17, 2013 10:55 am
I just checked the assembly. There are quite a few function calls which are not inlined even with /O2 and /Ob2 (inline any suitable).

Here is a list of functions which are not inlined:
Eigen::internal::transform_transform_product_impl<Eigen::Transform<double,3,2,0>,Eigen::Transform<double,3,2,0>,0>::run
Eigen::MapBase<Eigen::Block<Eigen::Matrix<double,4,4,0,4,4> const ,3,3,0>,0>::coeff
Eigen::EigenBase<Eigen::Matrix<double,3,3,0,3,3> >::derived // also with other types
Eigen::DenseStorage<double,9,3,3,0>::rows
Eigen::DenseCoeffsBase<Eigen::Block<Eigen::Matrix<double,4,4,0,4,4> const ,3,3,0>,2>::rowStride
Eigen::DenseStorage<double,9,3,3,0>::data

I even tested with /Ox but it did not bring much improvement.

Also, there are lots of 'inline' statements in the code. We should keep in mind, that these have little to do with inlining the code as in copy & pasting. Many of them are not required or should be replaced by EIGEN_STRONG_INLINE if that is what we actually intend. The 'inline' just helps us to prevent linker errors for non-templated free functions in header only code.

Btw, inserting a few EIGEN_STRONG_INLINE just shifts the inlining issue to other functions. I.e. we need to add it in a whole lot more of places if we want one liner function to be actually inlined which is probably true for almost all one-liner functions.

Regards,
Hauke


Bookmarks



Who is online

Registered users: Bing [Bot], Google [Bot], Sogou [Bot]