Performance test on mat product and cwiseProduct

Board index

Page 1 of 1 (6 posts)

Tags:

kleng Registered Member Posts 4 Karma 0	Performance test on mat product and cwiseProduct Sun Apr 10, 2016 8:33 pm Hi everyone, Thanks for entering my topic! We are developing a scientific computing code, basically by translating an old Fortran code into C++, using Eigen3 as the matrix library. The Fortran code is hardcoded (and thus ugly), so it makes a good performance yard. For our physical purpose, we only have two types of matrix operations: (1) A * B (2) A.cwiseProduct(B) where A and B are both 55 fixed-size matrices. We have to use 55 but not 44, by the way. Changing the input parameters, we can make either (1) or (2) predominant. The results came that, when (1) is dominant, the new code is slightly faster than the Fortran code, which means matrix-matrix multiplications are so well handled by Eigen3. This is amazingly brilliant! However, when (2) becomes dominant, the new code can be slower than the Fortran code. And the more cwiseProduct's are, the more slower the new code will be. We did some careful cputime counting and are now sort of sure that the performance lost is caused by (2), which was implemented simply as A(0:4, 0:4)B(0:4, 0:4) in Fortran. Our tests are based on both gcc and icc, and on both OS X and Linux workstation. Compile options are -O3 -NDEBUG -std=c++11. Can someone please bring up some suggestions? Thanks a lot for your time! Last edited by kleng on Wed Apr 13, 2016 11:20 am, edited 1 time in total.
kleng Registered Member Posts 4 Karma 0	Re: Performance test on mat product and cwiseProduct Mon Apr 11, 2016 1:04 pm I think I found the problem. It is because std:complex is too slow.
ggael Moderator Posts 3447 Karma 19 OS	Re: Performance test on mat product and cwiseProduct Mon Apr 11, 2016 4:19 pm Indeed, in many cases std::complex::operator* is not inlined, and thus it is extremely slow... This is why we try to by-pass it as much as we can, for instance for matrix products. In your case, using fixed-size types: Matrix<complex<double>,5,5> should help to produce better code.
kleng Registered Member Posts 4 Karma 0	Re: Performance test on mat product and cwiseProduct Tue Apr 12, 2016 3:19 pm ggael wrote:Indeed, in many cases std::complex::operator* is not inlined, and thus it is extremely slow... This is why we try to by-pass it as much as we can, for instance for matrix products. In your case, using fixed-size types: Matrix<complex<double>,5,5> should help to produce better code. Thanks a lot for replying! But our tests show that, Eigen seems very inefficient for fixed size Matrix<complex<float>,5,5>. On my OSX, the following code takes 0.840583s, while for real numbers, it's only 0.02s! Dynamic type is even faster, as Matrix<complex<float>,-1,-1> with size 55 takes only 0.18s. And very stragely, Matrix<complex<float>,6,6> takes only 0.18s. What's happening on 55? Wish you could try the simple code: Code: Select all #include <iostream> #include <Eigen/Dense> #include <complex> #include "time.h" int main(int argc, char argv[]) { const int n = 5; // typedef float myType; typedef std::complex<float> myType; // Eigen::Matrix<myType, -1, -1> A(n,n), B(n,n), C(n,n); Eigen::Matrix<myType, n, n> A, B, C; A = Eigen::Matrix<myType, n, n>::Random(); B = Eigen::Matrix<myType, n, n>::Random(); clock_t st = clock(); for (int i=0; i<=10000000;i++) { // C += A B; C += A.cwiseProduct(B); } clock_t ed = clock(); std::cout << C << std::endl; std::cout << (float(ed - st)) / CLOCKS_PER_SEC<< std::endl; }
ggael Moderator Posts 3447 Karma 19 OS	Re: Performance test on mat product and cwiseProduct Wed Apr 13, 2016 12:07 pm Right, I confirm. You can workaround inlining issues by adding the following overload (same for complex<double> if needed): namespace std { complex<float> operator(const complex<float> &a, const complex<float> &b) { return complex<float>(a.real()b.real() - a.imag()b.imag(), a.imag()b.real() + a.real()*b.imag()); } } This should be be enough to match with Fortran. Then, the situation is roughly as follows: - Fixed size 5x5 -> cannot be properly aligned for vectorization -> vectorization is disabled but you benefits from explicit loop unrolling and no malloc - Dynamic size -> properly aligned at the cost of mallocs -> vectorization is enabled and partly resolved at runtime, no loop unrolling For float/double this strategy works well, but I've to admit that for complexes, disabling vectorization for fixed 5x5 is not very good, and enabling unaligned vectorization would probably pays off, especially on recent CPUs for which the overhead of unaligned load/stores is marginal. This is something I planed for the near future. In the meantime, in addition to the above trick, you could either: - use fixed 6x6 matrices padded with zeros -> excellent for vectorization for both coefficient-wise and matrix products because each column is perfectly aligned with SSE -> best approach for matrix products. - use Matrix<complex<float>, Dynamic, Dynamic, 0, 6,5> which enable runtime vectorization without malloc.
kleng Registered Member Posts 4 Karma 0	Re: Performance test on mat product and cwiseProduct Wed Apr 13, 2016 2:33 pm Very helpful! Thanks again!