Registered Member
|
Hi everyone,
Thanks for entering my topic! We are developing a scientific computing code, basically by translating an old Fortran code into C++, using Eigen3 as the matrix library. The Fortran code is hardcoded (and thus ugly), so it makes a good performance yard. For our physical purpose, we only have two types of matrix operations: (1) A * B (2) A.cwiseProduct(B) where A and B are both 5*5 fixed-size matrices. We have to use 5*5 but not 4*4, by the way. Changing the input parameters, we can make either (1) or (2) predominant. The results came that, when (1) is dominant, the new code is slightly faster than the Fortran code, which means matrix-matrix multiplications are so well handled by Eigen3. This is amazingly brilliant! However, when (2) becomes dominant, the new code can be slower than the Fortran code. And the more cwiseProduct's are, the more slower the new code will be. We did some careful cputime counting and are now sort of sure that the performance lost is caused by (2), which was implemented simply as A(0:4, 0:4)*B(0:4, 0:4) in Fortran. Our tests are based on both gcc and icc, and on both OS X and Linux workstation. Compile options are -O3 -NDEBUG -std=c++11. Can someone please bring up some suggestions? Thanks a lot for your time!
Last edited by kleng on Wed Apr 13, 2016 11:20 am, edited 1 time in total.
|
Registered Member
|
I think I found the problem. It is because std:complex is too slow.
|
Moderator
|
Indeed, in many cases std::complex::operator* is not inlined, and thus it is extremely slow... This is why we try to by-pass it as much as we can, for instance for matrix products. In your case, using fixed-size types: Matrix<complex<double>,5,5> should help to produce better code.
|
Registered Member
|
Thanks a lot for replying! But our tests show that, Eigen seems very inefficient for fixed size Matrix<complex<float>,5,5>. On my OSX, the following code takes 0.840583s, while for real numbers, it's only 0.02s! Dynamic type is even faster, as Matrix<complex<float>,-1,-1> with size 5*5 takes only 0.18s. And very stragely, Matrix<complex<float>,6,6> takes only 0.18s. What's happening on 5*5? Wish you could try the simple code:
|
Moderator
|
Right, I confirm. You can workaround inlining issues by adding the following overload (same for complex<double> if needed):
namespace std { complex<float> operator*(const complex<float> &a, const complex<float> &b) { return complex<float>(a.real()*b.real() - a.imag()*b.imag(), a.imag()*b.real() + a.real()*b.imag()); } } This should be be enough to match with Fortran. Then, the situation is roughly as follows: - Fixed size 5x5 -> cannot be properly aligned for vectorization -> vectorization is disabled but you benefits from explicit loop unrolling and no malloc - Dynamic size -> properly aligned at the cost of mallocs -> vectorization is enabled and partly resolved at runtime, no loop unrolling For float/double this strategy works well, but I've to admit that for complexes, disabling vectorization for fixed 5x5 is not very good, and enabling unaligned vectorization would probably pays off, especially on recent CPUs for which the overhead of unaligned load/stores is marginal. This is something I planed for the near future. In the meantime, in addition to the above trick, you could either: - use fixed 6x6 matrices padded with zeros -> excellent for vectorization for both coefficient-wise and matrix products because each column is perfectly aligned with SSE -> best approach for matrix products. - use Matrix<complex<float>, Dynamic, Dynamic, 0, 6,5> which enable runtime vectorization without malloc. |
Registered Member
|
Very helpful!
Thanks again! |
Registered users: Bing [Bot], Evergrowing, Google [Bot], rockscient