Eigen multiplication speed

Board index

Page 1 of 1 (7 posts)

Tags:

avishekk Registered Member Posts 4 Karma 0	Eigen multiplication speed Tue Apr 22, 2014 4:04 pm Hi, I've been using Eigen and up until now have had any real performance issues with my code. But now I'm multiplying quite large matrices 400+ rows/columns and am getting VERY slow program speeds. An example piece of code is: ------------------------- Eigen::MatrixXd Sc; Sc = Hcl.transpose()QblockHcl + Hclu.transpose()RblockHclu + 2Hcl.transpose()NblockHclu +Hclnp.transpose()P*Hclnp; ------------------------ In the example all the matrices (except Sc) are part of a class 'predictionModel', I just removed predictionModel. above for brevity. Hcl is 400x100 Qblock is 400x400 Hclu is 100x100 Rblock is 100x100 Nblock is 400x100 Hclnp is 8x100 In MATLAB the calculation takes 0.004555s, using Eigen it's taking over 0.5s (hard to get exact timings, in debug mode it's running at 5s). At present, in release mode I have O2, SSE2, EIGEN_NO_DEBUG all set. Should I be grouping this calculation differently? Qblock and Rblock only have non zero entries on the diagonal, should I make these sparse matrices? Any help would be much appreciated. Best regards, Avi
ggael Moderator Posts 3447 Karma 19 OS	Re: Eigen multiplication speed Tue Apr 22, 2014 8:56 pm What's your compiler? The following takes 0.004s on my laptop using eigen 3.2 with a recent clang or gcc: Code: Select all #include <bench/BenchTimer.h> #include <iostream> using namespace Eigen; int main() { MatrixXd Hcl(400,100); Hcl.setRandom(); MatrixXd Qblock(400,400); Qblock.setRandom(); MatrixXd Hclu(100,100); Hclu.setRandom(); MatrixXd Rblock(100,100); Rblock.setRandom(); MatrixXd Nblock(400,100); Nblock.setRandom(); MatrixXd Hclnp(8,100); Hclnp.setRandom(); MatrixXd P(8,8); Hclnp.setRandom(); MatrixXd Sc; BenchTimer t; BENCH(t, 3, 1, Sc.noalias() = Hcl.transpose()QblockHcl + Hclu.transpose()RblockHclu + 2Hcl.transpose()NblockHclu + Hclnp.transpose()P*Hclnp); std::cout << t.best() << " " << Sc(0,0) << "\n"; } It also takes only 0.0024s on our AVX branch.
avishekk Registered Member Posts 4 Karma 0	Re: Eigen multiplication speed Wed Apr 23, 2014 8:43 am Hi ggael, Just while typing this reply I found that even though I had selected /O2 optimisations, they weren't being applied for some reason! My apologies for wasting your time in that regard. Now the code is much faster but I think I could still speed things up with my own optimisations. Do you know of any additional compiler setting that can slow down the performance of Eigen (MSVS 2008 with Eigen 3 (last updated 29/1/2014)) . My current set up uses: /Ob2 /Ot /GL /I "C:\Projects\Controllers\RnD\MPCCART3newAPI\Source\..\Infrastructure\Source\Eigen" /D "WIN32" /D "WIN32_LEAN_AND_MEAN" /D "_USE_32BIT_TIME_T" /D "_CRT_SECURE_NO_WARNINGS" /D "NOMINMAX" /D "EIGEN_NO_DEBUG" /D "DNDEBUG" /D "_MBCS" /FD /EHsc /MT /Zp4 /arch:SSE2 /W3 /nologo /c /Zi /TP Regards, Avi
ggael Moderator Posts 3447 Karma 19 OS	Re: Eigen multiplication speed Wed Apr 23, 2014 9:18 am I don't know much about msvc, compiling in 64bit mode helps a lot (more registers are available).
avishekk Registered Member Posts 4 Karma 0	Re: Eigen multiplication speed Wed Apr 23, 2014 11:36 am Hi ggael, I've found another bottleneck in my code: BENCH(t, 3, 1, Lc1.noalias() = 2((Hcl.transpose()QblockPcl + Hclu.transpose()RblockPclu + Hclnp.transpose()PPclnp)scaledState + (Hcl.transpose()QblockPclop + Hclu.transpose()RblockPcluop + Hclnp.transpose()PPclopnp)stateTarget + (Hcl.transpose()QblockDcl + Hclu.transpose()RblockDclu + Hclnp.transpose()PDclnp)deltaDisturbance)); This is a slightly larger version of the previous example where: stateTarget is 400x1 and deltaDisturbance is 200x1 Currently this is taking about 0.6s (as returned by BENCH) while in MATLAB it's taking 0.006s. I've separated out the expression and (Hcl.transpose()QblockPclop + Hclu.transpose()RblockPcluop + Hclnp.transpose()PPclopnp)stateTarget (Hcl.transpose()QblockDcl + Hclu.transpose()RblockDclu + Hclnp.transpose()PDclnp)deltaDisturbance seem to be the culprits though I'm not entirely sure why. Regards, Avi
avishekk Registered Member Posts 4 Karma 0	Re: Eigen multiplication speed Wed Apr 23, 2014 1:00 pm Okay after a bit more investigation I have found a slight code revision makes the calculation around 30x faster: ------------------------------------------------------------ // Slow code Lc1.noalias() = 2((predictionModel.Hcl.transpose()predictionModel.QblockpredictionModel.Pcl + predictionModel.Hclu.transpose()predictionModel.RblockpredictionModel.Pclu + Hclnp.transpose()predictionModel.PPclnp)scaledState + (predictionModel.Hcl.transpose()predictionModel.QblockpredictionModel.Pclop + predictionModel.Hclu.transpose()predictionModel.RblockpredictionModel.Pcluop + Hclnp.transpose()predictionModel.PPclopnp)stateTarget + (predictionModel.Hcl.transpose()predictionModel.QblockpredictionModel.Dcl + predictionModel.Hclu.transpose()predictionModel.RblockpredictionModel.Dclu + Hclnp.transpose()predictionModel.PDclnp)deltaDisturbance); // Split Lc calculation for speed temp1 = predictionModel.Hcl.transpose()predictionModel.QblockpredictionModel.Pcl + predictionModel.Hclu.transpose()predictionModel.RblockpredictionModel.Pclu + Hclnp.transpose()predictionModel.PPclnp; temp1 = scaledState; temp2 = predictionModel.Hcl.transpose()predictionModel.QblockpredictionModel.Pclop + predictionModel.Hclu.transpose()predictionModel.RblockpredictionModel.Pcluop + Hclnp.transpose()predictionModel.PPclopnp; temp2 = stateTarget; temp3 = predictionModel.Hcl.transpose()predictionModel.QblockpredictionModel.Dcl + predictionModel.Hclu.transpose()predictionModel.RblockpredictionModel.Dclu + Hclnp.transpose()predictionModel.PDclnp; temp3 = deltaDisturbance; temp1 += temp2 + temp3; temp1 = 2; ------------------------------------------------------------ I don't understand why the second case is faster, would you be able to give me some insight?
ggael Moderator Posts 3447 Karma 19 OS	Re: Eigen multiplication speed Wed Apr 23, 2014 2:44 pm I don't understand how the second version could be faster, however let me give you some hints: When using vectors (400x1) make sure that you are using VectorXd (or VectorXd) types instead of matrix types. Then you must take care about the priority of operator . For instance, ABv is much slower than A(B*v) when A and B are matrices and v a vector because the second version involves much fewer operations.