|   Registered Member   
 | 
							Hi, I've been using Eigen and up until now have had any real performance issues with my code. But now I'm multiplying quite large matrices 400+ rows/columns and am getting VERY slow program speeds. An example piece of code is: ------------------------- Eigen::MatrixXd Sc; Sc = Hcl.transpose()*Qblock*Hcl + Hclu.transpose()*Rblock*Hclu + 2*Hcl.transpose()*Nblock*Hclu +Hclnp.transpose()*P*Hclnp; ------------------------ In the example all the matrices (except Sc) are part of a class 'predictionModel', I just removed predictionModel. above for brevity. Hcl is 400x100 Qblock is 400x400 Hclu is 100x100 Rblock is 100x100 Nblock is 400x100 Hclnp is 8x100 In MATLAB the calculation takes 0.004555s, using Eigen it's taking over 0.5s (hard to get exact timings, in debug mode it's running at 5s). At present, in release mode I have O2, SSE2, EIGEN_NO_DEBUG all set. Should I be grouping this calculation differently? Qblock and Rblock only have non zero entries on the diagonal, should I make these sparse matrices? Any help would be much appreciated. Best regards, Avi | 
|   Moderator   
 | 
							What's your compiler? The following takes 0.004s on my laptop using eigen 3.2 with a recent clang or gcc: 
 It also takes only 0.0024s on our AVX branch. | 
|   Registered Member   
 | 
							Hi ggael, Just while typing this reply I found that even though I had selected /O2 optimisations, they weren't being applied for some reason! My apologies for wasting your time in that regard. Now the code is much faster but I think I could still speed things up with my own optimisations. Do you know of any additional compiler setting that can slow down the performance of Eigen (MSVS 2008 with Eigen 3 (last updated 29/1/2014)) . My current set up uses: /Ob2 /Ot /GL /I "C:\Projects\Controllers\RnD\MPCCART3newAPI\Source\..\Infrastructure\Source\Eigen" /D "WIN32" /D "WIN32_LEAN_AND_MEAN" /D "_USE_32BIT_TIME_T" /D "_CRT_SECURE_NO_WARNINGS" /D "NOMINMAX" /D "EIGEN_NO_DEBUG" /D "DNDEBUG" /D "_MBCS" /FD /EHsc /MT /Zp4 /arch:SSE2 /W3 /nologo /c /Zi /TP Regards, Avi | 
|   Moderator   
 | 
							I don't know much about msvc, compiling in 64bit mode helps a lot (more registers are available).
						 | 
|   Registered Member   
 | 
							Hi ggael, I've found another bottleneck in my code: BENCH(t, 3, 1, Lc1.noalias() = 2*((Hcl.transpose()*Qblock*Pcl + Hclu.transpose()*Rblock*Pclu + Hclnp.transpose()*P*Pclnp)*scaledState + (Hcl.transpose()*Qblock*Pclop + Hclu.transpose()*Rblock*Pcluop + Hclnp.transpose()*P*Pclopnp)*stateTarget + (Hcl.transpose()*Qblock*Dcl + Hclu.transpose()*Rblock*Dclu + Hclnp.transpose()*P*Dclnp)*deltaDisturbance)); This is a slightly larger version of the previous example where: stateTarget is 400x1 and deltaDisturbance is 200x1 Currently this is taking about 0.6s (as returned by BENCH) while in MATLAB it's taking 0.006s. I've separated out the expression and (Hcl.transpose()*Qblock*Pclop + Hclu.transpose()*Rblock*Pcluop + Hclnp.transpose()*P*Pclopnp)*stateTarget (Hcl.transpose()*Qblock*Dcl + Hclu.transpose()*Rblock*Dclu + Hclnp.transpose()*P*Dclnp)*deltaDisturbance seem to be the culprits though I'm not entirely sure why. Regards, Avi | 
|   Registered Member   
 | 
							Okay after a bit more investigation I have found a slight code revision makes the calculation around 30x faster: ------------------------------------------------------------ // Slow code Lc1.noalias() = 2*((predictionModel.Hcl.transpose()*predictionModel.Qblock*predictionModel.Pcl + predictionModel.Hclu.transpose()*predictionModel.Rblock*predictionModel.Pclu + Hclnp.transpose()*predictionModel.P*Pclnp)*scaledState + (predictionModel.Hcl.transpose()*predictionModel.Qblock*predictionModel.Pclop + predictionModel.Hclu.transpose()*predictionModel.Rblock*predictionModel.Pcluop + Hclnp.transpose()*predictionModel.P*Pclopnp)*stateTarget + (predictionModel.Hcl.transpose()*predictionModel.Qblock*predictionModel.Dcl + predictionModel.Hclu.transpose()*predictionModel.Rblock*predictionModel.Dclu + Hclnp.transpose()*predictionModel.P*Dclnp)*deltaDisturbance); // Split Lc calculation for speed temp1 = predictionModel.Hcl.transpose()*predictionModel.Qblock*predictionModel.Pcl + predictionModel.Hclu.transpose()*predictionModel.Rblock*predictionModel.Pclu + Hclnp.transpose()*predictionModel.P*Pclnp; temp1 *= scaledState; temp2 = predictionModel.Hcl.transpose()*predictionModel.Qblock*predictionModel.Pclop + predictionModel.Hclu.transpose()*predictionModel.Rblock*predictionModel.Pcluop + Hclnp.transpose()*predictionModel.P*Pclopnp; temp2 *= stateTarget; temp3 = predictionModel.Hcl.transpose()*predictionModel.Qblock*predictionModel.Dcl + predictionModel.Hclu.transpose()*predictionModel.Rblock*predictionModel.Dclu + Hclnp.transpose()*predictionModel.P*Dclnp; temp3 *= deltaDisturbance; temp1 += temp2 + temp3; temp1 *= 2; ------------------------------------------------------------ I don't understand why the second case is faster, would you be able to give me some insight? | 
|   Moderator   
 | 
							I don't understand how the second version could be faster, however let me give you some hints: When using vectors (400x1) make sure that you are using VectorXd (or VectorXd) types instead of matrix types. Then you must take care about the priority of operator *. For instance, A*B*v is much slower than A*(B*v) when A and B are matrices and v a vector because the second version involves much fewer operations.
						 | 
Registered users: Baidu [Spider], Bing [Bot], Google [Bot], rblackwell
 
		 
		 
		 
		