Performance degradation: Updating from eigen2 to devel • KDE Community Forums

This forum has been archived. All content is frozen. Please use KDE Discuss instead.

Board index

Performance degradation: Updating from eigen2 to devel

Page 2 of 3 (36 posts)

Previous • 1, 2, 3 • Next

Tags:

EamonNerbonne Registered Member Posts 36 Karma 0	Re: Performance degradation: Updating from eigen2 to devel Thu Feb 04, 2010 7:39 am I was trying to simplify the test-case, and I ran into an unrelated (I think) slowdown. Just do "mu_vK.dot(P * vK);" in the test, and eigen dev takes about 2 times longer than v2. Code: Select all #include <boost/progress.hpp> #include <Eigen/Core> #include <Eigen/Array> using namespace Eigen; int main(int argc, char* argv[]) { Vector2d mu_vK = Vector2d::Random(); VectorXd vK = VectorXd::Random(25); Matrix<double,2,Dynamic> P = Matrix<double,2,Dynamic>::Random(2,25); const int num_runs = 50000000; double sum = 0.0; boost::progress_timer t; for (int i=0; i<num_runs; ++i) { P(num_runs%2, (num_runs/2)%25) = 1.0; //vs. optimizer sum += mu_vK.dot(P * vK); } std::cout << sum<<std::endl;//vs. optimizer return 0; }
ggael Moderator Posts 3447 Karma 19 OS	Re: Performance degradation: Updating from eigen2 to devel Thu Feb 04, 2010 4:38 pm ok, I've updated the product dispatcher to use the unrolled path for hybrid small/large outer products. That fix the issue you had with: Code: Select all `P.noalias() -= lr_P * ( mu_vJ * vJ.transpose() + mu_vK * vK.transpose())` Here it still creates a temporary but that's not causing a slowdown. Ideally we would need top-down evaluation mechanism (instead of our current bottom-up one), such that we could automatically evaluate such expressions as: P.noalias() -= lr_P * mu_vJ * vJ.transpose(); P.noalias() -= lr_P * mu_vK * vK.transpose(); And this for all kind of products, not only outer ones... Regarding the problem with diagonal, I would suggest to overload ProductBase::diagonal() to return the correct expression, i.e., the coefficient based product nested in a diagonal expression. In the same vein I also plane to overload ProductBase::block(): it would return a product expression with block expressions applied to its right and left hand side: (AB).block() => A.block() B.block() Again, with a top-down evaluator we could handle such cases much more easily... I planed to present and discuss that idea at the meeting.
ggael Moderator Posts 3447 Karma 19 OS	Re: Performance degradation: Updating from eigen2 to devel Thu Feb 04, 2010 5:28 pm ok, the diagonal stuff is committed.
EamonNerbonne Registered Member Posts 36 Karma 0	Re: Performance degradation: Updating from eigen2 to devel Fri Feb 05, 2010 9:46 am OK, new version is faster, but still more than three times slower than v2. The temporaries are causing a slowdown (at least on VS.NET, I'll try with g++ in a sec). The profile is striking by now. Thanks to your optimizations, most of the "real" work is fading away, making malloc/free loom even larger: they now consume 38% of the overall run-time (!). I'll post again once I have some more detailed data.
ggael Moderator Posts 3447 Karma 19 OS	Re: Performance degradation: Updating from eigen2 to devel Fri Feb 05, 2010 10:08 am yes I forgot to mention that in the devel branch we generate much too many unnecessary temporaries for that expressions. We are discussing it on the ML and I think we found a solution. However, here those temporaries and extra copies represents only epsilon, so perhaps malloc is extremely slow on your platform ? Again let me recall that your expression can advantageously be rewritten like this: P.noalias() -= (lr_P * mu_vJ).eval() * vJ.transpose(); P.noalias() -= (lr_P * mu_vK).eval() * vK.transpose(); => no dynamic alloc, and only 4 muls with lr_P instead of 50.
EamonNerbonne Registered Member Posts 36 Karma 0	Re: Performance degradation: Updating from eigen2 to devel Fri Feb 05, 2010 4:38 pm Compiler Choice Well, I tried g++, and the results are surprising (and inspiring?). I tried 4.4.1 for x86 and 4.4.3 for x64 (cross-compiling resulted in weird compiler errors, unfortunately, and I didn't feel like compiling gcc itself with appropriate configure options). The x64 variants were easily faster; also, to get optimal performance on x86 I really needed to tweak the compiler flags, whereas x64 performed best with just -march=core2 -O2 (with -m64 being implied). Code: Select all `EIGENV2 + eigen's vectorization on gcc: 1.06 seconds EIGENV2 + EIGEN_DONT_VECTORIZE on gcc: 0.94 seconds (>10% faster!) EIGENV2 + eigen's vectorization on cl : 1.70 seconds (ouch) EIGENV2 + EIGEN_DONT_VECTORIZE on cl : 2.91 seconds (OUCH)` there was essentially no variation in runtime each run. So, the compiler really matters (and yes, I did define NDEBUG in all cases, and I fiddled with various optimization settings for cl, 2.91 was the best I could get). Dev branch vs. v2 Now, without your suggestion (to rewrite the update to P) the same benchmark takes about 5.5 seconds on gcc, and about 6.5 on cl (slower with EIGEN_DONT_VECTORIZE). With your suggestion to rewrite the P update to avoid temporaries I get a compilation error in MS's cl: Code: Select all `learningBench.cpp(24): error C2784: 'ei_enable_if<!Eigen::ei_is_same_type<Derived::Scalar,Derived::RealScalar>::ret,const Eigen::ScaledProduct<Derived>>::type Eigen::operator (Derived::RealScalar,const Eigen::ProductBase<Derived,_Lhs,_Rhs> &)' : could not deduce template argument for 'const Eigen::ProductBase<Derived,_Lhs,_Rhs> &' from 'const Eigen::Vector2d'` Though it works fine in gcc, the following variant performs the same on gcc and at least works on cl: Code: Select all `Vector2d tmpJ = lr_P mu_vJ; Vector2d tmpK = lr_P * mu_vK; P.noalias() -= tmpJ * vJ.transpose(); P.noalias() -= tmpK * vK.transpose(); return 1.0 / ( (P.transpose() * P).diagonal().sum());` I get the following results: Code: Select all `cl+EIGEN_DONT_VECTORIZE : 5.28 - 5.33 sec cl+eigen's vectorization : 3.37 - 3.41 sec g++ +eigen's vectorization : 2.91 - 2.92 sec g++ -DEIGEN_DONT_VECTORIZE : 2.73 - 2.74 sec g++ -O3 -DEIGEN_DONT_VECTORIZE: 2.15 - 2.22 sec (usually on the fast end of that range)` So, the dev version is still quite a bit slower, on cl malloc/free are still using 28% of the time (cl+eigen's vectorization) compared to less that 0.1% with eigen v2.
bjacob Registered Member Posts 658 Karma 3	Re: Performance degradation: Updating from eigen2 to devel Wed Feb 10, 2010 3:50 am Just FYI, Gael, who is a genius, just committed large fixes to the devel branch. Could you please retry now? Join us on Eigen's IRC channel: #eigen on irc.freenode.net Have a serious interest in Eigen? Then join the mailing list!
EamonNerbonne Registered Member Posts 36 Karma 0	Re: Performance degradation: Updating from eigen2 to devel Wed Feb 10, 2010 1:16 pm bjacob wrote:Just FYI, Gael, who is a genius, just committed large fixes to the devel branch. Could you please retry now? I definitely will. Right now, I'm in the middle on converting my builds to using cmake to ease the pain of dealing with both visual studio and the gnu toolchain, so it might take me a while until I've got it all running satisfactorily again. Incidentally, the performance fix has no urgency for me - I just wanted to help identify room for improvement to make eigen3 even better than 2 (which is great, thank you!).
EamonNerbonne Registered Member Posts 36 Karma 0	Re: Performance degradation: Updating from eigen2 to devel Thu Feb 25, 2010 4:36 pm Hi again, I've retried the tests and the new version is much better! These were the most extreme samples, but I've got a few more; the full program still runs a bit slower on eigen3 than eigen2 - would it be useful to you to extract more testcases (if I can pinpoint the cause in the first place)? Detailed Results To reiterate, the code that's being looped 5'000'000 times: Code: Select all `P.noalias() -= lr_P * ( mu_vJ * vJ.transpose() + mu_vK * vK.transpose()); return 1.0 / ( (P.transpose() * P).diagonal().sum());` ...or similar for eigen2. Results: Code: Select all `Compilation settings Eigen 2 Eigen 3 Msc 2.86 1.30 Msc+eigenVec 1.66 1.25 Mingw 0.89 0.78 Mingw+eigenVec 1.20 0.71` runs without +eigenVec were compiled with -DEIGEN_DONT_VECTORIZE
EamonNerbonne Registered Member Posts 36 Karma 0	Re: Performance degradation: Updating from eigen2 to devel Thu Feb 25, 2010 4:53 pm Incidentally, this variant: Code: Select all `P.noalias() -= (lr_P * mu_vJ).eval() * vJ.transpose(); P.noalias() -= (lr_P * mu_vK).eval() * vK.transpose();` Doesn't compile in MSC (can't deduce template argument), and is actually slower for gcc.
bjacob Registered Member Posts 658 Karma 3	Re: Performance degradation: Updating from eigen2 to devel Thu Feb 25, 2010 5:10 pm Whenever you can make a short, self-contained, compilable program that runs slower with eigen3 than eigen2, I'd say that's very interesting! Join us on Eigen's IRC channel: #eigen on irc.freenode.net Have a serious interest in Eigen? Then join the mailing list!
EamonNerbonne Registered Member Posts 36 Karma 0	Re: Performance degradation: Updating from eigen2 to devel Thu Feb 25, 2010 8:03 pm Ok here's another one. Code: Select all `//const VectorXd& point, Matrix<double,2,Dynamic>& P Vector2d P_point = Ppoint;//slow in eigen3,especially vectorized` Results in seconds (64-bit) Code: Select all `Compiler Eigen2 Eigen3 Msc 2.38 3.75 Msc+v 1.34 4.14 Mingw 2.46 1.39 Mingw+v 1.42 7.11` Particularly the vectorized versions perform poorly. Full test case: Code: Select all #include <boost/progress.hpp> #include <Eigen/Core> #if !EIGEN3 #include <Eigen/Array> #endif using namespace Eigen; using namespace boost; using namespace std; #define DIMS 25 EIGEN_DONT_INLINE double projectionTestIter( const VectorXd& point, Matrix<double,2,Dynamic>& P) { Vector2d P_point = Ppoint; return P_point.sum(); } void projectionTest() { VectorXd a = VectorXd::Random(DIMS); VectorXd b = VectorXd::Random(DIMS); Matrix<double,2,Dynamic> P = Matrix<double,2,Dynamic>::Random(2,DIMS); progress_timer t(cerr); double sum = 0.0; const int num_runs = 30000000; for (int i=0; i<num_runs; ++i) { if(num_runs % (i+1) > sum) sum -= projectionTestIter(a, P); else sum += projectionTestIter(b, P); } cout <<"(" << sum<<") "; } int main(int argc, char* argv[]){ projectionTest(); return 0; } I define the symbol EIGEN3 in the compile parameters depending on the test case.
ggael Moderator Posts 3447 Karma 19 OS	Re: Performance degradation: Updating from eigen2 to devel Thu Feb 25, 2010 9:09 pm ouch! again, the problem was simply a bad product selection rule. Should be better now. Thanks a lot for reporting those problems!
EamonNerbonne Registered Member Posts 36 Karma 0	Re: Performance degradation: Updating from eigen2 to devel Thu Feb 25, 2010 9:24 pm OK, that looks better! The last patch did degrade eigen3 performance without vectorization - before the patch eigen3+EIGEN_DONT_VECTORIZE on mingw took 1.39 seconds, now it takes 2.4 seconds. On the other hand it got faster on msc; so maybe it's just triggering some slightly different optimization this time round. Full output: Code: Select all `Compiler Eigen2 Eigen3-old Eigen3-new Msc 2.38 3.75 2.35 Msc+v 1.34 4.14 1.34 Mingw 2.46 1.39 2.41 Mingw+v 1.42 7.11 1.29` Nice!
EamonNerbonne Registered Member Posts 36 Karma 0	Re: Performance degradation: Updating from eigen2 to devel Thu Feb 25, 2010 10:12 pm I think I have two more test cases. One of em's a little tricky to isolate, so I'll wait till next week to take a look. This one isn't very big, and maybe it's due to compiler settings, but at least it's easy to isolate . The slowdown involves plain non-aliased subtraction of VectorXd instances. The performance profile's a little weird on this one, though: Code: Select all `Compiler Eigen2 Eigen3 Msc 2.18 3.24 Msc+v 1.37 1.97(!) Mingw 0.85(!) 1.60 Mingw+v 0.99 0.90` The eigen2 code without vectorization on mingw is fastest. In particular, the msc slowdown is large. From what I can see in the profiler (namely something at all), ei_assign_impl::run isn't being inlined by MSC. I haven't looked into what dst.template copyPacket does, but that's where the slowdown is anyhow. Ah, I almost forgot the testcase... Code: Select all #include <boost/progress.hpp> #include <Eigen/Core> #if !EIGEN3 #include <Eigen/Array> #endif using namespace Eigen; using namespace boost; using namespace std; #define DIMS 25 void subtractTest() { VectorXd a = VectorXd::Random(DIMS); VectorXd b = VectorXd::Random(DIMS); VectorXd c = VectorXd::Random(DIMS); progress_timer t(cerr); double sum = 0.0; const int num_runs = 30000000; for (int i=0; i<num_runs; ++i) { #if EIGEN3 if(i%2==0) c.noalias() = a - b; else b.noalias() = a - c; #else if(i%2==0) c = (a - b).lazy(); else b = (a - c).lazy(); #endif } cout <<"(" << c(0) <<") "; } int main(int argc, char* argv[]){ subtractTest(); return 0; } Last edited by EamonNerbonne on Thu Feb 25, 2010 11:15 pm, edited 2 times in total.

Page 2 of 3 (36 posts)

Previous • 1, 2, 3 • Next

Bookmarks

Who is online

Registered users: abc72656, Bing [Bot], daret, Google [Bot], Sogou [Bot], Yahoo [Bot]