Performance degradation: Updating from eigen2 to devel • KDE Community Forums

This forum has been archived. All content is frozen. Please use KDE Discuss instead.

Board index

Performance degradation: Updating from eigen2 to devel

Page 1 of 3 (36 posts)

1, 2, 3 • Next

Tags:

EamonNerbonne Registered Member Posts 36 Karma 0	Performance degradation: Updating from eigen2 to devel Thu Jan 28, 2010 2:39 pm I thought I'd check out the devel branch to see how it compares. In the performance sensitive segment, the app I'm working on deals purely with dense matrices and vectors of up to 50 dimensions. Most parts are faster, which is nice, but some parts are slower - and these parts are so much slower that the overall result is several times slower. This seems to be caused by the devel branch's usage of temporaries where none are advised. Such expressions are much slower for apparently two reasons. In particular, I've got several code segments that looks as follows: matM = [...] + vecA * vecA0.transpose() + [...]; or some variant thereof (vec? are column vectors, mat is the appropriately sized pre-allocated matrix). In eigen2, I marked the multiplication as .lazy(), since there's no aliasing and since the multiplication is very cheap. In -devel, I marked matM as ".noalias()". Using an intermediate here is expensive primarily because of memory allocation. On the other hand, inline evaluation doesn't actually recompute anything unnecessarily anyhow; even without and intermediate, the product of element i of vecA and element j of vecB is used only once; namely to fill element (i,j) of matM. As a result, for the overall application, malloc+free time is 13.5% of total in the eigen-devel variant of the program, but only 3.4% of total in the original eigen2-based program, which in cpu-time is a factor 12 increase (the total running time of the eigen-devel variant was longer). The second reason such intermediate-introducing code segments are slower is because the multiplication expressions are sometimes not completely used anyhow. As part of a matrix normalization, there's a line: double scale = 1.0 / ( (matM.transpose() * matM).lazy().diagonal().sum()); with the devel-branch, lazy falls away, and the entire product matM^T * matM is computed even though only the diagonals are needed! Now, I could work around the second problem by manually computing the sum of dot-products of rows+columns, and I could mitigate the first problem by accepting the intermediates but preallocating some temp variable (which is probably still slower, however). Is there some alternative to force lazy evaluation where .noalias() doesn't quite cover the intent? Improved heuristics might cover it too, for me that would just be .diagonal() on an expression should suggest lazy evaluation, and obviously a column-vector times a row-vector doesn't need to generate an intermediate. Everything else compiled with almost no changes on my part (replacing lazy with noalias, basically) and simply runs about 20% faster, so the -devel branch has some juicy stuff, it looks like!
EamonNerbonne Registered Member Posts 36 Karma 0	Re: Performance degradation: Updating from eigen2 to devel Thu Jan 28, 2010 3:15 pm I also checked the mailing list: http://listengine.tuxfamily.org/lists.t ... 00039.html and that heuristic doesn't seem to hold for small matrices and vectors; lazy() outperforms eager there (by quite a margin, eventually). So, I'm slightly worried to see lazy go...
jitseniesen Registered Member Posts 204 Karma 2	Re: Performance degradation: Updating from eigen2 to devel Thu Jan 28, 2010 3:22 pm Thanks for the feedback. As that threat hints at, a work-around for the first problem which does not allocate a temporary (I believe) is to write matM = [...] + vecA * vecA0.transpose() + [...]; as matM = [...]; matM += vecA * vecA0.transpose(); matM += [...]; You may need a noalias() in the first += statement. But this solution is not ideal.
bjacob Registered Member Posts 658 Karma 3	Re: Performance degradation: Updating from eigen2 to devel Thu Jan 28, 2010 4:49 pm Indeed it's not ideal: you're then traversing the arrays N times instead of just once. Traversing arrays only once is another important benefit of lazy evaluation. The performance regression that you're mentioning is possibly related to the one that was just reported to the list: http://listengine.tuxfamily.org/lists.t ... 00300.html So, as you can see, you're not the only one to be concerned about it. It is definitely a simple bug, something that will soon be fixed. Don't worry about the removal of lazy(). The noalias() system should perfectly cover your use case, it's just that you hit a bug. This happens, with the development branch Join us on Eigen's IRC channel: #eigen on irc.freenode.net Have a serious interest in Eigen? Then join the mailing list!
EamonNerbonne Registered Member Posts 36 Karma 0	Re: Performance degradation: Updating from eigen2 to devel Fri Jan 29, 2010 1:33 pm So, the .diagonal() issue is just a bug, I guess - but you're saying the vecvec.transpose() issue should indeed evaluate lazily (i.e. without temporary) without further hints and that that's also just a bug? The point is that .noalias() is generally similar in purpose to .lazy(), but sometimes .lazy() had the additional hint that suggested that despite eigen's own defaults a product should be evaluated lazily, not eagerly... That additional hint is now no longer expressible, it seems. As to the performance regression mentioned in the mailing list thread; it's possible it's related. That was talking of ArrayXd, however. I'm unfamiliar with the details, but as far as I can tell, eigen2 in general seems to prefer eager evaluation of product expressions - I'm unsure about eigen-devel, but if it's similar, then vecAvecB.transpose() would by that heuristic be evaluated eagerly, despite that rarely being optimal mostly due to memory allocation overhead. Honestly though, I don't really understand why the follow is apparently true for non-tiny matrices: D = C + (AB).eval(); //faster D = C + (AB).lazy(); //slower It strikes me that each coeefficient of C + A*B should be evaluable in one loop (so D in three nested loops), and that an intermediate won't help, but nevertheless the non-lazy code is faster if the matrices aren't tiny - is that due to cache-friendliness?
bjacob Registered Member Posts 658 Karma 3	Re: Performance degradation: Updating from eigen2 to devel Fri Jan 29, 2010 1:43 pm So, the .diagonal() issue is just a bug, I guess - but you're saying the vecvec.transpose() issue should indeed evaluate lazily (i.e. without temporary) without further hints and that that's also just a bug? The point is that .noalias() is generally similar in purpose to .lazy(), but sometimes .lazy() had the additional hint that suggested that despite eigen's own defaults a product should be evaluated lazily, not eagerly... That additional hint is now no longer expressible, it seems. Oh oh, I see now the potential problem. .noalias() can't prevent evaluation into a temporary for expressions that have the "evaluate before nesting" flag, and that are nested. So in Code: Select all `result.noalias() = A + BC;` since the product expression has the "evaluate before nesting" flag, it gets evaluated right away, when it is nested in the sum expression, long before the .noalias() comes into play. Gael, I'm interested in your opinion? Am I missing a reason why this is not a problem? Note that here we're talking about an outer product of vectors; while in general evaluating products into a temporary is good, it is probably not the case for outer products, right? Or did your measurements show otherwise (which would be really surprising?) Honestly though, I don't really understand why the follow is apparently true for non-tiny matrices: D = C + (AB).eval(); //faster D = C + (AB).lazy(); //slower Here too I need to let Gael answer, since he took care of that stuff. Join us on Eigen's IRC channel: #eigen on irc.freenode.net Have a serious interest in Eigen? Then join the mailing list!
bjacob Registered Member Posts 658 Karma 3	Re: Performance degradation: Updating from eigen2 to devel Fri Jan 29, 2010 1:51 pm So, proposed solution: remove the EvalBeforeNestingBit on outer products? Join us on Eigen's IRC channel: #eigen on irc.freenode.net Have a serious interest in Eigen? Then join the mailing list!
ggael Moderator Posts 3447 Karma 19 OS	Re: Performance degradation: Updating from eigen2 to devel Sun Jan 31, 2010 11:31 am hi, the problem was not about lazy evaluation. The problem was that all kind of outer products took the outer-product specialization, which is tailored for large matrices. Problem solved now.
EamonNerbonne Registered Member Posts 36 Karma 0	Re: Performance degradation: Updating from eigen2 to devel Wed Feb 03, 2010 5:31 pm Did you merge the fix? I'm still seeing the similar performance here; malloc/free are taking 16.5% of runtime which is about 30 times more than on stable. It might be due to the specialization chosen; but the following two lines P.noalias() -= lr_P * ( mu_vJ * vJ.transpose() + mu_vK * vK.transpose()) ; double pNormScale = 1.0 / ( (P.transpose() * P).diagonal().sum()); Are much slower in the development version than in stable. Both of them do dynamic memory allocation only in the development version, and it's possible that diagonal is still overevaluating; the profiler output is kind of hard to read with all the templating . with: Vector2d mu_vJ, mu_vK; VectorXd vJ, vK; //25 rows. Matrix<double,2, Dynamic> P; //2 rows and 25 columns this run The compiler is VS2010 beta2.
Hauke Registered Member Posts 109 Karma 3 OS	Re: Performance degradation: Updating from eigen2 to devel Wed Feb 03, 2010 7:19 pm EamonNerbonne wrote:Did you merge the fix? I'm still seeing the similar performance here; malloc/free are taking 16.5% of runtime which is about 30 times more than on stable. It might be due to the specialization chosen; but the following two lines P.noalias() -= lr_P * ( mu_vJ * vJ.transpose() + mu_vK * vK.transpose()) ; double pNormScale = 1.0 / ( (P.transpose() * P).diagonal().sum()); Are much slower in the development version than in stable. Both of them do dynamic memory allocation only in the development version, and it's possible that diagonal is still overevaluating; the profiler output is kind of hard to read with all the templating . with: Vector2d mu_vJ, mu_vK; VectorXd vJ, vK; //25 rows. Matrix<double,2, Dynamic> P; //2 rows and 25 columns this run The compiler is VS2010 beta2. Overe here, I get the following timings with VS2008 SP 1, Release, x64: eigen devel.: 0.00476292 ms eigen 2.....: 0.00490333 ms Code: Select all #include <Bench/BenchTimer.h> #include <Eigen/Core> #include <Eigen/Array> using namespace Eigen; EIGEN_DONT_INLINE double run_test( const Vector2d& mu_vJ, const Vector2d& mu_vK, const VectorXd& vJ, const VectorXd& vK, const double lr_P, Matrix<double,2,Dynamic>& P) { #if 1 P -= lr_P * ( mu_vJ * vJ.transpose() + mu_vK * vK.transpose()).lazy(); double pNormScale = 1.0 / ( (P.transpose() * P).diagonal().sum()); return pNormScale; #else P.noalias() -= lr_P * ( mu_vJ * vJ.transpose() + mu_vK * vK.transpose()); double pNormScale = 1.0 / ( (P.transpose() * P).diagonal().sum()); return pNormScale; #endif } void main() { Vector2d mu_vJ = Vector2d::Random(); Vector2d mu_vK = Vector2d::Random(); VectorXd vJ = VectorXd::Random(25); VectorXd vK = VectorXd::Random(25); Matrix<double,2,Dynamic> P = Matrix<double,2,Dynamic>::Random(2,25); double lr_P = ei_random<double>(); BenchTimer eigen_timer; const int num_runs = 20000; double sum = 0.0; eigen_timer.start(); for (int i=0; i<num_runs; ++i) sum += run_test(mu_vJ, mu_vK, vJ, vK, lr_P, P); eigen_timer.stop(); double elapsed_eigen = eigen_timer.value(); std::cout << "eigen.: " << elapsed_eigen*1000.0/num_runs << " ms" << std::endl; } Maybe it's the beta version of VC... - Hauke
Hauke Registered Member Posts 109 Karma 3 OS	Re: Performance degradation: Updating from eigen2 to devel Wed Feb 03, 2010 9:05 pm Ok, in 32 bit with SSE disabled there might be a problem. When I have time I will investigate. Probably the GCC guys won't be able to reproduce this. eigen default.: 0.0167268 ms eigen 2.......: 0.00664499 ms - Hauke
EamonNerbonne Registered Member Posts 36 Karma 0	Re: Performance degradation: Updating from eigen2 to devel Wed Feb 03, 2010 10:47 pm Hauke wrote:Overe here, I get the following timings with VS2008 SP 1, Release, x64: eigen devel.: 0.00476292 ms eigen 2.....: 0.00490333 ms Maybe it's the beta version of VC... - Hauke I upped the iterations to 100k and get pretty much your results (devel: 4.2 us, v2: 4.1us) (so, doesn't look like the compiler - and that would surprise me anyhow; since after all, the compiler really shouldn't so dramatically affect malloc/free usage right?) But the devil's in the details, try the following v2 code: Code: Select all `P -= lr_P * ( (mu_vJ * vJ.transpose()).lazy() + (mu_vK * vK.transpose()).lazy()) ; double pNormScale = 1.0 / (P.transpose() * P).lazy().diagonal().sum();` It's 10 times faster, and the profile is missing the extremely high malloc/free times.
bjacob Registered Member Posts 658 Karma 3	Re: Performance degradation: Updating from eigen2 to devel Thu Feb 04, 2010 12:13 am Guys, products have the EvalBeforeNestingBit. In 2.0, we could remove it with .lazy(). In the devel branch, we can't remove it. Isn't that the simple explanation of this problem? Gael, is that correct? Can't we just make that bit conditional, e.g. do not set it in the case of outer products? Join us on Eigen's IRC channel: #eigen on irc.freenode.net Have a serious interest in Eigen? Then join the mailing list!
bjacob Registered Member Posts 658 Karma 3	Re: Performance degradation: Updating from eigen2 to devel Thu Feb 04, 2010 12:23 am How about this diff to begin with: Code: Select all diff -r 9e0b51ec5fbc Eigen/src/Core/ProductBase.h --- a/Eigen/src/Core/ProductBase.h Wed Feb 03 19:20:25 2010 +0100 +++ b/Eigen/src/Core/ProductBase.h Wed Feb 03 19:21:43 2010 -0500 @@ -42,7 +42,9 @@ ColsAtCompileTime = ei_traits<Rhs>::ColsAtCompileTime, MaxRowsAtCompileTime = ei_traits<Lhs>::MaxRowsAtCompileTime, MaxColsAtCompileTime = ei_traits<Rhs>::MaxColsAtCompileTime, - Flags = EvalBeforeNestingBit \| EvalBeforeAssigningBit, + Flags = ei_traits<Lhs>::ColsAtCompileTime == 1 + ? int(0) + : int(EvalBeforeNestingBit \| EvalBeforeAssigningBit), CoeffReadCost = 0 // FIXME why is it needed ? }; }; Then I don't understand one more thing. Just below that, in ProductBase.h line 53, we have: Code: Select all `// enforce evaluation before nesting template<typename Derived, typename Lhs, typename Rhs,int N,typename EvalType> struct ei_nested<ProductBase<Derived,Lhs,Rhs>, N, EvalType> { typedef EvalType type; };` I don't remember why this is needed? Can't we just remove this, since the EvalBeforeNestingBit is supposed to take care of doing exactly that? This partial specialization prevents the above patch from taking effect. Join us on Eigen's IRC channel: #eigen on irc.freenode.net Have a serious interest in Eigen? Then join the mailing list!
EamonNerbonne Registered Member Posts 36 Karma 0	Re: Performance degradation: Updating from eigen2 to devel Thu Feb 04, 2010 7:03 am bjacob wrote:Guys, products have the EvalBeforeNestingBit. In 2.0, we could remove it with .lazy(). In the devel branch, we can't remove it. Isn't that the simple explanation of this problem? That's the cause of these slowdowns as far as I can tell. Gael, is that correct? Can't we just make that bit conditional, e.g. do not set it in the case of outer products? That's one case, but the other is .diagonal(): here the product is a fairly normal matrix product, it's just that the result is mostly discarded anyhow: setting .lazy() means that only the required coefficients are evaluated (and as a bonus, avoids malloc/free).

Page 1 of 3 (36 posts)

1, 2, 3 • Next

Bookmarks

Who is online

Registered users: abc72656, Bing [Bot], daret, Google [Bot], Sogou [Bot], Yahoo [Bot]