Registered Member
|
I thought I'd check out the devel branch to see how it compares. In the performance sensitive segment, the app I'm working on deals purely with dense matrices and vectors of up to 50 dimensions.
Most parts are faster, which is nice, but some parts are slower - and these parts are so much slower that the overall result is several times slower. This seems to be caused by the devel branch's usage of temporaries where none are advised. Such expressions are much slower for apparently two reasons. In particular, I've got several code segments that looks as follows: matM = [...] + vecA * vecA0.transpose() + [...]; or some variant thereof (vec? are column vectors, mat is the appropriately sized pre-allocated matrix). In eigen2, I marked the multiplication as .lazy(), since there's no aliasing and since the multiplication is very cheap. In -devel, I marked matM as ".noalias()". Using an intermediate here is expensive primarily because of memory allocation. On the other hand, inline evaluation doesn't actually recompute anything unnecessarily anyhow; even without and intermediate, the product of element i of vecA and element j of vecB is used only once; namely to fill element (i,j) of matM. As a result, for the overall application, malloc+free time is 13.5% of total in the eigen-devel variant of the program, but only 3.4% of total in the original eigen2-based program, which in cpu-time is a factor 12 increase (the total running time of the eigen-devel variant was longer). The second reason such intermediate-introducing code segments are slower is because the multiplication expressions are sometimes not completely used anyhow. As part of a matrix normalization, there's a line: double scale = 1.0 / ( (matM.transpose() * matM).lazy().diagonal().sum()); with the devel-branch, lazy falls away, and the entire product matM^T * matM is computed even though only the diagonals are needed! Now, I could work around the second problem by manually computing the sum of dot-products of rows+columns, and I could mitigate the first problem by accepting the intermediates but preallocating some temp variable (which is probably still slower, however). Is there some alternative to force lazy evaluation where .noalias() doesn't quite cover the intent? Improved heuristics might cover it too, for me that would just be .diagonal() on an expression should suggest lazy evaluation, and obviously a column-vector times a row-vector doesn't need to generate an intermediate. Everything else compiled with almost no changes on my part (replacing lazy with noalias, basically) and simply runs about 20% faster, so the -devel branch has some juicy stuff, it looks like! |
Registered Member
|
I also checked the mailing list:
http://listengine.tuxfamily.org/lists.t ... 00039.html and that heuristic doesn't seem to hold for small matrices and vectors; lazy() outperforms eager there (by quite a margin, eventually). So, I'm slightly worried to see lazy go... |
Registered Member
|
Thanks for the feedback. As that threat hints at, a work-around for the first problem which does not allocate a temporary (I believe) is to write
matM = [...] + vecA * vecA0.transpose() + [...]; as matM = [...]; matM += vecA * vecA0.transpose(); matM += [...]; You may need a noalias() in the first += statement. But this solution is not ideal. |
Registered Member
|
Indeed it's not ideal: you're then traversing the arrays N times instead of just once. Traversing arrays only once is another important benefit of lazy evaluation.
The performance regression that you're mentioning is possibly related to the one that was just reported to the list: http://listengine.tuxfamily.org/lists.t ... 00300.html So, as you can see, you're not the only one to be concerned about it. It is definitely a simple bug, something that will soon be fixed. Don't worry about the removal of lazy(). The noalias() system should perfectly cover your use case, it's just that you hit a bug. This happens, with the development branch
Join us on Eigen's IRC channel: #eigen on irc.freenode.net
Have a serious interest in Eigen? Then join the mailing list! |
Registered Member
|
So, the .diagonal() issue is just a bug, I guess - but you're saying the vec*vec.transpose() issue should indeed evaluate lazily (i.e. without temporary) without further hints and that that's also just a bug?
The point is that .noalias() is generally similar in purpose to .lazy(), but sometimes .lazy() had the additional hint that suggested that despite eigen's own defaults a product should be evaluated lazily, not eagerly... That additional hint is now no longer expressible, it seems. As to the performance regression mentioned in the mailing list thread; it's possible it's related. That was talking of ArrayXd, however. I'm unfamiliar with the details, but as far as I can tell, eigen2 in general seems to prefer eager evaluation of product expressions - I'm unsure about eigen-devel, but if it's similar, then vecA*vecB.transpose() would by that heuristic be evaluated eagerly, despite that rarely being optimal mostly due to memory allocation overhead. Honestly though, I don't really understand why the follow is apparently true for non-tiny matrices: D = C + (A*B).eval(); //faster D = C + (A*B).lazy(); //slower It strikes me that each coeefficient of C + A*B should be evaluable in one loop (so D in three nested loops), and that an intermediate won't help, but nevertheless the non-lazy code is faster if the matrices aren't tiny - is that due to cache-friendliness? |
Registered Member
|
Oh oh, I see now the potential problem. .noalias() can't prevent evaluation into a temporary for expressions that have the "evaluate before nesting" flag, and that are nested. So in
since the product expression has the "evaluate before nesting" flag, it gets evaluated right away, when it is nested in the sum expression, long before the .noalias() comes into play. Gael, I'm interested in your opinion? Am I missing a reason why this is not a problem? Note that here we're talking about an outer product of vectors; while in general evaluating products into a temporary is good, it is probably not the case for outer products, right? Or did your measurements show otherwise (which would be really surprising?)
Here too I need to let Gael answer, since he took care of that stuff.
Join us on Eigen's IRC channel: #eigen on irc.freenode.net
Have a serious interest in Eigen? Then join the mailing list! |
Registered Member
|
So, proposed solution: remove the EvalBeforeNestingBit on outer products?
Join us on Eigen's IRC channel: #eigen on irc.freenode.net
Have a serious interest in Eigen? Then join the mailing list! |
Moderator
|
hi,
the problem was not about lazy evaluation. The problem was that all kind of outer products took the outer-product specialization, which is tailored for large matrices. Problem solved now. |
Registered Member
|
Did you merge the fix? I'm still seeing the similar performance here; malloc/free are taking 16.5% of runtime which is about 30 times more than on stable.
It might be due to the specialization chosen; but the following two lines P.noalias() -= lr_P * ( mu_vJ * vJ.transpose() + mu_vK * vK.transpose()) ; double pNormScale = 1.0 / ( (P.transpose() * P).diagonal().sum()); Are much slower in the development version than in stable. Both of them do dynamic memory allocation only in the development version, and it's possible that diagonal is still overevaluating; the profiler output is kind of hard to read with all the templating . with: Vector2d mu_vJ, mu_vK; VectorXd vJ, vK; //25 rows. Matrix<double,2, Dynamic> P; //2 rows and 25 columns this run The compiler is VS2010 beta2. |
Registered Member
|
Overe here, I get the following timings with VS2008 SP 1, Release, x64: eigen devel.: 0.00476292 ms eigen 2.....: 0.00490333 ms
Maybe it's the beta version of VC... - Hauke |
Registered Member
|
Ok, in 32 bit with SSE disabled there might be a problem. When I have time I will investigate. Probably the GCC guys won't be able to reproduce this.
eigen default.: 0.0167268 ms eigen 2.......: 0.00664499 ms - Hauke |
Registered Member
|
I upped the iterations to 100k and get pretty much your results (devel: 4.2 us, v2: 4.1us) (so, doesn't look like the compiler - and that would surprise me anyhow; since after all, the compiler really shouldn't so dramatically affect malloc/free *usage* right?) But the devil's in the details, try the following v2 code:
It's 10 times faster, and the profile is missing the extremely high malloc/free times. |
Registered Member
|
Guys, products have the EvalBeforeNestingBit. In 2.0, we could remove it with .lazy(). In the devel branch, we can't remove it. Isn't that the simple explanation of this problem?
Gael, is that correct? Can't we just make that bit conditional, e.g. do not set it in the case of outer products?
Join us on Eigen's IRC channel: #eigen on irc.freenode.net
Have a serious interest in Eigen? Then join the mailing list! |
Registered Member
|
How about this diff to begin with:
Then I don't understand one more thing. Just below that, in ProductBase.h line 53, we have:
I don't remember why this is needed? Can't we just remove this, since the EvalBeforeNestingBit is supposed to take care of doing exactly that? This partial specialization prevents the above patch from taking effect.
Join us on Eigen's IRC channel: #eigen on irc.freenode.net
Have a serious interest in Eigen? Then join the mailing list! |
Registered Member
|
That's the cause of these slowdowns as far as I can tell. That's one case, but the other is .diagonal(): here the product is a fairly normal matrix product, it's just that the result is mostly discarded anyhow: setting .lazy() means that only the required coefficients are evaluated (and as a bonus, avoids malloc/free). |
Registered users: abc72656, Bing [Bot], daret, Google [Bot], Sogou [Bot], Yahoo [Bot]