This forum has been archived. All content is frozen. Please use KDE Discuss instead.

Performance degradation: Updating from eigen2 to devel

Tags: None
(comma "," separated)
EamonNerbonne
Registered Member
Posts
36
Karma
0
I thought I'd check out the devel branch to see how it compares. In the performance sensitive segment, the app I'm working on deals purely with dense matrices and vectors of up to 50 dimensions.

Most parts are faster, which is nice, but some parts are slower - and these parts are so much slower that the overall result is several times slower. This seems to be caused by the devel branch's usage of temporaries where none are advised. Such expressions are much slower for apparently two reasons.

In particular, I've got several code segments that looks as follows:

matM = [...] + vecA * vecA0.transpose() + [...];

or some variant thereof (vec? are column vectors, mat is the appropriately sized pre-allocated matrix).

In eigen2, I marked the multiplication as .lazy(), since there's no aliasing and since the multiplication is very cheap. In -devel, I marked matM as ".noalias()". Using an intermediate here is expensive primarily because of memory allocation. On the other hand, inline evaluation doesn't actually recompute anything unnecessarily anyhow; even without and intermediate, the product of element i of vecA and element j of vecB is used only once; namely to fill element (i,j) of matM.

As a result, for the overall application, malloc+free time is 13.5% of total in the eigen-devel variant of the program, but only 3.4% of total in the original eigen2-based program, which in cpu-time is a factor 12 increase (the total running time of the eigen-devel variant was longer).

The second reason such intermediate-introducing code segments are slower is because the multiplication expressions are sometimes not completely used anyhow. As part of a matrix normalization, there's a line:

double scale = 1.0 / ( (matM.transpose() * matM).lazy().diagonal().sum());

with the devel-branch, lazy falls away, and the entire product matM^T * matM is computed even though only the diagonals are needed!

Now, I could work around the second problem by manually computing the sum of dot-products of rows+columns, and I could mitigate the first problem by accepting the intermediates but preallocating some temp variable (which is probably still slower, however).

Is there some alternative to force lazy evaluation where .noalias() doesn't quite cover the intent? Improved heuristics might cover it too, for me that would just be .diagonal() on an expression should suggest lazy evaluation, and obviously a column-vector times a row-vector doesn't need to generate an intermediate.

Everything else compiled with almost no changes on my part (replacing lazy with noalias, basically) and simply runs about 20% faster, so the -devel branch has some juicy stuff, it looks like!
EamonNerbonne
Registered Member
Posts
36
Karma
0
I also checked the mailing list:
http://listengine.tuxfamily.org/lists.t ... 00039.html

and that heuristic doesn't seem to hold for small matrices and vectors; lazy() outperforms eager there (by quite a margin, eventually).

So, I'm slightly worried to see lazy go...
jitseniesen
Registered Member
Posts
204
Karma
2
Thanks for the feedback. As that threat hints at, a work-around for the first problem which does not allocate a temporary (I believe) is to write

matM = [...] + vecA * vecA0.transpose() + [...];

as

matM = [...];
matM += vecA * vecA0.transpose();
matM += [...];

You may need a noalias() in the first += statement. But this solution is not ideal.
User avatar
bjacob
Registered Member
Posts
658
Karma
3
Indeed it's not ideal: you're then traversing the arrays N times instead of just once. Traversing arrays only once is another important benefit of lazy evaluation.

The performance regression that you're mentioning is possibly related to the one that was just reported to the list:
http://listengine.tuxfamily.org/lists.t ... 00300.html

So, as you can see, you're not the only one to be concerned about it. It is definitely a simple bug, something that will soon be fixed.

Don't worry about the removal of lazy(). The noalias() system should perfectly cover your use case, it's just that you hit a bug. This happens, with the development branch ;)


Join us on Eigen's IRC channel: #eigen on irc.freenode.net
Have a serious interest in Eigen? Then join the mailing list!
EamonNerbonne
Registered Member
Posts
36
Karma
0
So, the .diagonal() issue is just a bug, I guess - but you're saying the vec*vec.transpose() issue should indeed evaluate lazily (i.e. without temporary) without further hints and that that's also just a bug?

The point is that .noalias() is generally similar in purpose to .lazy(), but sometimes .lazy() had the additional hint that suggested that despite eigen's own defaults a product should be evaluated lazily, not eagerly... That additional hint is now no longer expressible, it seems.

As to the performance regression mentioned in the mailing list thread; it's possible it's related. That was talking of ArrayXd, however. I'm unfamiliar with the details, but as far as I can tell, eigen2 in general seems to prefer eager evaluation of product expressions - I'm unsure about eigen-devel, but if it's similar, then vecA*vecB.transpose() would by that heuristic be evaluated eagerly, despite that rarely being optimal mostly due to memory allocation overhead.

Honestly though, I don't really understand why the follow is apparently true for non-tiny matrices:

D = C + (A*B).eval(); //faster
D = C + (A*B).lazy(); //slower

It strikes me that each coeefficient of C + A*B should be evaluable in one loop (so D in three nested loops), and that an intermediate won't help, but nevertheless the non-lazy code is faster if the matrices aren't tiny - is that due to cache-friendliness?
User avatar
bjacob
Registered Member
Posts
658
Karma
3
So, the .diagonal() issue is just a bug, I guess - but you're saying the vec*vec.transpose() issue should indeed evaluate lazily (i.e. without temporary) without further hints and that that's also just a bug?

The point is that .noalias() is generally similar in purpose to .lazy(), but sometimes .lazy() had the additional hint that suggested that despite eigen's own defaults a product should be evaluated lazily, not eagerly... That additional hint is now no longer expressible, it seems.


Oh oh, I see now the potential problem.
.noalias() can't prevent evaluation into a temporary for expressions that have the "evaluate before nesting" flag, and that are nested. So in
Code: Select all
result.noalias() = A + B*C;

since the product expression has the "evaluate before nesting" flag, it gets evaluated right away, when it is nested in the sum expression, long before the .noalias() comes into play.

Gael, I'm interested in your opinion? Am I missing a reason why this is not a problem? Note that here we're talking about an outer product of vectors; while in general evaluating products into a temporary is good, it is probably not the case for outer products, right? Or did your measurements show otherwise (which would be really surprising?)

Honestly though, I don't really understand why the follow is apparently true for non-tiny matrices:

D = C + (A*B).eval(); //faster
D = C + (A*B).lazy(); //slower


Here too I need to let Gael answer, since he took care of that stuff.


Join us on Eigen's IRC channel: #eigen on irc.freenode.net
Have a serious interest in Eigen? Then join the mailing list!
User avatar
bjacob
Registered Member
Posts
658
Karma
3
So, proposed solution: remove the EvalBeforeNestingBit on outer products?


Join us on Eigen's IRC channel: #eigen on irc.freenode.net
Have a serious interest in Eigen? Then join the mailing list!
User avatar
ggael
Moderator
Posts
3447
Karma
19
OS
hi,

the problem was not about lazy evaluation. The problem was that all kind of outer products took the outer-product specialization, which is tailored for large matrices. Problem solved now.
EamonNerbonne
Registered Member
Posts
36
Karma
0
Did you merge the fix? I'm still seeing the similar performance here; malloc/free are taking 16.5% of runtime which is about 30 times more than on stable.

It might be due to the specialization chosen; but the following two lines


P.noalias() -= lr_P * ( mu_vJ * vJ.transpose() + mu_vK * vK.transpose()) ;
double pNormScale = 1.0 / ( (P.transpose() * P).diagonal().sum());

Are much slower in the development version than in stable. Both of them do dynamic memory allocation only in the development version, and it's possible that diagonal is still overevaluating; the profiler output is kind of hard to read with all the templating :-).


with:
Vector2d mu_vJ, mu_vK;
VectorXd vJ, vK; //25 rows.
Matrix<double,2, Dynamic> P; //2 rows and 25 columns this run

The compiler is VS2010 beta2.
Hauke
Registered Member
Posts
109
Karma
3
OS
EamonNerbonne wrote:Did you merge the fix? I'm still seeing the similar performance here; malloc/free are taking 16.5% of runtime which is about 30 times more than on stable.

It might be due to the specialization chosen; but the following two lines


P.noalias() -= lr_P * ( mu_vJ * vJ.transpose() + mu_vK * vK.transpose()) ;
double pNormScale = 1.0 / ( (P.transpose() * P).diagonal().sum());

Are much slower in the development version than in stable. Both of them do dynamic memory allocation only in the development version, and it's possible that diagonal is still overevaluating; the profiler output is kind of hard to read with all the templating :-).


with:
Vector2d mu_vJ, mu_vK;
VectorXd vJ, vK; //25 rows.
Matrix<double,2, Dynamic> P; //2 rows and 25 columns this run

The compiler is VS2010 beta2.


Overe here, I get the following timings with VS2008 SP 1, Release, x64:

eigen devel.: 0.00476292 ms
eigen 2.....: 0.00490333 ms

Code: Select all
#include <Bench/BenchTimer.h>
#include <Eigen/Core>
#include <Eigen/Array>

using namespace Eigen;

EIGEN_DONT_INLINE double run_test(
  const Vector2d& mu_vJ, const Vector2d& mu_vK,
  const VectorXd& vJ, const VectorXd& vK,
  const double lr_P,
  Matrix<double,2,Dynamic>& P)
{
#if 1
  P -= lr_P * ( mu_vJ * vJ.transpose() + mu_vK * vK.transpose()).lazy();
  double pNormScale = 1.0 / ( (P.transpose() * P).diagonal().sum());
  return pNormScale;
#else
  P.noalias() -= lr_P * ( mu_vJ * vJ.transpose() + mu_vK * vK.transpose());
  double pNormScale = 1.0 / ( (P.transpose() * P).diagonal().sum());
  return pNormScale;
#endif
}

void main()
{
  Vector2d mu_vJ = Vector2d::Random();
  Vector2d mu_vK = Vector2d::Random();
  VectorXd vJ = VectorXd::Random(25);
  VectorXd vK = VectorXd::Random(25);
  Matrix<double,2,Dynamic> P = Matrix<double,2,Dynamic>::Random(2,25);
  double lr_P = ei_random<double>();

  BenchTimer eigen_timer;

  const int num_runs = 20000;

  double sum = 0.0;
  eigen_timer.start();
  for (int i=0; i<num_runs; ++i) sum += run_test(mu_vJ, mu_vK, vJ, vK, lr_P, P);
  eigen_timer.stop();
  double elapsed_eigen = eigen_timer.value();

  std::cout << "eigen.: " << elapsed_eigen*1000.0/num_runs << " ms" << std::endl;
}



Maybe it's the beta version of VC...

- Hauke
Hauke
Registered Member
Posts
109
Karma
3
OS
Ok, in 32 bit with SSE disabled there might be a problem. When I have time I will investigate. Probably the GCC guys won't be able to reproduce this.

eigen default.: 0.0167268 ms
eigen 2.......: 0.00664499 ms

- Hauke
EamonNerbonne
Registered Member
Posts
36
Karma
0
Hauke wrote:Overe here, I get the following timings with VS2008 SP 1, Release, x64:

eigen devel.: 0.00476292 ms
eigen 2.....: 0.00490333 ms


Maybe it's the beta version of VC...

- Hauke


I upped the iterations to 100k and get pretty much your results (devel: 4.2 us, v2: 4.1us)

(so, doesn't look like the compiler - and that would surprise me anyhow; since after all, the compiler really shouldn't so dramatically affect malloc/free *usage* right?)

But the devil's in the details, try the following v2 code:

Code: Select all
  P -=  lr_P * ( (mu_vJ * vJ.transpose()).lazy() + (mu_vK * vK.transpose()).lazy()) ;
  double pNormScale = 1.0 /  (P.transpose() * P).lazy().diagonal().sum();


It's 10 times faster, and the profile is missing the extremely high malloc/free times.
User avatar
bjacob
Registered Member
Posts
658
Karma
3
Guys, products have the EvalBeforeNestingBit. In 2.0, we could remove it with .lazy(). In the devel branch, we can't remove it. Isn't that the simple explanation of this problem?

Gael, is that correct? Can't we just make that bit conditional, e.g. do not set it in the case of outer products?


Join us on Eigen's IRC channel: #eigen on irc.freenode.net
Have a serious interest in Eigen? Then join the mailing list!
User avatar
bjacob
Registered Member
Posts
658
Karma
3
How about this diff to begin with:

Code: Select all
diff -r 9e0b51ec5fbc Eigen/src/Core/ProductBase.h
--- a/Eigen/src/Core/ProductBase.h      Wed Feb 03 19:20:25 2010 +0100
+++ b/Eigen/src/Core/ProductBase.h      Wed Feb 03 19:21:43 2010 -0500
@@ -42,7 +42,9 @@
     ColsAtCompileTime = ei_traits<Rhs>::ColsAtCompileTime,
     MaxRowsAtCompileTime = ei_traits<Lhs>::MaxRowsAtCompileTime,
     MaxColsAtCompileTime = ei_traits<Rhs>::MaxColsAtCompileTime,
-    Flags = EvalBeforeNestingBit | EvalBeforeAssigningBit,
+    Flags = ei_traits<Lhs>::ColsAtCompileTime == 1
+            ? int(0)
+            : int(EvalBeforeNestingBit | EvalBeforeAssigningBit),
     CoeffReadCost = 0 // FIXME why is it needed ?
   };
 };


Then I don't understand one more thing. Just below that, in ProductBase.h line 53, we have:
Code: Select all
// enforce evaluation before nesting
template<typename Derived, typename Lhs, typename Rhs,int N,typename EvalType>
struct ei_nested<ProductBase<Derived,Lhs,Rhs>, N, EvalType>
{
  typedef EvalType type;
};

I don't remember why this is needed? Can't we just remove this, since the EvalBeforeNestingBit is supposed to take care of doing exactly that? This partial specialization prevents the above patch from taking effect.


Join us on Eigen's IRC channel: #eigen on irc.freenode.net
Have a serious interest in Eigen? Then join the mailing list!
EamonNerbonne
Registered Member
Posts
36
Karma
0
bjacob wrote:Guys, products have the EvalBeforeNestingBit. In 2.0, we could remove it with .lazy(). In the devel branch, we can't remove it. Isn't that the simple explanation of this problem?


That's the cause of these slowdowns as far as I can tell.

Gael, is that correct? Can't we just make that bit conditional, e.g. do not set it in the case of outer products?
That's one case, but the other is .diagonal(): here the product is a fairly normal matrix product, it's just that the result is mostly discarded anyhow: setting .lazy() means that only the required coefficients are evaluated (and as a bonus, avoids malloc/free).


Bookmarks



Who is online

Registered users: abc72656, Bing [Bot], daret, Google [Bot], Sogou [Bot], Yahoo [Bot]