This forum has been archived. All content is frozen. Please use KDE Discuss instead.

Performance degradation: Updating from eigen2 to devel

Tags: None
(comma "," separated)
EamonNerbonne
Registered Member
Posts
36
Karma
0
I was trying to simplify the test-case, and I ran into an unrelated (I think) slowdown.

Just do "mu_vK.dot(P * vK);" in the test, and eigen dev takes about 2 times longer than v2.

Code: Select all
#include <boost/progress.hpp>
#include <Eigen/Core>
#include <Eigen/Array>

using namespace Eigen;

int main(int argc, char* argv[])
{
  Vector2d mu_vK = Vector2d::Random();
  VectorXd vK = VectorXd::Random(25);
  Matrix<double,2,Dynamic> P = Matrix<double,2,Dynamic>::Random(2,25);
  const int num_runs = 50000000;
  double sum = 0.0;

  boost::progress_timer t;
  for (int i=0; i<num_runs; ++i) {
    P(num_runs%2, (num_runs/2)%25) = 1.0; //vs. optimizer
    sum +=  mu_vK.dot(P * vK);
  }
  std::cout << sum<<std::endl;//vs. optimizer
  return 0;
}
User avatar
ggael
Moderator
Posts
3447
Karma
19
OS
ok, I've updated the product dispatcher to use the unrolled path for hybrid small/large outer products. That fix the issue you had with:

Code: Select all
P.noalias() -= lr_P * ( mu_vJ * vJ.transpose() + mu_vK * vK.transpose())


Here it still creates a temporary but that's not causing a slowdown. Ideally we would need top-down evaluation mechanism (instead of our current bottom-up one), such that we could automatically evaluate such expressions as:

P.noalias() -= lr_P * mu_vJ * vJ.transpose();
P.noalias() -= lr_P * mu_vK * vK.transpose();

And this for all kind of products, not only outer ones...

Regarding the problem with diagonal, I would suggest to overload ProductBase::diagonal() to return the correct expression, i.e., the coefficient based product nested in a diagonal expression.

In the same vein I also plane to overload ProductBase::block(): it would return a product expression with block expressions applied to its right and left hand side:

(A*B).block() => A.block() * B.block()

Again, with a top-down evaluator we could handle such cases much more easily... I planed to present and discuss that idea at the meeting.
User avatar
ggael
Moderator
Posts
3447
Karma
19
OS
ok, the diagonal stuff is committed.
EamonNerbonne
Registered Member
Posts
36
Karma
0
OK, new version is faster, but still more than three times slower than v2. The temporaries are causing a slowdown (at least on VS.NET, I'll try with g++ in a sec). The profile is striking by now. Thanks to your optimizations, most of the "real" work is fading away, making malloc/free loom even larger: they now consume 38% of the overall run-time (!).

I'll post again once I have some more detailed data.
User avatar
ggael
Moderator
Posts
3447
Karma
19
OS
yes I forgot to mention that in the devel branch we generate much too many unnecessary temporaries for that expressions. We are discussing it on the ML and I think we found a solution. However, here those temporaries and extra copies represents only epsilon, so perhaps malloc is extremely slow on your platform ?

Again let me recall that your expression can advantageously be rewritten like this:

P.noalias() -= (lr_P * mu_vJ).eval() * vJ.transpose();
P.noalias() -= (lr_P * mu_vK).eval() * vK.transpose();

=> no dynamic alloc, and only 4 muls with lr_P instead of 50.
EamonNerbonne
Registered Member
Posts
36
Karma
0
Compiler Choice
Well, I tried g++, and the results are surprising (and inspiring?). I tried 4.4.1 for x86 and 4.4.3 for x64 (cross-compiling resulted in weird compiler errors, unfortunately, and I didn't feel like compiling gcc itself with appropriate configure options). The x64 variants were easily faster; also, to get optimal performance on x86 I really needed to tweak the compiler flags, whereas x64 performed best with just -march=core2 -O2 (with -m64 being implied).

Code: Select all
EIGENV2 + eigen's vectorization on gcc: 1.06 seconds
EIGENV2 + EIGEN_DONT_VECTORIZE  on gcc: 0.94 seconds (>10% faster!)
EIGENV2 + eigen's vectorization on cl : 1.70 seconds (ouch)
EIGENV2 + EIGEN_DONT_VECTORIZE  on cl : 2.91 seconds (OUCH)

there was essentially no variation in runtime each run.

So, the compiler really matters (and yes, I did define NDEBUG in all cases, and I fiddled with various optimization settings for cl, 2.91 was the best I could get).

Dev branch vs. v2
Now, without your suggestion (to rewrite the update to P) the same benchmark takes about 5.5 seconds on gcc, and about 6.5 on cl (slower with EIGEN_DONT_VECTORIZE).

With your suggestion to rewrite the P update to avoid temporaries I get a compilation error in MS's cl:
Code: Select all
learningBench.cpp(24): error C2784: 'ei_enable_if<!Eigen::ei_is_same_type<Derived::Scalar,Derived::RealScalar>::ret,const Eigen::ScaledProduct<Derived>>::type Eigen::operator *(Derived::RealScalar,const Eigen::ProductBase<Derived,_Lhs,_Rhs> &)' : could not deduce template argument for 'const Eigen::ProductBase<Derived,_Lhs,_Rhs> &' from 'const Eigen::Vector2d'


Though it works fine in gcc, the following variant performs the same on gcc and at least works on cl:
Code: Select all
    Vector2d tmpJ = lr_P * mu_vJ;
    Vector2d tmpK = lr_P * mu_vK;
    P.noalias() -= tmpJ * vJ.transpose();
    P.noalias() -= tmpK * vK.transpose();
    return 1.0 / ( (P.transpose() * P).diagonal().sum());



I get the following results:
Code: Select all
cl+EIGEN_DONT_VECTORIZE       :  5.28 - 5.33 sec
cl+eigen's vectorization      :  3.37 - 3.41 sec
g++ +eigen's vectorization    :  2.91 - 2.92 sec
g++ -DEIGEN_DONT_VECTORIZE    :  2.73 - 2.74 sec
g++ -O3 -DEIGEN_DONT_VECTORIZE:  2.15 - 2.22 sec (usually on the fast end of that range)


So, the dev version is still quite a bit slower, on cl malloc/free are still using 28% of the time (cl+eigen's vectorization) compared to less that 0.1% with eigen v2.
User avatar
bjacob
Registered Member
Posts
658
Karma
3
Just FYI, Gael, who is a genius, just committed large fixes to the devel branch. Could you please retry now?


Join us on Eigen's IRC channel: #eigen on irc.freenode.net
Have a serious interest in Eigen? Then join the mailing list!
EamonNerbonne
Registered Member
Posts
36
Karma
0
bjacob wrote:Just FYI, Gael, who is a genius, just committed large fixes to the devel branch. Could you please retry now?


I definitely will. Right now, I'm in the middle on converting my builds to using cmake to ease the pain of dealing with both visual studio and the gnu toolchain, so it might take me a while until I've got it all running satisfactorily again.

Incidentally, the performance fix has no urgency for me - I just wanted to help identify room for improvement to make eigen3 even better than 2 (which is great, thank you!).
EamonNerbonne
Registered Member
Posts
36
Karma
0
Hi again, I've retried the tests and the new version is much better!

These were the most extreme samples, but I've got a few more; the full program still runs a bit slower on eigen3 than eigen2 - would it be useful to you to extract more testcases (if I can pinpoint the cause in the first place)?

Detailed Results
To reiterate, the code that's being looped 5'000'000 times:
Code: Select all
   P.noalias() -= lr_P * ( mu_vJ * vJ.transpose() + mu_vK * vK.transpose());
   return 1.0 / ( (P.transpose() * P).diagonal().sum());
...or similar for eigen2.

Results:
Code: Select all
Compilation settings    Eigen 2    Eigen 3
Msc                     2.86            1.30
Msc+eigenVec            1.66            1.25
Mingw                   0.89            0.78
Mingw+eigenVec          1.20            0.71

runs without +eigenVec were compiled with -DEIGEN_DONT_VECTORIZE
EamonNerbonne
Registered Member
Posts
36
Karma
0
Incidentally, this variant:

Code: Select all
   P.noalias() -= (lr_P * mu_vJ).eval() * vJ.transpose();
   P.noalias() -= (lr_P * mu_vK).eval() * vK.transpose();

Doesn't compile in MSC (can't deduce template argument), and is actually slower for gcc.
User avatar
bjacob
Registered Member
Posts
658
Karma
3
Whenever you can make a short, self-contained, compilable program that runs slower with eigen3 than eigen2, I'd say that's very interesting!


Join us on Eigen's IRC channel: #eigen on irc.freenode.net
Have a serious interest in Eigen? Then join the mailing list!
EamonNerbonne
Registered Member
Posts
36
Karma
0
Ok here's another one.

Code: Select all
//const VectorXd& point, Matrix<double,2,Dynamic>& P
Vector2d P_point = P*point;//slow in eigen3,especially vectorized


Results in seconds (64-bit)
Code: Select all
Compiler Eigen2 Eigen3
Msc      2.38   3.75   
Msc+v    1.34   4.14
Mingw    2.46   1.39
Mingw+v  1.42   7.11

Particularly the vectorized versions perform poorly.


Full test case:
Code: Select all
#include <boost/progress.hpp>
#include <Eigen/Core>
#if !EIGEN3
#include <Eigen/Array>
#endif

using namespace Eigen;
using namespace boost;
using namespace std;

#define DIMS 25

EIGEN_DONT_INLINE
double projectionTestIter(
      const VectorXd& point,
      Matrix<double,2,Dynamic>& P) {
   Vector2d P_point = P*point;
   return P_point.sum();
}

void projectionTest() {
   VectorXd a = VectorXd::Random(DIMS);
   VectorXd b = VectorXd::Random(DIMS);
   Matrix<double,2,Dynamic> P = Matrix<double,2,Dynamic>::Random(2,DIMS);

   progress_timer t(cerr);
   double sum = 0.0;
   const int num_runs = 30000000;
   for (int i=0; i<num_runs; ++i) {
      if(num_runs % (i+1) > sum)
         sum -= projectionTestIter(a, P);
      else
         sum += projectionTestIter(b, P);
   }
   cout <<"(" << sum<<") ";
}

int main(int argc, char* argv[]){
   projectionTest();
   return 0;
}


I define the symbol EIGEN3 in the compile parameters depending on the test case.
User avatar
ggael
Moderator
Posts
3447
Karma
19
OS
ouch! again, the problem was simply a bad product selection rule. Should be better now. Thanks a lot for reporting those problems!
EamonNerbonne
Registered Member
Posts
36
Karma
0
OK, that looks better!
The last patch did degrade eigen3 performance without vectorization - before the patch eigen3+EIGEN_DONT_VECTORIZE on mingw took 1.39 seconds, now it takes 2.4 seconds. On the other hand it got faster on msc; so maybe it's just triggering some slightly different optimization this time round.

Full output:
Code: Select all
Compiler Eigen2 Eigen3-old Eigen3-new
Msc      2.38   3.75       2.35
Msc+v    1.34   4.14       1.34
Mingw    2.46   1.39       2.41
Mingw+v  1.42   7.11       1.29


Nice!
EamonNerbonne
Registered Member
Posts
36
Karma
0
I think I have two more test cases. One of em's a little tricky to isolate, so I'll wait till next week to take a look. This one isn't very big, and maybe it's due to compiler settings, but at least it's easy to isolate ;-).

The slowdown involves plain non-aliased subtraction of VectorXd instances.

The performance profile's a little weird on this one, though:
Code: Select all
Compiler Eigen2  Eigen3
Msc      2.18    3.24       
Msc+v    1.37    1.97(!)
Mingw    0.85(!) 1.60       
Mingw+v  0.99    0.90


The eigen2 code without vectorization on mingw is fastest. In particular, the msc slowdown is large. From what I can see in the profiler (namely something at all), ei_assign_impl::run isn't being inlined by MSC. I haven't looked into what dst.template copyPacket does, but that's where the slowdown is anyhow.

Ah, I almost forgot the testcase...
Code: Select all
#include <boost/progress.hpp>
#include <Eigen/Core>
#if !EIGEN3
#include <Eigen/Array>
#endif

using namespace Eigen;
using namespace boost;
using namespace std;

#define DIMS 25
void subtractTest() {
   VectorXd a = VectorXd::Random(DIMS);
   VectorXd b = VectorXd::Random(DIMS);
   VectorXd c = VectorXd::Random(DIMS);

   progress_timer t(cerr);
   double sum = 0.0;
   const int num_runs = 30000000;
   for (int i=0; i<num_runs; ++i) {
#if EIGEN3
      if(i%2==0)
         c.noalias() = a - b;
      else
         b.noalias()  = a - c;
#else
      if(i%2==0)
         c = (a - b).lazy();
      else
         b  = (a - c).lazy();
#endif

   }
   cout <<"(" << c(0) <<") ";
}

int main(int argc, char* argv[]){
   subtractTest();
   return 0;
}

Last edited by EamonNerbonne on Thu Feb 25, 2010 11:15 pm, edited 2 times in total.


Bookmarks



Who is online

Registered users: abc72656, Bing [Bot], daret, Google [Bot], Sogou [Bot], Yahoo [Bot]