Registered Member
|
I was trying to simplify the test-case, and I ran into an unrelated (I think) slowdown.
Just do "mu_vK.dot(P * vK);" in the test, and eigen dev takes about 2 times longer than v2.
|
Moderator
|
ok, I've updated the product dispatcher to use the unrolled path for hybrid small/large outer products. That fix the issue you had with:
Here it still creates a temporary but that's not causing a slowdown. Ideally we would need top-down evaluation mechanism (instead of our current bottom-up one), such that we could automatically evaluate such expressions as: P.noalias() -= lr_P * mu_vJ * vJ.transpose(); P.noalias() -= lr_P * mu_vK * vK.transpose(); And this for all kind of products, not only outer ones... Regarding the problem with diagonal, I would suggest to overload ProductBase::diagonal() to return the correct expression, i.e., the coefficient based product nested in a diagonal expression. In the same vein I also plane to overload ProductBase::block(): it would return a product expression with block expressions applied to its right and left hand side: (A*B).block() => A.block() * B.block() Again, with a top-down evaluator we could handle such cases much more easily... I planed to present and discuss that idea at the meeting. |
Moderator
|
ok, the diagonal stuff is committed.
|
Registered Member
|
OK, new version is faster, but still more than three times slower than v2. The temporaries are causing a slowdown (at least on VS.NET, I'll try with g++ in a sec). The profile is striking by now. Thanks to your optimizations, most of the "real" work is fading away, making malloc/free loom even larger: they now consume 38% of the overall run-time (!).
I'll post again once I have some more detailed data. |
Moderator
|
yes I forgot to mention that in the devel branch we generate much too many unnecessary temporaries for that expressions. We are discussing it on the ML and I think we found a solution. However, here those temporaries and extra copies represents only epsilon, so perhaps malloc is extremely slow on your platform ?
Again let me recall that your expression can advantageously be rewritten like this: P.noalias() -= (lr_P * mu_vJ).eval() * vJ.transpose(); P.noalias() -= (lr_P * mu_vK).eval() * vK.transpose(); => no dynamic alloc, and only 4 muls with lr_P instead of 50. |
Registered Member
|
Compiler Choice
Well, I tried g++, and the results are surprising (and inspiring?). I tried 4.4.1 for x86 and 4.4.3 for x64 (cross-compiling resulted in weird compiler errors, unfortunately, and I didn't feel like compiling gcc itself with appropriate configure options). The x64 variants were easily faster; also, to get optimal performance on x86 I really needed to tweak the compiler flags, whereas x64 performed best with just -march=core2 -O2 (with -m64 being implied).
there was essentially no variation in runtime each run. So, the compiler really matters (and yes, I did define NDEBUG in all cases, and I fiddled with various optimization settings for cl, 2.91 was the best I could get). Dev branch vs. v2 Now, without your suggestion (to rewrite the update to P) the same benchmark takes about 5.5 seconds on gcc, and about 6.5 on cl (slower with EIGEN_DONT_VECTORIZE). With your suggestion to rewrite the P update to avoid temporaries I get a compilation error in MS's cl:
Though it works fine in gcc, the following variant performs the same on gcc and at least works on cl:
I get the following results:
So, the dev version is still quite a bit slower, on cl malloc/free are still using 28% of the time (cl+eigen's vectorization) compared to less that 0.1% with eigen v2. |
Registered Member
|
Just FYI, Gael, who is a genius, just committed large fixes to the devel branch. Could you please retry now?
Join us on Eigen's IRC channel: #eigen on irc.freenode.net
Have a serious interest in Eigen? Then join the mailing list! |
Registered Member
|
I definitely will. Right now, I'm in the middle on converting my builds to using cmake to ease the pain of dealing with both visual studio and the gnu toolchain, so it might take me a while until I've got it all running satisfactorily again. Incidentally, the performance fix has no urgency for me - I just wanted to help identify room for improvement to make eigen3 even better than 2 (which is great, thank you!). |
Registered Member
|
Hi again, I've retried the tests and the new version is much better!
These were the most extreme samples, but I've got a few more; the full program still runs a bit slower on eigen3 than eigen2 - would it be useful to you to extract more testcases (if I can pinpoint the cause in the first place)? Detailed Results To reiterate, the code that's being looped 5'000'000 times:
Results:
runs without +eigenVec were compiled with -DEIGEN_DONT_VECTORIZE |
Registered Member
|
Incidentally, this variant:
Doesn't compile in MSC (can't deduce template argument), and is actually slower for gcc. |
Registered Member
|
Whenever you can make a short, self-contained, compilable program that runs slower with eigen3 than eigen2, I'd say that's very interesting!
Join us on Eigen's IRC channel: #eigen on irc.freenode.net
Have a serious interest in Eigen? Then join the mailing list! |
Registered Member
|
Ok here's another one.
Results in seconds (64-bit)
Particularly the vectorized versions perform poorly. Full test case:
I define the symbol EIGEN3 in the compile parameters depending on the test case. |
Moderator
|
ouch! again, the problem was simply a bad product selection rule. Should be better now. Thanks a lot for reporting those problems!
|
Registered Member
|
OK, that looks better!
The last patch did degrade eigen3 performance without vectorization - before the patch eigen3+EIGEN_DONT_VECTORIZE on mingw took 1.39 seconds, now it takes 2.4 seconds. On the other hand it got faster on msc; so maybe it's just triggering some slightly different optimization this time round. Full output:
Nice! |
Registered Member
|
I think I have two more test cases. One of em's a little tricky to isolate, so I'll wait till next week to take a look. This one isn't very big, and maybe it's due to compiler settings, but at least it's easy to isolate .
The slowdown involves plain non-aliased subtraction of VectorXd instances. The performance profile's a little weird on this one, though:
The eigen2 code without vectorization on mingw is fastest. In particular, the msc slowdown is large. From what I can see in the profiler (namely something at all), ei_assign_impl::run isn't being inlined by MSC. I haven't looked into what dst.template copyPacket does, but that's where the slowdown is anyhow. Ah, I almost forgot the testcase...
Last edited by EamonNerbonne on Thu Feb 25, 2010 11:15 pm, edited 2 times in total.
|
Registered users: abc72656, Bing [Bot], daret, Google [Bot], Sogou [Bot], Yahoo [Bot]