Registered Member
|
I have a sparse ldlt solver of type:
Eigen::SimplicialLDLT<Eigen::SparseMatrix<double,Eigen::ColMajor>, Eigen::Upper, Eigen::NaturalOrdering<int>> My matrix is on nearly optimal form already, so reordering it is just a waste of time. Yet when profiling my code, the 'ordering' step takes 32% of total 'compute' time. NaturalOrdering is supposed to be just a no-op, right? There are two steps that consumes time inside of 'ordering'; internal::permute_symm_to_symm and internal::permute_symm_to_fullsymm. Is this as intended? |
Moderator
|
Indeed, with NaturalOrdering the current code is doing many useless operations. Here is a patch which should significantly speed-up computations. Works best in your input is column major and that the upper triangular part is stored:
In theory we could even avoid a copy in some ideal cases but I doubt that this copy is significant. Please tell us how well does this patch perform for you. |
Registered Member
|
I tried to implement the patch, and there was a notable speedup. The copying was not entirely insignificant though, so it is worthwhile to remove that step as well. My intuition agrees with yours, that the copying shouldn't be an issue, but it is possible that this is a borderline case since the matrix is roughly on the size where an unnecessary copy could cause data to drop out of L2 cache.
The strange thing is that factorize_preordered had a ~25% reduction in run-time with this fix. I'm not sure how to interpret this result. The one thing shouldn't really affect the other. |
Moderator
|
Profiling is often very very tricky. Could you export (see http://eigen.tuxfamily.org/dox-devel/gr ... tml#title3), zip, and send me your matrix so that I can try to reproduce the issue. Thank you.
|
Moderator
|
For the record, I applied the patch: https://bitbucket.org/eigen/eigen/commits/f9b6d0d26703/
|
Registered Member
|
Test matrix sent by email.
I did another test where i skipped the ordering and copying entirely. My results, as obtained by the VerySleepy profiler are: Original code: - compute = 3.9 s -- factorize = 2.2 s -- ordering = 1.3 s -- analyze = 0.43 s fix: - compute = 2.6 s -- factorize = 1.6 s -- ordering = 0.5 s -- analyze = 0.54 s fix v2 w/o copy: - compute = 1.9 s -- factorize = 1.4 s -- ordering = NA -- analyze = 0.56 s Note that this is for several iterations with a profiler running in the background, not the time for a single inversion. They can only be used for relative comparison. I did some wall-time measurement as well, but it is hard to be very quantitative about them since the LDLT-inversion is a fairly small part of the entire runtime and only becomes important because it is a bottleneck in a highly parallel section. That said, the wall time agrees roughly with the profiler. |
Moderator
|
Yes, I also observed a non negligible overhead of the deep copy. Fixed in devel branch:
https://bitbucket.org/eigen/eigen/commits/dafda2a707af/ |
Registered users: Bing [Bot], Google [Bot], Yahoo [Bot]