This forum has been archived. All content is frozen. Please use KDE Discuss instead.

NaturalOrdering takes a third of total inversion time

Tags: None
(comma "," separated)
haavind
Registered Member
Posts
13
Karma
0
OS
I have a sparse ldlt solver of type:
Eigen::SimplicialLDLT<Eigen::SparseMatrix<double,Eigen::ColMajor>, Eigen::Upper, Eigen::NaturalOrdering<int>>

My matrix is on nearly optimal form already, so reordering it is just a waste of time. Yet when profiling my code, the 'ordering' step takes 32% of total 'compute' time.
NaturalOrdering is supposed to be just a no-op, right?

There are two steps that consumes time inside of 'ordering';
internal::permute_symm_to_symm and internal::permute_symm_to_fullsymm.

Is this as intended?
User avatar
ggael
Moderator
Posts
3447
Karma
19
OS
Indeed, with NaturalOrdering the current code is doing many useless operations. Here is a patch which should significantly speed-up computations. Works best in your input is column major and that the upper triangular part is stored:
Code: Select all
diff --git a/Eigen/src/SparseCholesky/SimplicialCholesky.h b/Eigen/src/SparseCholesky/SimplicialCholesky.h
--- a/Eigen/src/SparseCholesky/SimplicialCholesky.h
+++ b/Eigen/src/SparseCholesky/SimplicialCholesky.h
@@ -602,29 +602,42 @@ public:
     bool m_LDLT;
 };
 
 template<typename Derived>
 void SimplicialCholeskyBase<Derived>::ordering(const MatrixType& a, CholMatrixType& ap)
 {
   eigen_assert(a.rows()==a.cols());
   const Index size = a.rows();
-  // Note that amd compute the inverse permutation
+  // Note that ordering methods compute the inverse permutation
+  if(!internal::is_same<OrderingType,NaturalOrdering<Index> >::value)
   {
-    CholMatrixType C;
-    C = a.template selfadjointView<UpLo>();
+    {
+      CholMatrixType C;
+      C = a.template selfadjointView<UpLo>();
+     
+      OrderingType ordering;
+      ordering(C,m_Pinv);
+    }
+
+    if(m_Pinv.size()>0) m_P = m_Pinv.inverse();
+    else                m_P.resize(0);
     
-    OrderingType ordering;
-    ordering(C,m_Pinv);
+    ap.resize(size,size);
+    ap.template selfadjointView<Upper>() = a.template selfadjointView<UpLo>().twistedBy(m_P);
   }
-
-  if(m_Pinv.size()>0)
-    m_P = m_Pinv.inverse();
   else
+  {
+    m_Pinv.resize(0);
     m_P.resize(0);
-
-  ap.resize(size,size);
-  ap.template selfadjointView<Upper>() = a.template selfadjointView<UpLo>().twistedBy(m_P);
+    if(UpLo==Lower)
+    {
+      ap.resize(size,size);
+      ap.template selfadjointView<Upper>() = a.template selfadjointView<UpLo>().twistedBy(m_P);
+    }
+    else
+      ap = a.template triangularView<Upper>();
+  } 
 }
 
 } // end namespace Eigen
 
 #endif // EIGEN_SIMPLICIAL_CHOLESKY_H


In theory we could even avoid a copy in some ideal cases but I doubt that this copy is significant. Please tell us how well does this patch perform for you.
haavind
Registered Member
Posts
13
Karma
0
OS
I tried to implement the patch, and there was a notable speedup. The copying was not entirely insignificant though, so it is worthwhile to remove that step as well. My intuition agrees with yours, that the copying shouldn't be an issue, but it is possible that this is a borderline case since the matrix is roughly on the size where an unnecessary copy could cause data to drop out of L2 cache.

The strange thing is that factorize_preordered had a ~25% reduction in run-time with this fix. I'm not sure how to interpret this result. The one thing shouldn't really affect the other.
User avatar
ggael
Moderator
Posts
3447
Karma
19
OS
Profiling is often very very tricky. Could you export (see http://eigen.tuxfamily.org/dox-devel/gr ... tml#title3), zip, and send me your matrix so that I can try to reproduce the issue. Thank you.
User avatar
ggael
Moderator
Posts
3447
Karma
19
OS
haavind
Registered Member
Posts
13
Karma
0
OS
Test matrix sent by email.

I did another test where i skipped the ordering and copying entirely. My results, as obtained by the VerySleepy profiler are:

Original code:
- compute = 3.9 s
-- factorize = 2.2 s
-- ordering = 1.3 s
-- analyze = 0.43 s

fix:
- compute = 2.6 s
-- factorize = 1.6 s
-- ordering = 0.5 s
-- analyze = 0.54 s

fix v2 w/o copy:
- compute = 1.9 s
-- factorize = 1.4 s
-- ordering = NA
-- analyze = 0.56 s

Note that this is for several iterations with a profiler running in the background, not the time for a single inversion. They can only be used for relative comparison.
I did some wall-time measurement as well, but it is hard to be very quantitative about them since the LDLT-inversion is a fairly small part of the entire runtime and only becomes important because it is a bottleneck in a highly parallel section. That said, the wall time agrees roughly with the profiler.
User avatar
ggael
Moderator
Posts
3447
Karma
19
OS
Yes, I also observed a non negligible overhead of the deep copy. Fixed in devel branch:

https://bitbucket.org/eigen/eigen/commits/dafda2a707af/


Bookmarks



Who is online

Registered users: Bing [Bot], Google [Bot], Yahoo [Bot]