This forum has been archived. All content is frozen. Please use KDE Discuss instead.

Eigen solver performance

Tags: None
(comma "," separated)
kmanuele
Registered Member
Posts
7
Karma
0

Eigen solver performance

Wed Dec 09, 2015 10:30 pm
Some (long winded) observations on Eigen sparse matrix performance. This is a specific-use case, so it may, or may not, be illustrative. Comments / questions appreciated.

We have used LDL from Tim Davis (of SuiteSparse fame) in our custom electrical simulator for several years. It is quite fast, but offered as demonstration code rather than production level. We considered replacing LDL with Eigen's tools, since it uses current C++ standards and is supported. Results are below:

Our matrices are very sparse (nnz ~ 4N). RHS is sparse too (nnz ~ 0.2N).
N ranges from a few hundred to > 10,000.

Our sparse matrix is structured as: std::vector<std::map<int, double> > sparseMatrix;

Eigen's is a SparseMatrix<double>

Both processes are similar:

LDL process:

1) load sparse matrix from the electrical node array, using sparseMatrix[i][j]
2) convert sparseMatrix to traditional CCS for LDL
Then Tim Davis LDL code:
3) ordering, via AMD or CAMD
4) LDL symbolic
5) LDL numeric
6) solve(RHS)


Eigen SimplicialLDLT process:

1) load Eigen SparseMatrix from the node array, using coeffRef(i,j)
2) makeCompressed()
3) analyzePattern()
4) factorize()
5) solve(RHS)

Using the same test arrays and RHS vectors, total times are:

N = 120:
LDL = 0.12 ms
Eigen = 0.18 ms

N = 940:
LDL = 0.84 ms
Eigen = 3.8 ms

For N = 940, loading the Eigen matrix (1+2) alone takes 3.2 ms, compared to 0.37 ms for the LDL sparseMatrix[][] process. So if Eigen's load process could be improved, it might be viable for us -- though still a bit slower.

SparseLU is even slower, ConjugateGradient is so slow as to be useless (we are surprised at this).

Ordering:

Ordering takes a big chunk of the solve time. The matrix pattern changes only slightly between iterations in our models, so we improve LDL performance by reusing the ordering permutation, and re-order only every 100 or so iterations (or when N changes). The improvements for the 2 LDL cases are significant:

N = 120: 0.067 ms vs 0.12 ms
N = 940: 0.52 ms vs 0.84 ms

This can't be done (directly) in Eigen, as factorize() fails often unless we run analyzePattern() at every iteration. Not sure how much of analyzePattern() is ordering, or if it could be segregated as we do.

Thanks

Kevin
User avatar
ggael
Moderator
Posts
3447
Karma
19
OS

Re: Eigen solver performance

Thu Dec 10, 2015 8:51 am
Both LDL and Eigen::SimplicialLDLT should be equally fast. Regarding the assembly, it is not very clear what you are actually doing. Does Eigen-step-1 corresponds to a conversion from your vector<map> to Eigen::SparseMatrix, or do you directly assemble the Eigen::SparseMatrix?

In both cases, calling naively coeffRef without reserving memory is indeed very slow.

In the later case, if you know in advance the number of nonzeros per column (or an upper bound), then calling spmat::reserve(Eigen::VectorXi::Constant(N,nnz_per_col)) will significantly speed-up the process. If the nnz are not constant, then fill a Eigen::VectorXi or std::vector<int> with appropriate values. If you do not insert the same value twice, then calling insert instead of coeffRef will be faster. Finally, you might also consider following the process recommended there: http://eigen.tuxfamily.org/dox/group__T ... tml#title3. It is not fully optimal because it does not exploit the knowledge you might have about the order of the non-zeros, but if you are calling coeffRef(i,j) in a totally non-structured manner, then this option might be good as well.

Here is a small bench I've extended with your vector<map>:

Code: Select all
#include <iostream>
#include <Eigen/Sparse>
#include <bench/BenchTimer.h>
#include <vector>

using namespace Eigen;
int main() {
  int n = 970;
  int nnz = 3;

  typedef Triplet<double> T;
  std::vector<T> elements;
  elements.reserve(n*nnz);
  MatrixXd D(n,n); D.setZero();
  for(int j=0; j<n; ++j)
    for(int k=0; k<nnz; ++k)
    {
      int i;
      do {
        i = internal::random<int>(0,n-1);
      } while(D(i,j)!=0);
      D(i,j) = 1;
      elements.push_back(T(i,j,1));
    }


  BenchTimer t1, t2, t3, t4, t5, t6;

  t1.start();
  {
    MatrixXd D(n,n);
    std::vector<T> el;
    el.reserve(n*nnz);
    for(int k=0; k<elements.size(); ++k)
      el.push_back(T(elements[k].row(), elements[k].col(), elements[k].value()));
    SparseMatrix<double> S(n,n);
    S.setFromTriplets(el.begin(), el.end());
  }
  t1.stop();
  std::cout << "setFromTriplets:     " << t1.best()*1000 << "ms\n";

  t2.start();
  {
    MatrixXd D(n,n);
    D.setZero(n,n);
    for(int k=0; k<elements.size(); ++k)
      D(elements[k].row(), elements[k].col()) = elements[k].value();
    SparseMatrix<double> S(n,n);
    S.reserve(n*nnz);
    S = D.sparseView();
    S.makeCompressed();
  }
  t2.stop();
  std::cout << "Dense+sparse_view:   " << t2.best()*1000 << "ms\n";

  t3.start();
  {
    SparseMatrix<double> S(n,n);
    S.reserve(VectorXi::Constant(n,nnz));
    for(int k=0; k<elements.size(); ++k)
      S.insert(elements[k].row(), elements[k].col()) = elements[k].value();
    S.makeCompressed();
  }
  t3.stop();
  std::cout << "reserve+insert:      " << t3.best()*1000 << "ms\n";

  t4.start();
  {
    SparseMatrix<double> S(n,n);
    S.reserve(VectorXi::Constant(n,nnz));
    for(int k=0; k<elements.size(); ++k)
      S.coeffRef(elements[k].row(), elements[k].col()) = elements[k].value();
    S.makeCompressed();
  }
  t4.stop();
  std::cout << "reserve+coeffRef:    " << t4.best()*1000 << "ms\n";

  t5.start();
  {
    SparseMatrix<double> S(n,n);
    for(int k=0; k<elements.size(); ++k)
      S.coeffRef(elements[k].row(), elements[k].col()) = elements[k].value();
    S.makeCompressed();
  }
  t5.stop();
  std::cout << "naive coeffRef:      " << t5.best()*1000 << "ms\n";

  t6.start();
  {
    std::vector<std::map<int,double> > S(n);
    for(int k=0; k<elements.size(); ++k)
      S[elements[k].col()][elements[k].row()] = elements[k].value();
  }
  t6.stop();
  std::cout << "vector<map> only:    " << t6.best()*1000 << "ms\n";
}


Here, with 3 elements per columns to get similar values as yours, I get with Eigen 3.2.7:
Code: Select all
setFromTriplets:     0.139267ms
Dense+sparse_view:   3.56398ms
reserve+insert:      0.043682ms
reserve+coeffRef:    0.052446ms
naive coeffRef:      3.5257ms
vector<map> only:    0.38527ms


Note that in Eigen 3.3 (devel), the "naive coeffRef" approach as been significantly optimized:
Code: Select all
naive coeffRef:      0.064089ms



For such extremely sparse matrices, it is not surprising at all that both SuperUL and CG are much slower.

Regarding ordering, you can use a SimplicialLDLT with NaturalOrdering (aka no ordering) and apply the permutation by hands from the permutation computed by AMDOrdering<>.
steveshi
Registered Member
Posts
8
Karma
0

Re: Eigen solver performance

Fri Dec 11, 2015 7:29 am
it is exciting that naive coeffRef is so effecient that setFromTriplets can be replaced.
User avatar
ggael
Moderator
Posts
3447
Karma
19
OS

Re: Eigen solver performance

Fri Dec 11, 2015 10:39 am
This benchmark is not representative because entries are inserted one column after the other. For pure random filling, without reserving memory for each column, the performance will still be very poor, and there is not much we can do about it.
kmanuele
Registered Member
Posts
7
Karma
0

Re: Eigen solver performance

Mon Dec 21, 2015 6:29 pm
Thanks for the very detailed response.

re: question on Eigen step 1, SparseMatrix is directly assembled like vector<map>. We start with a vector of Nodes, where a Node is a connection point with a list of all elements connected between it and some other Node. Our sparse matrix is a vector of columns (std::map)

The code that loads vector<Node> elements into the sparse matrices is identical except for one line:

vector<map>:
sm[ce->node2->index][i] += -ce->admittance.real();

SparseMatrix:
sm.coeffRef(ce->node2->index, i) += -ce->admittance.real();


For SparseMatrix, we were using resize(N,N) and reserve (5*N). Per your suggestion we changed this to:

sm.reserve(VectorXi::Constant(N, 5)), and there was a big improvement to 0.15 ms from 3.2 ms for coefReff. Thanks for that !!

We ran your test code using our complier (Embarcadero's 32 bit CLang) and we get similar result, though naive coeffRef clearly has some compiler dependencies.

setFromTriplets: 0.139308 ms
Dense+sparse_view: 3.95567 ms
reserve+insert: 0.0457338 ms
reserve+coeffRef: 0.0535567 ms
naive coeffRef: 6.36362 ms
vector<map> only: 0.282827 ms


Will work on your ordering suggestion.

Thanks again

Kevin


Bookmarks



Who is online

Registered users: bartoloni, Bing [Bot], Evergrowing, Google [Bot]