This forum has been archived. All content is frozen. Please use KDE Discuss instead.

Sparse matrix-vector product is slow

Tags: sparse, matrix-vector, benchmark, eigen, gmm sparse, matrix-vector, benchmark, eigen, gmm sparse, matrix-vector, benchmark, eigen, gmm
(comma "," separated)
munyanko
Registered Member
Posts
3
Karma
0
OS

Sparse matrix-vector product is slow

Thu Aug 28, 2014 12:14 pm
I've been using Eigen to compute product of many sparse matrices and was very pleased with perfomance. As half of the matrices in the product are essentially exponentials of another matrix its not efficient to exponentiate directly (as sizes grow up to 2^10-2^14), but rather to use Krylov-like methods to compute exp(t*H)*v so that pruduct slims down to many sparse matrix sparse vector products over selected columns. However perfomance of sparse matrix sparse vector product is at least 10 times slower that of direct matrix-matrix product. One would expect similar perfomance (+some overhead) of both algorithms, as is the case with gmm.
Code: Select all
mm gmm took 410
mm eigen took 499
mv gmm took 293
mv eigen took 3866


I wrote small test case to bench both algorithms with both libraries (output of its result is above). I do use NDEBUG and native optimisations but results are disastrous :-\
Code: Select all
#include <iostream>
#include <chrono>
#include <gmm/gmm.h>
#include <Eigen/Sparse>

void eigen_to_gmm(const Eigen::SparseMatrix<double> &e, gmm::col_matrix<gmm::wsvector<double>> &g)
{
    gmm::clear(g);
    gmm::resize(g, e.cols(), e.rows());
    for(int k = 0; k < e.outerSize(); ++k)
        for(Eigen::SparseMatrix<double>::InnerIterator it(e,k); it; ++it)
        {
            g(it.row(), it.col()) = it.value();
        }
}

void eigen_to_gmm(const Eigen::SparseMatrix<double> &e, gmm::col_matrix<gmm::rsvector<double>> &g)
{
    gmm::clear(g);
    gmm::resize(g, e.cols(), e.rows());
    for(int k = 0; k < e.outerSize(); ++k)
        for(Eigen::SparseMatrix<double>::InnerIterator it(e,k); it; ++it)
        {
            g(it.row(), it.col()) = it.value();
        }
}

int main()
{
    Eigen::SparseMatrix<double> exph(16,16);
    Eigen::SparseMatrix<double> opc(16,16);
    Eigen::SparseMatrix<double> opa(16,16);
    exph.insert(0,0) = 1;
    exph.insert(1,1) = 1.00014;
    exph.insert(2,2) = 1.00014;
    exph.insert(3,3) = 1.00014;
    exph.insert(4,4) = 1.00014;
    exph.insert(5,5) = 1.00007;
    exph.insert(6,6) = 0.999984;
    exph.insert(9,6) = -2.99995e-05;
    exph.insert(7,7) = 1.00004;
    exph.insert(8,7) = -3.00013e-05;
    exph.insert(7,8) = -3.00013e-05;
    exph.insert(8,8) = 1.00004;
    exph.insert(6,9) = -2.99995e-05;
    exph.insert(9,9) = 0.999984;
    exph.insert(10,10) = 1.00007;
    exph.insert(11,11) = 0.999676;
    exph.insert(12,12) = 0.999676;
    exph.insert(13,13) = 0.999676;
    exph.insert(14,14) = 0.999676;
    exph.insert(15,15) = 0.999068;

    opa.insert(0,1) = 1;
    opa.insert(2,5) = 1;
    opa.insert(3,6) = 1;
    opa.insert(4,8) = 1;
    opa.insert(7,11) = 1;
    opa.insert(9,12) = 1;
    opa.insert(10,13) = 1;
    opa.insert(14,15) = 1;

    opc.insert(1,0) = 1;
    opc.insert(5,2) = 1;
    opc.insert(6,3) = 1;
    opc.insert(8,4) = 1;
    opc.insert(11,7) = 1;
    opc.insert(12,9) = 1;
    opc.insert(13,10) = 1;
    opc.insert(15,14) = 1;

    gmm::col_matrix<gmm::rsvector<double>> exph_g;
    eigen_to_gmm(exph, exph_g);
    gmm::col_matrix<gmm::rsvector<double>> opa_g;
    gmm::col_matrix<gmm::rsvector<double>> opc_g;
    eigen_to_gmm(opa, opa_g);
    eigen_to_gmm(opc, opc_g);

    Eigen::SparseMatrix<double> iden(exph.rows(), exph.cols());
    for(int i = 0; i < iden.rows(); i++)
        iden.insert(i,i) = 1;
    gmm::col_matrix<gmm::rsvector<double>> iden_g;
    eigen_to_gmm(iden, iden_g);
    gmm::col_matrix<gmm::wsvector<double>> prod_g(iden.cols(), iden.rows());
    gmm::copy(iden_g, prod_g);
    gmm::col_matrix<gmm::wsvector<double>> tmp_g(exph.rows(), exph.cols());
    auto begin = std::chrono::high_resolution_clock::now();
    for(int i = 0; i < 100000; i++) {
        gmm::mult(prod_g, exph_g, tmp_g);
        gmm::mult(tmp_g, opa_g, prod_g);
        gmm::mult(prod_g, exph_g, tmp_g);
        gmm::mult(tmp_g, opc_g, prod_g);
    }
    gmm::mult(prod_g, exph_g, tmp_g);
    auto end = std::chrono::high_resolution_clock::now();
    std::cout << "mm gmm took " << std::chrono::duration_cast<std::chrono::milliseconds>(end-begin).count() << std::endl;


    Eigen::SparseMatrix<double> prod(iden);
    begin = std::chrono::high_resolution_clock::now();
    for(int i = 0; i < 100000; i++) {
        prod = prod * exph;
        prod = prod * opa;
        prod = prod * exph;
        prod = prod * opc;
    }
    prod = prod * exph;
    end = std::chrono::high_resolution_clock::now();
    std::cout << "mm eigen took " << std::chrono::duration_cast<std::chrono::milliseconds>(end-begin).count() << std::endl;

    gmm::wsvector<double> v(exph.rows());
    gmm::wsvector<double> y(exph.rows());
    gmm::col_matrix<gmm::wsvector<double>> prod_g2(iden.cols(), iden.rows());
    begin = std::chrono::high_resolution_clock::now();
    for(int state = 0; state < exph.cols(); state++) {
        gmm::copy(iden_g.col(state), v);
        for(int i = 0; i < 100000; i++) {
            gmm::mult(exph_g, v, y);
            gmm::mult(opa_g, y, v);
            gmm::mult(exph_g, v, y);
            gmm::mult(opc_g, y, v);
        }
        gmm::mult(exph_g, v, y);
        prod_g2.col(state) = y;
    }
    end = std::chrono::high_resolution_clock::now();
    std::cout << "mv gmm took " << std::chrono::duration_cast<std::chrono::milliseconds>(end-begin).count() << std::endl;

    Eigen::SparseVector<double> vec1(exph.rows());
    Eigen::SparseVector<double> vec2(exph.rows());
    Eigen::SparseMatrix<double> prod2(iden);
    begin = std::chrono::high_resolution_clock::now();
    for(int state = 0; state < exph.cols(); state++) {
        vec1 = iden.col(state);
        for(int i = 0; i < 100000; i++) {
            vec2 = exph * vec1;
            vec1 = opa * vec2;
            vec2 = exph * vec1;
            vec1 = opc * vec2;
        }
        prod2.col(state) = exph * vec1;
    }
    end = std::chrono::high_resolution_clock::now();
    std::cout << "mv eigen took " << std::chrono::duration_cast<std::chrono::milliseconds>(end-begin).count() << std::endl;

    return 0;
}

User avatar
ggael
Moderator
Posts
3447
Karma
19
OS
I guess that your example is not representative as using sparse representation for that small matrices is not recommended. What are the typical sizes of your real-world use case? Are you sure that the result remains sparse?
munyanko
Registered Member
Posts
3
Karma
0
OS
ggael wrote:I guess that your example is not representative as using sparse representation for that small matrices is not recommended. What are the typical sizes of your real-world use case? Are you sure that the result remains sparse?


Matrices are generally very sparse due to their nature: H is interaction hamiltonian which is mostly diagonal in occupation basis and op are fermionic operators.
And matrix product is always sparse by the phisycal nature of the problem, but here I ran test case with 1024x1024 matrices (limitied to 1000 products):
Code: Select all
mm gmm took 9994
mm eigen took 3939
sparsity: product 99.6267% exph 99.5895% op 99.9512%
mv gmm took 9088
mv eigen took 150801
average vector product sparsity 99.7234

In direct matrix-matrix product Eigen to my pleasure overtook gmm, but same product computed matrix by column yeilds abyssmal time for Eigen.
To make the matters worse as I have to compute the following product in real case millions of times per iteration where n could go up to 1000:
Code: Select all
exp(t_2n*H)*Op(t_2n)*exp(t_2n-1*H)*Op(t_2n-1)* ... * exp(t_1*H)

Where exponential argument is variable I have to compute exp everytime and Taylor or Pade are not optimal for such large matrices.
Thus I compute exp(t*H)*v instead via Newton-Leja algorithm but it has dozens matrix-vector products inside and this is where I noticed major slowdown for the first time.

As I first tried matrix-vector product for the first time in 3.2.1 or 3.2.2 and first my thought was that this is a regression in Eigen.

Now I'm faced with a choice of either introducing yet another matrix library to the project (gmm) or optimizing matrix-vector product if I did something wrong here (if I did), or simply wantiing for a fix if this is a regression, thus I brought the matter here.
User avatar
ggael
Moderator
Posts
3447
Karma
19
OS
Can you try with the devel branch, I've just committed a fix.
User avatar
ggael
Moderator
Posts
3447
Karma
19
OS
Moreover, it might also help to call prod2.reserve(prod2.cols() * approximate_nnz_per_column) before starting the vector by vector computation to prevent from multiple memory reallocation and copies when inserting the column vectors.
munyanko
Registered Member
Posts
3
Karma
0
OS
Thank you for the fix!
Code: Select all
mm gmm took 10786
mm eigen took 4509
mv gmm took 9299
mv eigen took 35470

Not 15 times slower as before but only 4 times now for matrix-vector product! Moreover reservations for vec1, vec2 and prod2 allowed to shrug off 3 seconds.
However it is still very suspicios, as when I make H fully diagonal, thus making all vectors and columns in byproducts have only one element things are still off:
Code: Select all
mm gmm took 1591
mm eigen took 1820
mv gmm took 1388
mv eigen took 26533


On a side note: matrix-matrix product is slower by 15% in dev version :(
User avatar
ggael
Moderator
Posts
3447
Karma
19
OS
yes, this slowdown in this extreme case is due to the fact that by default we are assuming aliasing and thus each product creates a temporary. In general this is not an issue because the cost of this dynamically allocated temporary is small compared to the cost of the product, but in your extreme case this is killing the performance. This will be fixed soon through the use of noalias (as with dense product). In the meantime you can update your copy of the devel branch and directly call some internal functions to by-pass this temporary:
Code: Select all
template<typename Lhs, typename Rhs, typename Res>
void mul_colmajor_in_place(const Lhs &lhs, const Rhs &rhs, Res &res)
{
  res.resize(lhs.rows(), rhs.cols());
  res.setZero();
  Eigen::internal::conservative_sparse_sparse_product_impl<Lhs,Rhs,Res>(lhs, rhs, res, true);
}
// ...
mul_colmajor_in_place(A, vec1, vec2);
mul_colmajor_in_place(A, vec2, vec1);


and for larger matrices you may also need to increase the stack allocation limit by compiling with, e.g.:

-DEIGEN_STACK_ALLOCATION_LIMIT=1000000


Bookmarks



Who is online

Registered users: Bing [Bot], claydoh, Google [Bot], rblackwell, Yahoo [Bot]