[SOLVED] LU only half as fast as LAPACK/BLAS? • KDE Community Forums

This forum has been archived. All content is frozen. Please use KDE Discuss instead.

Board index

[SOLVED] LU only half as fast as LAPACK/BLAS?

Page 1 of 3 (31 posts)

1, 2, 3 • Next

Tags:

linoleo Registered Member Posts 14 Karma 0	[SOLVED] LU only half as fast as LAPACK/BLAS? Sun Jan 25, 2009 3:11 am Hi guys, I just discovered Eigen2 and tried it out in our software last night, which computes inverses of large skew-symmetric matrices via LU-factorisation. I was a bit disappointed to discover that for all matrix sizes, Eigen2 took about twice as long to do this as gsl, which sits on top of LAPACK/BLAS. This is both for the factorisation and solving for the inverse. For instance, for a 1520x1520 matrix: LU factorisation: 5.8s gsl, 12.8s Eigen2 solve for inverse: 4.0s gsl, 8.2s Eigen2 Can anyone else confirm this? LU is not among the benchmark graphs on http://eigen.tuxfamily.org/index.php?title=Benchmark So what gives - is this the full pivoting (I think LAPACK only does partial), some other known inefficiency, or am I doing something wrong? This is on an Intel MacBook, compiling with -DNDEBUG -msse2. The code boils down to the following: Eigen::MatrixXd A(2n, 2n); // initialize A Eigen::LU lu(A); assert(lu.isInvertible()); A = lu.inverse(); // use A and the gsl version: gsl_matrix A = gsl_matrix_alloc(2n, 2n); gsl_permutation p = gsl_permutation_alloc(2n); gsl_vector b = gsl_vector_alloc(2n); // initialize A int sgn; gsl_linalg_LU_decomp(A, p, &sgn); for (i = 0; i < n; ++i) { gsl_vector_set_basis(b, 2i + 1); gsl_linalg_LU_svx(A, p, b); // use b } One obvious difference is that with gsl I am doing n individual solves whereas Eigen computes the full inverse, which amounts to 2n solves. However, doing n individual solves with Eigen (the same way as the gsl code) took far longer still - in the above example, 52 seconds! Specifically, this is when replacing the lu.inverse() call with: for (i = 0; i < n; ++i) { b = Eigen::VectorXd::Unit(2n, 2i + 1); lu.solve(b, &b); // use b } Why is this so slow? Is there a better way to do this in Eigen? Cheers Last edited by linoleo on Sun Jan 25, 2009 3:14 am, edited 1 time in total.
bjacob Registered Member Posts 658 Karma 3	[SOLVED] LU only half as fast as LAPACK/BLAS? Sun Jan 25, 2009 4:58 am linoleo wrote:So what gives - is this the full pivoting (I think LAPACK only does partial) Yes, that's it. Eigen2 does full pivoting. Last time I checked, iirc, GSL does only partial pivoting. In my experience, LU with only partial pivoting is just useless for a 1000x1000 matrix. Computing the inverse or rank of a matrix based on partial pivoting LU will frequently fail at such sizes. (I know I tried this initially for Eigen). EDIT (Jan 28, 2009): computing the inverse actually works well with partial pivoting. What doesn't work with partial pivoting is computing the rank, and all that's depending on it (kernel, image, etc). I think that doing only partial pivoting is a plain bug, or at least these libraries should say clearly that these functions handle only small matrices (at most 100x100). That said, there is room for improvement. Eigen2's LU spends more than 50% of its time doing the pivot lookup (that is, looking for the locationg of the biggest coefficient in a matrix). The reason for that, is that this operation is not currently vectorized by Eigen, while all the rest is already vectorized. So that operation is relatively very costly. It should be perfectly possible to vectorize that operation, and that should (for doubles) cut almost by 50% its cost. So one could expect a 25% improvement from just that. Which would make Eigen's full pivoting LU almost as fast as GSL's partial pivoting LU. Last edited by bjacob on Wed Jan 28, 2009 4:07 pm, edited 1 time in total. Join us on Eigen's IRC channel: #eigen on irc.freenode.net Have a serious interest in Eigen? Then join the mailing list!
bjacob Registered Member Posts 658 Karma 3	[SOLVED] LU only half as fast as LAPACK/BLAS? Sun Jan 25, 2009 5:13 am linoleo wrote:One obvious difference is that with gsl I am doing n individual solves whereas Eigen computes the full inverse, which amounts to 2n solves. Wow, so we've really been comparing apples and oranges since the beginning. linoleo wrote:Why is this so slow? Is there a better way to do this in Eigen? First of all it's to be expected that n vector solves take longer than one matrix solve with n columns. So the correct way is to construct a matrix and do only 1 matrix solve. It remains puzzling that according to your numbers Eigen is doing vector solves slower than GSL; the partial pivoting is only a marginal simplification here. At this point however GSL is using a LU decomposition that's wrong in the first place (since its partial pivoting and your matrix is too big for that) so I don't really know what this speed comparison is worth. Join us on Eigen's IRC channel: #eigen on irc.freenode.net Have a serious interest in Eigen? Then join the mailing list!
linoleo Registered Member Posts 14 Karma 0	[SOLVED] LU only half as fast as LAPACK/BLAS? Sun Jan 25, 2009 6:47 am bjacob wrote:Wow, so we've really been comparing apples and oranges since the beginning. Not at all. First, Eigen also takes twice as long to do the LU factorisation in the first place. Second, when comparing individual vector solves, gsl is over 10 times faster than Eigen. Both are apples-to-apples comparisons. bjacob wrote:So the correct way is to construct a matrix and do only 1 matrix solve. True but irrelevant, assuming that you'd like to be reasonably competitive (as in: not an order of magnitude slower) on a single vector solve. It may well be that I'm doing something wrong - I'm not a numerical guru by any means. That's why I'm asking if someone else can confirm this independently. Edit: Doing a single matrix solve with Eigen is exactly as fast (4.0s) as the individual vector solves under gsl, so Eigen's slowdown for vector solves is due to some unreasonably large constant overhead in lu.solve(). bjacob wrote:At this point however GSL is using a LU decomposition that's wrong in the first place (since its partial pivoting and your matrix is too big for that) so I don't really know what this speed comparison is worth. My matrices are guaranteed to have at least two +/-1 entries in each row and column, so partial pivoting suffices. FWIW, if I just want raw speed I'll use SuperLU (my matrix is sparse); what interests me in Eigen is that down the line I'll have to do this with non-standard arithmetic (i.e., with my own payload class), and Eigen seems really well-suited to this sort of thing. bjacob wrote:That said, there is room for improvement. Eigen2's LU spends more than 50% of its time doing the pivot lookup (that is, looking for the locationg of the biggest coefficient in a matrix). The reason for that, is that this operation is not currently vectorized by Eigen, while all the rest is already vectorized. So that operation is relatively very costly. It should be perfectly possible to vectorize that operation, and that should (for doubles) cut almost by 50% its cost. So one could expect a 25% improvement from just that. Which would make Eigen's full pivoting LU almost as fast as GSL's partial pivoting LU. Ah, that explains the speed difference in the LU factorisation, thanks. I totally agree on the shortcomings of partial pivoting in general, but unless you can indeed make full pivoting virtually as fast as partial pivoting, you may want to consider making the latter an option for cases such as mine where the matrix structure makes full pivoting unnecessary. Cheers Last edited by linoleo on Sun Jan 25, 2009 7:47 am, edited 1 time in total.
ggael Moderator Posts 3447 Karma 19 OS	[SOLVED] LU only half as fast as LAPACK/BLAS? Sun Jan 25, 2009 9:28 am Hi, first could you precise a bit your system settings ? - which complier versions ? - which compiler flags ? - which BLAS library ? (ATLAS, MKL, GOTO, etc ????) - 32 or 64 bits system ? - what are the average sizes of your matrix (n) ? Actually that the fact we are doing full-pivoting significantly slowdown both the LU factorization (for obvious reason) but also the LU solver since two permutations have to be applied in both directions. However your results for the LU solver are still a bit weird because the triangular solver should be the most expensive part and Eigen's triangular solver for vectors is the fastest one (see the benchmark). My matrices are guaranteed to have at least two +/-1 entries in each row and column, so partial pivoting suffices This does not mean at all that partial pivoting is enough for you ! The numerical stability depends on many other factors (magnitude order variation, linear dependency, etc.)
bjacob Registered Member Posts 658 Karma 3	[SOLVED] LU only half as fast as LAPACK/BLAS? Sun Jan 25, 2009 3:03 pm First of all, I haven't been clear enough on the shortcomings of partial pivoting. Take a random matrix m of size 1520x1520, with double precision. You can safely assume it's invertible (the opposite would be a huge coincidence). Take a LU decomposition of m with partial pivoting. Use it to compute the inverse of m. Now compute the product of m with the inverse just computed. If the inverse is correct, you should obtain the identity matrix.... in my experience, this will fail very frequently at these sizes. By contrast, Eigen's full pivoting LU will always succeed. Another good test is the rank computation... construct a matrix m with less than full rank, compute its rank, I'm pretty sure that partial pivoting LU can only fail here. So I'm not exaggerating when I say that partial pivoting LU is just _cheating_ here. It is just plain wrong. linoleo wrote:First, Eigen also takes twice as long to do the LU factorisation in the first place. And that is explained by the full pivoting -- although as I explained there is some room for improvement here, full pivoting LU will never be as fast as partial pivoting LU; however it has the advantage of returning a valid result for large sizes. linoleo wrote:Second, when comparing individual vector solves, gsl is over 10 times faster than Eigen. Both are apples-to-apples comparisons. I was talking about the comparison of individual vector solves in GSL against matrix inversion in Eigen. However this should have been to the advantage of Eigen. It's true that there must be a real problem if individual vector solves in Eigen are 10 times slower. Need to look into this. The most useful thing to do at this point would be to profile Eigen, adding some no-inline attributes to functions called inside the critical part. linoleo wrote: bjacob wrote:So the correct way is to construct a matrix and do only 1 matrix solve. True but irrelevant, assuming that you'd like to be reasonably competitive (as in: not an order of magnitude slower) on a single vector solve. Yes, ok, I agree, we must make sure single vector solving is fast. But I was saying, that whatever we do, it won't be as fast as one grouped matrix solving because the latter offers big optimization opportunities. linoleo wrote:It may well be that I'm doing something wrong - I'm not a numerical guru by any means. That's why I'm asking if someone else can confirm this independently. Well the first wrong thing is to apply partial pivoting to a 1520x1520 matrix, but as I said, this indeed doesn't explain the speed difference on individual vector solves. linoleo wrote:Edit: Doing a single matrix solve with Eigen is exactly as fast (4.0s) as the individual vector solves under gsl, so Eigen's slowdown for vector solves is due to some unreasonably large constant overhead in lu.solve(). Ah, good to know. The next thing is that you could pass the unit expression directly to solve(), no need to first copy it into b. Of course I don't expect this useless copy to account for the 10x speed difference, except that it might cause cache misses, one never knows. linoleo wrote:My matrices are guaranteed to have at least two +/-1 entries in each row and column, so partial pivoting suffices. No, that's completely wrong (as Gael also says above). Assuming that you also mean that your matrix coefficients don't get much bigger than 1 in absolute values (otherwise having +/- 1 entries is irrelevant), this is still wrong: Indeed what you say means that partial pivoting is enough for the FIRST step, when one zeroes out the FIRST column. But what happens after that? All the coefficients in all the other columns have been modified by this first round of operations. So you no longer have a +/- 1 coefficient in each column, after the very first step! Anyway, check the result for yourself (check the inverse) you'll see it's completely wrong (at least, frequently). linoleo wrote:FWIW, if I just want raw speed I'll use SuperLU (my matrix is sparse); what interests me in Eigen is that down the line I'll have to do this with non-standard arithmetic (i.e., with my own payload class), and Eigen seems really well-suited to this sort of thing. Then you should rather use Eigen's own sparse module. It lets you use SuperLU as backend so you get the same speed. There is just no point feeding a sparse matrix into a dense LU. (Eigen/LU is only dense. Eigen::Matrix is only dense matrices. If you want sparse, #include and see the documentation there). If your matrix is sparse enough, then OK I can believe that partial pivoting may be enough, by accident. But anyway we were discussing here the DENSE LU so really it's irrelevant that partial pivoting may be enough for sparse matrices. linoleo wrote:I totally agree on the shortcomings of partial pivoting in general, but unless you can indeed make full pivoting virtually as fast as partial pivoting, you may want to consider making the latter an option for cases such as mine where the matrix structure makes full pivoting unnecessary. No! Here, if full pivoting is unnecessary for you, that's only because your matrix is sparse, but then you shouldn't be using the dense LU in the first place, use . I'm not going to optimize the dense LU for the case of sparse matrices! I hope BTW that this isn't the reason for the speed of GSL's solver... Join us on Eigen's IRC channel: #eigen on irc.freenode.net Have a serious interest in Eigen? Then join the mailing list!
bjacob Registered Member Posts 658 Karma 3	[SOLVED] LU only half as fast as LAPACK/BLAS? Sun Jan 25, 2009 3:16 pm If we want to optimize this, the first thing to do is to profile inside the LU::solve(), to see which of the 4 steps takes the most time. For vector solves, there is a lot of room for improvement here in Step 2: Code: Select all `// Step 2 Matrix l(rows, rows); l.setZero(); l.corner(Eigen::TopLeft,rows,smalldim) = m_lu.corner(Eigen::TopLeft,rows,smalldim); l.template marked().solveTriangularInPlace(c);` Here we construct the lower-triangular matrix everytime. At least in the case of a square matrix, that's useless. In general, this should be dealt with either constructing L once and for all the first time it's used, or letting solveTriangularInPlace take an extra parameter to emulate zeros on the right. So linoleo: can you please try (since you have a square matrix) replacing this code with: Code: Select all `// Step 2 m_lu.template marked().solveTriangularInPlace(c);` That is, the matrix 'l' completely disappears. Join us on Eigen's IRC channel: #eigen on irc.freenode.net Have a serious interest in Eigen? Then join the mailing list!
bjacob Registered Member Posts 658 Karma 3	[SOLVED] LU only half as fast as LAPACK/BLAS? Sun Jan 25, 2009 3:42 pm OK, that was it. The speed difference is impressive indeed. I'll commit the fix ASAP but it comes with a couple more changes so takes time. Join us on Eigen's IRC channel: #eigen on irc.freenode.net Have a serious interest in Eigen? Then join the mailing list!
bjacob Registered Member Posts 658 Karma 3	[SOLVED] LU only half as fast as LAPACK/BLAS? Sun Jan 25, 2009 4:26 pm Committed in r916609. Can you retry ? The performance is much better now here. Join us on Eigen's IRC channel: #eigen on irc.freenode.net Have a serious interest in Eigen? Then join the mailing list!
linoleo Registered Member Posts 14 Karma 0	[SOLVED] LU only half as fast as LAPACK/BLAS? Mon Jan 26, 2009 1:32 am Damn you guys are fast... all this while I was sleeping! :thumbs_up: I've patched my LU.h as follows: Code: Select all `// Step 2 if (m_lu.rows() == m_lu.cols()) m_lu.template marked().solveTriangularInPlace(c); else { // construct l as before }` and can confirm that individual vector solves are now as fast for me as a single matrix solve, same speed as gsl (4.0s for 1520x1520). The next thing is that you could pass the unit expression directly to solve(), no need to first copy it into b. Of course I don't expect this useless copy to account for the 10x speed difference, except that it might cause cache misses, one never knows. Just for completeness, I checked this now and it didn't make a difference. I asssume that the copy is optimized away, g++ should be smart enough for that. Indeed what you say means that partial pivoting is enough for the FIRST step, when one zeroes out the FIRST column. But what happens after that? All the coefficients in all the other columns have been modified by this first round of operations. So you no longer have a +/- 1 coefficient in each column, after the very first step! You're quite right, but the sparseness of my matrices makes this very unlikely until the bitter (dense) end of the factorisation, where it doesn't matter anymore, so I can get away with partial pivoting. Then you should rather use Eigen's own sparse module. It lets you use SuperLU as backend so you get the same speed. There is just no point feeding a sparse matrix into a dense LU. (Eigen/LU is only dense. Eigen::Matrix is only dense matrices. If you want sparse, #include and see the documentation there). Point taken. But what documentation? I have shied away from the Sparse module since at this point it appears to in alpha stage - it is mentioned on the forum, but I can find no documentation on it at all, neither on the website nor in the source code. I have seen far too many linear algebra packages with vapourware or alpha-stage sparse support (e.g., octave for about a decade) to base my project on such a thing until it's officially there & documented... but yes I'd be very interested in Eigen/Sparse. Cheers
linoleo Registered Member Posts 14 Karma 0	RE: [SOLVED] LU only half as fast as LAPACK/BLAS? Mon Jan 26, 2009 1:58 am To give a bit more background, I'm writing code to compute properties of Ising spin glasses. This entails factoring matrices (up to as large as memory allows) with the following structure: * size 2n x 2n * skew-symmetric * sparse: each row/column has from 3n to about 10n entries * each row/column has one entry that can be extremely large or small, as it's the exponential of some other quantity * perversely, that entry sits in the natural pivot position (2i, 2i+1) * all other entries are +/-1 Down the line, I will have to do this in a non-standard arithmetic, i.e., with my own payload class. So in an ideal world I'd use a factorisation that is (a) sparse, (b) specialised for skew-symmetric matrices, and (c) coded in templated C++ so I can drop in my payload class. Now this is a tall order unlikely to be met in full, ever. Dropping (b) only increases computational cost by a factor of 2, so LU is fine for me. In my experience, (a) speeds up things for me by a factor of 10 or more, and allows me to handle larger matrices, so that's important. (c) is why I got interested in Eigen. So the big question for me is, are there any plans to have a generic (templated) sparse LU in Eigen? I assume the backends will only work for specialisation to float, double, or std::complex... BTW quite impressed with Eigen so far! Cheers Last edited by linoleo on Mon Jan 26, 2009 2:00 am, edited 1 time in total.
ggael Moderator Posts 3447 Karma 19 OS	RE: [SOLVED] LU only half as fast as LAPACK/BLAS? Mon Jan 26, 2009 10:16 am just for the record I wanted to say that LAPACK provides two LU routines: 1 - an auxiliary routine which performs full pivoting without any blocking optimization, so just like Eigen's LU does. What I can say is that Eigen's full-pivoting LU is faster than all other LAPACK implementation I tried. 2 - the main LU routine which implement partial pivoting in a block fashion. According to netlibs website, this routine is supposed to use the above auxiliary routine in a block fashion, but I don't know if all LAPACK implementations do that. I also don't know how a hybrid full/partial pivoting LU compares to a truly full-pivoting one wrt to stability (for the perf it is of course much faster, because of the block approach). linoleo wrote:* size 2n x 2n * sparse: each row/column has from 3n to about 10n entries I guess you meant from 3 to 10 entries ? Currently the sparse module provide the following direct factorization/solvers: LLT, with 3 backends: 1 - custom naive implementation (ok for custom types) 2 - CHOLMOD (float, double, complex) 3 - TAUCS (float, double, complex) LDLT: custom almost naive implementation (ok for custom types) LU: 1 - SuperLU (float, double, complex) 2 - UmfPack (float, double, complex) Seeing the very specific shape of your matrix I believe a sparse LLT or LDLT factorization should be enough, no ? moreover since you have a skew matrix you probably don't need to reorder the matrix and so the current naive implementations should works well. In case you wanna try, check the sparse tutorial to know how to fill a sparse matrix (in the doc) and the sparse_solver unit test to see how to call the sparse solvers (in eigen2/test/).
Seb Registered Member Posts 99 Karma 0	RE: [SOLVED] LU only half as fast as LAPACK/BLAS? Mon Jan 26, 2009 10:41 am linoleo wrote:(a) sparse, (b) specialised for skew-symmetric matrices, and (c) coded in templated C++ so I can drop in my payload class. From my understanding there exists a sparse module in Eigen as well though not being finished regarding features and optimizations. The sparse module offers its own LU and LDL implementation that you should try. I don't know about the status yet, my own tests last week were not that good regarding memory and speed (compared with Mumps). EDIT: I have to correct: LDLT is very slow, LU seems to be reasonable... (but I test it against a very sparse band structured matrix) Last edited by Seb on Mon Jan 26, 2009 11:38 am, edited 1 time in total.
linoleo Registered Member Posts 14 Karma 0	RE: [SOLVED] LU only half as fast as LAPACK/BLAS? Mon Jan 26, 2009 12:51 pm ggael wrote: linoleo wrote:* size 2n x 2n * sparse: each row/column has from 3n to about 10n entries I guess you meant from 3 to 10 entries ? Yes, that's exactly what I meant, thanks. ggael wrote:Seeing the very specific shape of your matrix I believe a sparse LLT or LDLT factorization should be enough, no? I thought LLT and LDLT were for symmetric matrices only? I know an LDLT-like decomposition for skew-symmetric matrices, but it's not the same beast because the 'D' in there is actually the Kronecker product of a diagonal matrix with [0 -1; 1 0]. It might be similar enough though to **** Eigen's sparse LDLT implementation for it... hmmm... ggael wrote:moreover since you have a skew matrix you probably don't need to reorder the matrix and so the current naive implementations should works well. Reordering should still be done to reduce fill-in, no? It certainly improves SuperLU's performance for my matrices. ggael wrote:In case you wanna try, check the sparse tutorial to know how to fill a sparse matrix (in the doc) and the sparse_solver unit test to see how to call the sparse solvers (in eigen2/test/). I'll definitely look into it. The macports package I installed has neither docs nor test directory, guess I need to sign onto the repository :shade:
ggael Moderator Posts 3447 Karma 19 OS	RE: [SOLVED] LU only half as fast as LAPACK/BLAS? Mon Jan 26, 2009 3:25 pm linoleo wrote: ggael wrote: linoleo wrote:* size 2n x 2n * sparse: each row/column has from 3n to about 10n entries I guess you meant from 3 to 10 entries ? Yes, that's exactly what I meant, thanks. ggael wrote:Seeing the very specific shape of your matrix I believe a sparse LLT or LDLT factorization should be enough, no? I thought LLT and LDLT were for symmetric matrices only? I know an LDLT-like decomposition for skew-symmetric matrices, but it's not the same beast because the 'D' in there is actually the Kronecker product of a diagonal matrix with [0 -1; 1 0]. It might be similar enough though to **** Eigen's sparse LDLT implementation for it... hmmm... ggael wrote:moreover since you have a skew matrix you probably don't need to reorder the matrix and so the current naive implementations should works well. Reordering should still be done to reduce fill-in, no? It certainly improves SuperLU's performance for my matrices. ggael wrote:In case you wanna try, check the sparse tutorial to know how to fill a sparse matrix (in the doc) and the sparse_solver unit test to see how to call the sparse solvers (in eigen2/test/). I'll definitely look into it. The macports package I installed has neither docs nor test directory, guess I need to sign onto the repository :shade: hm, sorry for some reason I mixed up the meaning of skew with something else. So now I know that a skew-symmetric matrix is actually an anti-symmetric matrix, please forgot all what I said ! I mean all ! So indeed, currently the sparse module is not able to handle your matrix with a custom types, and to my knowledge what you're looking for does not exist yet ! But having a custom sparse LU in something on the TODO list but I won't be able to implement something that competes with SuperLU ! Any help is more than welcome !

Page 1 of 3 (31 posts)

1, 2, 3 • Next

Bookmarks

Who is online

Registered users: Bing [Bot], Google [Bot], Sogou [Bot]