This forum has been archived. All content is frozen. Please use KDE Discuss instead.

[SOLVED] LU only half as fast as LAPACK/BLAS?

Tags: None
(comma "," separated)
linoleo
Registered Member
Posts
14
Karma
0
ggael wrote:LDLT: custom almost naive implementation (ok for custom types)


Okay I did some research... the routines you're using here (both for dense and sparse) are very basic, with no pivoting or blocking. This is fine for hermitian positive definite matrices but misses a key advantage of LDLT over LLT - namely, that a good LDLT can in fact handle indefinite as well as skew-hermitian matrices.

To cope with indefinite matrices, one must go to a blocked version, where the pivot block size is normally 1x1 but goes to 2x2 if the singleton pivot is too small. Skew-hermitian matrices can be handled by premultiplying them with i (the imaginary unit), which makes them hermitian (but generally indefinite). A very good reference for all this is Chapter 11 of:
Nicholas J. Higham, Accuracy and Stability of Numerical Algorithms
2nd Ed., SIAM 2002. ISBN 0898715210, 9780898715217
(I got the entire chapter as preview on Google Books :-) )

So I see two ways forward:
1. Hack 2x2 blocks and pivoting into Eigen's existing LDLT codes, or
2. Adopt into Eigen existing generic dense/sparse block LDLT codes with pivoting.

I'll have a look around to see if I can find a code suitable for 2. So far I found a
dense one that looks interesting: http://www.alglib.net/matrixops/symmetric/ldlt.php
The MPFR/GMP version seems to use custom payload types. What do you think?

No luck so far for a sparse version. Cheers

Last edited by linoleo on Tue Jan 27, 2009 8:00 am, edited 1 time in total.
User avatar
ggael
Moderator
Posts
3447
Karma
19
OS
linoleo wrote:Okay I did some research... the routines you're using here (both for dense and sparse) are very basic, with no pivoting or blocking. This is fine for hermitian positive definite matrices but misses a key advantage of LDLT over LLT - namely, that a good LDLT can in fact handle indefinite as well as skew-hermitian matrices.


yes I know it's very basic !

To cope with indefinite matrices, one must go to a blocked version, where the pivot block size is normally 1x1 but goes to 2x2 if the singleton pivot is too small. Skew-hermitian matrices can be handled by premultiplying them with i (the imaginary unit), which makes them hermitian (but generally indefinite). A very good reference for all this is Chapter 11 of:
Nicholas J. Higham, Accuracy and Stability of Numerical Algorithms
2nd Ed., SIAM 2002. ISBN 0898715210, 9780898715217
(I got the entire chapter as preview on Google Books :-) )

So I see two ways forward:
1. Hack 2x2 blocks and pivoting into Eigen's existing LDLT codes, or
2. Adopt into Eigen existing generic dense/sparse block LDLT codes with pivoting.

I'll have a look around to see if I can find a code suitable for 2. So far I found a
dense one that looks interesting: http://www.alglib.net/matrixops/symmetric/ldlt.php
The MPFR/GMP version seems to use custom payload types. What do you think?


for sparse matrices and direct solver there is a very interesting book:
http://www.cise.ufl.edu/research/sparse/CSparse/
this book is accompanied with small C library which is open source (LGPL) and available on the web site. If I remember well, it includes routines for LLT, LDLT, and LU using the same code base. Though this library is designed for teaching purpose (eg., support only double), it has a very good simplicity/performance ratio, and I so this would a be a very good starting point.
User avatar
bjacob
Registered Member
Posts
658
Karma
3
Guys,

I'm sorry that I was wrong when I said that partial pivoting was unusable for large matrices: that was wrong.

I experimented partial pivoting today and my results are that it works well for invertible matrices. It can be used to invert large invertible matrices.

Where partial pivoting doesn't work, is for non-invertible matrices: computing the rank/kernel/image only works with full pivoting.

Sorry for the noise!


Join us on Eigen's IRC channel: #eigen on irc.freenode.net
Have a serious interest in Eigen? Then join the mailing list!
linoleo
Registered Member
Posts
14
Karma
0
ggael wrote:for sparse matrices and direct solver there is a very interesting book:
http://www.cise.ufl.edu/research/sparse/CSparse/
this book is accompanied with small C library which is open source (LGPL) and available on the web site. If I remember well, it includes routines for LLT, LDLT, and LU using the same code base. Though this library is designed for teaching purpose (eg., support only double), it has a very good simplicity/performance ratio, and I so this would a be a very good starting point.


As far as I can tell CSparse does not include any LDLT, blocked or otherwise. The same author (Tim Davis) wrote the simple unblocked LDLT code you're already using, so this won't help us at all.
linoleo
Registered Member
Posts
14
Karma
0
bjacob wrote:Where partial pivoting doesn't work, is for non-invertible matrices: computing the rank/kernel/image only works with full pivoting.


Ah, that makes sense - if you can have an extensive nullspace, you may not be able to find a non-zero pivot at all unless you search the whole matrix. Simplest example: a matrix with only one non-zero element...
User avatar
bjacob
Registered Member
Posts
658
Karma
3
I just committed partial-pivoting LU in trunk. I restricted it to square invertible matrices -- which is the main use case for it.

So if you're interested in performance, give it a try. The methods inverse() and determinant() in MatrixBase have been updated to use it instead of full-pivoting LU, so if you use them you don't have to change anything in your code.

The class name is currently PartialLU but that is not definitive.

Here with gcc-4.4 and SSE2, and a 1000x1000 float matrix I get the following improvements:

inverse() :
3.0s (full pivoting LU) --> 2.0s (partial pivoting LU)

lu decomposition itself (e.g. if you want a single vector solve):
1.8s (full pivoting LU) --> 0.8s (partial pivoting LU)

Thanks to everybody for your input and making me realize that there is a use case for partial pivoting LU... ;-)


Join us on Eigen's IRC channel: #eigen on irc.freenode.net
Have a serious interest in Eigen? Then join the mailing list!
n2k
Registered Member
Posts
41
Karma
0
Hi,

I'm new to Eigen and C++. First of all I would like to thank all the developers for this impressive library.

I'm not sure how to use the new partial pivoting with lu:

bjacob wrote:So if you're interested in performance, give it a try. The methods inverse() and determinant() in MatrixBase have been updated to use it instead of full-pivoting LU, so if you use them you don't have to change anything in your code.

The class name is currently PartialLU but that is not definitive.


Is this implemented in the last release of Eigen (I'm using 2.0.2)? My original problem is the solution of a linear system:

Code: Select all
int main()
{
MatrixXd A = MatrixXd::Random(1000, 1000);
MatrixXd B = MatrixXd::Random(1000, 1);
MatrixXd X;   
double starttime=clock();
A.lu().solve(B, &X);
cout << "Used Time =" << (clock()-starttime)/1000 << " ms"<< endl;   
}


Compiled with g++-4.2 -O3 -sse3 this takes 1.8sec on my new Intel Macbook 2Ghz.

Equivalent codes written first in Matlab and then in C++ using the library armadillo (blas/lapack are provided by the apple accelerate framework) take 0.3sec and 0.2sec respectively.

Is this difference due to the use of full pivoting in Eigen? I'm sure that I'm missing something here but I can't figure out what that is.

Thanks for your help,
Giacomo
User avatar
bjacob
Registered Member
Posts
658
Karma
3
n2k wrote:Hi,

I'm new to Eigen and C++. First of all I would like to thank all the developers for this impressive library.

I'm not sure how to use the new partial pivoting with lu:

bjacob wrote:So if you're interested in performance, give it a try. The methods inverse() and determinant() in MatrixBase have been updated to use it instead of full-pivoting LU, so if you use them you don't have to change anything in your code.

The class name is currently PartialLU but that is not definitive.


Is this implemented in the last release of Eigen (I'm using 2.0.2)?


No, it is in the development branch, will appear in 2.1.

[quote]
My original problem is the solution of a linear system:

[code]
int main()
{
MatrixXd A = MatrixXd::Random(1000, 1000);
MatrixXd B = MatrixXd::Random(1000, 1);
MatrixXd X;
double starttime=clock();
A.lu().solve(B, &X);
cout 1 cores and your software is able to parallelize this operation, that may also explain it.

EDIT : I can see that you defined B as a matrixXd while it is really a vector. You have some room for improvement defining B as a VectorXd instead. Same for X.

Last edited by bjacob on Mon Jun 01, 2009 6:13 pm, edited 1 time in total.


Join us on Eigen's IRC channel: #eigen on irc.freenode.net
Have a serious interest in Eigen? Then join the mailing list!
n2k
Registered Member
Posts
41
Karma
0
Thanks for your fast reply!

bjacob wrote:Part of it is the partial pivoting: try the development version. It should give you 2x or 3x better performance.
Part of it might be that you didn't disable assertions: try -DEIGEN_NO_DEBUG or -DNDEBUG.
Part of it might be, also, that we have room for improvement. The partial pivoting should ideally work on blocks. This is on our TODO but not done yet. Also, if you have >1 cores and your software is able to parallelize this operation, that may also explain it.

EDIT : I can see that you defined B as a matrixXd while it is really a vector. You have some room for improvement defining B as a VectorXd instead. Same for X.


For the record, disabling assertions and using VectorXd didn't have any significant effect. I downloaded the development version and found PartiaLU.h in it but I'm not sure how I should modify my simple code (please se above) to take advantage of partial pivoting.

Any help is very appreciated. Thank you.
User avatar
bjacob
Registered Member
Posts
658
Karma
3
n2k wrote:For the record, disabling assertions and using VectorXd didn't have any significant effect. I downloaded the development version and found PartiaLU.h in it but I'm not sure how I should modify my simple code (please se above) to take advantage of partial pivoting.


Just call partialLu() instead of lu().

And you're welcome :)


Join us on Eigen's IRC channel: #eigen on irc.freenode.net
Have a serious interest in Eigen? Then join the mailing list!
n2k
Registered Member
Posts
41
Karma
0
bjacob wrote:Just call partialLu() instead of lu().


oh ok... I was typing "PartialLu()", that's why. I now get 1.25sec.
User avatar
bjacob
Registered Member
Posts
658
Karma
3
n2k wrote:
bjacob wrote:Just call partialLu() instead of lu().


oh ok... I was typing "PartialLu()", that's why. I now get 1.25sec.


Hm, so it's still 4x or 5x slower than the competition? OK, as I said we have room for improvement -- beginning with implementing the block algorithm.

I'd be interested if you can check how many threads the other programs use, and what is the performance when you force the program to use only 1 thread.


Join us on Eigen's IRC channel: #eigen on irc.freenode.net
Have a serious interest in Eigen? Then join the mailing list!
n2k
Registered Member
Posts
41
Karma
0
bjacob wrote:I'd be interested if you can check how many threads the other programs use, and what is the performance when you force the program to use only 1 thread.


I'd be happy to check that for you if I only knew how do do that :) I'll try to find out and keep you posted.
User avatar
bjacob
Registered Member
Posts
658
Karma
3
n2k wrote:
bjacob wrote:I'd be interested if you can check how many threads the other programs use, and what is the performance when you force the program to use only 1 thread.


I'd be happy to check that for you if I only knew how do do that :) I'll try to find out and keep you posted.


If you're using linux or unix you can always make it work on a huge matrix so that it takes like 10 seconds to complete, and during the computation you run 'top' in a terminal. If it says "200%" this means it's using 2 cores of your CPU. Of course this experiments only works if you have a dual core CPU. If you have 1 core, disregard all I said!

Also notice that it is a well-known issue (see the TODO) that vectorization doesn't bring as strong a performance increase as it should, with PartialLU. What's needed here is for someone to examine the assembly carefully and understand where the code we're generating is bad...


Join us on Eigen's IRC channel: #eigen on irc.freenode.net
Have a serious interest in Eigen? Then join the mailing list!
n2k
Registered Member
Posts
41
Karma
0
bjacob wrote:If you're using linux or unix you can always make it work on a huge matrix so that it takes like 10 seconds to complete, and during the computation you run 'top' in a terminal. If it says "200%" this means it's using 2 cores of your CPU. Of course this experiments only works if you have a dual core CPU. If you have 1 core, disregard all I said!


I do have 2 cores. I observed that armadillo (blas/lapack) reached 191% while matlab exceeded 200%. Eigen spiked at 101%. So it actually seems that armadillo and matlab are using two cores but this happens only after a few seconds (I'm not sure if this is due to the refresh rate of "top").

I also did other tests and Eigen turned out to be the fastest at smaller matrix sizes. These are the results for the solution of a 50x50 system:
Eigen (partial lu) -> 0.216 ms.
Armadillo -> 0.238 ms
Eigen (full lu) -> 0.332 ms.
Matlab -> 0.350ms

Then increasing the matix size Eigen seems to slow down more rapidly than the other two.


Bookmarks



Who is online

Registered users: Bing [Bot], Google [Bot], Sogou [Bot]