Reply to topic

Improving partialPivLu speed?

mhassell
Registered Member
Posts
4
Karma
0

Improving partialPivLu speed?

Mon Oct 23, 2017 11:22 pm
EDIT: I was attempting to post this on the Eigen page, which I realized I did not. Sorry.

Hello,

I've been learning Eigen by implementing some tools, and am now at a point where I am aiming to improve the speed of my methods. I've enabled openMP via -fopenmp and used gprof to profile my code, and found that my Matlab/Octave code is actually faster, by a factor of 5 or so. In looking over the results of my profiling, I've found that the largest consumer of run time is using the partialPivLu() solver for my linear systems. Specifically, gprof indicates the following for the largest consumer of time:

Flat profile:

Each sample counts as 0.01 seconds.
% cumulative self self total
time seconds seconds calls s/call s/call name
69.61 225.32 225.32 12923 0.02 0.02 Eigen::internal::inplace_transpose_selector<Eigen::Matrix<double, -1, -1, 0, -1, -1>, false, false>::run(Eigen::Matrix<double, -1, -1, 0, -1, -1>&)

Looking at the call tree, I find specifically that the following call consumes most of the time (by making the most calls to the above method:

Eigen::internal::triangular_solve_matrix<double, long, 1, 5, false, 0, 0>::run(long, long, double const*, long, double*, long, Eigen::internal::level3_blocking<double, double>&)

I've also tried a different solver, colPivHouseholderQr(), with similar timing results.

Is there a way to speed up some of these calls? I cannot use sparse or symmetric matrix solvers, since my matrices are generally neither. Any help would be appreciated.
User avatar ggael
Moderator
Posts
3431
Karma
19
OS

Re: Improving partialPivLu speed?

Tue Oct 24, 2017 12:11 pm
First thing is to get compiler options right, for instance with gcc/clang: -O3 -march=native -fopenmp, all three are very important to fully exploit your CPU. Then, depending on your system, you might need to restrict OpenMP to the true number of physical cores, for instance if you have 8 hyper-threads and 4 physical cores, then try:

$ OMP_NUM_THREADS=4 ./my_app

For all your measurements with multi-threading you need to properly measure the wall time, not the CPU's time.

Also, if you think PartialPivLU::compute is the bottleneck, then properly measure its runtime by measuring the wall time. If you have Eigen's sources, you can do:

Code: Select all
#include <path/to/eigensrc/bench/BenchTimer.h>

PartialPivLU<MatrixXd> lu;
BenchTimer t;
t.start();
lu.compute(A);
t.stop();
cout << t.value(REAL_TIMER) << endl;


and compare to the whole running time.

Finally, inplace_transpose_selector is never called by PartialPivLU, so this directly comes from your own code (calls to transposeInPlace). If those really dominate, then try to avoid them.
mhassell
Registered Member
Posts
4
Karma
0

Re: Improving partialPivLu speed?

Tue Oct 24, 2017 10:35 pm
Yes, I have the compiler flags set for optimization, no debugging assertions, vectorization, and openMP. Here's an excerpt from my call graph:

Call graph (explanation follows)


granularity: each sample hit covers 2 byte(s) for 0.00% of 424.31 seconds

index % time self children called name
0.03 0.00 1/12923 testPotentialYh(geometry const&, double (*)(double), Eigen::Matrix<double, -1, -1, 0, -1, -1> const&, int, Eigen::Matrix<double, -1, -1, 0, -1, -1> const&) [14]
15.87 0.00 634/12923 Eigen::internal::triangular_solve_matrix<double, long, 1, 2, false, 0, 0>::run(long, long, double const*, long, double*, long, Eigen::internal::level3_blocking<double, double>&) [16]
26.08 0.00 1042/12923 std::ctype<char>::do_widen(char) const [13]
281.50 0.00 11246/12923 Eigen::internal::triangular_solve_matrix<double, long, 1, 5, false, 0, 0>::run(long, long, double const*, long, double*, long, Eigen::internal::level3_blocking<double, double>&) [7]
[1] 76.2 323.48 0.00 12923 Eigen::internal::inplace_transpose_selector<Eigen::Matrix<double, -1, -1, 0, -1, -1>, false, false>::run(Eigen::Matrix<double, -1, -1, 0, -1, -1>&) [1]
-----------------------------------------------
0.00 37.90 1/8 solve(Eigen::Matrix<double, -1, -1, 0, -1, -1> const&, Eigen::Matrix<double, -1, 1, 0, -1, 1> const&) [10]
0.00 265.27 7/8 solve(Eigen::Matrix<double, -1, -1, 0, -1, -1> const&, Eigen::Matrix<double, -1, -1, 0, -1, -1> const&) [6]
[2] 71.4 0.00 303.17 8 Eigen::PartialPivLU<Eigen::Matrix<double, -1, -1, 0, -1, -1> >::PartialPivLU<Eigen::Matrix<double, -1, -1, 0, -1, -1> >(Eigen::EigenBase<Eigen::Matrix<double, -1, -1, 0, -1, -1> > const&) [2]
0.00 303.01 8/8 Eigen::PartialPivLU<Eigen::Matrix<double, -1, -1, 0, -1, -1> >::compute() [3]
0.09 0.07 272/272 Eigen::internal::general_matrix_matrix_product<long, double, 0, false, double, 0, false, 0>::run(long, long, long, double const*, long, double const*, long, double*, long, double, Eigen::internal::level3_blocking<double, double>&, Eigen::internal::GemmParallelInfo<long>*) [19]
-----------------------------------------------
0.00 303.01 8/8 Eigen::PartialPivLU<Eigen::Matrix<double, -1, -1, 0, -1, -1> >::PartialPivLU<Eigen::Matrix<double, -1, -1, 0, -1, -1> >(Eigen::EigenBase<Eigen::Matrix<double, -1, -1, 0, -1, -1> > const&) [2]
[3] 71.4 0.00 303.01 8 Eigen::PartialPivLU<Eigen::Matrix<double, -1, -1, 0, -1, -1> >::compute() [3]
0.00 303.01 8/8 Eigen::internal::partial_lu_impl<double, 0, int>::blocked_lu(long, long, double*, long, int*, int&, long) [4]
-----------------------------------------------
434 Eigen::internal::partial_lu_impl<double, 0, int>::blocked_lu(long, long, double*, long, int*, int&, long) [4]
0.00 303.01 8/8 Eigen::PartialPivLU<Eigen::Matrix<double, -1, -1, 0, -1, -1> >::compute() [3]
[4] 71.4 0.00 303.01 8+434 Eigen::internal::partial_lu_impl<double, 0, int>::blocked_lu(long, long, double*, long, int*, int&, long) [4]
0.00 276.93 424/431 Eigen::internal::triangular_solve_matrix<double, long, 1, 5, false, 0, 0>::run(long, long, double const*, long, double*, long, Eigen::internal::level3_blocking<double, double>&) [7]
0.00 26.08 380/380 void Eigen::internal::parallelize_gemm<true, Eigen::internal::gemm_functor<double, long, Eigen::internal::general_matrix_matrix_product<long, double, 0, false, double, 0, false, 0>, Eigen::Block<Eigen::Block<Eigen::Map<Eigen::Matrix<double, -1, -1, 0, -1, -1>, 0, Eigen::Stride<0, 0> >, -1, -1, false>, -1, -1, false>, Eigen::Block<Eigen::Block<Eigen::Map<Eigen::Matrix<double, -1, -1, 0, -1, -1>, 0, Eigen::Stride<0, 0> >, -1, -1, false>, -1, -1, false>, Eigen::Block<Eigen::Block<Eigen::Map<Eigen::Matrix<double, -1, -1, 0, -1, -1>, 0, Eigen::Stride<0, 0> >, -1, -1, false>, -1, -1, false>, Eigen::internal::gemm_blocking_space<0, double, double, -1, -1, -1, 1, false> >, long>(Eigen::internal::gemm_functor<double, long, Eigen::internal::general_matrix_matrix_product<long, double, 0, false, double, 0, false, 0>, Eigen::Block<Eigen::Block<Eigen::Map<Eigen::Matrix<double, -1, -1, 0, -1, -1>, 0, Eigen::Stride<0, 0> >, -1, -1, false>, -1, -1, false>, Eigen::Block<Eigen::Block<Eigen::Map<Eigen::Matrix<double, -1, -1, 0, -1, -1>, 0, Eigen::Stride<0, 0> >, -1, -1, false>, -1, -1, false>, Eigen::Block<Eigen::Block<Eigen::Map<Eigen::Matrix<double, -1, -1, 0, -1, -1>, 0, Eigen::Stride<0, 0> >, -1, -1, false>, -1, -1, false>, Eigen::internal::gemm_blocking_space<0, double, double, -1, -1, -1, 1, false> > const&, long, long, long, bool) [12]
0.00 0.00 424/438 void Eigen::internal::evaluateProductBlockingSizesHeuristic<double, double, 4, long>(long&, long&, long&, long) [51]
0.00 0.00 388/388 Eigen::internal::partial_lu_impl<double, 0, int>::unblocked_lu(Eigen::Block<Eigen::Map<Eigen::Matrix<double, -1, -1, 0, -1, -1>, 0, Eigen::Stride<0, 0> >, -1, -1, false>&, int*, int&) [52]
0.00 0.00 380/531 void Eigen::internal::evaluateProductBlockingSizesHeuristic<double, double, 1, long>(long&, long&, long&, long) [50]
434 Eigen::internal::partial_lu_impl<double, 0, int>::blocked_lu(long, long, double*, long, int*, int&, long) [4]
-----------------------------------------------
<spontaneous>
[5] 67.4 0.00 285.78 projectXh(geometry const&, double (*)(double, double), int, Eigen::Matrix<double, -1, -1, 0, -1, -1> const&) [5]
0.00 285.72 2/2 solve(Eigen::Matrix<double, -1, -1, 0, -1, -1> const&, Eigen::Matrix<double, -1, -1, 0, -1, -1> const&) [6]
0.00 0.06 2/2 testXh(geometry const&, double (*)(double, double), int, Eigen::Matrix<double, -1, -1, 0, -1, -1> const&) [22]
0.00 0.00 2/19 void Eigen::internal::parallelize_gemm<true, Eigen::internal::gemm_functor<double, long, Eigen::internal::general_matrix_matrix_product<long, double, 0, false, double, 0, false, 0>, Eigen::Matrix<double, -1, -1, 0, -1, -1>, Eigen::Matrix<double, -1, -1, 0, -1, -1>, Eigen::Matrix<double, -1, -1, 0, -1, -1>, Eigen::internal::gemm_blocking_space<0, double, double, -1, -1, -1, 1, false> >, long>(Eigen::internal::gemm_functor<double, long, Eigen::internal::general_matrix_matrix_product<long, double, 0, false, double, 0, false, 0>, Eigen::Matrix<double, -1, -1, 0, -1, -1>, Eigen::Matrix<double, -1, -1, 0, -1, -1>, Eigen::Matrix<double, -1, -1, 0, -1, -1>, Eigen::internal::gemm_blocking_space<0, double, double, -1, -1, -1, 1, false> > const&, long, long, long, bool) <cycle 1> [55]
0.00 0.00 2/26 legendrebasis(int, Eigen::Matrix<double, -1, 1, 0, -1, 1>&, int, Eigen::Matrix<double, -1, -1, 0, -1, -1>&) [54]
0.00 0.00 2/531 void Eigen::internal::evaluateProductBlockingSizesHeuristic<double, double, 1, long>(long&, long&, long&, long) [50]
-----------------------------------------------
0.00 285.72 2/2 projectXh(geometry const&, double (*)(double, double), int, Eigen::Matrix<double, -1, -1, 0, -1, -1> const&) [5]
[6] 67.3 0.00 285.72 2 solve(Eigen::Matrix<double, -1, -1, 0, -1, -1> const&, Eigen::Matrix<double, -1, -1, 0, -1, -1> const&) [6]
0.00 265.27 7/8 Eigen::PartialPivLU<Eigen::Matrix<double, -1, -1, 0, -1, -1> >::PartialPivLU<Eigen::Matrix<double, -1, -1, 0, -1, -1> >(Eigen::EigenBase<Eigen::Matrix<double, -1, -1, 0, -1, -1> > const&) [2]
0.00 15.87 7/7 Eigen::internal::triangular_solve_matrix<double, long, 1, 2, false, 0, 0>::run(long, long, double const*, long, double*, long, Eigen::internal::level3_blocking<double, double>&) [16]
0.00 4.57 7/431 Eigen::internal::triangular_solve_matrix<double, long, 1, 5, false, 0, 0>::run(long, long, double const*, long, double*, long, Eigen::internal::level3_blocking<double, double>&) [7]
0.00 0.00 14/438 void Eigen::internal::evaluateProductBlockingSizesHeuristic<double, double, 4, long>(long&, long&, long&, long) [51]
-----------------------------------------------
0.00 4.57 7/431 solve(Eigen::Matrix<double, -1, -1, 0, -1, -1> const&, Eigen::Matrix<double, -1, -1, 0, -1, -1> const&) [6]
0.00 276.93 424/431 Eigen::internal::partial_lu_impl<double, 0, int>::blocked_lu(long, long, double*, long, int*, int&, long) [4]
[7] 66.3 0.00 281.50 431 Eigen::internal::triangular_solve_matrix<double, long, 1, 5, false, 0, 0>::run(long, long, double const*, long, double*, long, Eigen::internal::level3_blocking<double, double>&) [7]
281.50 0.00 11246/12923 Eigen::internal::inplace_transpose_selector<Eigen::Matrix<double, -1, -1, 0, -1, -1>, false, false>::run(Eigen::Matrix<double, -1, -1, 0, -1, -1>&) [1]
0.00 0.00 7236/7560 Eigen::internal::gemm_pack_rhs<double, long, Eigen::internal::blas_data_mapper<double, long, 0, 0>, 4, 0, false, true>::operator()(double*, Eigen::internal::blas_data_mapper<double, long, 0, 0> const&, long, long, long, long) [49]
-----------------------------------------------

The thing is that the most expensive call is to inplaceTransposeSelector, which is called by triangular_solve_matrix 11246 times, which is called by partial_lu_impl 424 times, which seems to be called recursively. These all seem solver related, and is far more calls to the solvers than I am using (I should have about 10 calls to partialPivLU in total).

I've dug into my code and seem to only call transposeInPlace a handful of times, about 4 that I could find, which wouldn't account for the ~13k calls to inplace_transpose_selector. I ran the profile around the solver routines, and found it was ~0.01 s for each call. I'll dig into the transpose in place suggestion, but what would cause such a difference in information between gprof and the wall time?
User avatar ggael
Moderator
Posts
3431
Karma
19
OS

Re: Improving partialPivLu speed?

Wed Oct 25, 2017 8:06 pm
Your callgraph does not make sense to me, inplace_transpose_selector is never called by triangular_solve_matrix. On linux I would recommend perf/hotspot for profiling (https://github.com/KDAB/hotspot), and if you have access to a mac Instrument (shipped with xcode) is super easy to use and reliable.
mhassell
Registered Member
Posts
4
Karma
0

Re: Improving partialPivLu speed?

Wed Oct 25, 2017 11:16 pm
Many thanks. I'll do some more profiling and see what I find.

 
Reply to topic

Bookmarks



Who is online

Registered users: Baidu [Spider], Bing [Bot], boudewijn, Capitain_Jack, chrisogilvie, Exabot [Bot], flyfan, Google [Bot], JesusM, johnguicar, johssw, jsola, lynnux, maymax, rblackwell, Sogou [Bot], waynes, Yahoo [Bot]