Reply to topic

Solve without memory allocation

twithaar
Registered Member
Posts
10
Karma
0

Solve without memory allocation

Mon May 06, 2013 9:56 am
Hi All,

I'm trying to use Eigen for a per-pixel processing.
This is a massive for()-loop, which currently has (among many other things) a svd.solve() and a qr.solve().
Both the svd and Qr objects and their result vectors are pre-allocated.
Still, a statement like:
Code: Select all
x = qr.solve(b)

causes a call to the memory allocator, and the following doesn't compile:
Code: Select all
x.noalias() = qr.solve(b)


Is there anyway to use these solvers without memory allocations ?
On windows 7 the code runs fine, but on windows Xp, the code I have is currently
blocking on the memory allocation, and therefore effectively running on 1 thread, even
though the outermost for()-loop is parallelized using OpenMp.

As a workaround I currently form the inverse explicitly, even though that does reduce the numerical precision....

Any suggestions are appreciated,

Thijs
User avatar ggael
Moderator
Posts
2204
Karma
15
OS

Re: Solve without memory allocation

Tue May 07, 2013 8:30 am
That's not easily doable because the amount of memory that would have to be preallocated depend on the number of columns of b. Nevertheless, it seems your are hitting a more serious issue. Isn't malloc thread safe on winXP??
twithaar
Registered Member
Posts
10
Karma
0

Re: Solve without memory allocation

Tue May 07, 2013 9:22 am
Isn't malloc thread safe on winXP??

It's thread-safe for sure, it's a performance issue.
One of the test-machines has 4 cores with hyperthreading, so OpenMp runs with 8 threads.
Those threads block on the mutex inside malloc, reducing parallelism to effectively 1 core.

In my case 'b' always has 1 column, and 'x', to which I assign the solution, is guaranteed (by my code) to have the correct size pre-allocated.
So although I didn't test it, the following code should work fine (as in: correct results, no memory allocations):
Code: Select all
auto xr = qr.solve(b);
for(int i=0; i<x.size(); i++)
{
    x[i] = xr[i];
}

Something tells me the for()-loop shouldn't be necessary. Note that the following does give a compiler error:
Code: Select all
auto xr = qr.solve(b);
x.noalias() = xr;

The error is:
DenseCoeffsBase.h(495): error C2440: '=' : cannot convert from 'const Eigen::ReturnByValue<Derived>::YOU_ARE_TRYING_TO_ACCESS_A_SINGLE_COEFFICIENT_IN_A_SPECIAL_EXPRESSION_WHERE_THAT_IS_NOT_ALLOWED_BECAUSE_THAT_WOULD_BE_INEFFICIENT' to 'double'


As a side note, the following code also does work without re-allocations:
Initialization:
Code: Select all
Eigen::ColPivHouseholderQR<Eigen::MatrixXd> decomposition = AA.colPivHouseholderQr();

inside the pixel-for()-loop:
Code: Select all
x.noalias() = decomposition.solve(Ab).cast<float>();

Here, it's the cast<float>() that actually makes the assignment work as expected.
For the svd() I've tried eval() and evalTo(), but can't get it to work without memory re-allocations
User avatar ggael
Moderator
Posts
2204
Karma
15
OS

Re: Solve without memory allocation

Tue May 07, 2013 10:00 am
I don't really get what you mean by:
Those threads block on the mutex inside malloc, reducing parallelism to effectively 1 core.

I guess that's just killing the perf.

For your information, when you do x = qr.solve(b); we need a temporary that has the same dimension than b to receive the result of Q^T b. In the special case where the underlying matrix is squared, we could directly use x for that purpose, but that's a rather rare use case for a QR factorization.

Maybe the sizes of your matrices are known at compile time, or at least bounded at compile time? If that's so, and that they are not too large, then you can avoid dynamic allocation using fixed size matrices, e.g.:

typedef Matrix<double, 20,3> DesignMatrixType;

or:

typedef Matrix<double, Dynamic, Dynamic, 0, 20, 3> DesignMatrixType;

Same for the vector types.
twithaar
Registered Member
Posts
10
Karma
0

Re: Solve without memory allocation

Tue May 07, 2013 5:40 pm
ggael wrote:I don't really get what you mean by:
Those threads block on the mutex inside malloc, reducing parallelism to effectively 1 core.

I guess that's just killing the perf.


Ah, it was an unclear statement. The memory allocator in windows XP appears to be fully single threaded.
When running code that creates intermediate results in multiple threads, all those threads call malloc(), which
has a mutex for thread-safety. On my test system, 8 threads are stalling on that mutex, the resulting cpu-load is
12%. 86% of the time is spent on waiting on the mutex inside malloc().
My laptop, with 4 threads and windows 7 runs nicely with a load of 100%. So memory allocations are killing parallelism on windows XP.

ggael wrote:For your information, when you do x = qr.solve(b); we need a temporary that has the same dimension than b to receive the result of Q^T b.

Ah, thanks, that's good to know!
Would there be any way to pass that temporary to solve ? (or let qr store it itself, for the next call ?)
In my code, copying data around is not a problem, but the memory allocations are absolutely killing performance.

If I understand Eigen's code correctly, svd is similar to qr; it needs a temporary for the singularvalues.

Anyway, thanks for the replies so far. Currently the explicitly pre-calculated inverse doesn't seem to give
me numerical problems, but if there's any way to use a qr or svd-decomposition directly, I'd love to hear that.

 
Reply to topic

Bookmarks



Who is online

Registered users: adabrowski, Alexa [Bot], Baidu [Spider], Bing [Bot], bjoernbalazs, CyberAngel, edmael, Exabot [Bot], Felix VI, firewalker, Google [Bot], gp[], jensreuterberg, lazyit, maggy, MSNbot Media, nezumi, orbmiser, pbCyanide, salvochea, SeaJey, sebas, stavallo, TheraHedwig, Uri_Herrera, Vindex17, Yahoo [Bot]