Solve without memory allocation

Board index

Page 1 of 1 (5 posts)

Tags:

twithaar Registered Member Posts 23 Karma 0	Solve without memory allocation Mon May 06, 2013 9:56 am Hi All, I'm trying to use Eigen for a per-pixel processing. This is a massive for()-loop, which currently has (among many other things) a svd.solve() and a qr.solve(). Both the svd and Qr objects and their result vectors are pre-allocated. Still, a statement like: Code: Select all `x = qr.solve(b)` causes a call to the memory allocator, and the following doesn't compile: Code: Select all `x.noalias() = qr.solve(b)` Is there anyway to use these solvers without memory allocations ? On windows 7 the code runs fine, but on windows Xp, the code I have is currently blocking on the memory allocation, and therefore effectively running on 1 thread, even though the outermost for()-loop is parallelized using OpenMp. As a workaround I currently form the inverse explicitly, even though that does reduce the numerical precision.... Any suggestions are appreciated, Thijs
ggael Moderator Posts 3447 Karma 19 OS	Re: Solve without memory allocation Tue May 07, 2013 8:30 am That's not easily doable because the amount of memory that would have to be preallocated depend on the number of columns of b. Nevertheless, it seems your are hitting a more serious issue. Isn't malloc thread safe on winXP??
twithaar Registered Member Posts 23 Karma 0	Re: Solve without memory allocation Tue May 07, 2013 9:22 am Isn't malloc thread safe on winXP?? It's thread-safe for sure, it's a performance issue. One of the test-machines has 4 cores with hyperthreading, so OpenMp runs with 8 threads. Those threads block on the mutex inside malloc, reducing parallelism to effectively 1 core. In my case 'b' always has 1 column, and 'x', to which I assign the solution, is guaranteed (by my code) to have the correct size pre-allocated. So although I didn't test it, the following code should work fine (as in: correct results, no memory allocations): Code: Select all `auto xr = qr.solve(b); for(int i=0; i<x.size(); i++) { x[i] = xr[i]; }` Something tells me the for()-loop shouldn't be necessary. Note that the following does give a compiler error: Code: Select all `auto xr = qr.solve(b); x.noalias() = xr;` The error is: DenseCoeffsBase.h(495): error C2440: '=' : cannot convert from 'const Eigen::ReturnByValue<Derived>::YOU_ARE_TRYING_TO_ACCESS_A_SINGLE_COEFFICIENT_IN_A_SPECIAL_EXPRESSION_WHERE_THAT_IS_NOT_ALLOWED_BECAUSE_THAT_WOULD_BE_INEFFICIENT' to 'double' As a side note, the following code also does work without re-allocations: Initialization: Code: Select all `Eigen::ColPivHouseholderQR<Eigen::MatrixXd> decomposition = AA.colPivHouseholderQr();` inside the pixel-for()-loop: Code: Select all `x.noalias() = decomposition.solve(Ab).cast<float>();` Here, it's the cast<float>() that actually makes the assignment work as expected. For the svd() I've tried eval() and evalTo(), but can't get it to work without memory re-allocations
ggael Moderator Posts 3447 Karma 19 OS	Re: Solve without memory allocation Tue May 07, 2013 10:00 am I don't really get what you mean by: Those threads block on the mutex inside malloc, reducing parallelism to effectively 1 core. I guess that's just killing the perf. For your information, when you do x = qr.solve(b); we need a temporary that has the same dimension than b to receive the result of Q^T b. In the special case where the underlying matrix is squared, we could directly use x for that purpose, but that's a rather rare use case for a QR factorization. Maybe the sizes of your matrices are known at compile time, or at least bounded at compile time? If that's so, and that they are not too large, then you can avoid dynamic allocation using fixed size matrices, e.g.: typedef Matrix<double, 20,3> DesignMatrixType; or: typedef Matrix<double, Dynamic, Dynamic, 0, 20, 3> DesignMatrixType; Same for the vector types.
twithaar Registered Member Posts 23 Karma 0	Re: Solve without memory allocation Tue May 07, 2013 5:40 pm ggael wrote:I don't really get what you mean by: Those threads block on the mutex inside malloc, reducing parallelism to effectively 1 core. I guess that's just killing the perf. Ah, it was an unclear statement. The memory allocator in windows XP appears to be fully single threaded. When running code that creates intermediate results in multiple threads, all those threads call malloc(), which has a mutex for thread-safety. On my test system, 8 threads are stalling on that mutex, the resulting cpu-load is 12%. 86% of the time is spent on waiting on the mutex inside malloc(). My laptop, with 4 threads and windows 7 runs nicely with a load of 100%. So memory allocations are killing parallelism on windows XP. ggael wrote:For your information, when you do x = qr.solve(b); we need a temporary that has the same dimension than b to receive the result of Q^T b. Ah, thanks, that's good to know! Would there be any way to pass that temporary to solve ? (or let qr store it itself, for the next call ?) In my code, copying data around is not a problem, but the memory allocations are absolutely killing performance. If I understand Eigen's code correctly, svd is similar to qr; it needs a temporary for the singularvalues. Anyway, thanks for the replies so far. Currently the explicitly pre-calculated inverse doesn't seem to give me numerical problems, but if there's any way to use a qr or svd-decomposition directly, I'd love to hear that.