Registered Member
|
Hi All,
I'm trying to use Eigen for a per-pixel processing. This is a massive for()-loop, which currently has (among many other things) a svd.solve() and a qr.solve(). Both the svd and Qr objects and their result vectors are pre-allocated. Still, a statement like:
causes a call to the memory allocator, and the following doesn't compile:
Is there anyway to use these solvers without memory allocations ? On windows 7 the code runs fine, but on windows Xp, the code I have is currently blocking on the memory allocation, and therefore effectively running on 1 thread, even though the outermost for()-loop is parallelized using OpenMp. As a workaround I currently form the inverse explicitly, even though that does reduce the numerical precision.... Any suggestions are appreciated, Thijs |
Moderator
|
That's not easily doable because the amount of memory that would have to be preallocated depend on the number of columns of b. Nevertheless, it seems your are hitting a more serious issue. Isn't malloc thread safe on winXP??
|
Registered Member
|
It's thread-safe for sure, it's a performance issue. One of the test-machines has 4 cores with hyperthreading, so OpenMp runs with 8 threads. Those threads block on the mutex inside malloc, reducing parallelism to effectively 1 core. In my case 'b' always has 1 column, and 'x', to which I assign the solution, is guaranteed (by my code) to have the correct size pre-allocated. So although I didn't test it, the following code should work fine (as in: correct results, no memory allocations):
Something tells me the for()-loop shouldn't be necessary. Note that the following does give a compiler error:
The error is:
As a side note, the following code also does work without re-allocations: Initialization:
inside the pixel-for()-loop:
Here, it's the cast<float>() that actually makes the assignment work as expected. For the svd() I've tried eval() and evalTo(), but can't get it to work without memory re-allocations |
Moderator
|
I don't really get what you mean by:
I guess that's just killing the perf. For your information, when you do x = qr.solve(b); we need a temporary that has the same dimension than b to receive the result of Q^T b. In the special case where the underlying matrix is squared, we could directly use x for that purpose, but that's a rather rare use case for a QR factorization. Maybe the sizes of your matrices are known at compile time, or at least bounded at compile time? If that's so, and that they are not too large, then you can avoid dynamic allocation using fixed size matrices, e.g.: typedef Matrix<double, 20,3> DesignMatrixType; or: typedef Matrix<double, Dynamic, Dynamic, 0, 20, 3> DesignMatrixType; Same for the vector types. |
Registered Member
|
Ah, it was an unclear statement. The memory allocator in windows XP appears to be fully single threaded. When running code that creates intermediate results in multiple threads, all those threads call malloc(), which has a mutex for thread-safety. On my test system, 8 threads are stalling on that mutex, the resulting cpu-load is 12%. 86% of the time is spent on waiting on the mutex inside malloc(). My laptop, with 4 threads and windows 7 runs nicely with a load of 100%. So memory allocations are killing parallelism on windows XP.
Ah, thanks, that's good to know! Would there be any way to pass that temporary to solve ? (or let qr store it itself, for the next call ?) In my code, copying data around is not a problem, but the memory allocations are absolutely killing performance. If I understand Eigen's code correctly, svd is similar to qr; it needs a temporary for the singularvalues. Anyway, thanks for the replies so far. Currently the explicitly pre-calculated inverse doesn't seem to give me numerical problems, but if there's any way to use a qr or svd-decomposition directly, I'd love to hear that. |
Registered users: Bing [Bot], Google [Bot], Sogou [Bot]