This forum has been archived. All content is frozen. Please use KDE Discuss instead.

Performance: avoiding temporaries and malloc/free

Tags: None
(comma "," separated)
EamonNerbonne
Registered Member
Posts
36
Karma
0
Short Story:
Can I avoid bounds-computations and malloc/free in an expression (all variables are VectorXd's) such as the following:
Code: Select all
vec1 = vec2 - vec3;


Long Story:
I've got a bunch of code which consists of many simple matrix expressions; generally, nothing more complex than:
Code: Select all
VectorXd vec1 = vec2 + scalar * (mat1.transpose() * vec3).lazy();

or even just
Code: Select all
VectorXd vec1 = vec2 - vec3;

However, the exact dimensions are unknown at compile time. Evaluating such expressions sometimes requires temporaries. Profiling the code shows that a lot of unnecessary time is spent in malloc/free. Although I can't avoid all temporaries, I can preallocate a few scratch-vectors and matrices of the right size in advance. For example, rather than declaring a local variable vec1, I could use a preexisting vector of the right size. However, it's not clear to me that doing so is actually preventing the overhead I'm trying to avoid: I still see significant malloc/free time when profiling.

Presumably, eigen2 can't be sure that the old value of vec1 has an equal amount of dimensions as the new value and thus still reallocates memory. This would also suggest that bounds-checks are being performed that aren't necessary.

Since my app deals with vectors that are fairly small, this overhead is quite significant. My code is spending at least 15% of its time in malloc/free (despite essentially looping over the same data all the time), and and unknown additional amount computing various bounds in the first place. Using fixed temporaries might also improve data-locality, so this might really add up.

Is there a (preferably clean) way of telling eigen2 that the target variable is already of the right size and to not do any bounds-checking (when NDEBUG's not defined, anyhow)?
User avatar
ggael
Moderator
Posts
3447
Karma
19
OS
It is hard to be specific without knowing more about your code, so here are some generalities about Eigen's internals that might help you to optimize your code:

1 - If you have "a = b - c;" and a already has the exact same sizes as b and c, then there is no reallocation. So all the optimizations you can do to reuse preallocated objects will indeed reduce the number of memory allocations.

2 - Eigen does not manage a global pool of memory, so every time you explicitly request for a temporary (e.g., "VectorXd a = b -c;") then of course Eigen will call malloc.

3 - Tip: if your objects are small enough and that you know at compile time the maximal sizes, then you can use stack allocation instead of heap allocation, e.g. for a vector:

typedef Matrix<double,Dynamic,1,0,MyMaxSize,1> StackVectorXd;
StackVectorXd a = b - c;

=> no call to malloc. On the down size MyMaxSize will be reserved on the stack wasting memory if the actual size of a is much smaller than MyMaxSize.

4 - Another tip:

VectorXd vec1 = vec2 + scalar * (mat1.transpose() * vec3).lazy();

If you are using the devel branch you can rewrite this expression as:

VectorXd vec1 = vec2;
vec1.noalias() += scalar * mat1.transpose() * vec3;

This way the matrix product will be more efficiently performed inplace in vec1.
EamonNerbonne
Registered Member
Posts
36
Karma
0
Thanks for the info - using the fact that a preallocated matrix of the right size is not reallocated, I now use a bunch of temporaries. This is slightly unfortunate in terms of readability, but it's a big win for performance, so that's OK.

Incidentally, I tried using the eigen's head branch; this however is quite a bit slower. According to the profiler, lines such as

Code: Select all
P = P - lr_P * ( (muK2_BjT_Bj_P_vJ * vJ.transpose()).lazy() + (muJ2_BkT_Bk_P_vK * vK.transpose()).lazy()) ;

(lr_P being a double, and P being a matrix, having a number of columns equal to vectors' vJ/vK size and as many rows as vectors muK.../muJ...)

Such expressions are on the order of three times as slow with the head branch as compared to 2.08 (using VS2010 or VS2008 compilers). Obviously, using noalias() here would communicate the intent incorrectly and perhaps cause bugs.
User avatar
ggael
Moderator
Posts
3447
Karma
19
OS
Indeed, with gcc, and vector sizes of 200, I observed a very small slow down (x1.18), but note that with the devel branch the lazy() function is deprecated (it is even ignored here). So your expression should be ideally written like this:

p.noalias() -= lr_P * muK2_BjT_Bj_P_vJ * vJ.transpose();
p.noalias() += lr_P * muJ2_BkT_Bk_P_vK * vK.transpose();

in which case here the devel branch is faster than your current expression+eigen2.0.9 (x0.73).

Perhaps, in the future we'll have an even more sophisticated expression analyzer which will be able to automatically split such expressions, but currently it is better to split it yourself. This is because matrix products are now always evaluated when nested in an expression. It appeared to be much faster all of the time.


Bookmarks



Who is online

Registered users: Bing [Bot], daret, Google [Bot], sandyvee, Sogou [Bot]