Registered Member
|
I did some benchmark on column-wise and row-wise reductions, and compare with a for-loop implementation. I found that column-wise performance is similar to the for-loop, but Eigen's row-wise performance is 7x - 8x slower than for-loop. I think this need to be fixed ...
Specifically, here are the Eigen-based codes
Below is my for-loop implementation
Note that when implementing rowwise_sum, I consider it as the process to sum up the column vectors, rather than evaluating the sum of each row vector. I conducted the benchmark on MacBook Pro / Mac OS X 10.7 / clang 3.0 / -O3. Here are the timing results (the elapsed time is measured on repeating the operation 100 times on a 2000 x 2000 pre-allocated random matrix)
I briefly looked at Eigen's code on row-wise evaluation. It seems to me that what it did is to cache each row to a temporary vector (so as to make it a continuous vector), and then compute the sum. This is done for each row (Correct me if this is not the case). If this is indeed the way that things are implemented, then it is a very slow approach. The bottleneck consists in traversing each row vector. With a column major layout, the interval between consecutive elements of a row vector is m (#rows). When matrix is large, traversing such a vector may incur lots of cache swaps, which is a big performance penalty. My implementation of rowwise_sum largely avoids such issue. This way can also be used in most of the row wise functions provided in Eigen, such as minCoeff, maxCoeff, and various norms. What do you think about this ? |
Registered Member
|
Just realized that for row-wise sum, there is a faster way:
row-wise norms can also be evaluated using this linear algebraic trick. However, some other reductions (e.g. minCoeffs and maxCoeffs) and row-wise broadcasting can not be computed using such trick and therefore a faster generic row-wise evaluation scheme is still useful. |
Moderator
|
I did not expected such a difference in performance. The inconvenient is that the row-wise case would require a temporary. Could you share your benchmark. I'd like to test smaller matrices too as well as some other variants, like:
void rowwise_sum(const MatrixXd& X, VectorXd& rw) { enum { PacketSize=4}; int n = (int)X.cols(); for (int k = 0; k < X.rows(); k+=PacketSize) { Matrix<double,PacketSize,1> tmp = Matrix<double,PacketSize,1>::Zero(); for (int j = 0; j < n; ++j) tmp += X.col(j).segment<PacketSize>(k); rw.segment<PacketSize>(k) = tmp; } (yes I know this variant should be adjusted to handle the case when X.rows()%PacketSize != 0) thanks |
Registered Member
|
Please refer to my earlier post for how I implement the row_sum and the performance reported there, which are all I have at current point. But I guess you may do a more thorough benchmarking, comparing more methods.
In some cases, row-wise evaluation may need a temporary, but in many useful cases, this is not true (for example, row-wise sum/max/min/norm, etc). Generally, row-wise reduction written as follows can completely avoid temporary copies,
One may devise different functors for different types of reduction, e.g. for sum, fun(r, A.col(j)) should be implemented as r += A.col(j), for min, it can be r = min(r, A.col(j)), etc. Also, these operations (e.g. r + A.col(j)) are vector-based operations, and thus can benefit from vectorized codes (e.g. SSE2, VML). I think a good design would be use a temporary-copy based way as the generic implementation of row-wise operation, while using a specialized version (through template specialization mechanism) to implement some row-wise reductions in this more efficient way. BTW, I think other row-wise operations can also avoid temporary copying and traversing over each row. For example: A.rowwise() += b can be implemented as
This is clearly much more efficient than the way that traverses and clones each row. This strategy can also be applied to more operations, e.g. assignment, plus/minus/multiplication/division, etc. |
Moderator
|
Note that all Eigen's code is already generic wrt the operations you apply, and explicitly vectorized, no need for the VML here.
Also, I was not referring to temporary copies, but real creation of temporary when the result of the row-wise operation is used in a more complexe expression, e.g.: v = A.rowwise().sum() + B.rowwise().sum(); On the other hand that makes sense because this is what we do for matrix-vector product (creating temporaries), and a matrix-vector product is actually just a special partial reduction: A*v <=> (A.array().rowwise() * v.transpose().array()).rowwise().sum(); |
Registered Member
|
I did some test on other statistics: minCoeff / maxCoeff / norm. The results are similar to what I reported in the first post of this thread.
X.rowwise.sum(), X.rowwise().minCoeff(), ....... All such statements took 6x - 7x run time as compared to X.colwise().sum() or my hand-written for-loop that use col-by-col way of implementation as I showed above. At this point, I can roll my own function like rowwise_sum as a workaround. But really wish this issue can be fixed soon. BTW, here is the entire benchmark code (for sum), compiled with -O3 (clang 3.0)
|
Registered Member
|
I just skimmed through the BugZilla website, and found the following bug:
"Partial reductions can be evaluated/vectorized in a more clever way with respect to the relative storage order" It is tagged as a bug blocking the release of Eigen 3.2 (does it mean that this is going to be addressed before v3.2?) Hope the performance issue that I noted in this thread would be gone by then. |
Registered users: Bing [Bot], claydoh, Google [Bot], rblackwell, Yahoo [Bot]