Row-wise reduction is slow

Board index

Page 1 of 1 (7 posts)

Tags:

dhlin Registered Member Posts 18 Karma 0	Row-wise reduction is slow Sat Apr 14, 2012 6:19 am I did some benchmark on column-wise and row-wise reductions, and compare with a for-loop implementation. I found that column-wise performance is similar to the for-loop, but Eigen's row-wise performance is 7x - 8x slower than for-loop. I think this need to be fixed ... Specifically, here are the Eigen-based codes Code: Select all `cw = A.colwise().sum(); rw = A.rowwise().sum();` Below is my for-loop implementation Code: Select all `void colwise_sum(const MatrixXd& X, RowVectorXd& cw) { int n = (int)X.cols(); for (int j = 0; j < n; ++j) { cw.coeffRef(j) = X.col(j).sum(); } } void rowwise_sum(const MatrixXd& X, VectorXd& rw) { int n = (int)X.cols(); rw.fill(0.0); for (int j = 0; j < n; ++j) { rw += X.col(j); } }` Note that when implementing rowwise_sum, I consider it as the process to sum up the column vectors, rather than evaluating the sum of each row vector. I conducted the benchmark on MacBook Pro / Mac OS X 10.7 / clang 3.0 / -O3. Here are the timing results (the elapsed time is measured on repeating the operation 100 times on a 2000 x 2000 pre-allocated random matrix) Code: Select all `col-wise sum (me): 0.4476 sec col-wise sum (eigen): 0.4514 sec row-wise sum (me): 0.4970 sec row-wise sum (eigen): 3.7420 sec` I briefly looked at Eigen's code on row-wise evaluation. It seems to me that what it did is to cache each row to a temporary vector (so as to make it a continuous vector), and then compute the sum. This is done for each row (Correct me if this is not the case). If this is indeed the way that things are implemented, then it is a very slow approach. The bottleneck consists in traversing each row vector. With a column major layout, the interval between consecutive elements of a row vector is m (#rows). When matrix is large, traversing such a vector may incur lots of cache swaps, which is a big performance penalty. My implementation of rowwise_sum largely avoids such issue. This way can also be used in most of the row wise functions provided in Eigen, such as minCoeff, maxCoeff, and various norms. What do you think about this ?
dhlin Registered Member Posts 18 Karma 0	Re: Row-wise reduction is slow Sat Apr 14, 2012 7:40 am Just realized that for row-wise sum, there is a faster way: Code: Select all `A * VectorXd::Ones(n);` row-wise norms can also be evaluated using this linear algebraic trick. However, some other reductions (e.g. minCoeffs and maxCoeffs) and row-wise broadcasting can not be computed using such trick and therefore a faster generic row-wise evaluation scheme is still useful.
ggael Moderator Posts 3447 Karma 19 OS	Re: Row-wise reduction is slow Sun Apr 15, 2012 6:01 am I did not expected such a difference in performance. The inconvenient is that the row-wise case would require a temporary. Could you share your benchmark. I'd like to test smaller matrices too as well as some other variants, like: void rowwise_sum(const MatrixXd& X, VectorXd& rw) { enum { PacketSize=4}; int n = (int)X.cols(); for (int k = 0; k < X.rows(); k+=PacketSize) { Matrix<double,PacketSize,1> tmp = Matrix<double,PacketSize,1>::Zero(); for (int j = 0; j < n; ++j) tmp += X.col(j).segment<PacketSize>(k); rw.segment<PacketSize>(k) = tmp; } (yes I know this variant should be adjusted to handle the case when X.rows()%PacketSize != 0) thanks
dhlin Registered Member Posts 18 Karma 0	Re: Row-wise reduction is slow Mon Apr 16, 2012 9:47 pm Please refer to my earlier post for how I implement the row_sum and the performance reported there, which are all I have at current point. But I guess you may do a more thorough benchmarking, comparing more methods. In some cases, row-wise evaluation may need a temporary, but in many useful cases, this is not true (for example, row-wise sum/max/min/norm, etc). Generally, row-wise reduction written as follows can completely avoid temporary copies, Code: Select all `// generic skeleton template<typename Fun> void rowwise_reduce(Fun fun, const MatrixXd& A, VectorXd& r) { r = A.col(0); int n = A.cols(); for (int j = 1; j < n; ++j) { fun(r, A.col(j)); } }` One may devise different functors for different types of reduction, e.g. for sum, fun(r, A.col(j)) should be implemented as r += A.col(j), for min, it can be r = min(r, A.col(j)), etc. Also, these operations (e.g. r + A.col(j)) are vector-based operations, and thus can benefit from vectorized codes (e.g. SSE2, VML). I think a good design would be use a temporary-copy based way as the generic implementation of row-wise operation, while using a specialized version (through template specialization mechanism) to implement some row-wise reductions in this more efficient way. BTW, I think other row-wise operations can also avoid temporary copying and traversing over each row. For example: A.rowwise() += b can be implemented as Code: Select all `void rowwise_addvec(MatrixXd& A, VectorXd& b) { int n = A.cols(); for (j = 0; j < n; ++j) { A.col(j) += b.coeff(j). } }` This is clearly much more efficient than the way that traverses and clones each row. This strategy can also be applied to more operations, e.g. assignment, plus/minus/multiplication/division, etc.
ggael Moderator Posts 3447 Karma 19 OS	Re: Row-wise reduction is slow Tue Apr 17, 2012 7:18 am Note that all Eigen's code is already generic wrt the operations you apply, and explicitly vectorized, no need for the VML here. Also, I was not referring to temporary copies, but real creation of temporary when the result of the row-wise operation is used in a more complexe expression, e.g.: v = A.rowwise().sum() + B.rowwise().sum(); On the other hand that makes sense because this is what we do for matrix-vector product (creating temporaries), and a matrix-vector product is actually just a special partial reduction: Av <=> (A.array().rowwise() v.transpose().array()).rowwise().sum();
dhlin Registered Member Posts 18 Karma 0	Re: Row-wise reduction is slow Thu Apr 19, 2012 5:06 am I did some test on other statistics: minCoeff / maxCoeff / norm. The results are similar to what I reported in the first post of this thread. X.rowwise.sum(), X.rowwise().minCoeff(), ....... All such statements took 6x - 7x run time as compared to X.colwise().sum() or my hand-written for-loop that use col-by-col way of implementation as I showed above. At this point, I can roll my own function like rowwise_sum as a workaround. But really wish this issue can be fixed soon. BTW, here is the entire benchmark code (for sum), compiled with -O3 (clang 3.0) Code: Select all #include <Eigen/Dense> #include <cstdio> #include "my_timer.h" using namespace gml; // add v to each column of A MatrixXd colwise_sum(const MatrixXd& A) { // int m = (int)A.rows(); int n = (int)A.cols(); MatrixXd R(1, n); for (int i = 0; i < n; ++i) { R.coeffRef(i) = A.col(i).sum(); } return R; } // add v to each row of A MatrixXd rowwise_sum(const MatrixXd& A) { int m = (int)A.rows(); int n = (int)A.cols(); MatrixXd R = MatrixXd::Zero(m, 1); for (int i = 0; i < n; ++i) { R += A.col(i); } return R; } int main(int argc, char *argv[]) { int n = 2000; int nrep = 100; MatrixXd A = MatrixXd::Random(n, n); Timer tm(true); for (int i = 0; i < nrep; ++i) { MatrixXd R = colwise_sum(A); } std::printf("my colwise: elapsed time = %.4f sec\n", tm.elapsed_secs()); Timer tm2(true); for (int i = 0; i < nrep; ++i) { MatrixXd R = A.colwise().sum(); } std::printf("eigen colwise: elapsed time = %.4f sec\n", tm2.elapsed_secs()); Timer tm3(true); for (int i = 0; i < nrep; ++i) { MatrixXd R = rowwise_sum(A); } std::printf("my rowwise: elapsed time = %.4f sec\n", tm3.elapsed_secs()); Timer tm4(true); for (int i = 0; i < nrep; ++i) { MatrixXd R = A.rowwise().sum(); } std::printf("eigen rowwise: elapsed time = %.4f sec\n", tm4.elapsed_secs()); }
dhlin Registered Member Posts 18 Karma 0	Re: Row-wise reduction is slow Thu Apr 19, 2012 6:28 am I just skimmed through the BugZilla website, and found the following bug: "Partial reductions can be evaluated/vectorized in a more clever way with respect to the relative storage order" It is tagged as a bug blocking the release of Eigen 3.2 (does it mean that this is going to be addressed before v3.2?) Hope the performance issue that I noted in this thread would be gone by then.