This forum has been archived. All content is frozen. Please use KDE Discuss instead.

Row-wise reduction is slow

Tags: None
(comma "," separated)
dhlin
Registered Member
Posts
18
Karma
0

Row-wise reduction is slow

Sat Apr 14, 2012 6:19 am
I did some benchmark on column-wise and row-wise reductions, and compare with a for-loop implementation. I found that column-wise performance is similar to the for-loop, but Eigen's row-wise performance is 7x - 8x slower than for-loop. I think this need to be fixed ...

Specifically, here are the Eigen-based codes
Code: Select all
cw = A.colwise().sum();
rw = A.rowwise().sum();


Below is my for-loop implementation
Code: Select all
void colwise_sum(const MatrixXd& X, RowVectorXd& cw)
{
   int n = (int)X.cols();
   for (int j = 0; j < n; ++j)
   {
      cw.coeffRef(j) = X.col(j).sum();
   }
}

void rowwise_sum(const MatrixXd& X, VectorXd& rw)
{
   int n = (int)X.cols();
   rw.fill(0.0);
   for (int j = 0; j < n; ++j)
   {
      rw += X.col(j);
   }
}


Note that when implementing rowwise_sum, I consider it as the process to sum up the column vectors, rather than evaluating the sum of each row vector.

I conducted the benchmark on MacBook Pro / Mac OS X 10.7 / clang 3.0 / -O3. Here are the timing results (the elapsed time is measured on repeating the operation 100 times on a 2000 x 2000 pre-allocated random matrix)

Code: Select all
col-wise sum (me):    0.4476 sec
col-wise sum (eigen): 0.4514 sec
row-wise sum (me):    0.4970 sec
row-wise sum (eigen): 3.7420 sec


I briefly looked at Eigen's code on row-wise evaluation. It seems to me that what it did is to cache each row to a temporary vector (so as to make it a continuous vector), and then compute the sum. This is done for each row (Correct me if this is not the case).

If this is indeed the way that things are implemented, then it is a very slow approach. The bottleneck consists in traversing each row vector. With a column major layout, the interval between consecutive elements of a row vector is m (#rows). When matrix is large, traversing such a vector may incur lots of cache swaps, which is a big performance penalty.

My implementation of rowwise_sum largely avoids such issue. This way can also be used in most of the row wise functions provided in Eigen, such as minCoeff, maxCoeff, and various norms.

What do you think about this ?
dhlin
Registered Member
Posts
18
Karma
0

Re: Row-wise reduction is slow

Sat Apr 14, 2012 7:40 am
Just realized that for row-wise sum, there is a faster way:
Code: Select all
A * VectorXd::Ones(n);

row-wise norms can also be evaluated using this linear algebraic trick.

However, some other reductions (e.g. minCoeffs and maxCoeffs) and row-wise broadcasting can not be computed using such trick and therefore a faster generic row-wise evaluation scheme is still useful.
User avatar
ggael
Moderator
Posts
3447
Karma
19
OS

Re: Row-wise reduction is slow

Sun Apr 15, 2012 6:01 am
I did not expected such a difference in performance. The inconvenient is that the row-wise case would require a temporary. Could you share your benchmark. I'd like to test smaller matrices too as well as some other variants, like:

void rowwise_sum(const MatrixXd& X, VectorXd& rw)
{
enum { PacketSize=4};
int n = (int)X.cols();
for (int k = 0; k < X.rows(); k+=PacketSize)
{
Matrix<double,PacketSize,1> tmp = Matrix<double,PacketSize,1>::Zero();
for (int j = 0; j < n; ++j)
tmp += X.col(j).segment<PacketSize>(k);
rw.segment<PacketSize>(k) = tmp;
}

(yes I know this variant should be adjusted to handle the case when X.rows()%PacketSize != 0)

thanks
dhlin
Registered Member
Posts
18
Karma
0

Re: Row-wise reduction is slow

Mon Apr 16, 2012 9:47 pm
Please refer to my earlier post for how I implement the row_sum and the performance reported there, which are all I have at current point. But I guess you may do a more thorough benchmarking, comparing more methods.

In some cases, row-wise evaluation may need a temporary, but in many useful cases, this is not true (for example, row-wise sum/max/min/norm, etc).

Generally, row-wise reduction written as follows can completely avoid temporary copies,
Code: Select all
// generic skeleton
template<typename Fun>
void rowwise_reduce(Fun fun, const MatrixXd& A, VectorXd& r)
{
    r = A.col(0);
    int n = A.cols();
    for (int j = 1; j < n; ++j)
    {
        fun(r, A.col(j));
    }
}


One may devise different functors for different types of reduction, e.g.
for sum, fun(r, A.col(j)) should be implemented as r += A.col(j), for min, it can be r = min(r, A.col(j)), etc.

Also, these operations (e.g. r + A.col(j)) are vector-based operations, and thus can benefit from vectorized codes (e.g. SSE2, VML).

I think a good design would be use a temporary-copy based way as the generic implementation of row-wise operation, while using a specialized version (through template specialization mechanism) to implement some row-wise reductions in this more efficient way.

BTW, I think other row-wise operations can also avoid temporary copying and traversing over each row. For example: A.rowwise() += b can be implemented as
Code: Select all
void rowwise_addvec(MatrixXd& A, VectorXd& b)
{
    int n = A.cols();
    for (j = 0; j < n; ++j)
    {
        A.col(j) += b.coeff(j).
    }
}

This is clearly much more efficient than the way that traverses and clones each row. This strategy can also be applied to more operations, e.g. assignment, plus/minus/multiplication/division, etc.
User avatar
ggael
Moderator
Posts
3447
Karma
19
OS

Re: Row-wise reduction is slow

Tue Apr 17, 2012 7:18 am
Note that all Eigen's code is already generic wrt the operations you apply, and explicitly vectorized, no need for the VML here.

Also, I was not referring to temporary copies, but real creation of temporary when the result of the row-wise operation is used in a more complexe expression, e.g.:

v = A.rowwise().sum() + B.rowwise().sum();

On the other hand that makes sense because this is what we do for matrix-vector product (creating temporaries), and a matrix-vector product is actually just a special partial reduction:

A*v <=> (A.array().rowwise() * v.transpose().array()).rowwise().sum();
dhlin
Registered Member
Posts
18
Karma
0

Re: Row-wise reduction is slow

Thu Apr 19, 2012 5:06 am
I did some test on other statistics: minCoeff / maxCoeff / norm. The results are similar to what I reported in the first post of this thread.

X.rowwise.sum(), X.rowwise().minCoeff(), .......

All such statements took 6x - 7x run time as compared to X.colwise().sum() or my hand-written for-loop that use col-by-col way of implementation as I showed above.

At this point, I can roll my own function like rowwise_sum as a workaround. But really wish this issue can be fixed soon.

BTW, here is the entire benchmark code (for sum), compiled with -O3 (clang 3.0)
Code: Select all
#include <Eigen/Dense>
#include <cstdio>
#include "my_timer.h" 

using namespace gml;

// add v to each column of A
MatrixXd colwise_sum(const MatrixXd& A)
{
   // int m = (int)A.rows();
   int n = (int)A.cols();

   MatrixXd R(1, n);
   for (int i = 0; i < n; ++i)
   {
      R.coeffRef(i) = A.col(i).sum();
   }

   return R;
}

// add v to each row of A
MatrixXd rowwise_sum(const MatrixXd& A)
{
   int m = (int)A.rows();
   int n = (int)A.cols();

   MatrixXd R = MatrixXd::Zero(m, 1);

   for (int i = 0; i < n; ++i)
   {
      R += A.col(i);
   }

   return R;
}


int main(int argc, char *argv[])
{
   int n = 2000;
   int nrep = 100;

   MatrixXd A = MatrixXd::Random(n, n);

   Timer tm(true);
   for (int i = 0; i < nrep; ++i)
   {
      MatrixXd R = colwise_sum(A);
   }
   std::printf("my colwise: elapsed time = %.4f sec\n", tm.elapsed_secs());

   Timer tm2(true);
   for (int i = 0; i < nrep; ++i)
   {
      MatrixXd R = A.colwise().sum();
   }
   std::printf("eigen colwise: elapsed time = %.4f sec\n", tm2.elapsed_secs());

   Timer tm3(true);
   for (int i = 0; i < nrep; ++i)
   {
      MatrixXd R = rowwise_sum(A);
   }
   std::printf("my rowwise: elapsed time = %.4f sec\n", tm3.elapsed_secs());

   Timer tm4(true);
   for (int i = 0; i < nrep; ++i)
   {
      MatrixXd R = A.rowwise().sum();
   }
   std::printf("eigen rowwise: elapsed time = %.4f sec\n", tm4.elapsed_secs());
}
dhlin
Registered Member
Posts
18
Karma
0

Re: Row-wise reduction is slow

Thu Apr 19, 2012 6:28 am
I just skimmed through the BugZilla website, and found the following bug:

"Partial reductions can be evaluated/vectorized in a more clever way with
respect to the relative storage order"

It is tagged as a bug blocking the release of Eigen 3.2 (does it mean that this is going to be addressed before v3.2?) Hope the performance issue that I noted in this thread would be gone by then.


Bookmarks



Who is online

Registered users: Bing [Bot], claydoh, Google [Bot], rblackwell, Yahoo [Bot]