Why is this code slow ? (Matrix Multiplication 4x4 * 4x1000)

Tue Oct 12, 2010 8:06 pm

In my program, I have a very big C-Array of size nstates * ncat * npattern ( nstates is usually 4 or 20, ncat is 4, npattern can be as big as 10K ) and a small array of size nstates * nstates.

What I have to do are pretty simple matrix operations, such as take a segment of nstates elements from the big array and multiply it with the matrix (nstates * nstates)

I tried to implement it in EIGEN and then compare with a naive implementation. Here is a snippet of the code:

As you might easily see, the code does exactly what Ive written above: take small vector segments (4 elements) and use them for computation one after another. This works fine and give me 10% speed up by using EIGEN.

Through Benchmarking I noticed that EIGEN does not work so well with small matrix/vector. So I think it might be better to somehow do the computation in a whole instead of extracting small vector like I did in the code above. Here is the new code, which does exactly the same, the only difference is that it extracts a very big segment (nstates*npattern element) to do the computation (I have to use Stride for this), the outer for loop therefore also disappears, which I thought would help since this loops 10K+ times.

To my big surprise, this code runs even slower than a naive implementation. Could it be that the matrix multiplication between (4x4) and (4x10K) are not so efficiently implemented ?

bjacob, can you help me ? THANKS !

lamtung Registered Member Posts 28 Karma 0	Why is this code slow ? (Matrix Multiplication 4x4 * 4x1000) Tue Oct 12, 2010 8:06 pm In my program, I have a very big C-Array of size nstates * ncat * npattern ( nstates is usually 4 or 20, ncat is 4, npattern can be as big as 10K ) and a small array of size nstates * nstates. What I have to do are pretty simple matrix operations, such as take a segment of nstates elements from the big array and multiply it with the matrix (nstates * nstates) I tried to implement it in EIGEN and then compare with a naive implementation. Here is a snippet of the code: Code: Select all for (ptn = 0; ptn < npattern; ++ptn) { ptn_freq = (aln)[ptn].frequency; lh_ptn = 0.0; for (cat = 0; cat < ncat; ++cat) { padding1 = cat trans_size; padding2 = ptn * block + cat * NSTATES; trans_state = trans_mat + padding1; partial_lh_site = node_branch->partial_lh + padding2; partial_lh_child = dad_branch->partial_lh + padding2; Map<Matrix<double, 1, NSTATES>, Aligned> ei_partial_lh_site(partial_lh_site); Map<Matrix<double, 1, NSTATES>, Aligned> ei_partial_lh_child(partial_lh_child); Map<Matrix<double, NSTATES, NSTATES>, Aligned> ei_trans_state(trans_state); lh_ptn += (ei_partial_lh_child * ei_trans_state).dot(ei_partial_lh_site); } lh_ptn = p_var_cat; if ((aln)[ptn].is_const && (aln)[ptn][0] < NSTATES) { lh_ptn += p_invar state_freq[(aln)[ptn][0]]; } assert(lh_ptn > 0); tree_lh += log(lh_ptn) ptn_freq; } return tree_lh; As you might easily see, the code does exactly what Ive written above: take small vector segments (4 elements) and use them for computation one after another. This works fine and give me 10% speed up by using EIGEN. Through Benchmarking I noticed that EIGEN does not work so well with small matrix/vector. So I think it might be better to somehow do the computation in a whole instead of extracting small vector like I did in the code above. Here is the new code, which does exactly the same, the only difference is that it extracts a very big segment (nstatesnpattern element) to do the computation (I have to use Stride for this), the outer for loop therefore also disappears, which I thought would help since this loops 10K+ times. Code: Select all for (cat = 0; cat < ncat; ++cat) { ............. int oStride = ncat NSTATES; // Outer Stride Map<Matrix<double, Dynamic, NSTATES, RowMajor>, Aligned, Stride<Dynamic, 1 > > ei_partial_lh_site(partial_lh_site, npattern, NSTATES, Stride<Dynamic, 1 > (oStride, 1)); Map<Matrix<double, Dynamic, NSTATES, RowMajor>, Aligned, Stride<Dynamic, 1 > > ei_partial_lh_child(partial_lh_child, npattern, NSTATES, Stride<Dynamic, 1 > (oStride, 1)); lh_allPtn.noalias() += ((ei_partial_lh_child * ei_trans_cat).cwiseProduct(ei_partial_lh_site)) * (Matrix<double, NSTATES, 1>::Ones()); partial_lh_site += NSTATES; partial_lh_child += NSTATES; } EIGEN_ALIGN16 double ptn_freqs[npattern]; for (ptn = 0; ptn < alnSize; ++ptn) { ptn_freqs[ptn] = (aln)[ptn].frequency; lh_allPtn[ptn] = p_var_cat; if ((aln)[ptn].is_const && (aln)[ptn][0] < NSTATES) { lh_allPtn[ptn] += p_invar * state_freq[(aln)[ptn][0]]; assert(lh_allPtn[ptn] > 0); } tree_lh += log(lh_allPtn[ptn]) ptn_freqs[ptn]; } return tree_lh; To my big surprise, this code runs even slower than a naive implementation. Could it be that the matrix multiplication between (4x4) and (4x10K) are not so efficiently implemented ? bjacob, can you help me ? THANKS !
lamtung Registered Member Posts 28 Karma 0	Re: Why is this code slow ? (Matrix Multiplication 4x4 * 4x1 Wed Oct 13, 2010 12:15 am I think , in the second code, I create a dynamic vector lh_allPtn of very big size (10K elements) to store the results in the function, which causes malloc. The function is called thousands time during the run, that might be the reason for bad performance. I will try to use a class variable for that vector, which is only malloced once at the beginning.
bjacob Registered Member Posts 658 Karma 3	Re: Why is this code slow ? (Matrix Multiplication 4x4 * 4x1 Wed Oct 13, 2010 1:36 am Yes, that's a good direction to look into. Your original post would take me too much time to look into; if you need help, please isolate a smaller portion of code and ask a more precise question PS. Eigen does very well on small matrices (like 4x4) provided that you tell it at compile time about these (template parameters). Join us on Eigen's IRC channel: #eigen on irc.freenode.net Have a serious interest in Eigen? Then join the mailing list!

Why is this code slow ? (Matrix Multiplication 4x4 * 4x1000)

Page 1 of 1 (3 posts)

Why is this code slow ? (Matrix Multiplication 4x4 * 4x1000)

Re: Why is this code slow ? (Matrix Multiplication 4x4 * 4x1

Re: Why is this code slow ? (Matrix Multiplication 4x4 * 4x1

Bookmarks

Who is online