This forum has been archived. All content is frozen. Please use KDE Discuss instead.

Why is this code slow ? (Matrix Multiplication 4x4 * 4x1000)

Tags: None
(comma "," separated)
lamtung
Registered Member
Posts
28
Karma
0
In my program, I have a very big C-Array of size nstates * ncat * npattern ( nstates is usually 4 or 20, ncat is 4, npattern can be as big as 10K ) and a small array of size nstates * nstates.

What I have to do are pretty simple matrix operations, such as take a segment of nstates elements from the big array and multiply it with the matrix (nstates * nstates)

I tried to implement it in EIGEN and then compare with a naive implementation. Here is a snippet of the code:

Code: Select all
  for (ptn = 0; ptn < npattern; ++ptn) {
        ptn_freq = (*aln)[ptn].frequency;
        lh_ptn = 0.0;
            for (cat = 0; cat < ncat; ++cat) {
                padding1 = cat * trans_size;
                padding2 = ptn * block + cat * NSTATES;
                trans_state = trans_mat + padding1;
                partial_lh_site = node_branch->partial_lh + padding2;
                partial_lh_child = dad_branch->partial_lh + padding2;
      Map<Matrix<double, 1, NSTATES>, Aligned> ei_partial_lh_site(partial_lh_site);
      Map<Matrix<double, 1, NSTATES>, Aligned> ei_partial_lh_child(partial_lh_child);
      Map<Matrix<double, NSTATES, NSTATES>, Aligned> ei_trans_state(trans_state);
                lh_ptn += (ei_partial_lh_child * ei_trans_state).dot(ei_partial_lh_site);               
            }       
        lh_ptn *= p_var_cat;
        if ((*aln)[ptn].is_const && (*aln)[ptn][0] < NSTATES) {
            lh_ptn += p_invar * state_freq[(*aln)[ptn][0]];
        }
        assert(lh_ptn > 0);
   tree_lh += log(lh_ptn) * ptn_freq;
}       
   return tree_lh;


As you might easily see, the code does exactly what Ive written above: take small vector segments (4 elements) and use them for computation one after another. This works fine and give me 10% speed up by using EIGEN.

Through Benchmarking I noticed that EIGEN does not work so well with small matrix/vector. So I think it might be better to somehow do the computation in a whole instead of extracting small vector like I did in the code above. Here is the new code, which does exactly the same, the only difference is that it extracts a very big segment (nstates*npattern element) to do the computation (I have to use Stride for this), the outer for loop therefore also disappears, which I thought would help since this loops 10K+ times.

Code: Select all
    for (cat = 0; cat < ncat; ++cat) {
        .............   
   int oStride = ncat * NSTATES; // Outer Stride
        Map<Matrix<double, Dynamic, NSTATES, RowMajor>, Aligned, Stride<Dynamic, 1 > > ei_partial_lh_site(partial_lh_site, npattern, NSTATES, Stride<Dynamic, 1 > (oStride, 1));
        Map<Matrix<double, Dynamic, NSTATES, RowMajor>, Aligned, Stride<Dynamic, 1 > > ei_partial_lh_child(partial_lh_child, npattern, NSTATES, Stride<Dynamic, 1 > (oStride, 1));
        lh_allPtn.noalias() += ((ei_partial_lh_child * ei_trans_cat).cwiseProduct(ei_partial_lh_site)) * (Matrix<double, NSTATES, 1>::Ones());
        partial_lh_site += NSTATES;
        partial_lh_child += NSTATES;
    }
    EIGEN_ALIGN16 double ptn_freqs[npattern];
    for (ptn = 0; ptn < alnSize; ++ptn) {
        ptn_freqs[ptn] = (*aln)[ptn].frequency;
        lh_allPtn[ptn] *= p_var_cat;
        if ((*aln)[ptn].is_const && (*aln)[ptn][0] < NSTATES) {
            lh_allPtn[ptn] += p_invar * state_freq[(*aln)[ptn][0]];
            assert(lh_allPtn[ptn] > 0);
        }
        tree_lh += log(lh_allPtn[ptn]) * ptn_freqs[ptn];
    }
    return tree_lh;


To my big surprise, this code runs even slower than a naive implementation. Could it be that the matrix multiplication between (4x4) and (4x10K) are not so efficiently implemented ?

bjacob, can you help me ? THANKS !
lamtung
Registered Member
Posts
28
Karma
0
I think , in the second code, I create a dynamic vector lh_allPtn of very big size (10K elements) to store the results in the function, which causes malloc. The function is called thousands time during the run, that might be the reason for bad performance.

I will try to use a class variable for that vector, which is only malloced once at the beginning.
User avatar
bjacob
Registered Member
Posts
658
Karma
3
Yes, that's a good direction to look into.

Your original post would take me too much time to look into; if you need help, please isolate a smaller portion of code and ask a more precise question :)

PS. Eigen does very well on small matrices (like 4x4) provided that you tell it at compile time about these (template parameters).


Join us on Eigen's IRC channel: #eigen on irc.freenode.net
Have a serious interest in Eigen? Then join the mailing list!


Bookmarks



Who is online

Registered users: Bing [Bot], blue_bullet, Google [Bot], rockscient, Yahoo [Bot]