This forum has been archived. All content is frozen. Please use KDE Discuss instead.

Outer Loop Vectorization

Tags: None
(comma "," separated)
fwitherden
Registered Member
Posts
1
Karma
0

Outer Loop Vectorization

Mon Apr 14, 2014 8:02 pm
Hi all,

While profiling my code I found a significant portion of time was being spent in the following function (where T = double and both matrices reside in either L2 or L3 cache):

Code: Select all
template<typename T>
eval_orthob(const Matrix<T, 2, Dynamic>& pq, Matrix<T, Dynamic, Dynamic>& out) const
{
    for (int k = 0; k < out.cols(); ++k)
    {
        const T& p = pq(0, k);
        const T& q = pq(1, k);

        T a = (q != 1) ? 2*(1 + p)/(1 - q) - 1 : 0;
        T pow1mqi = 1;
        T pow2ip1 = 0.5;

        for (int i = 0, off = 0; i <= qdeg(); ++i)
        {
            for (int j = i; j <= qdeg() - i; ++j, ++off)
            {
                T cij = sqrt(T((2*i + 1)*(2*i + 2*j + 2))) * pow2ip1;

                out(off, k) = cij*pow1mqi* jacobi_poly(i, 0, 0, a)*jacobi_poly(j, 2*i + 1, 0, q);
            }
            pow1mqi *= 1 - q;
            pow2ip1 /= 2;
        }
    }
}

template<typename T>
EIGEN_ALWAYS_INLINE T
jacobi_poly(int n, int a, int b, const T& x)
{
    int apb = a + b, amb = a - b, bbmaa = b*b - a*a;

    if (n == 0) return 1;
    else if (n == 1) return ((apb + 2)*x + amb) / 2;
    else
    {
        T jm1 = ((apb + 2)*x + amb) / 2, jm2 = 1;

        for (int q = 2; q <= n; ++q)
        {
            int qapbpq = q*(apb + q), apbp2q = apb + 2*q;
            int apbp2qm1 = apbp2q - 1, apbp2qm2 = apbp2q - 2;

            T aq = T(apbp2q*apbp2qm1) / (2*qapbpq);
            T bq = T(apbp2qm1*bbmaa) / (2*qapbpq*apbp2qm2);
            T cq = T(apbp2q)*((a + q - 1)*(b + q - 1)) / (qapbpq*apbp2qm2);

            std::swap(jm1, jm2);
            jm1 = (aq*x - bq)*jm2 - cq*jm1;
        }

        return jm1;
    }
}


Neither ICC, GCC or Clang produce particularly good code for the above function. I would therefore like to vectorize the outer (int k) loop as a means of improving the performance. However, I would like to do so in a way that avoids the allocation of temporaries and still permits the code to function when T is a non-standard type. Christoph on IRC suggested I investigate the packet interface -- although warned that it may be an implementation detail.

Hence, I guess my question is: what options are open to me to improve the performance of the above?

Regards, Freddie.


Bookmarks



Who is online

Registered users: Baidu [Spider], Bing [Bot], Google [Bot], rblackwell