This forum has been archived. All content is frozen. Please use KDE Discuss instead.

Performance vs hand-written code for small sizes

Tags: None
(comma "," separated)
fabiol
Registered Member
Posts
5
Karma
0
Hi all

I've seen some comments in this forum and benchmarks showing the gain of Eigen over hand-written code for Matrix Multiplication when the size of the involved matrices is *small* - say in the range [3x3, 30x30], possibly with non-square matrices.

In my experience this is true if we compare against GCC (-O3, with and w/o -mavx or -msse4.2), but not if we compare to ICC (-O3 -xAVX/-xSSE4.2). "Surprisingly" (ok, I'm not entirely sure about being surprising or not), the performance is even worse if we have a chain of matrix multiplications like A += B*C + D*E + F*G.

I'm using noalias() for the destination matrix and all of the involved matrices are of fixed size.

For example, this code:

Code: Select all
  Map<Matrix<double, 10, 14, RowMajor> > M_B0(B0);
  Map<Matrix<double, 10, 14, RowMajor> > M_B1(B1);
  Map<Matrix<double, 10, 14, RowMajor> > M_B2(B2);
  Map<Matrix<double, 10, 14, RowMajor> > M_B3(B3);

  Map<Matrix<double, 14, 10, RowMajor> > M_C0(C0);
  Map<Matrix<double, 14, 10, RowMajor> > M_C1(C1);
  Map<Matrix<double, 14, 10, RowMajor> > M_C2(C2);
  Map<Matrix<double, 14, 10, RowMajor> > M_C3(C3);

  Map<Matrix<double, 10, 10, RowMajor> > M_A(A);

  M_A.noalias() += M_B2*M_C0 + M_B1*M_C1 + M_B3*M_C2 + M_B0*M_C3;


runs *slower* than (here one operand in each matrix multiply is transposed w.r.t. to the previous case, but it doesn't make any difference):

Code: Select all
for (int i = 0; i<14; i++)
  for (int j = 0; j<10; j++)
    #pragma vector aligned  // Legal because actual size of each matrix is padded to 12 elements
    for (int k = 0; k<10; k++)
      A[j][k] += ((M_B2[i][k]*M_C0[i][j]) +(M_B1[i][k]*M_C1[i][j])+(M_B3[i][k]*M_C2[i][j])+(M_B0[i][k]*M_C3[i][j]));


The best (meaning the fastest) Eigen code was obtained compiling it with gcc 4.7 -O3, whereas the best hand-written C code was compiled with ICC 14 -O3.
The code was run in an outer loop several times, leading to:

Runtime Eigen = 1.812738 s
Runtime Hand-written = 1.329352 s

Everywhere I read that Eigen is optimal especially for small matrices, but here I see that straight ICC compilation pays off. OK, I haven't tried all of the possible sizes, but before spending hours coding various examples I ask:
- Am I missing something important?
- Are there benchmarks that compare, in an exhaustive way, eigen and hand-written code for small sizes?
- And most importantly, for which range of sizes do you think that Eigen >> best-possible-hand-written-code and still Eigen >> say MKL ?

Thanks for considering this long question

-- Fabio

Last edited by fabiol on Mon May 19, 2014 9:41 pm, edited 1 time in total.
User avatar
ggael
Moderator
Posts
3447
Karma
19
OS
Your comparison is unfair because in one case you're telling that the buffers are aligned, and not in the other case. Use Map<Matrix<double, 10, 14, RowMajor>, Aligned> to tell Eigen that you're pointers are aligned. The devel branch should also be significantly faster if you enable AVX.
User avatar
ggael
Moderator
Posts
3447
Karma
19
OS
BTW, you wrote:
Runtime Eigen = 1.329352 s
Runtime Hand-written = 1.812738 s

from which I understand that Eigen is already faster.
fabiol
Registered Member
Posts
5
Karma
0
ggael wrote:BTW, you wrote:
Runtime Eigen = 1.329352 s
Runtime Hand-written = 1.812738 s

from which I understand that Eigen is already faster.


I meant the other way round, I've edited the original post.

Adding Aligned doesn't make any difference: is it because on a sandy bridge, *if* I remember correctly, the cost of a movups is identical to that of movaps if data are actually aligned?

I understood AVX support was merged into trunk long time ago and then included in the latest release, but now you say that I should switch to the devel branch to use it, so I guess I was just wrong, right? Also, you say I have to "enable AVX". How? In any case, I guess I have to add two zero-columns to my matrices so as to have the length of each row (12) a multiple of the vector length (4).

Thanks for your support

-- Fabio


Bookmarks



Who is online

Registered users: Baidu [Spider], Bing [Bot], Google [Bot], rblackwell