Performance vs hand-written code for small sizes • KDE Community Forums

This forum has been archived. All content is frozen. Please use KDE Discuss instead.

Board index

Performance vs hand-written code for small sizes

Page 1 of 1 (4 posts)

Tags:

fabiol Registered Member Posts 5 Karma 0	Performance vs hand-written code for small sizes Wed May 14, 2014 5:17 pm Hi all I've seen some comments in this forum and benchmarks showing the gain of Eigen over hand-written code for Matrix Multiplication when the size of the involved matrices is small - say in the range [3x3, 30x30], possibly with non-square matrices. In my experience this is true if we compare against GCC (-O3, with and w/o -mavx or -msse4.2), but not if we compare to ICC (-O3 -xAVX/-xSSE4.2). "Surprisingly" (ok, I'm not entirely sure about being surprising or not), the performance is even worse if we have a chain of matrix multiplications like A += BC + DE + FG. I'm using noalias() for the destination matrix and all of the involved matrices are of fixed size. For example, this code: Code: Select all Map<Matrix<double, 10, 14, RowMajor> > M_B0(B0); Map<Matrix<double, 10, 14, RowMajor> > M_B1(B1); Map<Matrix<double, 10, 14, RowMajor> > M_B2(B2); Map<Matrix<double, 10, 14, RowMajor> > M_B3(B3); Map<Matrix<double, 14, 10, RowMajor> > M_C0(C0); Map<Matrix<double, 14, 10, RowMajor> > M_C1(C1); Map<Matrix<double, 14, 10, RowMajor> > M_C2(C2); Map<Matrix<double, 14, 10, RowMajor> > M_C3(C3); Map<Matrix<double, 10, 10, RowMajor> > M_A(A); M_A.noalias() += M_B2M_C0 + M_B1M_C1 + M_B3M_C2 + M_B0M_C3; runs slower* than (here one operand in each matrix multiply is transposed w.r.t. to the previous case, but it doesn't make any difference): Code: Select all `for (int i = 0; i<14; i++) for (int j = 0; j<10; j++) #pragma vector aligned // Legal because actual size of each matrix is padded to 12 elements for (int k = 0; k<10; k++) A[j][k] += ((M_B2[i][k]M_C0[i][j]) +(M_B1[i][k]M_C1[i][j])+(M_B3[i][k]M_C2[i][j])+(M_B0[i][k]M_C3[i][j]));` The best (meaning the fastest) Eigen code was obtained compiling it with gcc 4.7 -O3, whereas the best hand-written C code was compiled with ICC 14 -O3. The code was run in an outer loop several times, leading to: Runtime Eigen = 1.812738 s Runtime Hand-written = 1.329352 s Everywhere I read that Eigen is optimal especially for small matrices, but here I see that straight ICC compilation pays off. OK, I haven't tried all of the possible sizes, but before spending hours coding various examples I ask: - Am I missing something important? - Are there benchmarks that compare, in an exhaustive way, eigen and hand-written code for small sizes? - And most importantly, for which range of sizes do you think that Eigen >> best-possible-hand-written-code and still Eigen >> say MKL ? Thanks for considering this long question -- Fabio Last edited by fabiol on Mon May 19, 2014 9:41 pm, edited 1 time in total.
ggael Moderator Posts 3447 Karma 19 OS	Re: Performance vs hand-written code for small sizes Fri May 16, 2014 12:25 pm Your comparison is unfair because in one case you're telling that the buffers are aligned, and not in the other case. Use Map<Matrix<double, 10, 14, RowMajor>, Aligned> to tell Eigen that you're pointers are aligned. The devel branch should also be significantly faster if you enable AVX.
ggael Moderator Posts 3447 Karma 19 OS	Re: Performance vs hand-written code for small sizes Fri May 16, 2014 12:28 pm BTW, you wrote: Runtime Eigen = 1.329352 s Runtime Hand-written = 1.812738 s from which I understand that Eigen is already faster.
fabiol Registered Member Posts 5 Karma 0	Re: Performance vs hand-written code for small sizes Mon May 19, 2014 9:45 pm ggael wrote:BTW, you wrote: Runtime Eigen = 1.329352 s Runtime Hand-written = 1.812738 s from which I understand that Eigen is already faster. I meant the other way round, I've edited the original post. Adding Aligned doesn't make any difference: is it because on a sandy bridge, if I remember correctly, the cost of a movups is identical to that of movaps if data are actually aligned? I understood AVX support was merged into trunk long time ago and then included in the latest release, but now you say that I should switch to the devel branch to use it, so I guess I was just wrong, right? Also, you say I have to "enable AVX". How? In any case, I guess I have to add two zero-columns to my matrices so as to have the length of each row (12) a multiple of the vector length (4). Thanks for your support -- Fabio

Page 1 of 1 (4 posts)

Bookmarks

Who is online

Registered users: Baidu [Spider], Bing [Bot], Google [Bot], rblackwell