This forum has been archived. All content is frozen. Please use KDE Discuss instead.

Vectorization and enforcing order of instructions

Tags: None
(comma "," separated)
japro
Registered Member
Posts
8
Karma
0
OS
Edit: aaw, **** I just realized that the order change broke the algorithm (overwrites the accumulator xmm0) but I guess it should be fixable.

Hi, I'm not entirely sure if this is the right place for this. but I don't know where else to ask and I think it's relevant to Eigen.

I'm doing a parallel numerical computing course at the moment and in one of the exercises we were asked to implement expression templates (basic arithmetics + dot product). I then went one step further and made mine so it vectorizes automatically using xmm-/emmintrinsics. After some fiddling with details I made a comparison of ddot operations to Eigen3 and MKL (gcc 4.4 "-O2 -msse2 -funroll-loops" on intel core i3-330m):
Image

I was then kinda irritated by the fact that MKL was still faster considering that the assembly output didn't seem to contain any irrelevant instruction in the inner loop. So I figured it must be the order of the instructions. So I changed the assembly output from this:
Code: Select all
.L129:
   movapd   (%esi,%eax,8), %xmm7
   movapd   16(%esi,%eax,8), %xmm6
   mulpd   (%edi,%eax,8), %xmm7
   mulpd   16(%edi,%eax,8), %xmm6
   addpd   %xmm0, %xmm7
   movapd   32(%esi,%eax,8), %xmm5
   movapd   48(%esi,%eax,8), %xmm4
   mulpd   32(%edi,%eax,8), %xmm5
   mulpd   48(%edi,%eax,8), %xmm4
   movapd   64(%esi,%eax,8), %xmm3
   movapd   80(%esi,%eax,8), %xmm0
   mulpd   64(%edi,%eax,8), %xmm3
   mulpd   80(%edi,%eax,8), %xmm0
   movapd   96(%esi,%eax,8), %xmm2
   addpd   %xmm7, %xmm6
   mulpd   96(%edi,%eax,8), %xmm2
   addpd   %xmm6, %xmm5
   addpd   %xmm5, %xmm4
   addpd   %xmm4, %xmm3
   addpd   %xmm3, %xmm0
   addpd   %xmm0, %xmm2
   movapd   112(%esi,%eax,8), %xmm0
   mulpd   112(%edi,%eax,8), %xmm0
   addl   $16, %eax
   cmpl   %eax, %ebx
   addpd   %xmm2, %xmm0
   jg   .L129


to this: The following code is wrong... :(
Code: Select all
.L129:
   movapd   (%esi,%eax,8), %xmm7
   movapd   16(%esi,%eax,8), %xmm6
   movapd   32(%esi,%eax,8), %xmm5
   movapd   48(%esi,%eax,8), %xmm4
   movapd   64(%esi,%eax,8), %xmm3
   movapd   80(%esi,%eax,8), %xmm2
   movapd   96(%esi,%eax,8), %xmm0
   mulpd   (%edi,%eax,8), %xmm7
   addpd   %xmm0, %xmm7
   mulpd   16(%edi,%eax,8), %xmm6
   addpd   %xmm7, %xmm6
   mulpd   32(%edi,%eax,8), %xmm5
   addpd   %xmm6, %xmm5
   mulpd   48(%edi,%eax,8), %xmm4
   addpd   %xmm5, %xmm4
   mulpd   64(%edi,%eax,8), %xmm3
   addpd   %xmm4, %xmm3
   mulpd   80(%edi,%eax,8), %xmm2
   addpd   %xmm3, %xmm0   
   mulpd   96(%edi,%eax,8), %xmm0
   addpd   %xmm0, %xmm2
   movapd   112(%esi,%eax,8), %xmm0
   addl   $16, %eax
   cmpl   %eax, %ebx
   mulpd   112(%edi,%eax,8), %xmm0
   addpd   %xmm2, %xmm1
   jg   .L129


(only the order of the instructions is changed and xmm1 used to to accumulate)

Et voila:
Image

Now my question is. Is there any way to influence the way the optimizer of gcc orders instructions from intrinsics? The other idea I had to automate this was to insert asm("#start label") asm("#end label") comments into the library and then run an external program on the assembly... Or is it just "lets wait until gcc recognizes this kind of optimization itself"?

Last edited by japro on Sun Apr 10, 2011 8:11 pm, edited 1 time in total.
User avatar
bjacob
Registered Member
Posts
658
Karma
3
No idea about the GCC question, but your finding is interesting, as we are seeing a similar phenomenon on the first graph of http://eigen.tuxfamily.org/index.php?title=Benchmark where MKL peaks higher than us in a very similar way.


Join us on Eigen's IRC channel: #eigen on irc.freenode.net
Have a serious interest in Eigen? Then join the mailing list!
japro
Registered Member
Posts
8
Karma
0
OS
Well, the screwup with my algorithm seems to be not so trivial to fix as I thought. Apparently it's not so much the chaining of mul-add that is relevant it seems to be the chaining of 8 (!) loads which conflicts with keeping the accumulator in one of the registers. The algorithm would be fixable by say 4 loads 4 madds and again 4 and 4. But then it is "slow" again. I can fix it by loading and storing the accumulator every iteration and still retain some of the speedup but then it's slightly slower for larger arrays since the additional stores cut into the bandwidth.

EDIT:
I found a loop that gives me this performance (you may have to reload if you looked at the old image... i didn't change the name);
Image

the inner loop looks like this:
Code: Select all
.L129:
   movapd   (%esi,%eax,8), %xmm7
   movapd   16(%esi,%eax,8), %xmm6
   movapd   32(%esi,%eax,8), %xmm5
   movapd   48(%esi,%eax,8), %xmm4
   
   mulpd   (%edi,%eax,8), %xmm7
   addpd   %xmm7, %xmm3
   mulpd   16(%edi,%eax,8), %xmm6
   addpd   %xmm6, %xmm2
   mulpd   32(%edi,%eax,8), %xmm5
   addpd   %xmm5, %xmm1
   mulpd   48(%edi,%eax,8), %xmm4
   addpd   %xmm4, %xmm0

   movapd   64(%esi,%eax,8), %xmm7
   movapd   80(%esi,%eax,8), %xmm6
   movapd   96(%esi,%eax,8), %xmm5
   movapd   112(%esi,%eax,8), %xmm4   

   mulpd   64(%edi,%eax,8), %xmm7
   addpd   %xmm7, %xmm3
   mulpd   80(%edi,%eax,8), %xmm6
   addpd   %xmm6, %xmm2
   mulpd   96(%edi,%eax,8), %xmm5
   addpd   %xmm5, %xmm1
   mulpd   112(%edi,%eax,8), %xmm4
   addpd   %xmm4, %xmm0
   
   addl   $16, %eax
   cmpl   %eax, %ebx
   jg   .L129


I still don't understand what exactly makes the difference. It seems to be depend on which registers are used with which others or accessing them in the longest possible intervals... So this version now uses 4 (xmm0-3) accumulators that are added after the loop.

I guess there is no reasonable way to get the compiler to produce this code with templates and intrinsics alone. One could specialize the ddot template for the case that both operands are are "primary" expressions and then directly inline this version...
japro
Registered Member
Posts
8
Karma
0
OS
For completeness, here is an "extracted" version of the the ddot version as linline asm. Does anyone know how I can solve the problem of duplicated labels when this gets inlines? I'm not very competent when it comes to GCCs asm syntax :-/.

EDIT: fixed the label issue and also added the fdot version.

Code: Select all
/*
 * a and b have to be 16byte aligned
 */
inline double fast_ddot(int n, double *a, double *b)
{
  double res[2];

   asm volatile (
      " xorpd   %%xmm0, %%xmm0\n "
      "   testl   %%ebx, %%ebx;\n"
      "   jle   128f\n"
      "   leal   -1(%%ebx), %%edx\n"
      "   movapd   (%%esi), %%xmm1\n"
      "   movl   $2, %%eax\n"
      "   shrl   %%edx\n"
      "   mulpd   (%%edi), %%xmm1\n"
      "   addpd   %%xmm1, %%xmm0\n"
      "   andl   $7, %%edx\n"
      "   cmpl   $2, %%ebx\n"
      "   xorpd %%xmm1, %%xmm1\n"
      "   xorpd %%xmm2, %%xmm2\n"
      "   xorpd %%xmm3, %%xmm3\n"
      "   jle   128f\n"
      "   testl   %%edx, %%edx\n"
      "   je   129f\n"
      "   cmpl   $1, %%edx\n"
      "   je   410f\n"
      "   cmpl   $2, %%edx\n"
      "   .p2align 4,,5\n"
      "   je   411f\n"
      "   cmpl   $3, %%edx\n"
      "   .p2align 4,,5\n"
      "   je   412f\n"
      "   cmpl   $4, %%edx\n"
      "   .p2align 4,,5\n"
      "   je    413f\n"
      "   cmpl   $5, %%edx\n"
      "   .p2align 4,,5\n"
      "   je   414f\n"
      "   cmpl   $6, %%edx\n"
      "   .p2align 4,,5\n"
      "   je   415f\n"
      "   movl   $4, %%eax\n"
      "   movapd   16(%%esi), %%xmm2\n"
      "   mulpd   16(%%edi), %%xmm2\n"
      "   addpd   %%xmm2, %%xmm0\n"
      " 415:\n"
      "   movapd   (%%esi,%%eax,8), %%xmm3\n"
      "   mulpd   (%%edi,%%eax,8), %%xmm3\n"
      "   addl   $2, %%eax\n"
      "   addpd   %%xmm3, %%xmm0\n"
      " 414:\n"
      "   movapd   (%%esi,%%eax,8), %%xmm4\n"
      "   mulpd   (%%edi,%%eax,8), %%xmm4\n"
      "   addl   $2, %%eax\n"
      "   addpd   %%xmm4, %%xmm0\n"
      "413:\n"
      "   movapd   (%%esi,%%eax,8), %%xmm5\n"
      "   mulpd   (%%edi,%%eax,8), %%xmm5\n"
      "   addl   $2, %%eax\n"
      "   addpd   %%xmm5, %%xmm0\n"
      "412:\n"
      "   movapd   (%%esi,%%eax,8), %%xmm6\n"
      "   mulpd   (%%edi,%%eax,8), %%xmm6\n"
      "   addl   $2, %%eax\n"
      "   addpd   %%xmm6, %%xmm0\n"
      "411:\n"
      "   movapd   (%%esi,%%eax,8), %%xmm7\n"
      "   mulpd   (%%edi,%%eax,8), %%xmm7\n"
      "   addl   $2, %%eax\n"
      "   addpd   %%xmm7, %%xmm0\n"
      "410:\n"
      "   movapd   (%%esi,%%eax,8), %%xmm1\n"
      "   mulpd   (%%edi,%%eax,8), %%xmm1\n"
      "   addl   $2, %%eax\n"
      "   cmpl   %%eax, %%ebx\n"
      "   addpd   %%xmm1, %%xmm0\n"
      "   xorpd %%xmm1, %%xmm1\n"
      "   xorpd %%xmm2, %%xmm2\n"
      "   xorpd %%xmm3, %%xmm3\n"
      "   jle   128f\n"
      "   .p2align 4,,7\n"
      "   .p2align 3\n"
      "129:\n"
      "   movapd   (%%esi,%%eax,8), %%xmm7\n"
      "   movapd   16(%%esi,%%eax,8), %%xmm6\n"
      "   movapd   32(%%esi,%%eax,8), %%xmm5\n"
      "   movapd   48(%%esi,%%eax,8), %%xmm4\n"
      "   \n"
      "   mulpd   (%%edi,%%eax,8), %%xmm7\n"
      "   addpd   %%xmm7, %%xmm3\n"
      "   mulpd   16(%%edi,%%eax,8), %%xmm6\n"
      "   addpd   %%xmm6, %%xmm2\n"
      "   mulpd   32(%%edi,%%eax,8), %%xmm5\n"
      "   addpd   %%xmm5, %%xmm1\n"
      "   mulpd   48(%%edi,%%eax,8), %%xmm4\n"
      "   addpd   %%xmm4, %%xmm0\n"
      "\n"
      "   movapd   64(%%esi,%%eax,8), %%xmm7\n"
      "   movapd   80(%%esi,%%eax,8), %%xmm6\n"
      "   movapd   96(%%esi,%%eax,8), %%xmm5\n"
      "   movapd   112(%%esi,%%eax,8), %%xmm4\n"   

      "   mulpd   64(%%edi,%%eax,8), %%xmm7\n"
      "   addpd   %%xmm7, %%xmm3\n"
      "   mulpd   80(%%edi,%%eax,8), %%xmm6\n"
      "   addpd   %%xmm6, %%xmm2\n"
      "   mulpd   96(%%edi,%%eax,8), %%xmm5\n"
      "   addpd   %%xmm5, %%xmm1\n"
      "   mulpd   112(%%edi,%%eax,8), %%xmm4\n"
      "   addpd   %%xmm4, %%xmm0\n"
   
      "   addl   $16, %%eax\n"
      "   cmpl   %%eax, %%ebx\n"
      "   jg   129b\n"
      "128:\n"
      "   addpd   %%xmm3, %%xmm0\n"
      "   addpd   %%xmm2, %%xmm0\n"
      "   addpd   %%xmm1, %%xmm0\n"
      "   movupd   %%xmm0, %0\n"
      : "=m"(res)
      : "S"(a), "D"(b), "b"(n)
      : "%eax","%edx","%ecx","%xmm0","%xmm1","%xmm2","%xmm3","%xmm4","%xmm5","%xmm6","%xmm7"
      );

  return res[0] + res[1];
}


inline double fast_sdot(int n, float *a, float *b)
{
  float res[4];

   asm volatile (
      " xorps   %%xmm0, %%xmm0; "
      "   testl   %%ebx, %%ebx;   "
      "   jle   128f\n"
      "   leal   -1(%%ebx), %%edx\n"
      "   movaps   (%%esi), %%xmm1\n"
      "   movl   $4, %%eax\n"
      "   shrl   $2, %%edx\n"
      "   mulps   (%%edi), %%xmm1\n"
      "   addps   %%xmm1, %%xmm0\n"
      "   andl   $7, %%edx\n"
      "   cmpl   $4, %%ebx\n"
      "   xorps %%xmm1, %%xmm1\n"
      "   xorps %%xmm2, %%xmm2\n"
      "   xorps %%xmm3, %%xmm3\n"
      "   jle   128f\n"
      "   testl   %%edx, %%edx\n"
      "   je   129f\n"
      "   cmpl   $1, %%edx\n"
      "   je   410f\n"
      "   cmpl   $2, %%edx\n"
      "   .p2align 4,,5\n"
      "   je   411f\n"
      "   cmpl   $3, %%edx\n"
      "   .p2align 4,,5\n"
      "   je   412f\n"
      "   cmpl   $4, %%edx\n"
      "   .p2align 4,,5\n"
      "   je   413f\n"
      "   cmpl   $5, %%edx\n"
      "   .p2align 4,,5\n"
      "   je   414f\n"
      "   cmpl   $6, %%edx\n"
      "   .p2align 4,,5\n"
      "   je   415f\n"
      "   movl   $8, %%eax\n"
      "   movaps   16(%%esi), %%xmm2\n"
      "   mulps   16(%%edi), %%xmm2\n"
      "   addps   %%xmm2, %%xmm0\n"
      "415:\n"
      "   movaps   (%%esi,%%eax,4), %%xmm3\n"
      "   mulps   (%%edi,%%eax,4), %%xmm3\n"
      "   addl   $4, %%eax\n"
      "   addps   %%xmm3, %%xmm0\n"
      "414:\n"
      "   movaps   (%%esi,%%eax,4), %%xmm4\n"
      "   mulps   (%%edi,%%eax,4), %%xmm4\n"
      "   addl   $4, %%eax\n"
      "   addps   %%xmm4, %%xmm0\n"
      "413:\n"
      "   movaps   (%%esi,%%eax,4), %%xmm5\n"
      "   mulps   (%%edi,%%eax,4), %%xmm5\n"
      "   addl   $4, %%eax\n"
      "   addps   %%xmm5, %%xmm0\n"
      "412:\n"
      "   movaps   (%%esi,%%eax,4), %%xmm6\n"
      "   mulps   (%%edi,%%eax,4), %%xmm6\n"
      "   addl   $4, %%eax\n"
      "   addps   %%xmm6, %%xmm0\n"
      "411:\n"
      "   movaps   (%%esi,%%eax,4), %%xmm7\n"
      "   mulps   (%%edi,%%eax,4), %%xmm7\n"
      "   addl   $4, %%eax\n"
      "   addps   %%xmm7, %%xmm0\n"
      "410:\n"
      "   movaps   (%%esi,%%eax,4), %%xmm1\n"
      "   mulps   (%%edi,%%eax,4), %%xmm1\n"
      "   addl   $4, %%eax\n"
      "   cmpl   %%eax, %%ebx\n"
      "   addps   %%xmm1, %%xmm0\n"
      "   xorps %%xmm1, %%xmm1\n"
      "   xorps %%xmm2, %%xmm2\n"
      "   xorps %%xmm3, %%xmm3\n"
      "   jle   128f\n"
      "   .p2align 4,,7\n"
      "   .p2align 3\n"
      "129:\n"
      "   movaps   (%%esi,%%eax,4), %%xmm7\n"
      "   movaps   16(%%esi,%%eax,4), %%xmm6\n"
      "   movaps   32(%%esi,%%eax,4), %%xmm5\n"
      "   movaps   48(%%esi,%%eax,4), %%xmm4\n"
      "   "
      "   mulps   (%%edi,%%eax,4), %%xmm7\n"
      "   addps   %%xmm7, %%xmm3\n"
      "   mulps   16(%%edi,%%eax,4), %%xmm6\n"
      "   addps   %%xmm6, %%xmm2\n"
      "   mulps   32(%%edi,%%eax,4), %%xmm5\n"
      "   addps   %%xmm5, %%xmm1\n"
      "   mulps   48(%%edi,%%eax,4), %%xmm4\n"
      "   addps   %%xmm4, %%xmm0\n"
      " "
      "   movaps   64(%%esi,%%eax,4), %%xmm7\n"
      "   movaps   80(%%esi,%%eax,4), %%xmm6\n"
      "   movaps   96(%%esi,%%eax,4), %%xmm5\n"
      "   movaps   112(%%esi,%%eax,4), %%xmm4\n"   

      "   mulps   64(%%edi,%%eax,4), %%xmm7\n"
      "   addps   %%xmm7, %%xmm3\n"
      "   mulps   80(%%edi,%%eax,4), %%xmm6\n"
      "   addps   %%xmm6, %%xmm2\n"
      "   mulps   96(%%edi,%%eax,4), %%xmm5\n"
      "   addps   %%xmm5, %%xmm1\n"
      "   mulps   112(%%edi,%%eax,4), %%xmm4\n"
      "   addps   %%xmm4, %%xmm0\n"
   
      "   addl   $32, %%eax\n"
      "   cmpl   %%eax, %%ebx\n"
      "   jg   129b\n"
      "128:\n"
      "   addps   %%xmm3, %%xmm0\n"
      "   addps   %%xmm2, %%xmm0\n"
      "   addps   %%xmm1, %%xmm0\n"
      " haddps  %%xmm0, %%xmm0\n"
      " haddps  %%xmm0, %%xmm0\n"
      "   movups   %%xmm0, %0\n"
      : "=m"(res)
      : "S"(a), "D"(b), "b"(n)
      : "%eax","%edx","%ecx","%xmm0","%xmm1","%xmm2","%xmm3","%xmm4","%xmm5","%xmm6","%xmm7"
      );

   return res[0];
}
japro
Registered Member
Posts
8
Karma
0
OS
I'm just continuing my monologue :D. I hope that's ok.

No idea about the GCC question, but your finding is interesting, as we are seeing a similar phenomenon on the first graph of http://eigen.tuxfamily.org/index.php?title=Benchmark where MKL peaks higher than us in a very similar way.

So the operation there is saxpy right? It doesn't say single precision on the benchmark page but considering they peak around 4 operations per cycle that seems to be the most plausible. I again tested my generic vectorization as well as a hand optimized one

saxpy:
Image
and daxpy:
Image

Eigen and MKL show similar curves as on the benchmark page in the wiki. The hand optimized versions almost hit the theoretical maximum (1.95 ops/cycle double precision, 3.9 ops/cycle single). Interestingly my generic vectorization does better than eigen in this case.

Looking at the assembly of eigens vectorization and mine shows an interesting difference. The output looks pretty much the same except for the order in which registers are used. When loop unrolling is used Eigen tends to produce "linear" register usage like this:
addps %xmm1, %xmm0
addps %xmm2, %xmm0
addps %xmm3, %xmm0
addps %xmm4, %xmm0
movaps %xmm0, some location

my vectorizer often produces "cyclic" usage:
addps %xmm0, %xmm1
addps %xmm1, %xmm2
addps %xmm2, %xmm3
addps %xmm3, %xmm4
movaps %xmm4, some location

In this case that isn't very relevant, but in some cases where other instruction get interleaved the cyclic versions seem to support pipelining better since there is less repetitive access on the same registers in consecutive instructions.

The ordering seems to be partly determined by the nesting of intrinsics. I found that defining "madd" as _mm_add_ps(_mm_mul_ps(a, b), c) produces cyclic versions while _mm_add_ps(a, _mm_mul_ps(b, c)) gives me more linear ones.

I hope this helps someone. Also please share your own experience with this kind of thing. I think it's very interesting but there isn't that much well documented information around (or at least I'm unable to find it).

EDIT: Actually That's even observable by just changing the order or instructions in the source code. Using -O3 -funroll-loops I get consistently higher perfomance with Eigen if I write
Code: Select all
y = a*x + y;

instead of
Code: Select all
y = y + a*x;//or
y += a*x;
User avatar
ggael
Moderator
Posts
3447
Karma
19
OS
Hi the reason why Eigen does not exhibit the peak perf. like MKL is because we don't do loop peeling (partial unrolling). However, naive loop peeling does not help because GCC still reuse the same register for the loads, like:

xmm0 <- x[i]
xmm1 <- y[i]
...
y[i] <- xmm1

xmm0 <- x[i+1]
xmm1 <- y[i+1]
...
y[i+1] <- xmm1

while we want to generate something like:

xmm0 <- x[i]
xmm1 <- y[i]
xmm2 <- x[i+1]
xmm3 <- y[i+1]
...
...
y[i] <- xmm1
y[i+1] <- xmm3

To enforce that one possibility would be to introduce "super packets" containing multiple registers...
japro
Registered Member
Posts
8
Karma
0
OS
ggael wrote:Hi the reason why Eigen does not exhibit the peak perf. like MKL is because we don't do loop peeling (partial unrolling). However, naive loop peeling does not help because GCC still reuse the same register for the loads, like:

xmm0 <- x[i]
xmm1 <- y[i]
...
y[i] <- xmm1

xmm0 <- x[i+1]
xmm1 <- y[i+1]
...
y[i+1] <- xmm1


Are you talking about meta loop unrolling or the unrolling the compiler does? With Meta unrolling I got what you describe. If I use -funroll-loops on gcc I get loops like this (straight from asm output):
Code: Select all
.L51:
   movaps   (%ebx,%eax,4), %xmm3
   mulps   %xmm0, %xmm3
   addps   (%esi,%eax,4), %xmm3
   movaps   %xmm3, (%edx)
   movaps   16(%ebx,%eax,4), %xmm2
   mulps   %xmm0, %xmm2
   addps   16(%esi,%eax,4), %xmm2
   movaps   %xmm2, 16(%edx)
   movaps   32(%ebx,%eax,4), %xmm1
   mulps   %xmm0, %xmm1
   addps   32(%esi,%eax,4), %xmm1
   movaps   %xmm1, 32(%edx)
   movaps   48(%ebx,%eax,4), %xmm7
   mulps   %xmm0, %xmm7
   addps   48(%esi,%eax,4), %xmm7
   movaps   %xmm7, 48(%edx)
   movaps   64(%ebx,%eax,4), %xmm6
   mulps   %xmm0, %xmm6
   addps   64(%esi,%eax,4), %xmm6
   movaps   %xmm6, 64(%edx)
   movaps   80(%ebx,%eax,4), %xmm5
   mulps   %xmm0, %xmm5
   addps   80(%esi,%eax,4), %xmm5
   movaps   %xmm5, 80(%edx)
   movaps   96(%ebx,%eax,4), %xmm4
   mulps   %xmm0, %xmm4
   addps   96(%esi,%eax,4), %xmm4
   movaps   %xmm4, 96(%edx)
   movaps   112(%ebx,%eax,4), %xmm3
   mulps   %xmm0, %xmm3
   addps   112(%esi,%eax,4), %xmm3
   addl   $32, %eax
   movaps   %xmm3, 112(%edx)
   subl   $-128, %edx
   cmpl   %eax, %ecx
   jg   .L51
User avatar
ggael
Moderator
Posts
3447
Karma
19
OS
yes I was talking about explicit meta (partial) unrolling. Good to know that sometimes GCC is able to automatically unroll loops in a nice way. In your bench, did you also compiled Eigen with -funroll-loops?
japro
Registered Member
Posts
8
Karma
0
OS
ggael wrote:yes I was talking about explicit meta (partial) unrolling. Good to know that sometimes GCC is able to automatically unroll loops in a nice way. In your bench, did you also compiled Eigen with -funroll-loops?

Yes, eigen produces pretty much the same kind of loops. But not always at the same setting. Somehow Eigen often needs O3 for max performance while my implementation usually hits best performance at O2. I haven'f found out why. Interestingly, the intel compiler refuses to unroll the intrinsic loops. So performance ends up generally lower than with gcc in my testing.

How the registers are used is mostly determined by how the intrinsics are nested. add(a, mul(b, c)) produces different results than add(mul(b,c), a)...
Thats the reason why sometimes writing "a = a + 5*b" can give different performance than "a = 5*b + a". At the moment I have my symmetric operators always put the "heavy" expression on the left side since I found it to produce better performance (also the expressions always have same tree structure what makes specialization for certain operations easier).

I generally found meta unrolling to only give very small improvements while using compiler unrolling will boost certain operations tremendously (almost to the theoretical maximum). So i pretty much abandoned the meta unrolling for these kinds of operations since otherwise one might end up accidentally combining both and have "multi unrolled" loops that span several pages in the code and end up with severe and unpredictable penalties for jumping across page borders in an inner loop.
User avatar
ggael
Moderator
Posts
3447
Karma
19
OS
yes, the problem is that in general we cannot expect the compiler to do the right things, especially when we want to support multiple compilers. Actually, even the behavior of GCC varies a lot from one version to the other, and it also varies on some weird context variations. Moreover, with -funroll-loops, GCC much too aggressive in the sense that it unrolls many loops that should really not be unrolled !! for instance the tiny loops which deals with the boundaries which cannot be vectorized...
japro
Registered Member
Posts
8
Karma
0
OS
ggael wrote:yes, the problem is that in general we cannot expect the compiler to do the right things, especially when we want to support multiple compilers.

So I went back to the drawing board and tried to figure out a generic way to explicitly unroll the loops in the desired fashion and came up with this:

Along with the expression template "tree" I build a Storage type that is passed by reference through the operations. Then I dissected the operation that the "operation object" builds into three operations (load, op, store). So for example the "primary" expression that is the vector itself puts a _m128 into the storage type and loads a element into the load operation while the other two operations are empty. The addition expression just passes through the load and store orders and but calls _mm_add in its op function etc.

In the following cases the expression template ao contains a saxpy (a = a + 0.5*b) operation.

Now I can write a "loop template" for the vectorized loop:

Code: Select all
      typename A::Storage s1;

      ao.load_once(i, s1); //load constants

      for(;i<n;i+=vec::len)
   {
     ao.load(i, s1); //perform all loads of the expressioin
     ao.vector_op(i, s1); //perform all operations
     ao.store(i, s1); //perform all stores
   }

With the following expected result:
Code: Select all
.L70:
   movaps   (%esi,%eax,4), %xmm0
   mulps   %xmm1, %xmm0
   addps   (%edi,%eax,4), %xmm0
   movaps   %xmm0, (%edi,%eax,4)
   addl   $4, %eax
   cmpl   %eax, %edx
   jg   .L70


So if I want to unroll the loop I do this:
Code: Select all
      typename A::Storage s1;
      typename A::Storage s2;

      ao.load_once(i, s1); //load constants
      ao.load_once(i+vec::len, s2); //load constants
 
      for(;i<n;i+=2*vec::len)
        {
     ao.load(i, s1); //perform all loads of the expressioin
     ao.load(i+vec::len, s2); //perform all loads of the expressioin
     ao.vector_op(i, s1); //perform all operations
     ao.vector_op(i+vec::len, s2); //perform all operations
     ao.store(i, s1); //perform all stores
     ao.store(i+vec::len, s2); //perform all stores
        }

and get this:
Code: Select all
.L47:
   movaps   (%esi,%eax,4), %xmm1
   movaps   16(%esi,%eax,4), %xmm0
   mulps   %xmm2, %xmm1
   mulps   %xmm2, %xmm0
   addps   (%edi,%eax,4), %xmm1
   addps   16(%edi,%eax,4), %xmm0
   movaps   %xmm1, (%edi,%eax,4)
   movaps   %xmm0, 16(%edi,%eax,4)
   addl   $8, %eax
   cmpl   %eax, %edx
   jg   .L47

Let's try another one:
Code: Select all
      typename A::Storage s1;
      typename A::Storage s2;

      ao.load_once(i, s1);
      ao.load_once(i+vec::len, s2);
     
      for(;i<n;i+=4*vec::len)
        {
     ao.load(i, s1);
     ao.load(i+vec::len, s2);

     ao.vector_op(i, s1);
     ao.vector_op(i+vec::len, s2);

     ao.store(i, s1);
     ao.store(i+vec::len, s2);

     ao.load(i+2*vec::len, s1);
     ao.load(i+3*vec::len, s2);

     ao.vector_op(i+2*vec::len, s1);
     ao.vector_op(i+3*vec::len, s2);

     ao.store(i+2*vec::len, s1);
     ao.store(i+3*vec::len, s2);

       }


result:
Code: Select all
.L47:
   movaps   (%ecx,%eax,4), %xmm2
   movaps   16(%ecx,%eax,4), %xmm1
   mulps   %xmm0, %xmm2
   mulps   %xmm0, %xmm1
   addps   (%edx,%eax,4), %xmm2
   addps   16(%edx,%eax,4), %xmm1
   movaps   %xmm2, (%edx,%eax,4)
   movaps   %xmm1, 16(%edx,%eax,4)
   movaps   32(%ecx,%eax,4), %xmm2
   movaps   48(%ecx,%eax,4), %xmm1
   mulps   %xmm0, %xmm2
   mulps   %xmm0, %xmm1
   addps   32(%edx,%eax,4), %xmm2
   addps   48(%edx,%eax,4), %xmm1
   movaps   %xmm2, 32(%edx,%eax,4)
   movaps   %xmm1, 48(%edx,%eax,4)
   addl   $16, %eax
   cmpl   %eax, %ebx
   jg   .L47


It seems to work nicely :). Also, it kinda answers the questions that I started the thread with, YAY.

Edit: ok, it doesn't strictly enforce the order or operations. just had a few examples where the optimizer of gcc will "weave" some of the arithmetic and memory operations. But still, the control this method offers over the output is way beyond all my earlier approaches. Also the inconsistencies are mostly gone it looks like. Just had "a = 4*a + 5*b" run at over 5 flops per cycle with this method.
User avatar
ggael
Moderator
Posts
3447
Karma
19
OS
nice work! I've also thought about decoupling the loads from the actual computations, but this looked a bit scary to implement in a robust way. In particular you have to take care at not over unrolling the loop otherwise this may become worse as the compiler might be tempted to push/pop some register values, thus killing the performance. This requires not only to know the number of registers for loads (that's easy), but also to estimate the number of required temporary registers that is more complicated to evaluated as this highly depends on the compiler.

Nevertheless, it is good to see that it does work very well for simple cases.


Bookmarks



Who is online

Registered users: bartoloni, Bing [Bot], Evergrowing, Google [Bot], q.ignora