Vectorization and enforcing order of instructions • KDE Community Forums

This forum has been archived. All content is frozen. Please use KDE Discuss instead.

Board index

Vectorization and enforcing order of instructions

Page 1 of 1 (12 posts)

Tags:

japro Registered Member Posts 8 Karma 0 OS	Vectorization and enforcing order of instructions Sun Apr 10, 2011 6:24 pm Edit: aaw, **** I just realized that the order change broke the algorithm (overwrites the accumulator xmm0) but I guess it should be fixable. Hi, I'm not entirely sure if this is the right place for this. but I don't know where else to ask and I think it's relevant to Eigen. I'm doing a parallel numerical computing course at the moment and in one of the exercises we were asked to implement expression templates (basic arithmetics + dot product). I then went one step further and made mine so it vectorizes automatically using xmm-/emmintrinsics. After some fiddling with details I made a comparison of ddot operations to Eigen3 and MKL (gcc 4.4 "-O2 -msse2 -funroll-loops" on intel core i3-330m): I was then kinda irritated by the fact that MKL was still faster considering that the assembly output didn't seem to contain any irrelevant instruction in the inner loop. So I figured it must be the order of the instructions. So I changed the assembly output from this: Code: Select all .L129: movapd (%esi,%eax,8), %xmm7 movapd 16(%esi,%eax,8), %xmm6 mulpd (%edi,%eax,8), %xmm7 mulpd 16(%edi,%eax,8), %xmm6 addpd %xmm0, %xmm7 movapd 32(%esi,%eax,8), %xmm5 movapd 48(%esi,%eax,8), %xmm4 mulpd 32(%edi,%eax,8), %xmm5 mulpd 48(%edi,%eax,8), %xmm4 movapd 64(%esi,%eax,8), %xmm3 movapd 80(%esi,%eax,8), %xmm0 mulpd 64(%edi,%eax,8), %xmm3 mulpd 80(%edi,%eax,8), %xmm0 movapd 96(%esi,%eax,8), %xmm2 addpd %xmm7, %xmm6 mulpd 96(%edi,%eax,8), %xmm2 addpd %xmm6, %xmm5 addpd %xmm5, %xmm4 addpd %xmm4, %xmm3 addpd %xmm3, %xmm0 addpd %xmm0, %xmm2 movapd 112(%esi,%eax,8), %xmm0 mulpd 112(%edi,%eax,8), %xmm0 addl $16, %eax cmpl %eax, %ebx addpd %xmm2, %xmm0 jg .L129 to this: The following code is wrong... Code: Select all .L129: movapd (%esi,%eax,8), %xmm7 movapd 16(%esi,%eax,8), %xmm6 movapd 32(%esi,%eax,8), %xmm5 movapd 48(%esi,%eax,8), %xmm4 movapd 64(%esi,%eax,8), %xmm3 movapd 80(%esi,%eax,8), %xmm2 movapd 96(%esi,%eax,8), %xmm0 mulpd (%edi,%eax,8), %xmm7 addpd %xmm0, %xmm7 mulpd 16(%edi,%eax,8), %xmm6 addpd %xmm7, %xmm6 mulpd 32(%edi,%eax,8), %xmm5 addpd %xmm6, %xmm5 mulpd 48(%edi,%eax,8), %xmm4 addpd %xmm5, %xmm4 mulpd 64(%edi,%eax,8), %xmm3 addpd %xmm4, %xmm3 mulpd 80(%edi,%eax,8), %xmm2 addpd %xmm3, %xmm0 mulpd 96(%edi,%eax,8), %xmm0 addpd %xmm0, %xmm2 movapd 112(%esi,%eax,8), %xmm0 addl $16, %eax cmpl %eax, %ebx mulpd 112(%edi,%eax,8), %xmm0 addpd %xmm2, %xmm1 jg .L129 (only the order of the instructions is changed and xmm1 used to to accumulate) Et voila: Now my question is. Is there any way to influence the way the optimizer of gcc orders instructions from intrinsics? The other idea I had to automate this was to insert asm("#start label") asm("#end label") comments into the library and then run an external program on the assembly... Or is it just "lets wait until gcc recognizes this kind of optimization itself"? Last edited by japro on Sun Apr 10, 2011 8:11 pm, edited 1 time in total.
bjacob Registered Member Posts 658 Karma 3	Re: Vectorization and enforcing order of instructions Sun Apr 10, 2011 7:11 pm No idea about the GCC question, but your finding is interesting, as we are seeing a similar phenomenon on the first graph of http://eigen.tuxfamily.org/index.php?title=Benchmark where MKL peaks higher than us in a very similar way. Join us on Eigen's IRC channel: #eigen on irc.freenode.net Have a serious interest in Eigen? Then join the mailing list!
japro Registered Member Posts 8 Karma 0 OS	Re: Vectorization and enforcing order of instructions Sun Apr 10, 2011 7:47 pm Well, the screwup with my algorithm seems to be not so trivial to fix as I thought. Apparently it's not so much the chaining of mul-add that is relevant it seems to be the chaining of 8 (!) loads which conflicts with keeping the accumulator in one of the registers. The algorithm would be fixable by say 4 loads 4 madds and again 4 and 4. But then it is "slow" again. I can fix it by loading and storing the accumulator every iteration and still retain some of the speedup but then it's slightly slower for larger arrays since the additional stores cut into the bandwidth. EDIT: I found a loop that gives me this performance (you may have to reload if you looked at the old image... i didn't change the name); the inner loop looks like this: Code: Select all .L129: movapd (%esi,%eax,8), %xmm7 movapd 16(%esi,%eax,8), %xmm6 movapd 32(%esi,%eax,8), %xmm5 movapd 48(%esi,%eax,8), %xmm4 mulpd (%edi,%eax,8), %xmm7 addpd %xmm7, %xmm3 mulpd 16(%edi,%eax,8), %xmm6 addpd %xmm6, %xmm2 mulpd 32(%edi,%eax,8), %xmm5 addpd %xmm5, %xmm1 mulpd 48(%edi,%eax,8), %xmm4 addpd %xmm4, %xmm0 movapd 64(%esi,%eax,8), %xmm7 movapd 80(%esi,%eax,8), %xmm6 movapd 96(%esi,%eax,8), %xmm5 movapd 112(%esi,%eax,8), %xmm4 mulpd 64(%edi,%eax,8), %xmm7 addpd %xmm7, %xmm3 mulpd 80(%edi,%eax,8), %xmm6 addpd %xmm6, %xmm2 mulpd 96(%edi,%eax,8), %xmm5 addpd %xmm5, %xmm1 mulpd 112(%edi,%eax,8), %xmm4 addpd %xmm4, %xmm0 addl $16, %eax cmpl %eax, %ebx jg .L129 I still don't understand what exactly makes the difference. It seems to be depend on which registers are used with which others or accessing them in the longest possible intervals... So this version now uses 4 (xmm0-3) accumulators that are added after the loop. I guess there is no reasonable way to get the compiler to produce this code with templates and intrinsics alone. One could specialize the ddot template for the case that both operands are are "primary" expressions and then directly inline this version...
japro Registered Member Posts 8 Karma 0 OS	Re: Vectorization and enforcing order of instructions Mon Apr 11, 2011 12:09 am For completeness, here is an "extracted" version of the the ddot version as linline asm. Does anyone know how I can solve the problem of duplicated labels when this gets inlines? I'm not very competent when it comes to GCCs asm syntax . EDIT: fixed the label issue and also added the fdot version. Code: Select all /* * a and b have to be 16byte aligned / inline double fast_ddot(int n, double a, double b) { double res[2]; asm volatile ( " xorpd %%xmm0, %%xmm0\n " " testl %%ebx, %%ebx;\n" " jle 128f\n" " leal -1(%%ebx), %%edx\n" " movapd (%%esi), %%xmm1\n" " movl $2, %%eax\n" " shrl %%edx\n" " mulpd (%%edi), %%xmm1\n" " addpd %%xmm1, %%xmm0\n" " andl $7, %%edx\n" " cmpl $2, %%ebx\n" " xorpd %%xmm1, %%xmm1\n" " xorpd %%xmm2, %%xmm2\n" " xorpd %%xmm3, %%xmm3\n" " jle 128f\n" " testl %%edx, %%edx\n" " je 129f\n" " cmpl $1, %%edx\n" " je 410f\n" " cmpl $2, %%edx\n" " .p2align 4,,5\n" " je 411f\n" " cmpl $3, %%edx\n" " .p2align 4,,5\n" " je 412f\n" " cmpl $4, %%edx\n" " .p2align 4,,5\n" " je 413f\n" " cmpl $5, %%edx\n" " .p2align 4,,5\n" " je 414f\n" " cmpl $6, %%edx\n" " .p2align 4,,5\n" " je 415f\n" " movl $4, %%eax\n" " movapd 16(%%esi), %%xmm2\n" " mulpd 16(%%edi), %%xmm2\n" " addpd %%xmm2, %%xmm0\n" " 415:\n" " movapd (%%esi,%%eax,8), %%xmm3\n" " mulpd (%%edi,%%eax,8), %%xmm3\n" " addl $2, %%eax\n" " addpd %%xmm3, %%xmm0\n" " 414:\n" " movapd (%%esi,%%eax,8), %%xmm4\n" " mulpd (%%edi,%%eax,8), %%xmm4\n" " addl $2, %%eax\n" " addpd %%xmm4, %%xmm0\n" "413:\n" " movapd (%%esi,%%eax,8), %%xmm5\n" " mulpd (%%edi,%%eax,8), %%xmm5\n" " addl $2, %%eax\n" " addpd %%xmm5, %%xmm0\n" "412:\n" " movapd (%%esi,%%eax,8), %%xmm6\n" " mulpd (%%edi,%%eax,8), %%xmm6\n" " addl $2, %%eax\n" " addpd %%xmm6, %%xmm0\n" "411:\n" " movapd (%%esi,%%eax,8), %%xmm7\n" " mulpd (%%edi,%%eax,8), %%xmm7\n" " addl $2, %%eax\n" " addpd %%xmm7, %%xmm0\n" "410:\n" " movapd (%%esi,%%eax,8), %%xmm1\n" " mulpd (%%edi,%%eax,8), %%xmm1\n" " addl $2, %%eax\n" " cmpl %%eax, %%ebx\n" " addpd %%xmm1, %%xmm0\n" " xorpd %%xmm1, %%xmm1\n" " xorpd %%xmm2, %%xmm2\n" " xorpd %%xmm3, %%xmm3\n" " jle 128f\n" " .p2align 4,,7\n" " .p2align 3\n" "129:\n" " movapd (%%esi,%%eax,8), %%xmm7\n" " movapd 16(%%esi,%%eax,8), %%xmm6\n" " movapd 32(%%esi,%%eax,8), %%xmm5\n" " movapd 48(%%esi,%%eax,8), %%xmm4\n" " \n" " mulpd (%%edi,%%eax,8), %%xmm7\n" " addpd %%xmm7, %%xmm3\n" " mulpd 16(%%edi,%%eax,8), %%xmm6\n" " addpd %%xmm6, %%xmm2\n" " mulpd 32(%%edi,%%eax,8), %%xmm5\n" " addpd %%xmm5, %%xmm1\n" " mulpd 48(%%edi,%%eax,8), %%xmm4\n" " addpd %%xmm4, %%xmm0\n" "\n" " movapd 64(%%esi,%%eax,8), %%xmm7\n" " movapd 80(%%esi,%%eax,8), %%xmm6\n" " movapd 96(%%esi,%%eax,8), %%xmm5\n" " movapd 112(%%esi,%%eax,8), %%xmm4\n" " mulpd 64(%%edi,%%eax,8), %%xmm7\n" " addpd %%xmm7, %%xmm3\n" " mulpd 80(%%edi,%%eax,8), %%xmm6\n" " addpd %%xmm6, %%xmm2\n" " mulpd 96(%%edi,%%eax,8), %%xmm5\n" " addpd %%xmm5, %%xmm1\n" " mulpd 112(%%edi,%%eax,8), %%xmm4\n" " addpd %%xmm4, %%xmm0\n" " addl $16, %%eax\n" " cmpl %%eax, %%ebx\n" " jg 129b\n" "128:\n" " addpd %%xmm3, %%xmm0\n" " addpd %%xmm2, %%xmm0\n" " addpd %%xmm1, %%xmm0\n" " movupd %%xmm0, %0\n" : "=m"(res) : "S"(a), "D"(b), "b"(n) : "%eax","%edx","%ecx","%xmm0","%xmm1","%xmm2","%xmm3","%xmm4","%xmm5","%xmm6","%xmm7" ); return res[0] + res[1]; } inline double fast_sdot(int n, float a, float *b) { float res[4]; asm volatile ( " xorps %%xmm0, %%xmm0; " " testl %%ebx, %%ebx; " " jle 128f\n" " leal -1(%%ebx), %%edx\n" " movaps (%%esi), %%xmm1\n" " movl $4, %%eax\n" " shrl $2, %%edx\n" " mulps (%%edi), %%xmm1\n" " addps %%xmm1, %%xmm0\n" " andl $7, %%edx\n" " cmpl $4, %%ebx\n" " xorps %%xmm1, %%xmm1\n" " xorps %%xmm2, %%xmm2\n" " xorps %%xmm3, %%xmm3\n" " jle 128f\n" " testl %%edx, %%edx\n" " je 129f\n" " cmpl $1, %%edx\n" " je 410f\n" " cmpl $2, %%edx\n" " .p2align 4,,5\n" " je 411f\n" " cmpl $3, %%edx\n" " .p2align 4,,5\n" " je 412f\n" " cmpl $4, %%edx\n" " .p2align 4,,5\n" " je 413f\n" " cmpl $5, %%edx\n" " .p2align 4,,5\n" " je 414f\n" " cmpl $6, %%edx\n" " .p2align 4,,5\n" " je 415f\n" " movl $8, %%eax\n" " movaps 16(%%esi), %%xmm2\n" " mulps 16(%%edi), %%xmm2\n" " addps %%xmm2, %%xmm0\n" "415:\n" " movaps (%%esi,%%eax,4), %%xmm3\n" " mulps (%%edi,%%eax,4), %%xmm3\n" " addl $4, %%eax\n" " addps %%xmm3, %%xmm0\n" "414:\n" " movaps (%%esi,%%eax,4), %%xmm4\n" " mulps (%%edi,%%eax,4), %%xmm4\n" " addl $4, %%eax\n" " addps %%xmm4, %%xmm0\n" "413:\n" " movaps (%%esi,%%eax,4), %%xmm5\n" " mulps (%%edi,%%eax,4), %%xmm5\n" " addl $4, %%eax\n" " addps %%xmm5, %%xmm0\n" "412:\n" " movaps (%%esi,%%eax,4), %%xmm6\n" " mulps (%%edi,%%eax,4), %%xmm6\n" " addl $4, %%eax\n" " addps %%xmm6, %%xmm0\n" "411:\n" " movaps (%%esi,%%eax,4), %%xmm7\n" " mulps (%%edi,%%eax,4), %%xmm7\n" " addl $4, %%eax\n" " addps %%xmm7, %%xmm0\n" "410:\n" " movaps (%%esi,%%eax,4), %%xmm1\n" " mulps (%%edi,%%eax,4), %%xmm1\n" " addl $4, %%eax\n" " cmpl %%eax, %%ebx\n" " addps %%xmm1, %%xmm0\n" " xorps %%xmm1, %%xmm1\n" " xorps %%xmm2, %%xmm2\n" " xorps %%xmm3, %%xmm3\n" " jle 128f\n" " .p2align 4,,7\n" " .p2align 3\n" "129:\n" " movaps (%%esi,%%eax,4), %%xmm7\n" " movaps 16(%%esi,%%eax,4), %%xmm6\n" " movaps 32(%%esi,%%eax,4), %%xmm5\n" " movaps 48(%%esi,%%eax,4), %%xmm4\n" " " " mulps (%%edi,%%eax,4), %%xmm7\n" " addps %%xmm7, %%xmm3\n" " mulps 16(%%edi,%%eax,4), %%xmm6\n" " addps %%xmm6, %%xmm2\n" " mulps 32(%%edi,%%eax,4), %%xmm5\n" " addps %%xmm5, %%xmm1\n" " mulps 48(%%edi,%%eax,4), %%xmm4\n" " addps %%xmm4, %%xmm0\n" " " " movaps 64(%%esi,%%eax,4), %%xmm7\n" " movaps 80(%%esi,%%eax,4), %%xmm6\n" " movaps 96(%%esi,%%eax,4), %%xmm5\n" " movaps 112(%%esi,%%eax,4), %%xmm4\n" " mulps 64(%%edi,%%eax,4), %%xmm7\n" " addps %%xmm7, %%xmm3\n" " mulps 80(%%edi,%%eax,4), %%xmm6\n" " addps %%xmm6, %%xmm2\n" " mulps 96(%%edi,%%eax,4), %%xmm5\n" " addps %%xmm5, %%xmm1\n" " mulps 112(%%edi,%%eax,4), %%xmm4\n" " addps %%xmm4, %%xmm0\n" " addl $32, %%eax\n" " cmpl %%eax, %%ebx\n" " jg 129b\n" "128:\n" " addps %%xmm3, %%xmm0\n" " addps %%xmm2, %%xmm0\n" " addps %%xmm1, %%xmm0\n" " haddps %%xmm0, %%xmm0\n" " haddps %%xmm0, %%xmm0\n" " movups %%xmm0, %0\n" : "=m"(res) : "S"(a), "D"(b), "b"(n) : "%eax","%edx","%ecx","%xmm0","%xmm1","%xmm2","%xmm3","%xmm4","%xmm5","%xmm6","%xmm7" ); return res[0]; }
japro Registered Member Posts 8 Karma 0 OS	Re: Vectorization and enforcing order of instructions Mon Apr 11, 2011 7:21 pm I'm just continuing my monologue . I hope that's ok. No idea about the GCC question, but your finding is interesting, as we are seeing a similar phenomenon on the first graph of http://eigen.tuxfamily.org/index.php?title=Benchmark where MKL peaks higher than us in a very similar way. So the operation there is saxpy right? It doesn't say single precision on the benchmark page but considering they peak around 4 operations per cycle that seems to be the most plausible. I again tested my generic vectorization as well as a hand optimized one saxpy: and daxpy: Eigen and MKL show similar curves as on the benchmark page in the wiki. The hand optimized versions almost hit the theoretical maximum (1.95 ops/cycle double precision, 3.9 ops/cycle single). Interestingly my generic vectorization does better than eigen in this case. Looking at the assembly of eigens vectorization and mine shows an interesting difference. The output looks pretty much the same except for the order in which registers are used. When loop unrolling is used Eigen tends to produce "linear" register usage like this: addps %xmm1, %xmm0 addps %xmm2, %xmm0 addps %xmm3, %xmm0 addps %xmm4, %xmm0 movaps %xmm0, some location my vectorizer often produces "cyclic" usage: addps %xmm0, %xmm1 addps %xmm1, %xmm2 addps %xmm2, %xmm3 addps %xmm3, %xmm4 movaps %xmm4, some location In this case that isn't very relevant, but in some cases where other instruction get interleaved the cyclic versions seem to support pipelining better since there is less repetitive access on the same registers in consecutive instructions. The ordering seems to be partly determined by the nesting of intrinsics. I found that defining "madd" as _mm_add_ps(_mm_mul_ps(a, b), c) produces cyclic versions while _mm_add_ps(a, _mm_mul_ps(b, c)) gives me more linear ones. I hope this helps someone. Also please share your own experience with this kind of thing. I think it's very interesting but there isn't that much well documented information around (or at least I'm unable to find it). EDIT: Actually That's even observable by just changing the order or instructions in the source code. Using -O3 -funroll-loops I get consistently higher perfomance with Eigen if I write Code: Select all `y = ax + y;` instead of Code: Select all `y = y + ax;//or y += a*x;`
ggael Moderator Posts 3447 Karma 19 OS	Re: Vectorization and enforcing order of instructions Wed Apr 13, 2011 11:12 am Hi the reason why Eigen does not exhibit the peak perf. like MKL is because we don't do loop peeling (partial unrolling). However, naive loop peeling does not help because GCC still reuse the same register for the loads, like: xmm0 <- x[i] xmm1 <- y[i] ... y[i] <- xmm1 xmm0 <- x[i+1] xmm1 <- y[i+1] ... y[i+1] <- xmm1 while we want to generate something like: xmm0 <- x[i] xmm1 <- y[i] xmm2 <- x[i+1] xmm3 <- y[i+1] ... ... y[i] <- xmm1 y[i+1] <- xmm3 To enforce that one possibility would be to introduce "super packets" containing multiple registers...
japro Registered Member Posts 8 Karma 0 OS	Re: Vectorization and enforcing order of instructions Wed Apr 13, 2011 3:49 pm ggael wrote:Hi the reason why Eigen does not exhibit the peak perf. like MKL is because we don't do loop peeling (partial unrolling). However, naive loop peeling does not help because GCC still reuse the same register for the loads, like: xmm0 <- x[i] xmm1 <- y[i] ... y[i] <- xmm1 xmm0 <- x[i+1] xmm1 <- y[i+1] ... y[i+1] <- xmm1 Are you talking about meta loop unrolling or the unrolling the compiler does? With Meta unrolling I got what you describe. If I use -funroll-loops on gcc I get loops like this (straight from asm output): Code: Select all .L51: movaps (%ebx,%eax,4), %xmm3 mulps %xmm0, %xmm3 addps (%esi,%eax,4), %xmm3 movaps %xmm3, (%edx) movaps 16(%ebx,%eax,4), %xmm2 mulps %xmm0, %xmm2 addps 16(%esi,%eax,4), %xmm2 movaps %xmm2, 16(%edx) movaps 32(%ebx,%eax,4), %xmm1 mulps %xmm0, %xmm1 addps 32(%esi,%eax,4), %xmm1 movaps %xmm1, 32(%edx) movaps 48(%ebx,%eax,4), %xmm7 mulps %xmm0, %xmm7 addps 48(%esi,%eax,4), %xmm7 movaps %xmm7, 48(%edx) movaps 64(%ebx,%eax,4), %xmm6 mulps %xmm0, %xmm6 addps 64(%esi,%eax,4), %xmm6 movaps %xmm6, 64(%edx) movaps 80(%ebx,%eax,4), %xmm5 mulps %xmm0, %xmm5 addps 80(%esi,%eax,4), %xmm5 movaps %xmm5, 80(%edx) movaps 96(%ebx,%eax,4), %xmm4 mulps %xmm0, %xmm4 addps 96(%esi,%eax,4), %xmm4 movaps %xmm4, 96(%edx) movaps 112(%ebx,%eax,4), %xmm3 mulps %xmm0, %xmm3 addps 112(%esi,%eax,4), %xmm3 addl $32, %eax movaps %xmm3, 112(%edx) subl $-128, %edx cmpl %eax, %ecx jg .L51
ggael Moderator Posts 3447 Karma 19 OS	Re: Vectorization and enforcing order of instructions Thu Apr 14, 2011 11:23 am yes I was talking about explicit meta (partial) unrolling. Good to know that sometimes GCC is able to automatically unroll loops in a nice way. In your bench, did you also compiled Eigen with -funroll-loops?
japro Registered Member Posts 8 Karma 0 OS	Re: Vectorization and enforcing order of instructions Thu Apr 14, 2011 12:09 pm ggael wrote:yes I was talking about explicit meta (partial) unrolling. Good to know that sometimes GCC is able to automatically unroll loops in a nice way. In your bench, did you also compiled Eigen with -funroll-loops? Yes, eigen produces pretty much the same kind of loops. But not always at the same setting. Somehow Eigen often needs O3 for max performance while my implementation usually hits best performance at O2. I haven'f found out why. Interestingly, the intel compiler refuses to unroll the intrinsic loops. So performance ends up generally lower than with gcc in my testing. How the registers are used is mostly determined by how the intrinsics are nested. add(a, mul(b, c)) produces different results than add(mul(b,c), a)... Thats the reason why sometimes writing "a = a + 5b" can give different performance than "a = 5b + a". At the moment I have my symmetric operators always put the "heavy" expression on the left side since I found it to produce better performance (also the expressions always have same tree structure what makes specialization for certain operations easier). I generally found meta unrolling to only give very small improvements while using compiler unrolling will boost certain operations tremendously (almost to the theoretical maximum). So i pretty much abandoned the meta unrolling for these kinds of operations since otherwise one might end up accidentally combining both and have "multi unrolled" loops that span several pages in the code and end up with severe and unpredictable penalties for jumping across page borders in an inner loop.
ggael Moderator Posts 3447 Karma 19 OS	Re: Vectorization and enforcing order of instructions Thu Apr 14, 2011 4:31 pm yes, the problem is that in general we cannot expect the compiler to do the right things, especially when we want to support multiple compilers. Actually, even the behavior of GCC varies a lot from one version to the other, and it also varies on some weird context variations. Moreover, with -funroll-loops, GCC much too aggressive in the sense that it unrolls many loops that should really not be unrolled !! for instance the tiny loops which deals with the boundaries which cannot be vectorized...
japro Registered Member Posts 8 Karma 0 OS	Re: Vectorization and enforcing order of instructions Thu Apr 14, 2011 9:10 pm ggael wrote:yes, the problem is that in general we cannot expect the compiler to do the right things, especially when we want to support multiple compilers. So I went back to the drawing board and tried to figure out a generic way to explicitly unroll the loops in the desired fashion and came up with this: Along with the expression template "tree" I build a Storage type that is passed by reference through the operations. Then I dissected the operation that the "operation object" builds into three operations (load, op, store). So for example the "primary" expression that is the vector itself puts a _m128 into the storage type and loads a element into the load operation while the other two operations are empty. The addition expression just passes through the load and store orders and but calls _mm_add in its op function etc. In the following cases the expression template ao contains a saxpy (a = a + 0.5b) operation. Now I can write a "loop template" for the vectorized loop: Code: Select all `typename A::Storage s1; ao.load_once(i, s1); //load constants for(;i<n;i+=vec::len) { ao.load(i, s1); //perform all loads of the expressioin ao.vector_op(i, s1); //perform all operations ao.store(i, s1); //perform all stores }` With the following expected result: Code: Select all `.L70: movaps (%esi,%eax,4), %xmm0 mulps %xmm1, %xmm0 addps (%edi,%eax,4), %xmm0 movaps %xmm0, (%edi,%eax,4) addl $4, %eax cmpl %eax, %edx jg .L70` So if I want to unroll the loop I do this: Code: Select all typename A::Storage s1; typename A::Storage s2; ao.load_once(i, s1); //load constants ao.load_once(i+vec::len, s2); //load constants for(;i<n;i+=2vec::len) { ao.load(i, s1); //perform all loads of the expressioin ao.load(i+vec::len, s2); //perform all loads of the expressioin ao.vector_op(i, s1); //perform all operations ao.vector_op(i+vec::len, s2); //perform all operations ao.store(i, s1); //perform all stores ao.store(i+vec::len, s2); //perform all stores } and get this: Code: Select all `.L47: movaps (%esi,%eax,4), %xmm1 movaps 16(%esi,%eax,4), %xmm0 mulps %xmm2, %xmm1 mulps %xmm2, %xmm0 addps (%edi,%eax,4), %xmm1 addps 16(%edi,%eax,4), %xmm0 movaps %xmm1, (%edi,%eax,4) movaps %xmm0, 16(%edi,%eax,4) addl $8, %eax cmpl %eax, %edx jg .L47` Let's try another one: Code: Select all typename A::Storage s1; typename A::Storage s2; ao.load_once(i, s1); ao.load_once(i+vec::len, s2); for(;i<n;i+=4vec::len) { ao.load(i, s1); ao.load(i+vec::len, s2); ao.vector_op(i, s1); ao.vector_op(i+vec::len, s2); ao.store(i, s1); ao.store(i+vec::len, s2); ao.load(i+2vec::len, s1); ao.load(i+3vec::len, s2); ao.vector_op(i+2vec::len, s1); ao.vector_op(i+3vec::len, s2); ao.store(i+2vec::len, s1); ao.store(i+3vec::len, s2); } result: Code: Select all .L47: movaps (%ecx,%eax,4), %xmm2 movaps 16(%ecx,%eax,4), %xmm1 mulps %xmm0, %xmm2 mulps %xmm0, %xmm1 addps (%edx,%eax,4), %xmm2 addps 16(%edx,%eax,4), %xmm1 movaps %xmm2, (%edx,%eax,4) movaps %xmm1, 16(%edx,%eax,4) movaps 32(%ecx,%eax,4), %xmm2 movaps 48(%ecx,%eax,4), %xmm1 mulps %xmm0, %xmm2 mulps %xmm0, %xmm1 addps 32(%edx,%eax,4), %xmm2 addps 48(%edx,%eax,4), %xmm1 movaps %xmm2, 32(%edx,%eax,4) movaps %xmm1, 48(%edx,%eax,4) addl $16, %eax cmpl %eax, %ebx jg .L47 It seems to work nicely . Also, it kinda answers the questions that I started the thread with, YAY. Edit: ok, it doesn't strictly enforce the order or operations. just had a few examples where the optimizer of gcc will "weave" some of the arithmetic and memory operations. But still, the control this method offers over the output is way beyond all my earlier approaches. Also the inconsistencies are mostly gone it looks like. Just had "a = 4a + 5*b" run at over 5 flops per cycle with this method.
ggael Moderator Posts 3447 Karma 19 OS	Re: Vectorization and enforcing order of instructions Tue Apr 19, 2011 12:06 pm nice work! I've also thought about decoupling the loads from the actual computations, but this looked a bit scary to implement in a robust way. In particular you have to take care at not over unrolling the loop otherwise this may become worse as the compiler might be tempted to push/pop some register values, thus killing the performance. This requires not only to know the number of registers for loads (that's easy), but also to estimate the number of required temporary registers that is more complicated to evaluated as this highly depends on the compiler. Nevertheless, it is good to see that it does work very well for simple cases.

Page 1 of 1 (12 posts)

Bookmarks

Who is online

Registered users: bartoloni, Bing [Bot], Evergrowing, Google [Bot], q.ignora