Registered Member
|
Edit: aaw, **** I just realized that the order change broke the algorithm (overwrites the accumulator xmm0) but I guess it should be fixable.
Hi, I'm not entirely sure if this is the right place for this. but I don't know where else to ask and I think it's relevant to Eigen. I'm doing a parallel numerical computing course at the moment and in one of the exercises we were asked to implement expression templates (basic arithmetics + dot product). I then went one step further and made mine so it vectorizes automatically using xmm-/emmintrinsics. After some fiddling with details I made a comparison of ddot operations to Eigen3 and MKL (gcc 4.4 "-O2 -msse2 -funroll-loops" on intel core i3-330m): I was then kinda irritated by the fact that MKL was still faster considering that the assembly output didn't seem to contain any irrelevant instruction in the inner loop. So I figured it must be the order of the instructions. So I changed the assembly output from this:
to this: The following code is wrong...
(only the order of the instructions is changed and xmm1 used to to accumulate) Et voila: Now my question is. Is there any way to influence the way the optimizer of gcc orders instructions from intrinsics? The other idea I had to automate this was to insert asm("#start label") asm("#end label") comments into the library and then run an external program on the assembly... Or is it just "lets wait until gcc recognizes this kind of optimization itself"?
Last edited by japro on Sun Apr 10, 2011 8:11 pm, edited 1 time in total.
|
Registered Member
|
No idea about the GCC question, but your finding is interesting, as we are seeing a similar phenomenon on the first graph of http://eigen.tuxfamily.org/index.php?title=Benchmark where MKL peaks higher than us in a very similar way.
Join us on Eigen's IRC channel: #eigen on irc.freenode.net
Have a serious interest in Eigen? Then join the mailing list! |
Registered Member
|
Well, the screwup with my algorithm seems to be not so trivial to fix as I thought. Apparently it's not so much the chaining of mul-add that is relevant it seems to be the chaining of 8 (!) loads which conflicts with keeping the accumulator in one of the registers. The algorithm would be fixable by say 4 loads 4 madds and again 4 and 4. But then it is "slow" again. I can fix it by loading and storing the accumulator every iteration and still retain some of the speedup but then it's slightly slower for larger arrays since the additional stores cut into the bandwidth.
EDIT: I found a loop that gives me this performance (you may have to reload if you looked at the old image... i didn't change the name); the inner loop looks like this:
I still don't understand what exactly makes the difference. It seems to be depend on which registers are used with which others or accessing them in the longest possible intervals... So this version now uses 4 (xmm0-3) accumulators that are added after the loop. I guess there is no reasonable way to get the compiler to produce this code with templates and intrinsics alone. One could specialize the ddot template for the case that both operands are are "primary" expressions and then directly inline this version... |
Registered Member
|
For completeness, here is an "extracted" version of the the ddot version as linline asm. Does anyone know how I can solve the problem of duplicated labels when this gets inlines? I'm not very competent when it comes to GCCs asm syntax .
EDIT: fixed the label issue and also added the fdot version.
|
Registered Member
|
I'm just continuing my monologue . I hope that's ok.
So the operation there is saxpy right? It doesn't say single precision on the benchmark page but considering they peak around 4 operations per cycle that seems to be the most plausible. I again tested my generic vectorization as well as a hand optimized one saxpy: and daxpy: Eigen and MKL show similar curves as on the benchmark page in the wiki. The hand optimized versions almost hit the theoretical maximum (1.95 ops/cycle double precision, 3.9 ops/cycle single). Interestingly my generic vectorization does better than eigen in this case. Looking at the assembly of eigens vectorization and mine shows an interesting difference. The output looks pretty much the same except for the order in which registers are used. When loop unrolling is used Eigen tends to produce "linear" register usage like this:
my vectorizer often produces "cyclic" usage:
In this case that isn't very relevant, but in some cases where other instruction get interleaved the cyclic versions seem to support pipelining better since there is less repetitive access on the same registers in consecutive instructions. The ordering seems to be partly determined by the nesting of intrinsics. I found that defining "madd" as _mm_add_ps(_mm_mul_ps(a, b), c) produces cyclic versions while _mm_add_ps(a, _mm_mul_ps(b, c)) gives me more linear ones. I hope this helps someone. Also please share your own experience with this kind of thing. I think it's very interesting but there isn't that much well documented information around (or at least I'm unable to find it). EDIT: Actually That's even observable by just changing the order or instructions in the source code. Using -O3 -funroll-loops I get consistently higher perfomance with Eigen if I write
instead of
|
Moderator
|
Hi the reason why Eigen does not exhibit the peak perf. like MKL is because we don't do loop peeling (partial unrolling). However, naive loop peeling does not help because GCC still reuse the same register for the loads, like:
xmm0 <- x[i] xmm1 <- y[i] ... y[i] <- xmm1 xmm0 <- x[i+1] xmm1 <- y[i+1] ... y[i+1] <- xmm1 while we want to generate something like: xmm0 <- x[i] xmm1 <- y[i] xmm2 <- x[i+1] xmm3 <- y[i+1] ... ... y[i] <- xmm1 y[i+1] <- xmm3 To enforce that one possibility would be to introduce "super packets" containing multiple registers... |
Registered Member
|
Are you talking about meta loop unrolling or the unrolling the compiler does? With Meta unrolling I got what you describe. If I use -funroll-loops on gcc I get loops like this (straight from asm output):
|
Moderator
|
yes I was talking about explicit meta (partial) unrolling. Good to know that sometimes GCC is able to automatically unroll loops in a nice way. In your bench, did you also compiled Eigen with -funroll-loops?
|
Registered Member
|
Yes, eigen produces pretty much the same kind of loops. But not always at the same setting. Somehow Eigen often needs O3 for max performance while my implementation usually hits best performance at O2. I haven'f found out why. Interestingly, the intel compiler refuses to unroll the intrinsic loops. So performance ends up generally lower than with gcc in my testing. How the registers are used is mostly determined by how the intrinsics are nested. add(a, mul(b, c)) produces different results than add(mul(b,c), a)... Thats the reason why sometimes writing "a = a + 5*b" can give different performance than "a = 5*b + a". At the moment I have my symmetric operators always put the "heavy" expression on the left side since I found it to produce better performance (also the expressions always have same tree structure what makes specialization for certain operations easier). I generally found meta unrolling to only give very small improvements while using compiler unrolling will boost certain operations tremendously (almost to the theoretical maximum). So i pretty much abandoned the meta unrolling for these kinds of operations since otherwise one might end up accidentally combining both and have "multi unrolled" loops that span several pages in the code and end up with severe and unpredictable penalties for jumping across page borders in an inner loop. |
Moderator
|
yes, the problem is that in general we cannot expect the compiler to do the right things, especially when we want to support multiple compilers. Actually, even the behavior of GCC varies a lot from one version to the other, and it also varies on some weird context variations. Moreover, with -funroll-loops, GCC much too aggressive in the sense that it unrolls many loops that should really not be unrolled !! for instance the tiny loops which deals with the boundaries which cannot be vectorized...
|
Registered Member
|
So I went back to the drawing board and tried to figure out a generic way to explicitly unroll the loops in the desired fashion and came up with this: Along with the expression template "tree" I build a Storage type that is passed by reference through the operations. Then I dissected the operation that the "operation object" builds into three operations (load, op, store). So for example the "primary" expression that is the vector itself puts a _m128 into the storage type and loads a element into the load operation while the other two operations are empty. The addition expression just passes through the load and store orders and but calls _mm_add in its op function etc. In the following cases the expression template ao contains a saxpy (a = a + 0.5*b) operation. Now I can write a "loop template" for the vectorized loop:
With the following expected result:
So if I want to unroll the loop I do this:
and get this:
Let's try another one:
result:
It seems to work nicely . Also, it kinda answers the questions that I started the thread with, YAY. Edit: ok, it doesn't strictly enforce the order or operations. just had a few examples where the optimizer of gcc will "weave" some of the arithmetic and memory operations. But still, the control this method offers over the output is way beyond all my earlier approaches. Also the inconsistencies are mostly gone it looks like. Just had "a = 4*a + 5*b" run at over 5 flops per cycle with this method. |
Moderator
|
nice work! I've also thought about decoupling the loads from the actual computations, but this looked a bit scary to implement in a robust way. In particular you have to take care at not over unrolling the loop otherwise this may become worse as the compiler might be tempted to push/pop some register values, thus killing the performance. This requires not only to know the number of registers for loads (that's easy), but also to estimate the number of required temporary registers that is more complicated to evaluated as this highly depends on the compiler.
Nevertheless, it is good to see that it does work very well for simple cases. |
Registered users: bartoloni, Bing [Bot], Evergrowing, Google [Bot], q.ignora