This forum has been archived. All content is frozen. Please use KDE Discuss instead.

using two __m128 together when vectorizing complex<float>

Tags: None
(comma "," separated)
User avatar
inter1965
Registered Member
Posts
5
Karma
0
OS
HI, Eigen developers, thanks so much for bringing such wonderful expression template to the world.
Now, I encounter some problems that when implementing my vectorized functions targeting for std::complex (like abs2, std::abs, and std::polar). When dealing with those functions, i want to using two __m128 together since it would be some what easier when writing codes.
My questions is that, could we use an alternative packet size when combining two __m128, or i have to work on my own Scalar type and the associating traits?
Thank you in advance and i appreciate your help immensely.
User avatar
ggael
Moderator
Posts
3447
Karma
19
OS
I'm not sure to understand your goal. Ideally, you would like to define a new packet type for std::complex that would be twice as large, right? How is it more convenient? What's wrong with the currently vectorization of std::complex?

Anyway, currently this is not possible, and to do so your would indeed have to define a new complex type (e.g., simply inherit std::complex).
User avatar
inter1965
Registered Member
Posts
5
Karma
0
OS
Ok, let code speak itself.
Suppose i want a function norm which does some kind of normalization on complex number; for example,
norm(complex<float>(3, 4)) returns complex<float>(0.6, 0.8) ;
norm(complex<float>(-6, -8)) returns complex<float>(-0.6, -0.8) ;
and it takes care when the input is complex<float>(0, 0) and returns complex<float>(1, 0);

Now i have the following naive packed implementations, pnorm1 fits the current packet size which Eigen3 uses, however, pnorm2 takes two __m128 as input and does it together which means it may be more efficient than pnorm1.
My question comes here, how could i fit pnorm2 into Eigen3?
Code: Select all
inline __m128 pnorm1(const __m128 v) {
  const __m128 epsilon(_mm_set_ps(std::numeric_limits<float>::epsilon(), std::numeric_limits<float>::epsilon(), std::numeric_limits<float>::epsilon(), std::numeric_limits<float>::epsilon()));
  const __m128 unit(_mm_set_ps(0, 1, 0, 1));
  //const __m128 c(_mm_shuffle_ps(v, v, _MM_SHUFFLE(2, 3, 0, 1)));
  const __m128 c((__m128)_mm_shuffle_epi32((__m128i)v, _MM_SHUFFLE(2, 3, 0, 1)));
  const __m128 n(std::sqrt(v*v+c*c));
  const __m128 d(_mm_div_ps(v,n));
 
  const __m128 mz(_mm_cmplt_ps(epsilon, n));
 
  const __m128 rd(_mm_and_ps(mz, d));
  const __m128 ra(_mm_andnot_ps(mz, unit));
  const __m128 r(_mm_add_ps(rd, ra));
  return r;
}

inline void pnorm2(__m128 &u, __m128 &v) {
  const __m128 epsilon(_mm_set_ps(std::numeric_limits<float>::epsilon(), std::numeric_limits<float>::epsilon(), std::numeric_limits<float>::epsilon(), std::numeric_limits<float>::epsilon()));
  const __m128 unit(_mm_set1_ps(1));
  const __m128 zero(_mm_set1_ps(0));
  const __m128 r(_mm_shuffle_ps(u, v, _MM_SHUFFLE(2, 0, 2, 0)));
  const __m128 i(_mm_shuffle_ps(u, v, _MM_SHUFFLE(3, 1, 3, 1)));
  const __m128 n(std::sqrt(r*r+i*i));
  const __m128 mz(_mm_cmplt_ps(epsilon, n));
 
  const __m128 dr(_mm_div_ps(r,n));
  const __m128 di(_mm_div_ps(i,n));
 
  const __m128 rra(_mm_and_ps(mz, dr));
  const __m128 ria(_mm_and_ps(mz, di));

  const __m128 rrb(_mm_andnot_ps(mz, unit));
  const __m128 rib(_mm_andnot_ps(mz, zero));

  const __m128 rrc(_mm_add_ps(rra, rrb));
  const __m128 ric(_mm_add_ps(ria, rib));

  u = _mm_shuffle_ps(rrc, ric, _MM_SHUFFLE(1, 0, 1, 0));
  v = _mm_shuffle_ps(rrc, ric, _MM_SHUFFLE(3, 2, 3, 2));

  u = (__m128)_mm_shuffle_epi32((__m128i)u, _MM_SHUFFLE(3, 1, 2, 0));
  v = (__m128)_mm_shuffle_epi32((__m128i)v, _MM_SHUFFLE(3, 1, 2, 0));
}


PS: If you think norm is not that expensive, please consider std::polar whose vectorized version would rely on _mm_sincos_ps who is really expensive.
User avatar
inter1965
Registered Member
Posts
5
Karma
0
OS
Another example,
For 3d vector {x0, y0, z0}, if the packet size is limited to 4, we have to fit it into one packet then the last one is wasted.
However, if we could increase the packet size to 12, then 4 vectors {x0, y0, z0}, {x1, y1, z1}, {x2, y2, z2}, {x3, y3, z3} could be shuffled into 3 packets {x0, x1, x2, x3}, {y0, y1, y2, y3} {z0, z1, z2, z3} so that all bits are occupied. The latter style is more efficient than the former one.
User avatar
ggael
Moderator
Posts
3447
Karma
19
OS
OK, I see, by using 4 complexes at once you can save one sqrt. However, this is rather complicated to make Eigen exploit that. In the near future we plan to extend our vectorization engine to be able to use packets of various size for the same scalar type. This is very useful to fully exploit AVX and NEON, but that should also make it easier to implement your proposal. Typically, what you propose would require in addition an automatic mechanism to process larger packet and extend the cost model to automatically determine what's the best packet size to use for the given expression.

For instance, in Eigen one would write (a+b).array().norm(). In order to make use of your pnorm2, we would have to evaluate a+b per set of 2 packets, or, automatically generate a kind of padd for "virtual packet" containing 2 "machine packets". Same for the load and store, and all other trivial ops...

This is just to give you and idea of the complexity....
User avatar
inter1965
Registered Member
Posts
5
Karma
0
OS
Thank you.
Would we see this new vectorization engine in Eigen 4?
User avatar
ggael
Moderator
Posts
3447
Karma
19
OS
in 3.1 or more likely 3.2
User avatar
inter1965
Registered Member
Posts
5
Karma
0
OS
That would be great!
I could start writing my own avx function now.


Bookmarks



Who is online

Registered users: Bing [Bot], Google [Bot], q.ignora, watchstar