Registered Member
|
HI, Eigen developers, thanks so much for bringing such wonderful expression template to the world.
Now, I encounter some problems that when implementing my vectorized functions targeting for std::complex (like abs2, std::abs, and std::polar). When dealing with those functions, i want to using two __m128 together since it would be some what easier when writing codes. My questions is that, could we use an alternative packet size when combining two __m128, or i have to work on my own Scalar type and the associating traits? Thank you in advance and i appreciate your help immensely. |
Moderator
|
I'm not sure to understand your goal. Ideally, you would like to define a new packet type for std::complex that would be twice as large, right? How is it more convenient? What's wrong with the currently vectorization of std::complex?
Anyway, currently this is not possible, and to do so your would indeed have to define a new complex type (e.g., simply inherit std::complex). |
Registered Member
|
Ok, let code speak itself.
Suppose i want a function norm which does some kind of normalization on complex number; for example,
Now i have the following naive packed implementations, pnorm1 fits the current packet size which Eigen3 uses, however, pnorm2 takes two __m128 as input and does it together which means it may be more efficient than pnorm1. My question comes here, how could i fit pnorm2 into Eigen3?
PS: If you think norm is not that expensive, please consider std::polar whose vectorized version would rely on _mm_sincos_ps who is really expensive. |
Registered Member
|
Another example,
For 3d vector {x0, y0, z0}, if the packet size is limited to 4, we have to fit it into one packet then the last one is wasted. However, if we could increase the packet size to 12, then 4 vectors {x0, y0, z0}, {x1, y1, z1}, {x2, y2, z2}, {x3, y3, z3} could be shuffled into 3 packets {x0, x1, x2, x3}, {y0, y1, y2, y3} {z0, z1, z2, z3} so that all bits are occupied. The latter style is more efficient than the former one. |
Moderator
|
OK, I see, by using 4 complexes at once you can save one sqrt. However, this is rather complicated to make Eigen exploit that. In the near future we plan to extend our vectorization engine to be able to use packets of various size for the same scalar type. This is very useful to fully exploit AVX and NEON, but that should also make it easier to implement your proposal. Typically, what you propose would require in addition an automatic mechanism to process larger packet and extend the cost model to automatically determine what's the best packet size to use for the given expression.
For instance, in Eigen one would write (a+b).array().norm(). In order to make use of your pnorm2, we would have to evaluate a+b per set of 2 packets, or, automatically generate a kind of padd for "virtual packet" containing 2 "machine packets". Same for the load and store, and all other trivial ops... This is just to give you and idea of the complexity.... |
Registered Member
|
Thank you.
Would we see this new vectorization engine in Eigen 4? |
Moderator
|
in 3.1 or more likely 3.2
|
Registered Member
|
That would be great!
I could start writing my own avx function now. |
Registered users: Bing [Bot], Google [Bot], q.ignora, watchstar