This forum has been archived. All content is frozen. Please use KDE Discuss instead.

Vectorization Precision Problem

Tags: None
(comma "," separated)
L3Bee
Registered Member
Posts
6
Karma
0

Vectorization Precision Problem

Wed Jul 11, 2012 8:34 pm
I'm working on a project where I have a vector of sorted doubles, and I need to take the exp() of each. Since the data was already in an Eigen Vector, I thought I'd just use the provided methods. Unfortunately, the double versions aren't vectorized using SSE (or not in the 3.1.0 I'm using), so I thought I'd write my own using the Cephes library and the float version as a guide. It worked, kind of. The difference between this and the exp() from math.h is small most of the time, but for some values it's an absolute difference of 262144, which is enough to result in NANs farther in the code. The largest errors are from the largest input values. For example, the error of 262144 is from an input value of 57.7482317672092...

Anyway, I was hoping that someone with more experience that I could point out my mistake. This is my first attempt at doing anything with SSE intrinsics, so it's likely I missed something I should have changed between float and double.

Here's the code, from Eigen/src/Core/arch/SSE/MathFunctions.h
Code: Select all
Packet2d polylvl(const Packet2d& x, double* p, int N)
{
   Packet2d ans = pset1<Packet2d>(*p++);
   do
   {
      ans = pmadd(ans,x,pset1<Packet2d>(*p++));
   } while(--N);
   return ans;
}

template<> EIGEN_DEFINE_FUNCTION_ALLOWING_MULTIPLE_DEFINITIONS EIGEN_UNUSED
Packet2d pexp<Packet2d>(const Packet2d& _x)
{
  Packet2d x = _x;
 
  _EIGEN_DECLARE_CONST_Packet2d(1 , 1.0);
  _EIGEN_DECLARE_CONST_Packet2d(half, 0.5);
  _EIGEN_DECLARE_CONST_Packet4i(0x7f, 0x7f);

  _EIGEN_DECLARE_CONST_Packet2d(exp_hi,  88.029691931113054295988);
  _EIGEN_DECLARE_CONST_Packet2d(exp_lo, -88.72283911167299960540);

  _EIGEN_DECLARE_CONST_Packet2d(cephes_LOG2EF, 1.4426950408889634073599);

  static double P[] = {
    1.26177193074810590878e-4,
    3.02994407707441961300e-2,
    9.99999999999999999910e-1};
  static double Q[] = {
    3.00198505138664455042e-6,
    2.52448340349684104192e-3,
    2.27265548208155028766e-1,
    2.00000000000000000009e0};

  _EIGEN_DECLARE_CONST_Packet2d(cephes_exp_C1, 0.693145751953125);
  _EIGEN_DECLARE_CONST_Packet2d(cephes_exp_C2, 1.42860682030941723212e-6);

  Packet2d tmp = _mm_setzero_pd(), fx;

  // clamp x
  x = pmax(pmin(x, p2d_exp_hi), p2d_exp_lo);
  /* express exp(x) as exp(g + n*log(2)) */
  fx = pmadd(p2d_cephes_LOG2EF, x, p2d_half);
  /* how to perform a floorf with SSE: just below */
  Packet4i emm0 = _mm_cvttpd_epi32(fx);
  tmp  = _mm_cvtepi32_pd(emm0);
  /* if greater, substract 1 */
  Packet2d mask = _mm_cmpgt_pd(tmp, fx);
  mask = _mm_and_pd(mask, p2d_1);
  fx = psub(tmp, mask);

  tmp = pmul(fx, p2d_cephes_exp_C1);
  Packet2d z = pmul(fx, p2d_cephes_exp_C2), xx;
  x = psub(x, tmp);
  x = psub(x, z);

  xx = pmul(x,x);
  z = pmul(x,polylvl(xx,P,2));
  x = pdiv(z,psub(polylvl(xx,Q,3),z));
  x = pmadd(pset1<Packet2d>(2.0),x,p2d_1);
  // build 2^n
  emm0 = _mm_cvttpd_epi32(fx);
  emm0 = _mm_add_epi32(emm0, p4i_0x7f);
  emm0 = _mm_slli_epi32(emm0, 23);
  return pmul(x, _mm_cvtps_pd(_mm_castsi128_ps(emm0)));
}

Then, in PacketMath.h in the same folder, I added the line HasExp = 1, to the packet_traits<double> struct.

For reference, the Cephes library is http://www.netlib.org/cephes/
Edit: I should add that this is being run on a Xeon 5472 on Win XP x86 and is compiled with VS2010 on Ox and with SSE2 enabled. I'm pretty sure this is a logic error, but if not, here you go. If you need any other information, just ask.
User avatar
ggael
Moderator
Posts
3447
Karma
19
OS

Re: Vectorization Precision Problem

Thu Jul 12, 2012 6:57 pm
Hi,
one problem are the hi and lo bounds that should be adjusted for double.
L3Bee
Registered Member
Posts
6
Karma
0

Re: Vectorization Precision Problem

Thu Jul 12, 2012 7:26 pm
Well, I changed those to the +-708 limits from the cephes file (is that what you meant?), but of course, that doesn't help the problem. Still getting the same difference.
User avatar
ggael
Moderator
Posts
3447
Karma
19
OS

Re: Vectorization Precision Problem

Thu Jul 12, 2012 9:46 pm
Indeed, the other issue is when computing 2^fx, you really must do it for double, not float. The following works here but we of course need a faster solution:

double dat[2];
pstore(dat, fx);
dat[0] = std::pow(2,dat[0]);
dat[1] = std::pow(2,dat[1]);
return pmul(x,pload<Packet2d>(dat));
L3Bee
Registered Member
Posts
6
Karma
0

Re: Vectorization Precision Problem

Thu Jul 12, 2012 9:54 pm
Fast is good. I'll take your word that it's faster, at least until I get the accuracy right and can properly test it with my data.

Any idea what's causing the errors? The only thing I can think of is not having precise enough constants. I copied them straight from the Cephes files, but maybe there's a conversion somewhere in the code.
User avatar
ggael
Moderator
Posts
3447
Karma
19
OS

Re: Vectorization Precision Problem

Fri Jul 13, 2012 6:13 am
hm, to be clear the piece of code I pasted is a replacement for the end of the pexp routine:

// build 2^n
emm0 = _mm_cvttpd_epi32(fx);
emm0 = _mm_add_epi32(emm0, p4i_0x7f);
emm0 = _mm_slli_epi32(emm0, 23);
return pmul(x, _mm_cvtps_pd(_mm_castsi128_ps(emm0)));

which overflow for inputs greater than ~88. I have no clue whether this version is still faster than a sequential code, it was just to show the issue.
User avatar
ggael
Moderator
Posts
3447
Karma
19
OS

Re: Vectorization Precision Problem

Fri Jul 13, 2012 6:16 am
For the record here is a working version up to inputs between -88, 88:

Code: Select all
template<> EIGEN_DEFINE_FUNCTION_ALLOWING_MULTIPLE_DEFINITIONS EIGEN_UNUSED
Packet2d pexp<Packet2d>(const Packet2d& _x)
{
  Packet2d x = _x;

  _EIGEN_DECLARE_CONST_Packet2d(1 , 1.0);
  _EIGEN_DECLARE_CONST_Packet2d(2 , 2.0);
  _EIGEN_DECLARE_CONST_Packet2d(half, 0.5);
  _EIGEN_DECLARE_CONST_Packet4i(0x7f, 0x7f);

  _EIGEN_DECLARE_CONST_Packet2d(cephes_LOG2EF, 1.4426950408889634073599);

  _EIGEN_DECLARE_CONST_Packet2d(cephes_exp_p0, 1.26177193074810590878e-4);
  _EIGEN_DECLARE_CONST_Packet2d(cephes_exp_p1, 3.02994407707441961300e-2);
  _EIGEN_DECLARE_CONST_Packet2d(cephes_exp_p2, 9.99999999999999999910e-1);

  _EIGEN_DECLARE_CONST_Packet2d(cephes_exp_q0, 3.00198505138664455042e-6);
  _EIGEN_DECLARE_CONST_Packet2d(cephes_exp_q1, 2.52448340349684104192e-3);
  _EIGEN_DECLARE_CONST_Packet2d(cephes_exp_q2, 2.27265548208155028766e-1);
  _EIGEN_DECLARE_CONST_Packet2d(cephes_exp_q3, 2.00000000000000000009e0);

  _EIGEN_DECLARE_CONST_Packet2d(cephes_exp_C1, 0.693145751953125);
  _EIGEN_DECLARE_CONST_Packet2d(cephes_exp_C2, 1.42860682030941723212e-6);

  Packet2d tmp = _mm_setzero_pd(), fx;
  /* express exp(x) as exp(g + n*log(2)) */
  fx = pmadd(p2d_cephes_LOG2EF, x, p2d_half);
  /* how to perform a floorf with SSE: just below */
  Packet4i emm0 = _mm_cvttpd_epi32(fx);
  tmp  = _mm_cvtepi32_pd(emm0);
  /* if greater, substract 1 */
  Packet2d mask = _mm_cmpgt_pd(tmp, fx);
  mask = _mm_and_pd(mask, p2d_1);
  fx = psub(tmp, mask);

  tmp = pmul(fx, p2d_cephes_exp_C1);
  Packet2d z = pmul(fx, p2d_cephes_exp_C2);
  x = psub(x, tmp);
  x = psub(x, z);

  Packet2d x2 = pmul(x,x);

  Packet2d px = p2d_cephes_exp_p0;
  px = pmadd(px, x2, p2d_cephes_exp_p1);
  px = pmadd(px, x2, p2d_cephes_exp_p2);
  px = pmul (px, x);

  Packet2d qx = p2d_cephes_exp_q0;
  qx = pmadd(qx, x2, p2d_cephes_exp_q1);
  qx = pmadd(qx, x2, p2d_cephes_exp_q2);
  qx = pmadd(qx, x2, p2d_cephes_exp_q3);

  x = pdiv(px,psub(qx,px));
  x = pmadd(p2d_2,x,p2d_1);

  // build 2^n
  emm0 = _mm_cvttpd_epi32(fx);
  emm0 = _mm_add_epi32(emm0, p4i_0x7f);
  emm0 = _mm_slli_epi32(emm0, 23);
  return pmul(x, _mm_cvtps_pd(_mm_castsi128_ps(emm0)));
}
L3Bee
Registered Member
Posts
6
Karma
0

Re: Vectorization Precision Problem

Fri Jul 13, 2012 2:20 pm
Actually, that's not quite working. At least, not as you described, and not on my computer. It has a minimum input value of approximately -87.68 Anything lower than that causes it to return a zero, which is bad. I don't know how precise you can get, but -87.68 returns about 8.33e-39, and -87.69 returns 0. An upper value of 88 is close enough for my purposes, or at least it's not causing a divide by zero error.

Thank you for your help. I have it close enough, so I can use this.

One last question. If you multiply a double array by a double, that operation is vectorized, right?
User avatar
ggael
Moderator
Posts
3447
Karma
19
OS

Re: Vectorization Precision Problem

Fri Jul 13, 2012 9:41 pm
yes, that's what I said, it works for a limited input range. To fix this, we need to correct the last part of the routine which is the multiplication by 2^fx. For instance, you can replace this last part by:

// build 2^n
double dat[2];
pstore(dat, fx);
dat[0] = std::pow(2,dat[0]);
dat[1] = std::pow(2,dat[1]);
return pmul(x,pload<Packet2d>(dat));

to get correct results but at the cost of a big performance lost....
L3Bee
Registered Member
Posts
6
Karma
0

Re: Vectorization Precision Problem

Fri Jul 27, 2012 3:20 pm
Ok, I've got 2^fx working.

Code: Select all
template<> EIGEN_DEFINE_FUNCTION_ALLOWING_MULTIPLE_DEFINITIONS EIGEN_UNUSED
Packet2d pexp<Packet2d>(const Packet2d& _x)
{
  Packet2d x = _x;

  _EIGEN_DECLARE_CONST_Packet2d(1 , 1.0);
  _EIGEN_DECLARE_CONST_Packet2d(2 , 2.0);
  _EIGEN_DECLARE_CONST_Packet2d(half, 0.5);
  //_EIGEN_DECLARE_CONST_Packet4i(0x7f, 0x7f);

  _EIGEN_DECLARE_CONST_Packet2d(cephes_LOG2EF, 1.4426950408889634073599);

  _EIGEN_DECLARE_CONST_Packet2d(exp_hi,  7.08396418532264106224e2);
  _EIGEN_DECLARE_CONST_Packet2d(exp_lo, -7.08396418532264106224e2);

  _EIGEN_DECLARE_CONST_Packet2d(cephes_exp_p0, 1.26177193074810590878e-4);
  _EIGEN_DECLARE_CONST_Packet2d(cephes_exp_p1, 3.02994407707441961300e-2);
  _EIGEN_DECLARE_CONST_Packet2d(cephes_exp_p2, 9.99999999999999999910e-1);

  _EIGEN_DECLARE_CONST_Packet2d(cephes_exp_q0, 3.00198505138664455042e-6);
  _EIGEN_DECLARE_CONST_Packet2d(cephes_exp_q1, 2.52448340349684104192e-3);
  _EIGEN_DECLARE_CONST_Packet2d(cephes_exp_q2, 2.27265548208155028766e-1);
  _EIGEN_DECLARE_CONST_Packet2d(cephes_exp_q3, 2.00000000000000000009e0);

  _EIGEN_DECLARE_CONST_Packet2d(cephes_exp_C1, 0.693145751953125);
  _EIGEN_DECLARE_CONST_Packet2d(cephes_exp_C2, 1.42860682030941723212e-6);
  static const __m128i p4i_1023_0 = _mm_setr_epi32(1023, 1023, 0, 0);

  x = _mm_min_pd(p2d_exp_hi, _mm_max_pd(p2d_exp_lo, x));

  Packet2d tmp = _mm_setzero_pd(), fx;
  /* express exp(x) as exp(g + n*log(2)) */
  fx = pmadd(p2d_cephes_LOG2EF, x, p2d_half);
  /* how to perform a floorf with SSE: just below */
  Packet4i emm0 = _mm_cvttpd_epi32(fx);
  tmp  = _mm_cvtepi32_pd(emm0);
  /* if greater, substract 1 */
  Packet2d mask = _mm_cmpgt_pd(tmp, fx);
  mask = _mm_and_pd(mask, p2d_1);
  fx = psub(tmp, mask);

  tmp = pmul(fx, p2d_cephes_exp_C1);
  Packet2d z = pmul(fx, p2d_cephes_exp_C2);
  x = psub(x, tmp);
  x = psub(x, z);

  Packet2d x2 = pmul(x,x);

  Packet2d px = p2d_cephes_exp_p0;
  px = pmadd(px, x2, p2d_cephes_exp_p1);
  px = pmadd(px, x2, p2d_cephes_exp_p2);
  px = pmul (px, x);

  Packet2d qx = p2d_cephes_exp_q0;
  qx = pmadd(qx, x2, p2d_cephes_exp_q1);
  qx = pmadd(qx, x2, p2d_cephes_exp_q2);
  qx = pmadd(qx, x2, p2d_cephes_exp_q3);

  x = pdiv(px,psub(qx,px));
  x = pmadd(p2d_2,x,p2d_1);

  // build 2^n
  emm0 = _mm_cvttpd_epi32(fx);
  emm0 = _mm_add_epi32(emm0, p4i_1023_0);
  emm0 = _mm_slli_epi32(emm0, 20);
  emm0 = _mm_shuffle_epi32(emm0, _MM_SHUFFLE(1,2,0,3));
  return pmul(x, _mm_castsi128_pd(emm0));
}

This works for my cases, which don't come very close to the edge cases. You might want to check the bounds and make sure it doesn't do anything silly like return zero before you include this.
User avatar
ggael
Moderator
Posts
3447
Karma
19
OS

Re: Vectorization Precision Problem

Sat Jul 28, 2012 6:57 am
excellent:

https://bitbucket.org/eigen/eigen/chang ... 9d69a6827/
changeset: 3619d69a6827
user: ggael
date: 2012-07-27 23:40:04
summary: add SSE pexp function for double, make use of _mm_floor_p* for pexp with SSE4.1
Hauke
Registered Member
Posts
109
Karma
3
OS

Re: Vectorization Precision Problem

Tue Jul 31, 2012 11:47 am
I am running into troubles with the new code. I've got input values such as x = -737.77991177461399

The old version resulted in:
std::exp(x) = 3.858652694020e-321#DEN

The new one results in:
pexp(x) = -1.#INF000000000000

Do you see a way to fix this issue?

Regards,
Hauke
L3Bee
Registered Member
Posts
6
Karma
0

Re: Vectorization Precision Problem

Tue Jul 31, 2012 12:56 pm
Well, that's outside the bounds, so it should round. Unfortunately, the bounds that come from the Cephes methods don't seem to be exactly right. Try changing the minimum rounding bound (that's currently -7.08396418532264106224e2) upwards until it no longer fails. This is what I was worried about, unfortunately, I don't know enough about how the math works inside the computer to figure out the new bounds, or why it's not meeting the old. You'll probably just have to do it manually. Set up a small test program and fill a vector with -750 and use .exp() and keep changing the bounds until it works.

And I took my own advice and set up that small test program. I'm getting proper rounding behavior.
for x = -737.77991177461399;
Vec.setConstant(x); Vec.exp() = 2.22507e-308, which is the same as exp(-708....) the proper rounding.
and exp(x) = 3.85865e-321

So, I have to say those dreaded words, "It works on my machine".

Questions to ask. Are you using doubles?
Did you modify the PacketMath.h in the same folder as the MathFunctions.h? You need to add the line
HasExp = 1, or something like that (use the float version above it as an example), or it won't use the new function.
Hauke
Registered Member
Posts
109
Karma
3
OS

Re: Vectorization Precision Problem

Wed Aug 01, 2012 5:51 am
The issue is fixed with the following commit:

https://bitbucket.org/eigen/eigen/chang ... 815a797c62

Now I am getting nice and clean zeros instead of INFs. Perfect. Thanks Gael!

- Hauke


Bookmarks



Who is online

Registered users: Baidu [Spider], Bing [Bot], Google [Bot]