Vectorization Precision Problem • KDE Community Forums

This forum has been archived. All content is frozen. Please use KDE Discuss instead.

Board index

Vectorization Precision Problem

Page 1 of 1 (14 posts)

Tags:

L3Bee Registered Member Posts 6 Karma 0	Vectorization Precision Problem Wed Jul 11, 2012 8:34 pm I'm working on a project where I have a vector of sorted doubles, and I need to take the exp() of each. Since the data was already in an Eigen Vector, I thought I'd just use the provided methods. Unfortunately, the double versions aren't vectorized using SSE (or not in the 3.1.0 I'm using), so I thought I'd write my own using the Cephes library and the float version as a guide. It worked, kind of. The difference between this and the exp() from math.h is small most of the time, but for some values it's an absolute difference of 262144, which is enough to result in NANs farther in the code. The largest errors are from the largest input values. For example, the error of 262144 is from an input value of 57.7482317672092... Anyway, I was hoping that someone with more experience that I could point out my mistake. This is my first attempt at doing anything with SSE intrinsics, so it's likely I missed something I should have changed between float and double. Here's the code, from Eigen/src/Core/arch/SSE/MathFunctions.h Code: Select all Packet2d polylvl(const Packet2d& x, double* p, int N) { Packet2d ans = pset1<Packet2d>(p++); do { ans = pmadd(ans,x,pset1<Packet2d>(p++)); } while(--N); return ans; } template<> EIGEN_DEFINE_FUNCTION_ALLOWING_MULTIPLE_DEFINITIONS EIGEN_UNUSED Packet2d pexp<Packet2d>(const Packet2d& _x) { Packet2d x = _x; _EIGEN_DECLARE_CONST_Packet2d(1 , 1.0); _EIGEN_DECLARE_CONST_Packet2d(half, 0.5); _EIGEN_DECLARE_CONST_Packet4i(0x7f, 0x7f); _EIGEN_DECLARE_CONST_Packet2d(exp_hi, 88.029691931113054295988); _EIGEN_DECLARE_CONST_Packet2d(exp_lo, -88.72283911167299960540); _EIGEN_DECLARE_CONST_Packet2d(cephes_LOG2EF, 1.4426950408889634073599); static double P[] = { 1.26177193074810590878e-4, 3.02994407707441961300e-2, 9.99999999999999999910e-1}; static double Q[] = { 3.00198505138664455042e-6, 2.52448340349684104192e-3, 2.27265548208155028766e-1, 2.00000000000000000009e0}; _EIGEN_DECLARE_CONST_Packet2d(cephes_exp_C1, 0.693145751953125); _EIGEN_DECLARE_CONST_Packet2d(cephes_exp_C2, 1.42860682030941723212e-6); Packet2d tmp = _mm_setzero_pd(), fx; // clamp x x = pmax(pmin(x, p2d_exp_hi), p2d_exp_lo); /* express exp(x) as exp(g + nlog(2)) / fx = pmadd(p2d_cephes_LOG2EF, x, p2d_half); /* how to perform a floorf with SSE: just below / Packet4i emm0 = _mm_cvttpd_epi32(fx); tmp = _mm_cvtepi32_pd(emm0); / if greater, substract 1 */ Packet2d mask = _mm_cmpgt_pd(tmp, fx); mask = _mm_and_pd(mask, p2d_1); fx = psub(tmp, mask); tmp = pmul(fx, p2d_cephes_exp_C1); Packet2d z = pmul(fx, p2d_cephes_exp_C2), xx; x = psub(x, tmp); x = psub(x, z); xx = pmul(x,x); z = pmul(x,polylvl(xx,P,2)); x = pdiv(z,psub(polylvl(xx,Q,3),z)); x = pmadd(pset1<Packet2d>(2.0),x,p2d_1); // build 2^n emm0 = _mm_cvttpd_epi32(fx); emm0 = _mm_add_epi32(emm0, p4i_0x7f); emm0 = _mm_slli_epi32(emm0, 23); return pmul(x, _mm_cvtps_pd(_mm_castsi128_ps(emm0))); } Then, in PacketMath.h in the same folder, I added the line HasExp = 1, to the packet_traits<double> struct. For reference, the Cephes library is http://www.netlib.org/cephes/ Edit: I should add that this is being run on a Xeon 5472 on Win XP x86 and is compiled with VS2010 on Ox and with SSE2 enabled. I'm pretty sure this is a logic error, but if not, here you go. If you need any other information, just ask.
ggael Moderator Posts 3447 Karma 19 OS	Re: Vectorization Precision Problem Thu Jul 12, 2012 6:57 pm Hi, one problem are the hi and lo bounds that should be adjusted for double.
L3Bee Registered Member Posts 6 Karma 0	Re: Vectorization Precision Problem Thu Jul 12, 2012 7:26 pm Well, I changed those to the +-708 limits from the cephes file (is that what you meant?), but of course, that doesn't help the problem. Still getting the same difference.
ggael Moderator Posts 3447 Karma 19 OS	Re: Vectorization Precision Problem Thu Jul 12, 2012 9:46 pm Indeed, the other issue is when computing 2^fx, you really must do it for double, not float. The following works here but we of course need a faster solution: double dat[2]; pstore(dat, fx); dat[0] = std::pow(2,dat[0]); dat[1] = std::pow(2,dat[1]); return pmul(x,pload<Packet2d>(dat));
L3Bee Registered Member Posts 6 Karma 0	Re: Vectorization Precision Problem Thu Jul 12, 2012 9:54 pm Fast is good. I'll take your word that it's faster, at least until I get the accuracy right and can properly test it with my data. Any idea what's causing the errors? The only thing I can think of is not having precise enough constants. I copied them straight from the Cephes files, but maybe there's a conversion somewhere in the code.
ggael Moderator Posts 3447 Karma 19 OS	Re: Vectorization Precision Problem Fri Jul 13, 2012 6:13 am hm, to be clear the piece of code I pasted is a replacement for the end of the pexp routine: // build 2^n emm0 = _mm_cvttpd_epi32(fx); emm0 = _mm_add_epi32(emm0, p4i_0x7f); emm0 = _mm_slli_epi32(emm0, 23); return pmul(x, _mm_cvtps_pd(_mm_castsi128_ps(emm0))); which overflow for inputs greater than ~88. I have no clue whether this version is still faster than a sequential code, it was just to show the issue.
ggael Moderator Posts 3447 Karma 19 OS	Re: Vectorization Precision Problem Fri Jul 13, 2012 6:16 am For the record here is a working version up to inputs between -88, 88: Code: Select all template<> EIGEN_DEFINE_FUNCTION_ALLOWING_MULTIPLE_DEFINITIONS EIGEN_UNUSED Packet2d pexp<Packet2d>(const Packet2d& _x) { Packet2d x = _x; _EIGEN_DECLARE_CONST_Packet2d(1 , 1.0); _EIGEN_DECLARE_CONST_Packet2d(2 , 2.0); _EIGEN_DECLARE_CONST_Packet2d(half, 0.5); _EIGEN_DECLARE_CONST_Packet4i(0x7f, 0x7f); _EIGEN_DECLARE_CONST_Packet2d(cephes_LOG2EF, 1.4426950408889634073599); _EIGEN_DECLARE_CONST_Packet2d(cephes_exp_p0, 1.26177193074810590878e-4); _EIGEN_DECLARE_CONST_Packet2d(cephes_exp_p1, 3.02994407707441961300e-2); _EIGEN_DECLARE_CONST_Packet2d(cephes_exp_p2, 9.99999999999999999910e-1); _EIGEN_DECLARE_CONST_Packet2d(cephes_exp_q0, 3.00198505138664455042e-6); _EIGEN_DECLARE_CONST_Packet2d(cephes_exp_q1, 2.52448340349684104192e-3); _EIGEN_DECLARE_CONST_Packet2d(cephes_exp_q2, 2.27265548208155028766e-1); _EIGEN_DECLARE_CONST_Packet2d(cephes_exp_q3, 2.00000000000000000009e0); _EIGEN_DECLARE_CONST_Packet2d(cephes_exp_C1, 0.693145751953125); _EIGEN_DECLARE_CONST_Packet2d(cephes_exp_C2, 1.42860682030941723212e-6); Packet2d tmp = _mm_setzero_pd(), fx; /* express exp(x) as exp(g + nlog(2)) / fx = pmadd(p2d_cephes_LOG2EF, x, p2d_half); /* how to perform a floorf with SSE: just below / Packet4i emm0 = _mm_cvttpd_epi32(fx); tmp = _mm_cvtepi32_pd(emm0); / if greater, substract 1 */ Packet2d mask = _mm_cmpgt_pd(tmp, fx); mask = _mm_and_pd(mask, p2d_1); fx = psub(tmp, mask); tmp = pmul(fx, p2d_cephes_exp_C1); Packet2d z = pmul(fx, p2d_cephes_exp_C2); x = psub(x, tmp); x = psub(x, z); Packet2d x2 = pmul(x,x); Packet2d px = p2d_cephes_exp_p0; px = pmadd(px, x2, p2d_cephes_exp_p1); px = pmadd(px, x2, p2d_cephes_exp_p2); px = pmul (px, x); Packet2d qx = p2d_cephes_exp_q0; qx = pmadd(qx, x2, p2d_cephes_exp_q1); qx = pmadd(qx, x2, p2d_cephes_exp_q2); qx = pmadd(qx, x2, p2d_cephes_exp_q3); x = pdiv(px,psub(qx,px)); x = pmadd(p2d_2,x,p2d_1); // build 2^n emm0 = _mm_cvttpd_epi32(fx); emm0 = _mm_add_epi32(emm0, p4i_0x7f); emm0 = _mm_slli_epi32(emm0, 23); return pmul(x, _mm_cvtps_pd(_mm_castsi128_ps(emm0))); }
L3Bee Registered Member Posts 6 Karma 0	Re: Vectorization Precision Problem Fri Jul 13, 2012 2:20 pm Actually, that's not quite working. At least, not as you described, and not on my computer. It has a minimum input value of approximately -87.68 Anything lower than that causes it to return a zero, which is bad. I don't know how precise you can get, but -87.68 returns about 8.33e-39, and -87.69 returns 0. An upper value of 88 is close enough for my purposes, or at least it's not causing a divide by zero error. Thank you for your help. I have it close enough, so I can use this. One last question. If you multiply a double array by a double, that operation is vectorized, right?
ggael Moderator Posts 3447 Karma 19 OS	Re: Vectorization Precision Problem Fri Jul 13, 2012 9:41 pm yes, that's what I said, it works for a limited input range. To fix this, we need to correct the last part of the routine which is the multiplication by 2^fx. For instance, you can replace this last part by: // build 2^n double dat[2]; pstore(dat, fx); dat[0] = std::pow(2,dat[0]); dat[1] = std::pow(2,dat[1]); return pmul(x,pload<Packet2d>(dat)); to get correct results but at the cost of a big performance lost....
L3Bee Registered Member Posts 6 Karma 0	Re: Vectorization Precision Problem Fri Jul 27, 2012 3:20 pm Ok, I've got 2^fx working. Code: Select all template<> EIGEN_DEFINE_FUNCTION_ALLOWING_MULTIPLE_DEFINITIONS EIGEN_UNUSED Packet2d pexp<Packet2d>(const Packet2d& _x) { Packet2d x = _x; _EIGEN_DECLARE_CONST_Packet2d(1 , 1.0); _EIGEN_DECLARE_CONST_Packet2d(2 , 2.0); _EIGEN_DECLARE_CONST_Packet2d(half, 0.5); //_EIGEN_DECLARE_CONST_Packet4i(0x7f, 0x7f); _EIGEN_DECLARE_CONST_Packet2d(cephes_LOG2EF, 1.4426950408889634073599); _EIGEN_DECLARE_CONST_Packet2d(exp_hi, 7.08396418532264106224e2); _EIGEN_DECLARE_CONST_Packet2d(exp_lo, -7.08396418532264106224e2); _EIGEN_DECLARE_CONST_Packet2d(cephes_exp_p0, 1.26177193074810590878e-4); _EIGEN_DECLARE_CONST_Packet2d(cephes_exp_p1, 3.02994407707441961300e-2); _EIGEN_DECLARE_CONST_Packet2d(cephes_exp_p2, 9.99999999999999999910e-1); _EIGEN_DECLARE_CONST_Packet2d(cephes_exp_q0, 3.00198505138664455042e-6); _EIGEN_DECLARE_CONST_Packet2d(cephes_exp_q1, 2.52448340349684104192e-3); _EIGEN_DECLARE_CONST_Packet2d(cephes_exp_q2, 2.27265548208155028766e-1); _EIGEN_DECLARE_CONST_Packet2d(cephes_exp_q3, 2.00000000000000000009e0); _EIGEN_DECLARE_CONST_Packet2d(cephes_exp_C1, 0.693145751953125); _EIGEN_DECLARE_CONST_Packet2d(cephes_exp_C2, 1.42860682030941723212e-6); static const __m128i p4i_1023_0 = _mm_setr_epi32(1023, 1023, 0, 0); x = _mm_min_pd(p2d_exp_hi, _mm_max_pd(p2d_exp_lo, x)); Packet2d tmp = _mm_setzero_pd(), fx; /* express exp(x) as exp(g + nlog(2)) / fx = pmadd(p2d_cephes_LOG2EF, x, p2d_half); /* how to perform a floorf with SSE: just below / Packet4i emm0 = _mm_cvttpd_epi32(fx); tmp = _mm_cvtepi32_pd(emm0); / if greater, substract 1 */ Packet2d mask = _mm_cmpgt_pd(tmp, fx); mask = _mm_and_pd(mask, p2d_1); fx = psub(tmp, mask); tmp = pmul(fx, p2d_cephes_exp_C1); Packet2d z = pmul(fx, p2d_cephes_exp_C2); x = psub(x, tmp); x = psub(x, z); Packet2d x2 = pmul(x,x); Packet2d px = p2d_cephes_exp_p0; px = pmadd(px, x2, p2d_cephes_exp_p1); px = pmadd(px, x2, p2d_cephes_exp_p2); px = pmul (px, x); Packet2d qx = p2d_cephes_exp_q0; qx = pmadd(qx, x2, p2d_cephes_exp_q1); qx = pmadd(qx, x2, p2d_cephes_exp_q2); qx = pmadd(qx, x2, p2d_cephes_exp_q3); x = pdiv(px,psub(qx,px)); x = pmadd(p2d_2,x,p2d_1); // build 2^n emm0 = _mm_cvttpd_epi32(fx); emm0 = _mm_add_epi32(emm0, p4i_1023_0); emm0 = _mm_slli_epi32(emm0, 20); emm0 = _mm_shuffle_epi32(emm0, _MM_SHUFFLE(1,2,0,3)); return pmul(x, _mm_castsi128_pd(emm0)); } This works for my cases, which don't come very close to the edge cases. You might want to check the bounds and make sure it doesn't do anything silly like return zero before you include this.
ggael Moderator Posts 3447 Karma 19 OS	Re: Vectorization Precision Problem Sat Jul 28, 2012 6:57 am excellent: https://bitbucket.org/eigen/eigen/chang ... 9d69a6827/ changeset: 3619d69a6827 user: ggael date: 2012-07-27 23:40:04 summary: add SSE pexp function for double, make use of _mm_floor_p* for pexp with SSE4.1
Hauke Registered Member Posts 109 Karma 3 OS	Re: Vectorization Precision Problem Tue Jul 31, 2012 11:47 am I am running into troubles with the new code. I've got input values such as x = -737.77991177461399 The old version resulted in: std::exp(x) = 3.858652694020e-321#DEN The new one results in: pexp(x) = -1.#INF000000000000 Do you see a way to fix this issue? Regards, Hauke
L3Bee Registered Member Posts 6 Karma 0	Re: Vectorization Precision Problem Tue Jul 31, 2012 12:56 pm Well, that's outside the bounds, so it should round. Unfortunately, the bounds that come from the Cephes methods don't seem to be exactly right. Try changing the minimum rounding bound (that's currently -7.08396418532264106224e2) upwards until it no longer fails. This is what I was worried about, unfortunately, I don't know enough about how the math works inside the computer to figure out the new bounds, or why it's not meeting the old. You'll probably just have to do it manually. Set up a small test program and fill a vector with -750 and use .exp() and keep changing the bounds until it works. And I took my own advice and set up that small test program. I'm getting proper rounding behavior. for x = -737.77991177461399; Vec.setConstant(x); Vec.exp() = 2.22507e-308, which is the same as exp(-708....) the proper rounding. and exp(x) = 3.85865e-321 So, I have to say those dreaded words, "It works on my machine". Questions to ask. Are you using doubles? Did you modify the PacketMath.h in the same folder as the MathFunctions.h? You need to add the line HasExp = 1, or something like that (use the float version above it as an example), or it won't use the new function.
Hauke Registered Member Posts 109 Karma 3 OS	Re: Vectorization Precision Problem Wed Aug 01, 2012 5:51 am The issue is fixed with the following commit: https://bitbucket.org/eigen/eigen/chang ... 815a797c62 Now I am getting nice and clean zeros instead of INFs. Perfect. Thanks Gael! - Hauke

Page 1 of 1 (14 posts)

Bookmarks

Who is online

Registered users: Baidu [Spider], Bing [Bot], Google [Bot]