Registered Member
|
It's pretty simple: with the development branch, when you say Aligned, you tell Eigen that it's safe to rely on the assumption that your array start pointer is a multiple of 16 bytes. Then it uses SSE instructions that rely on that assumption, that is in ei_pload, that's why you get the crash there.
The cause of the crash is that you did something of the form
where some_non_aligned_ptr is not a multiple of 16 bytes. For example, on many platforms, malloc() and "new" return such pointers. When you need a pointer aligned to 16 bytes, you need to use Eigen's functions ei_aligned_malloc or ei_aligned_new.
Join us on Eigen's IRC channel: #eigen on irc.freenode.net
Have a serious interest in Eigen? Then join the mailing list! |
Registered Member
|
Hi bjacob,
Thanks for your help again. I doubled-checked the memory allocation for the pointer I was mapping and I found one place I wasn't controlling, and that was producing the exception. Now I’m controlling all the allocations and making them aligned. So the code I referred to before is working again with the development branch, however, I think there is still something else wrong. The code is still slower than the hand written one. And when I look a the disassembly for the line eResults += eVector * eMatrix.transpose(); I see some instructions and a ei_unaligned_assign_impl which makes me think it’s not using the aligned version of the code??? I attach the disassembled code for the line: eResults += eVector * eMatrix.transpose(); 00837E80 lea eax,[esp+18h] 00837E84 push eax 00837E85 lea ecx,[esp+60h] 00837E89 lea edx,[esp+44h] 00837E8D push ecx 00837E8E lea ecx,[esp+38h] 00837E92 mov dword ptr [esp+20h],edx 00837E96 call Eigen::MatrixBase<Eigen::Map<Eigen::Matrix<double,33331,33331,0,33331,33331>,1,Eigen::Stride<0,0> > >::operator*<Eigen::Transpose<Eigen::Map<Eigen::Matrix<double,33331,33331,0,33331,33331>,1,Eigen::Stride<0,0> > > > (839CC0h) 00837E9B lea edx,[esp+20h] 00837E9F mov dword ptr [esp+18h],edx 00837EA3 lea ecx,[esp+17h] 00837EA7 lea edx,[esp+50h] 00837EAB mov dword ptr [esp+1Ch],ecx 00837EAF push edx 00837EB0 mov ecx,eax 00837EB2 mov byte ptr [esp+110h],1 00837EBA call Eigen::DenseBase<Eigen::GeneralProduct<Eigen::Map<Eigen::Matrix<double,33331,33331,0,33331,33331>,1,Eigen::Stride<0,0> >,Eigen::Transpose<Eigen::Map<Eigen::Matrix<double,33331,33331,0,33331,33331>,1,Eigen::Stride<0,0> > >,5> >::eval (839F80h) 00837EBF mov edi,dword ptr [esp+28h] 00837EC3 imul edi,dword ptr [esp+24h] 00837EC8 mov esi,eax 00837ECA mov eax,edi 00837ECC cdq 00837ECD sub eax,edx 00837ECF sar eax,1 00837ED1 add eax,eax 00837ED3 xor ecx,ecx 00837ED5 test eax,eax 00837ED7 jle ImageMetrics::DoubleMatrix::AddVecMatTransposeProduct+1A2h (837F02h) 00837ED9 lea esp,[esp] 00837EE0 mov edx,dword ptr [esi] 00837EE2 movapd xmm1,xmmword ptr [edx+ecx*8] 00837EE7 mov edx,dword ptr [esp+20h] 00837EEB movapd xmm0,xmmword ptr [edx+ecx*8] 00837EF0 lea edx,[edx+ecx*8] 00837EF3 add ecx,2 00837EF6 cmp ecx,eax 00837EF8 addpd xmm0,xmm1 00837EFC movapd xmmword ptr [edx],xmm0 00837F00 jl ImageMetrics::DoubleMatrix::AddVecMatTransposeProduct+180h (837EE0h) 00837F02 push edi 00837F03 push eax 00837F04 lea eax,[esp+20h] 00837F08 push eax 00837F09 push esi 00837F0A call Eigen::ei_unaligned_assign_impl<0>::run<Eigen::Matrix<double,33331,33331,0,33331,33331>,Eigen::SelfCwiseBinaryOp<Eigen::ei_scalar_sum_op<double>,Eigen::Map<Eigen::Matrix<double,33331,33331,0,33331,33331>,1,Eigen::Stride<0,0> > > > (839A30h) 00837F0F mov ecx,dword ptr [esp+60h] 00837F13 mov esi,dword ptr [__imp___aligned_free (9E86CCh)] 00837F19 push ecx 00837F1A call esi 00837F1C mov edx,dword ptr [esp+78h] 00837F20 push edx 00837F21 call esi 00837F23 add esp,18h Call to operator *= template<typename Derived> template<typename OtherDerived> inline const typename ProductReturnType<Derived,OtherDerived>::Type MatrixBase<Derived>::operator*(const MatrixBase<OtherDerived> &other) const { 00839CC0 push ecx // A note regarding the function declaration: In MSVC, this function will sometimes // not be inlined since ei_matrix_storage is an unwindable object for dynamic // matrices and product types are holding a member to store the result. // Thus it does not help tagging this function with EIGEN_STRONG_INLINE. enum { ProductIsValid = Derived::ColsAtCompileTime==Dynamic || OtherDerived::RowsAtCompileTime==Dynamic || int(Derived::ColsAtCompileTime)==int(OtherDerived::RowsAtCompileTime), AreVectors = Derived::IsVectorAtCompileTime && OtherDerived::IsVectorAtCompileTime, SameSizes = EIGEN_PREDICATE_SAME_MATRIX_SIZE(Derived,OtherDerived) }; // note to the lost user: // * for a dot product use: v1.dot(v2) // * for a coeff-wise product use: v1.cwiseProduct(v2) EIGEN_STATIC_ASSERT(ProductIsValid || !(AreVectors && SameSizes), INVALID_VECTOR_VECTOR_PRODUCT__IF_YOU_WANTED_A_DOT_OR_COEFF_WISE_PRODUCT_YOU_MUST_USE_THE_EXPLICIT_FUNCTIONS) EIGEN_STATIC_ASSERT(ProductIsValid || !(SameSizes && !AreVectors), INVALID_MATRIX_PRODUCT__IF_YOU_WANTED_A_COEFF_WISE_PRODUCT_YOU_MUST_USE_THE_EXPLICIT_FUNCTION) EIGEN_STATIC_ASSERT(ProductIsValid || SameSizes, INVALID_MATRIX_PRODUCT) return typename ProductReturnType<Derived,OtherDerived>::Type(derived(), other.derived()); 00839CC1 mov eax,dword ptr [esp+8] 00839CC5 mov dword ptr [eax],ecx 00839CC7 mov ecx,dword ptr [esp+0Ch] 00839CCB mov ecx,dword ptr [ecx] 00839CCD xor edx,edx 00839CCF mov dword ptr [eax+4],ecx 00839CD2 mov dword ptr [esp],edx 00839CD5 mov dword ptr [eax+8],edx 00839CD8 mov dword ptr [eax+0Ch],edx 00839CDB mov dword ptr [eax+10h],edx } 00839CDE pop ecx 00839CDF ret 8 Call to product / eval /** \returns the matrix or vector obtained by evaluating this expression. * * Notice that in the case of a plain matrix or vector (not an expression) this function just returns * a const reference, in order to avoid a useless copy. */ EIGEN_STRONG_INLINE const typename ei_eval<Derived>::type eval() const { 00839F80 push 0FFFFFFFFh 00839F82 push offset __ehhandler$?eval@?$DenseBase@V?$GeneralProduct@V?$Map@V?$Matrix@N$0ICDD@$0ICDD@$0A@$0ICDD@$0ICDD@@Eigen@@$00V?$Stride@$0A@$0A@@2@@Eigen@@V?$Transpose@V?$Map@V?$Matrix@N$0ICDD@$0ICDD@$0A@$0ICDD@$0ICDD@@Eigen@@$00V?$Stride@$0A@$0A@@2@@Eigen@@@2@$04@Eigen@@@Eigen@@QBE?BV?$Matrix@N$0ICDD@$0ICDD@$0A@$0ICDD@$0ICDD@@2@XZ (9DD278h) 00839F87 mov eax,dword ptr fs:[00000000h] 00839F8D push eax 00839F8E mov dword ptr fs:[0],esp 00839F95 push ecx 00839F96 push esi 00839F97 mov esi,ecx // Even though MSVC does not honor strong inlining when the return type // is a dynamic matrix, we desperately need strong inlining for fixed // size types on MSVC. return typename ei_eval<Derived>::type(derived()); 00839F99 mov eax,dword ptr [esi+4] 00839F9C mov eax,dword ptr [eax+4] 00839F9F mov ecx,dword ptr [esi] 00839FA1 mov ecx,dword ptr [ecx+4] 00839FA4 push edi 00839FA5 mov edi,dword ptr [esp+1Ch] 00839FA9 push eax 00839FAA imul eax,ecx 00839FAD push ecx 00839FAE push eax 00839FAF mov ecx,edi 00839FB1 mov dword ptr [esp+14h],0 00839FB9 call Eigen::ei_matrix_storage<double,33331,33331,33331,0>::ei_matrix_storage<double,33331,33331,33331,0> (838720h) 00839FBE mov edx,dword ptr [esi+4] 00839FC1 mov eax,dword ptr [edx+4] 00839FC4 mov ecx,dword ptr [esi] 00839FC6 mov ecx,dword ptr [ecx+4] 00839FC9 push eax 00839FCA imul eax,ecx 00839FCD push ecx 00839FCE push eax 00839FCF mov ecx,edi 00839FD1 mov dword ptr [esp+20h],0 00839FD9 call Eigen::ei_matrix_storage<double,33331,33331,33331,1>::resize (8386B0h) 00839FDE mov edx,dword ptr [esi+4] 00839FE1 mov eax,dword ptr [edx+4] 00839FE4 mov ecx,dword ptr [esi] 00839FE6 mov ecx,dword ptr [ecx+4] 00839FE9 push eax 00839FEA imul eax,ecx 00839FED push ecx 00839FEE push eax 00839FEF mov ecx,edi 00839FF1 call Eigen::ei_matrix_storage<double,33331,33331,33331,1>::resize (8386B0h) 00839FF6 push edi 00839FF7 mov ecx,esi 00839FF9 call Eigen::ProductBase<Eigen::GeneralProduct<Eigen::Map<Eigen::Matrix<double,33331,33331,0,33331,33331>,1,Eigen::Stride<0,0> >,Eigen::Transpose<Eigen::Map<Eigen::Matrix<double,33331,33331,0,33331,33331>,1,Eigen::Stride<0,0> > >,5>,Eigen::Map<Eigen::Matrix<double,33331,33331,0,33331,33331>,1,Eigen::Stride<0,0> >,Eigen::Transpose<Eigen::Map<Eigen::Matrix<double,33331,33331,0,33331,33331>,1,Eigen::Stride<0,0> > > >::evalTo<Eigen::Matrix<double,33331,33331,0,33331,33331> > (839ED0h) } 00839FFE mov ecx,dword ptr [esp+0Ch] 0083A002 mov eax,edi 0083A004 pop edi 0083A005 pop esi 0083A006 mov dword ptr fs:[0],ecx 0083A00D add esp,10h 0083A010 ret 4 Call to template <> struct ei_unaligned_assign_impl<false> { // MSVC must not inline this functions. If it does, it fails to optimize the // packet access path. #ifdef _MSC_VER template <typename Derived, typename OtherDerived> static EIGEN_DONT_INLINE void run(const Derived& src, OtherDerived& dst, int start, int end) #else template <typename Derived, typename OtherDerived> static EIGEN_STRONG_INLINE void run(const Derived& src, OtherDerived& dst, int start, int end) #endif { for (int index = start; index < end; ++index) 00839A30 mov edx,dword ptr [esp+0Ch] 00839A34 push edi 00839A35 mov edi,dword ptr [esp+14h] 00839A39 cmp edx,edi 00839A3B jge Eigen::ei_unaligned_assign_impl<0>::run<Eigen::Matrix<double,33331,33331,0,33331,33331>,Eigen::SelfCwiseBinaryOp<Eigen::ei_scalar_sum_op<double>,Eigen::Map<Eigen::Matrix<double,33331,33331,0,33331,33331>,1,Eigen::Stride<0,0> > > >+0C7h (839AF7h) 00839A41 mov eax,edi 00839A43 push ebp 00839A44 mov ebp,dword ptr [esp+10h] 00839A48 sub eax,edx 00839A4A cmp eax,4 00839A4D push esi 00839A4E jl Eigen::ei_unaligned_assign_impl<0>::run<Eigen::Matrix<double,33331,33331,0,33331,33331>,Eigen::SelfCwiseBinaryOp<Eigen::ei_scalar_sum_op<double>,Eigen::Map<Eigen::Matrix<double,33331,33331,0,33331,33331>,1,Eigen::Stride<0,0> > > >+95h (839AC5h) 00839A50 mov eax,dword ptr [esp+10h] 00839A54 mov ecx,dword ptr [ebp] 00839A57 mov ecx,dword ptr [ecx] 00839A59 push ebx 00839A5A mov ebx,dword ptr [eax] }; |
Registered Member
|
First of all, don't worry about ei_unaligned_assign_impl, it is just taking care of the small parts of your matrix that can't be addressed by 16 byte packets (the beginning and end of a row/column...).
The most probable cause of bad performance is MSVC failing to correctly inline a function that must crucially be inlined. The reason why you'd be the first one to report this particular issue is that it's not too common to do row_vector * transpose_of_matrix, so can you jus try replacing:
by
and see if that makes any difference. If that's still slow, then the best you can do is file a bug report on our issue tracker with a self-contained compilable test program, exhibiting poor performance with MSVC 2008.
Join us on Eigen's IRC channel: #eigen on irc.freenode.net
Have a serious interest in Eigen? Then join the mailing list! |
Registered Member
|
Hi bjacob,
I have tried the alternative syntax. I don’t get any performance improvement. I’ll make an entry on the issue tracker. Thanks for your help. Martin. |
Moderator
|
for the record there were two issues:
1 - you really should use Map<RowVectorXd> for the vectors otherwise Eigen uses the general matrix*matrix product which is not very well suited for matrix*vectors... 2 - there was a bug in eigen preventing Map<> object to be fully optimized. if you update your local copy and do the change 1) then the eigen version should really be significantly faster because of SSE *and* better cache use. Also, here adding Aligned is not really useful, but adding .noalias() avoid one useless memory alloc/copy. |
Registered Member
|
Wow, this is huge, how could I let that pass!! Great job for 2) too. martinakos, what gael means with .noalias() is eResults.noalias() = otherstuff;
Join us on Eigen's IRC channel: #eigen on irc.freenode.net
Have a serious interest in Eigen? Then join the mailing list! |
Registered Member
|
Now it's working as expected!!!
Using doubles is twice as fast as the non-SSE hand written code! I have tried using Aligned and without it and the speed is more or less the same, so I imagine that for this size of matrixes/vectors I can get good performance without needing to aligning the memory. Thanks very much ggael and bjacob for your help. Martinakos |
Registered users: Bing [Bot], Google [Bot], Sogou [Bot]