This forum has been archived. All content is frozen. Please use KDE Discuss instead.

I can't get SSE2 code in MSVC 2005, help!

Tags: None
(comma "," separated)
User avatar
bjacob
Registered Member
Posts
658
Karma
3
It's pretty simple: with the development branch, when you say Aligned, you tell Eigen that it's safe to rely on the assumption that your array start pointer is a multiple of 16 bytes. Then it uses SSE instructions that rely on that assumption, that is in ei_pload, that's why you get the crash there.

The cause of the crash is that you did something of the form
Code: Select all
Map<MatrixXd, Aligned> my_map(some_non_aligned_ptr, x, y)

where some_non_aligned_ptr is not a multiple of 16 bytes. For example, on many platforms, malloc() and "new" return such pointers. When you need a pointer aligned to 16 bytes, you need to use Eigen's functions ei_aligned_malloc or ei_aligned_new.


Join us on Eigen's IRC channel: #eigen on irc.freenode.net
Have a serious interest in Eigen? Then join the mailing list!
martinakos
Registered Member
Posts
53
Karma
0
OS
Hi bjacob,

Thanks for your help again. I doubled-checked the memory allocation for the pointer I was mapping and I found one place I wasn't controlling, and that was producing the exception. Now I’m controlling all the allocations and making them aligned. So the code I referred to before is working again with the development branch, however, I think there is still something else wrong. The code is still slower than the hand written one. And when I look a the disassembly for the line
eResults += eVector * eMatrix.transpose();

I see some instructions and a ei_unaligned_assign_impl which makes me think it’s not using the aligned version of the code???

I attach the disassembled code for the line:
eResults += eVector * eMatrix.transpose();


00837E80 lea eax,[esp+18h]
00837E84 push eax
00837E85 lea ecx,[esp+60h]
00837E89 lea edx,[esp+44h]
00837E8D push ecx
00837E8E lea ecx,[esp+38h]
00837E92 mov dword ptr [esp+20h],edx
00837E96 call Eigen::MatrixBase<Eigen::Map<Eigen::Matrix<double,33331,33331,0,33331,33331>,1,Eigen::Stride<0,0> > >::operator*<Eigen::Transpose<Eigen::Map<Eigen::Matrix<double,33331,33331,0,33331,33331>,1,Eigen::Stride<0,0> > > > (839CC0h)
00837E9B lea edx,[esp+20h]
00837E9F mov dword ptr [esp+18h],edx
00837EA3 lea ecx,[esp+17h]
00837EA7 lea edx,[esp+50h]
00837EAB mov dword ptr [esp+1Ch],ecx
00837EAF push edx
00837EB0 mov ecx,eax
00837EB2 mov byte ptr [esp+110h],1
00837EBA call Eigen::DenseBase<Eigen::GeneralProduct<Eigen::Map<Eigen::Matrix<double,33331,33331,0,33331,33331>,1,Eigen::Stride<0,0> >,Eigen::Transpose<Eigen::Map<Eigen::Matrix<double,33331,33331,0,33331,33331>,1,Eigen::Stride<0,0> > >,5> >::eval (839F80h)
00837EBF mov edi,dword ptr [esp+28h]
00837EC3 imul edi,dword ptr [esp+24h]
00837EC8 mov esi,eax
00837ECA mov eax,edi
00837ECC cdq
00837ECD sub eax,edx
00837ECF sar eax,1
00837ED1 add eax,eax
00837ED3 xor ecx,ecx
00837ED5 test eax,eax
00837ED7 jle ImageMetrics::DoubleMatrix::AddVecMatTransposeProduct+1A2h (837F02h)
00837ED9 lea esp,[esp]
00837EE0 mov edx,dword ptr [esi]
00837EE2 movapd xmm1,xmmword ptr [edx+ecx*8]
00837EE7 mov edx,dword ptr [esp+20h]
00837EEB movapd xmm0,xmmword ptr [edx+ecx*8]
00837EF0 lea edx,[edx+ecx*8]
00837EF3 add ecx,2
00837EF6 cmp ecx,eax
00837EF8 addpd xmm0,xmm1
00837EFC movapd xmmword ptr [edx],xmm0
00837F00 jl ImageMetrics::DoubleMatrix::AddVecMatTransposeProduct+180h (837EE0h)
00837F02 push edi
00837F03 push eax
00837F04 lea eax,[esp+20h]
00837F08 push eax
00837F09 push esi
00837F0A call Eigen::ei_unaligned_assign_impl<0>::run<Eigen::Matrix<double,33331,33331,0,33331,33331>,Eigen::SelfCwiseBinaryOp<Eigen::ei_scalar_sum_op<double>,Eigen::Map<Eigen::Matrix<double,33331,33331,0,33331,33331>,1,Eigen::Stride<0,0> > > > (839A30h)
00837F0F mov ecx,dword ptr [esp+60h]
00837F13 mov esi,dword ptr [__imp___aligned_free (9E86CCh)]
00837F19 push ecx
00837F1A call esi
00837F1C mov edx,dword ptr [esp+78h]
00837F20 push edx
00837F21 call esi
00837F23 add esp,18h


Call to operator *=

template<typename Derived>
template<typename OtherDerived>
inline const typename ProductReturnType<Derived,OtherDerived>::Type
MatrixBase<Derived>::operator*(const MatrixBase<OtherDerived> &other) const
{
00839CC0 push ecx
// A note regarding the function declaration: In MSVC, this function will sometimes
// not be inlined since ei_matrix_storage is an unwindable object for dynamic
// matrices and product types are holding a member to store the result.
// Thus it does not help tagging this function with EIGEN_STRONG_INLINE.
enum {
ProductIsValid = Derived::ColsAtCompileTime==Dynamic
|| OtherDerived::RowsAtCompileTime==Dynamic
|| int(Derived::ColsAtCompileTime)==int(OtherDerived::RowsAtCompileTime),
AreVectors = Derived::IsVectorAtCompileTime && OtherDerived::IsVectorAtCompileTime,
SameSizes = EIGEN_PREDICATE_SAME_MATRIX_SIZE(Derived,OtherDerived)
};
// note to the lost user:
// * for a dot product use: v1.dot(v2)
// * for a coeff-wise product use: v1.cwiseProduct(v2)
EIGEN_STATIC_ASSERT(ProductIsValid || !(AreVectors && SameSizes),
INVALID_VECTOR_VECTOR_PRODUCT__IF_YOU_WANTED_A_DOT_OR_COEFF_WISE_PRODUCT_YOU_MUST_USE_THE_EXPLICIT_FUNCTIONS)
EIGEN_STATIC_ASSERT(ProductIsValid || !(SameSizes && !AreVectors),
INVALID_MATRIX_PRODUCT__IF_YOU_WANTED_A_COEFF_WISE_PRODUCT_YOU_MUST_USE_THE_EXPLICIT_FUNCTION)
EIGEN_STATIC_ASSERT(ProductIsValid || SameSizes, INVALID_MATRIX_PRODUCT)
return typename ProductReturnType<Derived,OtherDerived>::Type(derived(), other.derived());
00839CC1 mov eax,dword ptr [esp+8]
00839CC5 mov dword ptr [eax],ecx
00839CC7 mov ecx,dword ptr [esp+0Ch]
00839CCB mov ecx,dword ptr [ecx]
00839CCD xor edx,edx
00839CCF mov dword ptr [eax+4],ecx
00839CD2 mov dword ptr [esp],edx
00839CD5 mov dword ptr [eax+8],edx
00839CD8 mov dword ptr [eax+0Ch],edx
00839CDB mov dword ptr [eax+10h],edx
}
00839CDE pop ecx
00839CDF ret 8


Call to product / eval

/** \returns the matrix or vector obtained by evaluating this expression.
*
* Notice that in the case of a plain matrix or vector (not an expression) this function just returns
* a const reference, in order to avoid a useless copy.
*/
EIGEN_STRONG_INLINE const typename ei_eval<Derived>::type eval() const
{
00839F80 push 0FFFFFFFFh
00839F82 push offset __ehhandler$?eval@?$DenseBase@V?$GeneralProduct@V?$Map@V?$Matrix@N$0ICDD@$0ICDD@$0A@$0ICDD@$0ICDD@@Eigen@@$00V?$Stride@$0A@$0A@@2@@Eigen@@V?$Transpose@V?$Map@V?$Matrix@N$0ICDD@$0ICDD@$0A@$0ICDD@$0ICDD@@Eigen@@$00V?$Stride@$0A@$0A@@2@@Eigen@@@2@$04@Eigen@@@Eigen@@QBE?BV?$Matrix@N$0ICDD@$0ICDD@$0A@$0ICDD@$0ICDD@@2@XZ (9DD278h)
00839F87 mov eax,dword ptr fs:[00000000h]
00839F8D push eax
00839F8E mov dword ptr fs:[0],esp
00839F95 push ecx
00839F96 push esi
00839F97 mov esi,ecx
// Even though MSVC does not honor strong inlining when the return type
// is a dynamic matrix, we desperately need strong inlining for fixed
// size types on MSVC.
return typename ei_eval<Derived>::type(derived());
00839F99 mov eax,dword ptr [esi+4]
00839F9C mov eax,dword ptr [eax+4]
00839F9F mov ecx,dword ptr [esi]
00839FA1 mov ecx,dword ptr [ecx+4]
00839FA4 push edi
00839FA5 mov edi,dword ptr [esp+1Ch]
00839FA9 push eax
00839FAA imul eax,ecx
00839FAD push ecx
00839FAE push eax
00839FAF mov ecx,edi
00839FB1 mov dword ptr [esp+14h],0
00839FB9 call Eigen::ei_matrix_storage<double,33331,33331,33331,0>::ei_matrix_storage<double,33331,33331,33331,0> (838720h)
00839FBE mov edx,dword ptr [esi+4]
00839FC1 mov eax,dword ptr [edx+4]
00839FC4 mov ecx,dword ptr [esi]
00839FC6 mov ecx,dword ptr [ecx+4]
00839FC9 push eax
00839FCA imul eax,ecx
00839FCD push ecx
00839FCE push eax
00839FCF mov ecx,edi
00839FD1 mov dword ptr [esp+20h],0
00839FD9 call Eigen::ei_matrix_storage<double,33331,33331,33331,1>::resize (8386B0h)
00839FDE mov edx,dword ptr [esi+4]
00839FE1 mov eax,dword ptr [edx+4]
00839FE4 mov ecx,dword ptr [esi]
00839FE6 mov ecx,dword ptr [ecx+4]
00839FE9 push eax
00839FEA imul eax,ecx
00839FED push ecx
00839FEE push eax
00839FEF mov ecx,edi
00839FF1 call Eigen::ei_matrix_storage<double,33331,33331,33331,1>::resize (8386B0h)
00839FF6 push edi
00839FF7 mov ecx,esi
00839FF9 call Eigen::ProductBase<Eigen::GeneralProduct<Eigen::Map<Eigen::Matrix<double,33331,33331,0,33331,33331>,1,Eigen::Stride<0,0> >,Eigen::Transpose<Eigen::Map<Eigen::Matrix<double,33331,33331,0,33331,33331>,1,Eigen::Stride<0,0> > >,5>,Eigen::Map<Eigen::Matrix<double,33331,33331,0,33331,33331>,1,Eigen::Stride<0,0> >,Eigen::Transpose<Eigen::Map<Eigen::Matrix<double,33331,33331,0,33331,33331>,1,Eigen::Stride<0,0> > > >::evalTo<Eigen::Matrix<double,33331,33331,0,33331,33331> > (839ED0h)
}
00839FFE mov ecx,dword ptr [esp+0Ch]
0083A002 mov eax,edi
0083A004 pop edi
0083A005 pop esi
0083A006 mov dword ptr fs:[0],ecx
0083A00D add esp,10h
0083A010 ret 4


Call to


template <>
struct ei_unaligned_assign_impl<false>
{
// MSVC must not inline this functions. If it does, it fails to optimize the
// packet access path.
#ifdef _MSC_VER
template <typename Derived, typename OtherDerived>
static EIGEN_DONT_INLINE void run(const Derived& src, OtherDerived& dst, int start, int end)
#else
template <typename Derived, typename OtherDerived>
static EIGEN_STRONG_INLINE void run(const Derived& src, OtherDerived& dst, int start, int end)
#endif
{
for (int index = start; index < end; ++index)
00839A30 mov edx,dword ptr [esp+0Ch]
00839A34 push edi
00839A35 mov edi,dword ptr [esp+14h]
00839A39 cmp edx,edi
00839A3B jge Eigen::ei_unaligned_assign_impl<0>::run<Eigen::Matrix<double,33331,33331,0,33331,33331>,Eigen::SelfCwiseBinaryOp<Eigen::ei_scalar_sum_op<double>,Eigen::Map<Eigen::Matrix<double,33331,33331,0,33331,33331>,1,Eigen::Stride<0,0> > > >+0C7h (839AF7h)
00839A41 mov eax,edi
00839A43 push ebp
00839A44 mov ebp,dword ptr [esp+10h]
00839A48 sub eax,edx
00839A4A cmp eax,4
00839A4D push esi
00839A4E jl Eigen::ei_unaligned_assign_impl<0>::run<Eigen::Matrix<double,33331,33331,0,33331,33331>,Eigen::SelfCwiseBinaryOp<Eigen::ei_scalar_sum_op<double>,Eigen::Map<Eigen::Matrix<double,33331,33331,0,33331,33331>,1,Eigen::Stride<0,0> > > >+95h (839AC5h)
00839A50 mov eax,dword ptr [esp+10h]
00839A54 mov ecx,dword ptr [ebp]
00839A57 mov ecx,dword ptr [ecx]
00839A59 push ebx
00839A5A mov ebx,dword ptr [eax]
};
User avatar
bjacob
Registered Member
Posts
658
Karma
3
First of all, don't worry about ei_unaligned_assign_impl, it is just taking care of the small parts of your matrix that can't be addressed by 16 byte packets (the beginning and end of a row/column...).

The most probable cause of bad performance is MSVC failing to correctly inline a function that must crucially be inlined. The reason why you'd be the first one to report this particular issue is that it's not too common to do row_vector * transpose_of_matrix, so can you jus try replacing:
Code: Select all
eResults += eVector * eMatrix.transpose();

by
Code: Select all
eResults.transpose() += eMatrix * eVector.transpose();

and see if that makes any difference. If that's still slow, then the best you can do is file a bug report on our issue tracker with a self-contained compilable test program, exhibiting poor performance with MSVC 2008.


Join us on Eigen's IRC channel: #eigen on irc.freenode.net
Have a serious interest in Eigen? Then join the mailing list!
martinakos
Registered Member
Posts
53
Karma
0
OS
Hi bjacob,

I have tried the alternative syntax. I don’t get any performance improvement. I’ll make an entry on the issue tracker.

Thanks for your help.
Martin.
User avatar
ggael
Moderator
Posts
3447
Karma
19
OS
for the record there were two issues:

1 - you really should use Map<RowVectorXd> for the vectors otherwise Eigen uses the general matrix*matrix product which is not very well suited for matrix*vectors...

2 - there was a bug in eigen preventing Map<> object to be fully optimized.

if you update your local copy and do the change 1) then the eigen version should really be significantly faster because of SSE *and* better cache use.

Also, here adding Aligned is not really useful, but adding .noalias() avoid one useless memory alloc/copy.
User avatar
bjacob
Registered Member
Posts
658
Karma
3
1 - you really should use Map<RowVectorXd> for the vectors otherwise Eigen uses the general matrix*matrix product which is not very well suited for matrix*vectors...


Wow, this is huge, how could I let that pass!!

Great job for 2) too.

martinakos, what gael means with .noalias() is eResults.noalias() = otherstuff;


Join us on Eigen's IRC channel: #eigen on irc.freenode.net
Have a serious interest in Eigen? Then join the mailing list!
martinakos
Registered Member
Posts
53
Karma
0
OS
Now it's working as expected!!! :)

Using doubles is twice as fast as the non-SSE hand written code!

I have tried using Aligned and without it and the speed is more or less the same, so I imagine that for this size of matrixes/vectors I can get good performance without needing to aligning the memory.

Thanks very much ggael and bjacob for your help.

Martinakos :)


Bookmarks



Who is online

Registered users: Bing [Bot], Google [Bot], Sogou [Bot]