This forum has been archived. All content is frozen. Please use KDE Discuss instead.

I can't get SSE2 code in MSVC 2005, help!

Tags: None
(comma "," separated)
User avatar
Registered Member
It's pretty simple: with the development branch, when you say Aligned, you tell Eigen that it's safe to rely on the assumption that your array start pointer is a multiple of 16 bytes. Then it uses SSE instructions that rely on that assumption, that is in ei_pload, that's why you get the crash there.

The cause of the crash is that you did something of the form
Code: Select all
Map<MatrixXd, Aligned> my_map(some_non_aligned_ptr, x, y)

where some_non_aligned_ptr is not a multiple of 16 bytes. For example, on many platforms, malloc() and "new" return such pointers. When you need a pointer aligned to 16 bytes, you need to use Eigen's functions ei_aligned_malloc or ei_aligned_new.

Join us on Eigen's IRC channel: #eigen on
Have a serious interest in Eigen? Then join the mailing list!
Registered Member
Hi bjacob,

Thanks for your help again. I doubled-checked the memory allocation for the pointer I was mapping and I found one place I wasn't controlling, and that was producing the exception. Now I’m controlling all the allocations and making them aligned. So the code I referred to before is working again with the development branch, however, I think there is still something else wrong. The code is still slower than the hand written one. And when I look a the disassembly for the line
eResults += eVector * eMatrix.transpose();

I see some instructions and a ei_unaligned_assign_impl which makes me think it’s not using the aligned version of the code???

I attach the disassembled code for the line:
eResults += eVector * eMatrix.transpose();

00837E80 lea eax,[esp+18h]
00837E84 push eax
00837E85 lea ecx,[esp+60h]
00837E89 lea edx,[esp+44h]
00837E8D push ecx
00837E8E lea ecx,[esp+38h]
00837E92 mov dword ptr [esp+20h],edx
00837E96 call Eigen::MatrixBase<Eigen::Map<Eigen::Matrix<double,33331,33331,0,33331,33331>,1,Eigen::Stride<0,0> > >::operator*<Eigen::Transpose<Eigen::Map<Eigen::Matrix<double,33331,33331,0,33331,33331>,1,Eigen::Stride<0,0> > > > (839CC0h)
00837E9B lea edx,[esp+20h]
00837E9F mov dword ptr [esp+18h],edx
00837EA3 lea ecx,[esp+17h]
00837EA7 lea edx,[esp+50h]
00837EAB mov dword ptr [esp+1Ch],ecx
00837EAF push edx
00837EB0 mov ecx,eax
00837EB2 mov byte ptr [esp+110h],1
00837EBA call Eigen::DenseBase<Eigen::GeneralProduct<Eigen::Map<Eigen::Matrix<double,33331,33331,0,33331,33331>,1,Eigen::Stride<0,0> >,Eigen::Transpose<Eigen::Map<Eigen::Matrix<double,33331,33331,0,33331,33331>,1,Eigen::Stride<0,0> > >,5> >::eval (839F80h)
00837EBF mov edi,dword ptr [esp+28h]
00837EC3 imul edi,dword ptr [esp+24h]
00837EC8 mov esi,eax
00837ECA mov eax,edi
00837ECC cdq
00837ECD sub eax,edx
00837ECF sar eax,1
00837ED1 add eax,eax
00837ED3 xor ecx,ecx
00837ED5 test eax,eax
00837ED7 jle ImageMetrics::DoubleMatrix::AddVecMatTransposeProduct+1A2h (837F02h)
00837ED9 lea esp,[esp]
00837EE0 mov edx,dword ptr [esi]
00837EE2 movapd xmm1,xmmword ptr [edx+ecx*8]
00837EE7 mov edx,dword ptr [esp+20h]
00837EEB movapd xmm0,xmmword ptr [edx+ecx*8]
00837EF0 lea edx,[edx+ecx*8]
00837EF3 add ecx,2
00837EF6 cmp ecx,eax
00837EF8 addpd xmm0,xmm1
00837EFC movapd xmmword ptr [edx],xmm0
00837F00 jl ImageMetrics::DoubleMatrix::AddVecMatTransposeProduct+180h (837EE0h)
00837F02 push edi
00837F03 push eax
00837F04 lea eax,[esp+20h]
00837F08 push eax
00837F09 push esi
00837F0A call Eigen::ei_unaligned_assign_impl<0>::run<Eigen::Matrix<double,33331,33331,0,33331,33331>,Eigen::SelfCwiseBinaryOp<Eigen::ei_scalar_sum_op<double>,Eigen::Map<Eigen::Matrix<double,33331,33331,0,33331,33331>,1,Eigen::Stride<0,0> > > > (839A30h)
00837F0F mov ecx,dword ptr [esp+60h]
00837F13 mov esi,dword ptr [__imp___aligned_free (9E86CCh)]
00837F19 push ecx
00837F1A call esi
00837F1C mov edx,dword ptr [esp+78h]
00837F20 push edx
00837F21 call esi
00837F23 add esp,18h

Call to operator *=

template<typename Derived>
template<typename OtherDerived>
inline const typename ProductReturnType<Derived,OtherDerived>::Type
MatrixBase<Derived>::operator*(const MatrixBase<OtherDerived> &other) const
00839CC0 push ecx
// A note regarding the function declaration: In MSVC, this function will sometimes
// not be inlined since ei_matrix_storage is an unwindable object for dynamic
// matrices and product types are holding a member to store the result.
// Thus it does not help tagging this function with EIGEN_STRONG_INLINE.
enum {
ProductIsValid = Derived::ColsAtCompileTime==Dynamic
|| OtherDerived::RowsAtCompileTime==Dynamic
|| int(Derived::ColsAtCompileTime)==int(OtherDerived::RowsAtCompileTime),
AreVectors = Derived::IsVectorAtCompileTime && OtherDerived::IsVectorAtCompileTime,
SameSizes = EIGEN_PREDICATE_SAME_MATRIX_SIZE(Derived,OtherDerived)
// note to the lost user:
// * for a dot product use:
// * for a coeff-wise product use: v1.cwiseProduct(v2)
EIGEN_STATIC_ASSERT(ProductIsValid || !(AreVectors && SameSizes),
EIGEN_STATIC_ASSERT(ProductIsValid || !(SameSizes && !AreVectors),
return typename ProductReturnType<Derived,OtherDerived>::Type(derived(), other.derived());
00839CC1 mov eax,dword ptr [esp+8]
00839CC5 mov dword ptr [eax],ecx
00839CC7 mov ecx,dword ptr [esp+0Ch]
00839CCB mov ecx,dword ptr [ecx]
00839CCD xor edx,edx
00839CCF mov dword ptr [eax+4],ecx
00839CD2 mov dword ptr [esp],edx
00839CD5 mov dword ptr [eax+8],edx
00839CD8 mov dword ptr [eax+0Ch],edx
00839CDB mov dword ptr [eax+10h],edx
00839CDE pop ecx
00839CDF ret 8

Call to product / eval

/** \returns the matrix or vector obtained by evaluating this expression.
* Notice that in the case of a plain matrix or vector (not an expression) this function just returns
* a const reference, in order to avoid a useless copy.
EIGEN_STRONG_INLINE const typename ei_eval<Derived>::type eval() const
00839F80 push 0FFFFFFFFh
00839F82 push offset __ehhandler$?eval@?$DenseBase@V?$GeneralProduct@V?$Map@V?$Matrix@N$0ICDD@$0ICDD@$0A@$0ICDD@$0ICDD@@Eigen@@$00V?$Stride@$0A@$0A@@2@@Eigen@@V?$Transpose@V?$Map@V?$Matrix@N$0ICDD@$0ICDD@$0A@$0ICDD@$0ICDD@@Eigen@@$00V?$Stride@$0A@$0A@@2@@Eigen@@@2@$04@Eigen@@@Eigen@@QBE?BV?$Matrix@N$0ICDD@$0ICDD@$0A@$0ICDD@$0ICDD@@2@XZ (9DD278h)
00839F87 mov eax,dword ptr fs:[00000000h]
00839F8D push eax
00839F8E mov dword ptr fs:[0],esp
00839F95 push ecx
00839F96 push esi
00839F97 mov esi,ecx
// Even though MSVC does not honor strong inlining when the return type
// is a dynamic matrix, we desperately need strong inlining for fixed
// size types on MSVC.
return typename ei_eval<Derived>::type(derived());
00839F99 mov eax,dword ptr [esi+4]
00839F9C mov eax,dword ptr [eax+4]
00839F9F mov ecx,dword ptr [esi]
00839FA1 mov ecx,dword ptr [ecx+4]
00839FA4 push edi
00839FA5 mov edi,dword ptr [esp+1Ch]
00839FA9 push eax
00839FAA imul eax,ecx
00839FAD push ecx
00839FAE push eax
00839FAF mov ecx,edi
00839FB1 mov dword ptr [esp+14h],0
00839FB9 call Eigen::ei_matrix_storage<double,33331,33331,33331,0>::ei_matrix_storage<double,33331,33331,33331,0> (838720h)
00839FBE mov edx,dword ptr [esi+4]
00839FC1 mov eax,dword ptr [edx+4]
00839FC4 mov ecx,dword ptr [esi]
00839FC6 mov ecx,dword ptr [ecx+4]
00839FC9 push eax
00839FCA imul eax,ecx
00839FCD push ecx
00839FCE push eax
00839FCF mov ecx,edi
00839FD1 mov dword ptr [esp+20h],0
00839FD9 call Eigen::ei_matrix_storage<double,33331,33331,33331,1>::resize (8386B0h)
00839FDE mov edx,dword ptr [esi+4]
00839FE1 mov eax,dword ptr [edx+4]
00839FE4 mov ecx,dword ptr [esi]
00839FE6 mov ecx,dword ptr [ecx+4]
00839FE9 push eax
00839FEA imul eax,ecx
00839FED push ecx
00839FEE push eax
00839FEF mov ecx,edi
00839FF1 call Eigen::ei_matrix_storage<double,33331,33331,33331,1>::resize (8386B0h)
00839FF6 push edi
00839FF7 mov ecx,esi
00839FF9 call Eigen::ProductBase<Eigen::GeneralProduct<Eigen::Map<Eigen::Matrix<double,33331,33331,0,33331,33331>,1,Eigen::Stride<0,0> >,Eigen::Transpose<Eigen::Map<Eigen::Matrix<double,33331,33331,0,33331,33331>,1,Eigen::Stride<0,0> > >,5>,Eigen::Map<Eigen::Matrix<double,33331,33331,0,33331,33331>,1,Eigen::Stride<0,0> >,Eigen::Transpose<Eigen::Map<Eigen::Matrix<double,33331,33331,0,33331,33331>,1,Eigen::Stride<0,0> > > >::evalTo<Eigen::Matrix<double,33331,33331,0,33331,33331> > (839ED0h)
00839FFE mov ecx,dword ptr [esp+0Ch]
0083A002 mov eax,edi
0083A004 pop edi
0083A005 pop esi
0083A006 mov dword ptr fs:[0],ecx
0083A00D add esp,10h
0083A010 ret 4

Call to

template <>
struct ei_unaligned_assign_impl<false>
// MSVC must not inline this functions. If it does, it fails to optimize the
// packet access path.
#ifdef _MSC_VER
template <typename Derived, typename OtherDerived>
static EIGEN_DONT_INLINE void run(const Derived& src, OtherDerived& dst, int start, int end)
template <typename Derived, typename OtherDerived>
static EIGEN_STRONG_INLINE void run(const Derived& src, OtherDerived& dst, int start, int end)
for (int index = start; index < end; ++index)
00839A30 mov edx,dword ptr [esp+0Ch]
00839A34 push edi
00839A35 mov edi,dword ptr [esp+14h]
00839A39 cmp edx,edi
00839A3B jge Eigen::ei_unaligned_assign_impl<0>::run<Eigen::Matrix<double,33331,33331,0,33331,33331>,Eigen::SelfCwiseBinaryOp<Eigen::ei_scalar_sum_op<double>,Eigen::Map<Eigen::Matrix<double,33331,33331,0,33331,33331>,1,Eigen::Stride<0,0> > > >+0C7h (839AF7h)
00839A41 mov eax,edi
00839A43 push ebp
00839A44 mov ebp,dword ptr [esp+10h]
00839A48 sub eax,edx
00839A4A cmp eax,4
00839A4D push esi
00839A4E jl Eigen::ei_unaligned_assign_impl<0>::run<Eigen::Matrix<double,33331,33331,0,33331,33331>,Eigen::SelfCwiseBinaryOp<Eigen::ei_scalar_sum_op<double>,Eigen::Map<Eigen::Matrix<double,33331,33331,0,33331,33331>,1,Eigen::Stride<0,0> > > >+95h (839AC5h)
00839A50 mov eax,dword ptr [esp+10h]
00839A54 mov ecx,dword ptr [ebp]
00839A57 mov ecx,dword ptr [ecx]
00839A59 push ebx
00839A5A mov ebx,dword ptr [eax]
User avatar
Registered Member
First of all, don't worry about ei_unaligned_assign_impl, it is just taking care of the small parts of your matrix that can't be addressed by 16 byte packets (the beginning and end of a row/column...).

The most probable cause of bad performance is MSVC failing to correctly inline a function that must crucially be inlined. The reason why you'd be the first one to report this particular issue is that it's not too common to do row_vector * transpose_of_matrix, so can you jus try replacing:
Code: Select all
eResults += eVector * eMatrix.transpose();

Code: Select all
eResults.transpose() += eMatrix * eVector.transpose();

and see if that makes any difference. If that's still slow, then the best you can do is file a bug report on our issue tracker with a self-contained compilable test program, exhibiting poor performance with MSVC 2008.

Join us on Eigen's IRC channel: #eigen on
Have a serious interest in Eigen? Then join the mailing list!
Registered Member
Hi bjacob,

I have tried the alternative syntax. I don’t get any performance improvement. I’ll make an entry on the issue tracker.

Thanks for your help.
User avatar
for the record there were two issues:

1 - you really should use Map<RowVectorXd> for the vectors otherwise Eigen uses the general matrix*matrix product which is not very well suited for matrix*vectors...

2 - there was a bug in eigen preventing Map<> object to be fully optimized.

if you update your local copy and do the change 1) then the eigen version should really be significantly faster because of SSE *and* better cache use.

Also, here adding Aligned is not really useful, but adding .noalias() avoid one useless memory alloc/copy.
User avatar
Registered Member
1 - you really should use Map<RowVectorXd> for the vectors otherwise Eigen uses the general matrix*matrix product which is not very well suited for matrix*vectors...

Wow, this is huge, how could I let that pass!!

Great job for 2) too.

martinakos, what gael means with .noalias() is eResults.noalias() = otherstuff;

Join us on Eigen's IRC channel: #eigen on
Have a serious interest in Eigen? Then join the mailing list!
Registered Member
Now it's working as expected!!! :)

Using doubles is twice as fast as the non-SSE hand written code!

I have tried using Aligned and without it and the speed is more or less the same, so I imagine that for this size of matrixes/vectors I can get good performance without needing to aligning the memory.

Thanks very much ggael and bjacob for your help.

Martinakos :)


Who is online

Registered users: Bing [Bot], Google [Bot], Sogou [Bot]