Performance issue with outer product.

Board index

Page 1 of 1 (5 posts)

Tags:

egery Registered Member Posts 2 Karma 0	Performance issue with outer product. Thu Jun 28, 2012 2:57 pm Hello, First, thanks to all Eigen team for the great work. I've encountered a performance issue doing a simple outer product of vectors (M = V1 * V2.transpose();) It's a slower than a hand made loop ! We can make it almost twice faster by explicitly iterating on V2 elements (thus having vectorscalar operation for each col of M). Vectorization issue ? I've searched the forum for this specific issue, without success. Tested with Eigen 3.1, gcc 4.4.5, gcc 4.7.1 with options : -mmmx -msse -msse2 -O3 -msse3 -msse4 -mtune=pentium4 -march=i686 -DNDEBUG -lrt -I~/Eigen31 Results : \| Dim1* \| Dim2 \| nbIter \| By hand \| Eigen \| Eigen fixed \| \| 1024 \| 50 \| 3906 \| 77 ms \| 116 ms \| 62 ms \| From this code : Code: Select all /! @brief Benchmark vector * vector.t() outer product. * / #include <stdio.h> #include <time.h> #include "Eigen/Core" const int DIM_1 = 1024; const int DIM_2 = 50; namespace { typedef Eigen::MatrixXf Matrix ; typedef Eigen::VectorXf Vector ; //Helper, get realtime in micro seconds long long GetRealTime_ms() { timespec currentTime; clock_gettime(CLOCK_REALTIME, &currentTime); return currentTime.tv_sec1e6 + currentTime.tv_nsec/1000; } //Loops by hand. void OuterProduct_Ref( const Vector& v1, const Vector& v2, Matrix& o_result) { const int dim1 = v1.rows() ; const int dim2 = v2.rows() ; //Caller must have allocated room for result. assert(o_result.rows() == dim1); assert(o_result.cols() == dim2); const float* input = v1.data() ; const float* output = v2.data() ; float* result = o_result.data() ; for (int j=0; j<dim2; ++j) { const float beta = output[j]; const float * in = input ; for(int i=0; i<dim1; ++i) { // result[i+jdim1] = betainput[i] ; (result++) = beta (in++) ; } } } void OuterProduct_Eigen( const Vector& v1, const Vector& v2, Matrix & o_result) { //Caller must have allocated room for result. assert(o_result.rows() == v1.rows()); assert(o_result.cols() == v2.rows()); o_result.noalias() = v1 v2.transpose() ; } void OuterProduct_Eigen_Fix( const Vector& v1, const Vector& v2, Matrix & o_result) { //Caller must have allocated room for result. assert(o_result.rows() == v1.rows()); assert(o_result.cols() == v2.rows()); const int dim2 = v2.rows() ; for (int j=0 ; j != dim2 ; ++j) { o_result.col(j) = v1 * v2(j) ; } } void benchmark_OuterProduct() { long timeRef, timeEigen, timeEigenFix ; const int nbOperations = DIM_1DIM_2 ; const int nbIter = 2e08/nbOperations ; Vector V1 (DIM_1); Vector V2 (DIM_2); V1.setRandom(); V2.setRandom(); Matrix resultRef, result ; resultRef.resize(DIM_1, DIM_2) ; result.resize(DIM_1, DIM_2) ; OuterProduct_Ref (V1, V2, resultRef); //By hand { long long timeRef_start = GetRealTime_ms(); for(int n = 0; n < nbIter; n++) { OuterProduct_Ref (V1, V2, result); } timeRef = GetRealTime_ms() - timeRef_start ; assert (result.isApprox(resultRef)) ; } //With Eigen { long long timeEigen_start = GetRealTime_ms(); for(int n = 0; n < nbIter; n++) { OuterProduct_Eigen (V1, V2, result); } timeEigen = GetRealTime_ms() - timeEigen_start ; assert (result.isApprox(resultRef)) ; } //With Eigen, fixed implementation. { long long timeEigenFix_start = GetRealTime_ms(); for(int n = 0; n < nbIter; n++) { OuterProduct_Eigen_Fix (V1, V2, result); } timeEigenFix = GetRealTime_ms() - timeEigenFix_start ; assert (result.isApprox(resultRef)) ; } // Result ////////////////////////////////////////////////////////////////////// printf( "\| Dim1* \| Dim2 \| nbIter \| By hand \| Eigen \| Eigen fixed \| \n" "\| %d \| %d \| %d \| %d ms \| %d ms \| %d ms \|\n", DIM_1, DIM_2, nbIter, timeRef/1000, timeEigen/1000, timeEigenFix/1000) ; } } //anonymous namespace int main() { benchmark_OuterProduct() ; return 0 ; } Thanks, Yves
ggael Moderator Posts 3447 Karma 19 OS	Re: Performance issue with outer product. Thu Jun 28, 2012 3:43 pm your "OuterProduct_Eigen_Fix" version is how it is implemented. There might be a recent performance regression here. I'll check.
ggael Moderator Posts 3447 Karma 19 OS	Re: Performance issue with outer product. Thu Jun 28, 2012 4:47 pm alright, the issue is that the code is optimized to perform rank-1 updates: A += v1 * v2.transpose(); If you adjust your OuterProduct_Eigen_Fix to do this operation, then you get the exact same performance. Here the overhead for a simple A = v1 * v2.transpose(); is about 30%, and it is still quite faster than OuterProduct_Ref.
egery Registered Member Posts 2 Karma 0	Re: Performance issue with outer product. Fri Jun 29, 2012 3:26 pm Interesting ! It's true that for update, Eigen and EigenFix give the same performance, which is 1.7x better than hand-made update. For assignation, I still have to use the "fix" to obtain the best performance, which remains very close to hand-made version. I've compared 3 operations (assignation, assignation simulated with SetZero() and +=, and update). BTW, noalias() helps here. Byhand = : 77ms Byhand += : 130ms Eigen = (operator =) : 116ms Eigen = (SetZero then +=) : 115ms Eigen += : 77ms EigenFix = (operator =) : 62ms EigenFix = (SetZero then +=) : 115ms (no difference with plain Eigen expression) EigenFix += : 77ms (no difference with plain Eigen expression) FYI, /proc/cpuinfo gives : Code: Select all processor : 0 vendor_id : GenuineIntel cpu family : 6 model : 44 model name : Intel(R) Xeon(R) CPU E5645 @ 2.40GHz stepping : 2 cpu MHz : 1596.000 cache size : 12288 KB physical id : 1 siblings : 12 core id : 0 cpu cores : 6 apicid : 32 initial apicid : 32 fdiv_bug : no hlt_bug : no f00f_bug : no coma_bug : no fpu : yes fpu_exception : yes cpuid level : 11 wp : yes flags : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm pbe nx pdpe1gb rdtscp lm constant_tsc arch_pe rfmon pebs bts xtopology nonstop_tsc aperfmperf pni pclmulqdq dtes64 monitor ds_cpl vmx smx est tm2 ssse3 cx16 xtpr pdcm dca sse4_1 sse4_2 popcnt aes lahf_lm ida arat tpr_shadow v nmi flexpriority ept vpid bogomips : 4787.79 clflush size : 64 cache_alignment : 64 address sizes : 40 bits physical, 48 bits virtual Yves
ggael Moderator Posts 3447 Karma 19 OS	Re: Performance issue with outer product. Fri Jun 29, 2012 10:41 pm yes that's the expect behavior of the current implementation. I've added an entry there: http://eigen.tuxfamily.org/bz/show_bug.cgi?id=483 so that we don't forget to optimize this case.