This forum has been archived. All content is frozen. Please use KDE Discuss instead.

Eigen - performance issues (compared to my matrix class)

Tags: None
(comma "," separated)
nocturn
Registered Member
Posts
4
Karma
0
Hi
I'm currently writting an image processing application. In my application I need to perform a lot of matrix operations (like subtraction,multiplication etc). Originaly I started with writting my own matrix class in which I implemented basic operations. However after profiling the code it turned out that the most time consuming operations are matrix multiplicaton and matrix addition operations. So I decied to use some specialistic library for linear algebra. Thanks to stackoverflow's users I decied to use Eigen.
Unforunatelly after using Eigen matrix it turned out that my application works even slower then before.
Function which take majority of time looks like that
Code: Select all
TemplateClusterBase* TemplateClusterBase::TransformTemplateOne( vector<Eigen::Vector2d*>& pointVector, Eigen::Matrix2d& rotation ,Eigen::Matrix2d& scale,Eigen::Vector2d& translation )
{
   
   Eigen::Matrix2d temp = (rotation*scale);
   for (int i=0;i<pointVector.size();i++ )
   {
      //Eigen::MatrixXd outcome =
      Eigen::Vector2d outcome = temp * (*pointVector[i])  + translation;
   
      MatrixHelper::SetX(*prototypePointVector[i],MatrixHelper::GetX(outcome));
      MatrixHelper::SetY(*prototypePointVector[i],MatrixHelper::GetY(outcome));
      
   }
   
   return this;
}

Eigen::Vector2d AlgorithmPointBased::UpdateTranslationMatrix( int clusterIndex )
{
   double membershipSum = 0,outcome = 0;
   double currentPower = 0;
   Eigen::Vector2d outcomePoint = Eigen::Vector2d();
   outcomePoint << 0,0;
   Eigen::Vector2d templatePoint;
   Eigen::Matrix2d temp = prototypeVector[clusterIndex]->rotationMatrix*prototypeVector[clusterIndex]->scalingMatrix;
   for (int i=0;i< imageDataVector.size();i++)
   {
      currentPower =0;
      membershipSum += currentPower = pow(membershipMatrix[clusterIndex][i],m);
      outcomePoint.noalias() +=  (*imageDataVector[i] - (temp* ( *templateCluster->templatePointVector[prototypeVector[clusterIndex]->assosiatedPointIndexVector[i]]) ))*currentPower ;
   }

   outcomePoint.noalias() = outcomePoint/=membershipSum;
   return outcomePoint; //.ConvertToMatrix();
}
double AlgorithmPointBased::UpdateAngleCoefficient( int clusterIndex )
{
   double membershipSum = 0;
   double currentPower = 0;

   vector<Eigen::Vector2d*> prototypePointVecotr = prototypeVector[clusterIndex]->prototypePointVector;
   vector<Eigen::Vector2d*> templatePointVector = templateCluster->templatePointVector;

   Eigen::MatrixXd numerator =  Eigen::MatrixXd(1,1);
   numerator << 0;
   Eigen::MatrixXd denominator =  Eigen::MatrixXd(1,1);
   denominator << 0;
   Eigen::Vector2d templatePoint;
   for (int i=0;i< imageDataVector.size();i++)
   {
      membershipSum += currentPower = pow(membershipMatrix[clusterIndex][i],m);

      templatePoint = *templatePointVector[prototypeVector[clusterIndex]->assosiatedPointIndexVector[i]]; //*templatePointVector[matchingPointVector[i]->associatedTemplateIndex];
      numerator +=  (*imageDataVector[i] - prototypeVector[clusterIndex]->translationMatrix).transpose()* this->zeroOneMatrix * templatePoint * currentPower;
      denominator +=(*imageDataVector[i] - prototypeVector[clusterIndex]->translationMatrix).transpose()* templatePoint * currentPower;
   }
   
   double tangent =   (((numerator) /= (denominator)(0,0))(0,0) );
   return    atan(tangent);
}

I use Visual Studio 2010 and I enabled "Streaming SIMD Extensions 2 (/arch:SSE2) (/arch:SSE2)". Frankly speaking I don't konw how to enhance performance of these functions. After changing my MatrixClass into Eigen::Matrix class my application need 87 seconds to acomplish task - before change, it was 55 s.
Could somebody explain me why if I use Eigen::Matrix instead of myMatrixClass I get worse pefrormance ?? I thought it would be the other way around.
Maybe I use Eigen class in a wrong way ??

Last edited by nocturn on Thu Jun 02, 2011 11:47 pm, edited 1 time in total.
graphicsMan
Registered Member
Posts
16
Karma
0
OS
Why are you using dynamic matrices to store the numerator and denominator? These are scalar, and IMHO should just be of type double.
nocturn
Registered Member
Posts
4
Karma
0
graphicsMan wrote:Why are you using dynamic matrices to store the numerator and denominator? These are scalar, and IMHO should just be of type double.


Because if I did sth like that
Code: Select all
double numerator = 0;
numerator +=  (*imageDataVector[i] - prototypeVector[clusterIndex]->translationMatrix).transpose()* this->zeroOneMatrix * templatePoint * currentPower;

I can't compile the code -
I get an error
"no operator += matches these operands", so I decided to use matrix instead of double
graphicsMan
Registered Member
Posts
16
Karma
0
OS
It looks like the result of the RHS is a Vector2d. Is that correct? It seems that you're ignoring the second component of that vector, so you could just access the [0] element.

Additionally, you have a common sub-expression which might not be optimized by the compiler. You could try extracting that (the expression ending in transpose()), and reuse the result.

Have you profiled the code? Looking at it again, it could be the pow() function. What is the type of m?
nocturn
Registered Member
Posts
4
Karma
0
I got a bit confused :) What do You mean by RHS ??
m is a double so I can't use m << powValue in order to speed up pow function

However notice that the same code structure is used with my type myMatrixClass and it is faster than using Eigen::Matrix
graphicsMan
Registered Member
Posts
16
Karma
0
OS
RHS == right hand side. The lines
Code: Select all
numerator +=  (*imageDataVector[i] - prototypeVector[clusterIndex]->translationMatrix).transpose()* this->zeroOneMatrix * templatePoint * currentPower;
      denominator +=(*imageDataVector[i] - prototypeVector[clusterIndex]->translationMatrix).transpose()* templatePoint * currentPower;


Both have the common sub-expression

Code: Select all
(*imageDataVector[i] - prototypeVector[clusterIndex]->translationMatrix).transpose()


The compiler may be overwhelmed by the large amount of templateness, and might not recognize that this sub-expression only needs to be computed once.

You're also computing the pow() twice per element if both UpdateTranslationMatrix and UpdateAngleCoefficient are called for a given index.

However, none of this matters if these aren't hot-spots in your code.
Have you managed to profile your code? A good profiler should be able to point you to your bottleneck.
User avatar
ggael
Moderator
Posts
3447
Karma
19
OS
hi, here is slightly optimized version. Check this page to see how to write inner products and get a scalar instead of the 1x1 matrix:
http://eigen.tuxfamily.org/dox/QuickRef ... cOperators

I also changed some copy per values to const references and factorized duplicated expressions.

I believe the bottleneck of your code are the random memory accesses. Since you are looping over the same pattern multiple times, perhaps it would be worth it to pack the data once into a Matrix<double,2,Dynamic> and then leverage big matrix operations. Same for the powers that could be pre-computed once into a VectorXd.

If it is still slow, you could check without SSE, and alos check the assembly to see if MSVC is not messing up.


Code: Select all
Eigen::Vector2d AlgorithmPointBased::UpdateTranslationMatrix( int clusterIndex )
{
   double membershipSum = 0,outcome = 0;
   double currentPower = 0;
   Eigen::Vector2d outcomePoint(0,0);
   Eigen::Vector2d templatePoint;
   Eigen::Matrix2d temp = prototypeVector[clusterIndex]->rotationMatrix*prototypeVector[clusterIndex]->scalingMatrix;
   for (int i=0;i< imageDataVector.size();i++)
   {
      currentPower =0;
      membershipSum += currentPower = pow(membershipMatrix[clusterIndex][i],m);
      const Eigen::Vector2d& vec = (*templateCluster->templatePointVector[prototypeVector[clusterIndex]->assosiatedPointIndexVector[i]]);
      outcomePoint += (*imageDataVector[i] - temp * vec)*currentPower ;
   }

   outcomePoint /= membershipSum;
   return outcomePoint;
}
double AlgorithmPointBased::UpdateAngleCoefficient( int clusterIndex )
{
   double membershipSum = 0;
   double currentPower = 0;

   const vector<Eigen::Vector2d*>& prototypePointVecotr = prototypeVector[clusterIndex]->prototypePointVector;
   const vector<Eigen::Vector2d*>& templatePointVector = templateCluster->templatePointVector;

   double numerator = 0
   double denominator = 0;
   for (int i=0;i< imageDataVector.size();i++)
   {
      membershipSum += currentPower = pow(membershipMatrix[clusterIndex][i],m);

      const Eigen::Vector2d& templatePoint = *templatePointVector[prototypeVector[clusterIndex]->assosiatedPointIndexVector[i]];
      const Eigen::Vector2d diff = (*imageDataVector[i] - prototypeVector[clusterIndex]->translationMatrix) * currentPower;
      numerator   += (diff.transpose() * this->zeroOneMatrix * templatePoint).value();
      denominator += diff.dot(templatePoint);
   }
   
   return atan(numerator / denominator);
}
nocturn
Registered Member
Posts
4
Karma
0
Thanks for answer.Now it works a bit faster, however version with my own matrix class works a way faster. Here is a screen from profiler.

Image

Without sse is very slow - 108 s.
If it is still slow, you could check without SSE, and alos check the assembly to see if MSVC is not messing up.

Could You tell me how to check assembly.
Regarding this

I believe the bottleneck of your code are the random memory accesses. Since you are looping over the same pattern multiple times, perhaps it would be worth it to pack the data once into a Matrix<double,2,Dynamic> and then leverage big matrix operations. Same for the powers that could be pre-computed once into a VectorXd.

Could You tell me sth more about packing data into Matrix<double,2,Dynamic> what I would gain ?? Notice that function UpdateTranslationMatrix and UpdateAngleCoefficient are called a lot of times.
User avatar
ggael
Moderator
Posts
3447
Karma
19
OS
hm... when I see your profiler is able to see functions as coeff(), operator(), rows(), and so on I'm wondering whether you are benchmarking in debug or release mode, i.e., without or with optimizations enabled.


Bookmarks



Who is online

Registered users: Bing [Bot], Google [Bot], q.ignora, watchstar