This forum has been archived. All content is frozen. Please use KDE Discuss instead.

transpose block

Tags: None
(comma "," separated)
johnm1019
Registered Member
Posts
46
Karma
0

transpose block

Tue Apr 03, 2012 9:20 am
currently I do
x.array() *= ( A.transpose() * currEstimate ).array() / columnSums ;
which is single threaded -- I wanted to try parallelizing it, by manually issuing different blocks to different threads.

Will I incur a huge performance hit with A.transpose().block(i,j,p,q) or is there a better way?

Will post benchmarks from the current method when I finish.

Thanks.
johnm1019
Registered Member
Posts
46
Karma
0

Re: transpose block

Tue Apr 03, 2012 10:02 am
First bench finished. Suggestions welcome. I've avoided openMP and am using QtConcurrent framework, same concept though.

Old code and new code
Code: Select all
x.array() *= ( A.transpose() * currEstimate ).array() / columnSums;

Code: Select all
      
      // parallel block method
      qRegisterMetaType<Eigen::VectorXf>("Eigen::VectorXf");
      qRegisterMetaType<Eigen::ArrayXf>("Eigen::ArrayXf");
      qRegisterMetaType<Eigen::Matrix<float, Eigen::Dynamic, Eigen::Dynamic, Eigen::ColMajor> >("Eigen::Matrix<float, Eigen::Dynamic, Eigen::Dynamic, Eigen::ColMajor>");
      solverPointers theData;
      theData.x = &x;
      theData.A = &A;
      theData.columnSums = &columnSums;
      theData.currEstimate = &currEstimate;
      QVector<QFuture<void> > blockFutures;
      int numBlocks = 4;
      for(int b=0;b<(numBlocks-1);++b)
      {
         blockFutures.push_back( QtConcurrent::run(calculatePatch, theData, b*(numCols/numBlocks), (numCols/numBlocks)) );
      }
      int numLeft = numCols-((numBlocks-1)*(numCols/numBlocks));
      blockFutures.push_back( QtConcurrent::run(calculatePatch, theData, (numBlocks-1)*(numCols/numBlocks), numLeft) );
      
      for(int bfs=0;bfs<blockFutures.size();++bfs)
      {
         blockFutures[bfs].waitForFinished();
      }

Code: Select all
void calculatePatch(solverPointers theData, int colStart, int numCols)
{
   //x.array() *= ( A.transpose() * currEstimate ).array() / columnSums;
   theData.x->array().segment(colStart,numCols) *= ( theData.A->transpose().block(colStart, 0, numCols, theData.A->rows()) * *theData.currEstimate ).array() / theData.columnSums->segment(colStart, numCols);
};


Running on ~5GB of random floats >= 0, averaging over 20 runs
Core i7-i920, 12GB RAM, windows 7, vs2010 sp-latest, -Ox, 64-bit

average time/iter
Old single : 469.9ms
new single : 479.5ms (expected)
new 2-thread: 370.3ms
new 3-thread: 351.0ms
new 4-thread: 343.4ms
[img=500x400]http://www.chartgo.com/trans.jsp?filename=chartgo&img=chart1&id=7C8DB4EE97F7154E39ECF011588443EA_9878[/img]
Any suggestions? Its a little better.....
User avatar
ggael
Moderator
Posts
3447
Karma
19
OS

Re: transpose block

Tue Apr 03, 2012 9:23 pm
Yes, the speedups are not very impressive. What is the typical size of A? I guess the bottleneck is the matrix-vector product: computing only x = A.transpose() * currEstimate should deliver similar perf, and for some unknown reasons I've already observed that parallelizing a matrix-vector product this way does not scale very well (I used OpenMP). Again, I've no idea why. Perhaps duplicating currEstimate to each thread could help?
johnm1019
Registered Member
Posts
46
Karma
0

Re: transpose block

Tue Apr 03, 2012 9:49 pm
ggael wrote:Yes, the speedups are not very impressive. What is the typical size of A? I guess the bottleneck is the matrix-vector product: computing only x = A.transpose() * currEstimate should deliver similar perf, and for some unknown reasons I've already observed that parallelizing a matrix-vector product this way does not scale very well (I used OpenMP). Again, I've no idea why. Perhaps duplicating currEstimate to each thread could help?


Matrices are a bit rectangular. typically 2E3->1E4 x 3E5->3E6 [rows,cols]
I feel like parallelizing matrix vector product should be easy, obviously its got some subtleties. My best guess is either limited memory bandwidth over a single area (the small vector), or locality algorithms which don't want to replicate items from a single memory location across the cache on each chip. That could be total bollocks of course.
User avatar
ggael
Moderator
Posts
3447
Karma
19
OS

Re: transpose block

Wed Apr 04, 2012 6:46 am
yes that's why I suggested to copy the rhs vector to each thread... not sure though, just something easy to try.
johnm1019
Registered Member
Posts
46
Karma
0

Re: transpose block

Wed Apr 04, 2012 10:06 am
Lots more benches

Essentially, copying any of the other structures to make them local does nothing. The only one I could get a repeatable and measureable difference in making local was my columnSums structure -- which isn't even intuitive. It was about a 1-2% performance gain.
Having the currEstimate local did absolutely nothing (but I was expecting it too).

Furthermore, for a generic A*x mat->vec multiplication, a ColMajor storage order performs significantly better when striping (vs RowMajor), about %10 better in fact. This is repeatable.

So for now I will go with striping, 3 threads, and a ColMajor storage order, which provides a %20 boost in performance, which I'll take. (I was already ColMajor)

I'm going to try and bench on VS11 and see what happens.
johnm1019
Registered Member
Posts
46
Karma
0

Re: transpose block

Wed Apr 04, 2012 11:36 am
Just tried VS11, essentially the same when compiling with SSE2. If I switch on AVX, as only found in VS11 (not available in 2010), it gets about 5x _worse_. No clue why turning on AVX would tank performance. Although the VS11 is still a developer preview, so they may have a lot of work left on compiler optimizations.


Bookmarks



Who is online

Registered users: Bing [Bot], claydoh, Google [Bot], rblackwell, Yahoo [Bot]