transpose block

Board index

Page 1 of 1 (7 posts)

Tags:

johnm1019 Registered Member Posts 46 Karma 0	transpose block Tue Apr 03, 2012 9:20 am currently I do x.array() = ( A.transpose() currEstimate ).array() / columnSums ; which is single threaded -- I wanted to try parallelizing it, by manually issuing different blocks to different threads. Will I incur a huge performance hit with A.transpose().block(i,j,p,q) or is there a better way? Will post benchmarks from the current method when I finish. Thanks.
johnm1019 Registered Member Posts 46 Karma 0	Re: transpose block Tue Apr 03, 2012 10:02 am First bench finished. Suggestions welcome. I've avoided openMP and am using QtConcurrent framework, same concept though. Old code and new code Code: Select all `x.array() = ( A.transpose() currEstimate ).array() / columnSums;` Code: Select all // parallel block method qRegisterMetaType<Eigen::VectorXf>("Eigen::VectorXf"); qRegisterMetaType<Eigen::ArrayXf>("Eigen::ArrayXf"); qRegisterMetaType<Eigen::Matrix<float, Eigen::Dynamic, Eigen::Dynamic, Eigen::ColMajor> >("Eigen::Matrix<float, Eigen::Dynamic, Eigen::Dynamic, Eigen::ColMajor>"); solverPointers theData; theData.x = &x; theData.A = &A; theData.columnSums = &columnSums; theData.currEstimate = &currEstimate; QVector<QFuture<void> > blockFutures; int numBlocks = 4; for(int b=0;b<(numBlocks-1);++b) { blockFutures.push_back( QtConcurrent::run(calculatePatch, theData, b(numCols/numBlocks), (numCols/numBlocks)) ); } int numLeft = numCols-((numBlocks-1)(numCols/numBlocks)); blockFutures.push_back( QtConcurrent::run(calculatePatch, theData, (numBlocks-1)(numCols/numBlocks), numLeft) ); for(int bfs=0;bfs<blockFutures.size();++bfs) { blockFutures[bfs].waitForFinished(); } Code: Select all `void calculatePatch(solverPointers theData, int colStart, int numCols) { //x.array() = ( A.transpose() * currEstimate ).array() / columnSums; theData.x->array().segment(colStart,numCols) = ( theData.A->transpose().block(colStart, 0, numCols, theData.A->rows()) *theData.currEstimate ).array() / theData.columnSums->segment(colStart, numCols); };` Running on ~5GB of random floats >= 0, averaging over 20 runs Core i7-i920, 12GB RAM, windows 7, vs2010 sp-latest, -Ox, 64-bit average time/iter Old single : 469.9ms new single : 479.5ms (expected) new 2-thread: 370.3ms new 3-thread: 351.0ms new 4-thread: 343.4ms [img=500x400]http://www.chartgo.com/trans.jsp?filename=chartgo&img=chart1&id=7C8DB4EE97F7154E39ECF011588443EA_9878[/img] Any suggestions? Its a little better.....
ggael Moderator Posts 3447 Karma 19 OS	Re: transpose block Tue Apr 03, 2012 9:23 pm Yes, the speedups are not very impressive. What is the typical size of A? I guess the bottleneck is the matrix-vector product: computing only x = A.transpose() * currEstimate should deliver similar perf, and for some unknown reasons I've already observed that parallelizing a matrix-vector product this way does not scale very well (I used OpenMP). Again, I've no idea why. Perhaps duplicating currEstimate to each thread could help?
johnm1019 Registered Member Posts 46 Karma 0	Re: transpose block Tue Apr 03, 2012 9:49 pm ggael wrote:Yes, the speedups are not very impressive. What is the typical size of A? I guess the bottleneck is the matrix-vector product: computing only x = A.transpose() * currEstimate should deliver similar perf, and for some unknown reasons I've already observed that parallelizing a matrix-vector product this way does not scale very well (I used OpenMP). Again, I've no idea why. Perhaps duplicating currEstimate to each thread could help? Matrices are a bit rectangular. typically 2E3->1E4 x 3E5->3E6 [rows,cols] I feel like parallelizing matrix vector product should be easy, obviously its got some subtleties. My best guess is either limited memory bandwidth over a single area (the small vector), or locality algorithms which don't want to replicate items from a single memory location across the cache on each chip. That could be total bollocks of course.
ggael Moderator Posts 3447 Karma 19 OS	Re: transpose block Wed Apr 04, 2012 6:46 am yes that's why I suggested to copy the rhs vector to each thread... not sure though, just something easy to try.
johnm1019 Registered Member Posts 46 Karma 0	Re: transpose block Wed Apr 04, 2012 10:06 am Lots more benches Essentially, copying any of the other structures to make them local does nothing. The only one I could get a repeatable and measureable difference in making local was my columnSums structure -- which isn't even intuitive. It was about a 1-2% performance gain. Having the currEstimate local did absolutely nothing (but I was expecting it too). Furthermore, for a generic A*x mat->vec multiplication, a ColMajor storage order performs significantly better when striping (vs RowMajor), about %10 better in fact. This is repeatable. So for now I will go with striping, 3 threads, and a ColMajor storage order, which provides a %20 boost in performance, which I'll take. (I was already ColMajor) I'm going to try and bench on VS11 and see what happens.
johnm1019 Registered Member Posts 46 Karma 0	Re: transpose block Wed Apr 04, 2012 11:36 am Just tried VS11, essentially the same when compiling with SSE2. If I switch on AVX, as only found in VS11 (not available in 2010), it gets about 5x _worse_. No clue why turning on AVX would tank performance. Although the VS11 is still a developer preview, so they may have a lot of work left on compiler optimizations.