Registered Member
|
currently I do
x.array() *= ( A.transpose() * currEstimate ).array() / columnSums ; which is single threaded -- I wanted to try parallelizing it, by manually issuing different blocks to different threads. Will I incur a huge performance hit with A.transpose().block(i,j,p,q) or is there a better way? Will post benchmarks from the current method when I finish. Thanks. |
Registered Member
|
First bench finished. Suggestions welcome. I've avoided openMP and am using QtConcurrent framework, same concept though.
Old code and new code
Running on ~5GB of random floats >= 0, averaging over 20 runs Core i7-i920, 12GB RAM, windows 7, vs2010 sp-latest, -Ox, 64-bit average time/iter Old single : 469.9ms new single : 479.5ms (expected) new 2-thread: 370.3ms new 3-thread: 351.0ms new 4-thread: 343.4ms [img=500x400]http://www.chartgo.com/trans.jsp?filename=chartgo&img=chart1&id=7C8DB4EE97F7154E39ECF011588443EA_9878[/img] Any suggestions? Its a little better..... |
Moderator
|
Yes, the speedups are not very impressive. What is the typical size of A? I guess the bottleneck is the matrix-vector product: computing only x = A.transpose() * currEstimate should deliver similar perf, and for some unknown reasons I've already observed that parallelizing a matrix-vector product this way does not scale very well (I used OpenMP). Again, I've no idea why. Perhaps duplicating currEstimate to each thread could help?
|
Registered Member
|
Matrices are a bit rectangular. typically 2E3->1E4 x 3E5->3E6 [rows,cols] I feel like parallelizing matrix vector product should be easy, obviously its got some subtleties. My best guess is either limited memory bandwidth over a single area (the small vector), or locality algorithms which don't want to replicate items from a single memory location across the cache on each chip. That could be total bollocks of course. |
Moderator
|
yes that's why I suggested to copy the rhs vector to each thread... not sure though, just something easy to try.
|
Registered Member
|
Lots more benches
Essentially, copying any of the other structures to make them local does nothing. The only one I could get a repeatable and measureable difference in making local was my columnSums structure -- which isn't even intuitive. It was about a 1-2% performance gain. Having the currEstimate local did absolutely nothing (but I was expecting it too). Furthermore, for a generic A*x mat->vec multiplication, a ColMajor storage order performs significantly better when striping (vs RowMajor), about %10 better in fact. This is repeatable. So for now I will go with striping, 3 threads, and a ColMajor storage order, which provides a %20 boost in performance, which I'll take. (I was already ColMajor) I'm going to try and bench on VS11 and see what happens. |
Registered Member
|
Just tried VS11, essentially the same when compiling with SSE2. If I switch on AVX, as only found in VS11 (not available in 2010), it gets about 5x _worse_. No clue why turning on AVX would tank performance. Although the VS11 is still a developer preview, so they may have a lot of work left on compiler optimizations.
|
Registered users: Bing [Bot], claydoh, Google [Bot], rblackwell, Yahoo [Bot]