Registered Member
|
Hello,
I am trying to build a very naive metric for the performance on linear algebra systems on my system. The code I am using is here. I do not know if my results are sensible and I would like some comments and/or pointing-out of possible mistakes I am doing. My 4-year old system (HP NW9440 - Intel T7400 @ 2.16GHz - 3GB RAM) is having the following configuration: Ubuntu 10.04.3 - 2.6.32-34-generic; GSL version : 1.13; MKL version : 10.3; Eigen version: - 2.0.12. MKL was installed manually by me and to my best of knowledge it works fine. GSL and Eigen were installed using Synaptic. g++ is on version 4.4.3 (Ubuntu 4.4.3-4ubuntu5) and icpc on 12.1.0 (gcc version 4.4.3 compatibility). GCC (and g++ on that matter) were installed by synaptic, icpc manually. Admittedly the versions of GSL and Eigen are not in their latest editions; I wanted to keep the configuration as "vanilla" as possible. I initialize two 2000-by-2000 random matrix and I compute their product to a third matrix. First using Eigen, then using "naive C-code" and then using BLAS. Originally I compile my code using g++ under the arguments : g++ -I /usr/include/eigen2/ my_program.cpp -o my_program.out -lgsl -lgslcblas -O3 The running times on my system are (in seconds) : Eigen's Time: 7.780000 Naive's Time: 69.250000 GSLcblas' Time: 44.800000 (medians after 7 runs) Then I compile my code using icpc under the arguments : icpc -I /usr/include/eigen2/ my_program.cpp -o my_programI.out -lgsl -mkl=sequential -fast Eigen's Time: 3.650000 Naive's Time: 14.510000 MKLcblas' Time: 2.240000 (medians after 7 runs) Using the formula (2)*(N^3)*(1/execution_time)*(1/1000000) I calculate the number of megaflops performance output each algorithm has. The (theoretical) gigaflop peak performance each of my cores has is 8.64 (= 17.28/2) as listed here by Intel - I think at least) Raw numbers in Megaflops followed by percentage of utilization: Eigen's Performance ( g++ / gslcblas) : 2056.55526 (23.8027%) Naive-C's Performance ( g++ / gslcblas) : 231.04693 (2.67415%) BLAS (by GSL) Performance ( g++ / gslcblas) : 357.14286(4.13360%) Eigen's Performance (icpc / mkl) : 4383.56164 (50.73566%) Naive-C's Performance ( icpc / mkl) : 1102.6878 (12.76259%) BLAS (by MKL) Performance ( icpc/ mkl) : 7142.85714 (82.67195%) Are my results sensible? (I know this is a very subjective question given that someone has to comment about the performance of a machine he doesn't have physical access to) Is it normal for GSL BLAS to be so... bad? Am I somehow hindering it's performance? Is it normal that Eigen scales so abruptly between two compilers? Is it normal that Eigen2 (Yes I know it's an older version and MKL has a newer version) is so much lower than MKL? (I have seen the benchmark page but a) I have compiled them specifically for my systems Is it normal that my plain C-code is so much slower when using g++? Do I calculate my code performance "sensibly"? If you spot any obvious mistake in the way I am conducting my M-M multiplication please tell me so I can re-run my benchmarks to get more sensible results. (I tried to be have an almost pure-C syntax.) Thank you! |
Registered Member
|
For GCC, use the -march=native -mtune=native flags // more info (incl. -Ofast) in my posts here: viewtopic.php?f=74&t=96825&p=203599#p203599
If your target is 32-bit x86, don't forget to also include the appropriate SSE options (for the x86-64 compiler SSE is enabled by default). According to Wiki -- http://en.wikipedia.org/wiki/T7400#.22M ... C_65_nm.29 -- your CPU supports up to and including SSSE3. So, also use -mfpmath=sse -mssse3 |
Registered Member
|
Thank you for your suggestions, they were very helpful.
I should have used -march=native in the first place, I have used it excessively in the past but today I totally forgot it... My bad! The flag : -march=native made a big difference on Eigen's performance but none on GSL's BLAS. Eigen's (g++ compiled) running time is now actually faster than MKL's (on icpc). Eigen's median execution time: 2.06s. All the others flags (disappointingly) brought no additional performance gains (or individual ones when used on there own). The obvious exception was -mfpmath=sse -mssse3 which had the same positive effect as -march=native. I am really impressed by Eigen's performance on this being faster than MKL. Bravo! |
Registered Member
|
Seems your code contains an mistake. Your compared performance of Eigen::MatrixXf, which is single-precision float, with double-precision cblas function. This is not a fair test.
I have written a test similar to you but got opposite result, so I looked into you code and found this. After making it doulbe-precision, Eigen (I use Eigen v3 becuase it's faster than v2) is a little bit slower than mkl but very close. (Eigen: 1.75s; MKL cblas: 1.71s, on i7 820M, test done with single-thread and CPU frequency limited to the minimum of1.2GHz by $cpufreq-set to avoid inaccuracy induced by Intel turbo boost and frequency scaling). Though a little bit slower, it's impressive that a compiler optimised library could be almost as fast as hand-tuned library like MKL or GotoBlas! I didn't test ATLAS which is also compiler optimised but ATLAS should be slower than Eigen, since Eigen performance is so close to MKL. I also tested Armadillo. The performance was almost the same as MKL cblas. This is reasonable because my Armadillo used MKL blas. Another impressive result is that g++ outperformed Intel icpc here! g++ -DNDEBUG -march=native -O3 Eigen: 1.75s MKL (cblas or Armadillo): 1.71s icpc -DNDEBUG -xHost -O3 Eigen: 2.13s MKL (cblas or Armadillo): 1.74s PGO didn't improve performance compared to above CXXFLAGS Sorry to bump up an old thread but I just want to correct this error so other people will not be mislead if they are lead here by google. |
Registered users: Baidu [Spider], Bing [Bot], Google [Bot]