Registered Member
|
I tried to run a very simple (sparse) matrix vector product (V0) on my laptop but it's very very slow ?... A naive implementation (V1) is very very much faster :
>> ./matVecProd.sh g++ -I /home/fghoussen/Documents/INRIA/eigen-eigen-5a0156e40feb/local/include/eigen3 -mavx -march=native -O3 -o matVecProdV0.exe matVecProdV0.cpp g++ -march=native -O3 -ftree-vectorize -fopt-info-optall -funroll-loops -ffast-math -fstrict-aliasing -o matVecProdV1.exe matVecProdV1.cpp matVecProdV1.cpp:43:28: note: loop unrolled 7 times matVecProdV0 : movpd 0, addpd 2, mulpd 0, time 136911 ms, KO matVecProdV1 : movpd 0, addpd 0, mulpd 0, time 495 ms, KO This appends both when I compile with -mavx or -msse2 Can somebody help me to understand why ? My laptop has 4 procs (2 cores + hyperthreading). >> cat /proc/cpuinfo model name : Intel(R) Core(TM) i7-3687U CPU @ 2.10GHz flags : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm pbe syscall nx rdtscp lm constant_tsc arch_perfmon pebs bts rep_good nopl xtopology nonstop_tsc cpuid aperfmperf pni pclmulqdq dtes64 monitor ds_cpl vmx smx est tm2 ssse3 cx16 xtpr pdcm pcid sse4_1 sse4_2 x2apic popcnt tsc_deadline_timer aes xsave avx f16c rdrand lahf_lm cpuid_fault epb tpr_shadow vnmi flexpriority ept vpid fsgsbase smep erms xsaveopt dtherm ida arat pln pts Franck Files are available here : https://filesender.renater.fr/?s=downlo ... dca165dd22 |
Moderator
|
but you code is plain wrong and does not compute anything ! The problem is that pMatIr[i]==p for all i>0 and thus pMatIr[i+1]-pMatIr[i]==0 and the line pRes[i] += ...; is never called.
Then this += line is also wrong, it should be: pRes[i] += pMatVal[pMatJc[startJc+j]]*pVec[pMatJc[startJc+j]]; and not: pRes[i] += pMatVal[pMatJc[startJc+j]]*pVec[i]; finally, the checks are wrong too, you should replace n by p: int rc = 0; for (size_t i = 0; i < n; i++) if (abs(pRes[i] - AVE*p) > 1e-12) {rc = 1; break;} // Prevent compiler optimizations. cout << ((rc == 0) ? ", OK" : ", KO") << endl; and finally, if you want more performance for sparse matrix-vector products uses a row-major matrix: SparseMatrix<double,RowMajor> for which you can even enable multi-threading if compiling with -fopenmp. And generally, it is not a good idea to bench code written within the main function, better wrap it within a non-inline function to be closer to real-world usage. |
Registered Member
|
Just saw your reply here... Thanks, you are right.
Even with the fix, I still the same kind of problem. >>matVecProdV0 : movpd 0, addpd 31, mulpd 0, time 136682 ms, OK >>matVecProdV1 : movpd 0, addpd 0, mulpd 0, time 64635 ms, OK Franck |
Moderator
|
here I get same speed if just fixing the code, then Eigen's about x1.8 faster if using SparseMatrix<double,RowMajor>, and finally x5 faster if using RowMajor + "-fopenmp".
|
Moderator
|
V0:
V1:
I'm using the 3.3 branch + clang-4.0 |
Registered Member
|
This is great, Gael. Where can we learn tricks and tips like this please? I tried searching on Google but couldn't find a page that mentions your trick |
Moderator
|
regarding multithreading, there is a dedicated page: http://eigen.tuxfamily.org/dox/TopicMultiThreading.html
|
Registered Member
|
Thanks a lot, Gael! |
Registered users: Bing [Bot], Google [Bot], Sogou [Bot]