This forum has been archived. All content is frozen. Please use KDE Discuss instead.

Eigen MAGMA backend implementation project

Tags: None
(comma "," separated)
bravegag
Registered Member
Posts
52
Karma
0
Hello,

I have "created a fork" of Eigen 3.2 and incorporated some (small) progress I have preparing a MAGMA backend to best exploit GPU & CPU. This is an alternative to using MKL which indirectly uses MKL because MAGMA does use MKL in the back. I have been testing it using MAGMA 1.4.0-beta2 and so far all my project tests pass without having to change our Eigen-based code base which is great! Anyone who wants to contribute please contact me to bravegag@hotmail.com or via the GitHub account below.

The code base is available here:
https://github.com/bravegag/eigen-magma

I have working the first port corresponding to GeneralMatrixMatrix_MAGMA.h which in reality uses MAGMA API but invokes CUBLAS which is slightly faster:
https://github.com/bravegag/eigen-magma ... ix_MAGMA.h

Another partial implementation (currently a bit of work in progress) ColPivHouseholderQR_MAGMA.h it is missing enabling the macro for float and complex types:
https://github.com/bravegag/eigen-magma ... QR_MAGMA.h

I have been adding implementations prioritizing the functions we use as part of our project.

The remaining *_MAGMA.h implementations are simply mock copies of the MKL counterparts with some basic pre-processing changes i.e. MKL -> MAGMA replacement.

Best regards,
Giovanni
bravegag
Registered Member
Posts
52
Karma
0
I have created a simple benchmark project to check the implementation:
https://github.com/bravegag/eigen-magma-benchmark

I have added documentation including the Gflops results for DGEMM and DGEQP3 so far implemented.
User avatar
ggael
Moderator
Posts
3447
Karma
19
OS
Great work. I've updated the respective bugzilla entry: http://eigen.tuxfamily.org/bz/show_bug.cgi?id=461.
bravegag
Registered Member
Posts
52
Karma
0
Hi ggael,

Thank you. I am currently in holidays but next week will continue expanding the project to other products and factorizations.

If you give me your GitHub account I can give you commit access to the project.

Best regards,
Giovanni
bravegag
Registered Member
Posts
52
Karma
0
Quick update, I have additionally implemented:

- dgemv (matrix vector multiplication)
- dtrsm (triangular matrix solver)
- dpotrf (Cholesky decomposition)

The results are very disappointing, unless I have bugs (e.g. copying Host <-> Device more memory than needed ) MAGMA overall and taking into account the memory transfer underperforms in these three cases see:
https://github.com/bravegag/eigen-magma-benchmark

The Cholesky decomposition result is the most surprising because Eigen beats both MKL and MAGMA (see the Gflops):
https://raw.github.com/bravegag/eigen-m ... gflops.png

If anyone would be willing to donate a code review I will be more than happy ;)

Best regards,
Giovanni
bravegag
Registered Member
Posts
52
Karma
0
Quick update:

I have done some bugfixes and included ?gesvd as well. The benchmark results have been updated to also reflect when it is using MAGMA and when it is using CUBLAS. The benchmark results are now embedded: https://github.com/bravegag/eigen-magma-benchmark

So far the MAGMA backend pays off for dgemm (Matrix Matrix product), dgeqp3 (QR decomposition Household with column pivoting), and dgesvd (SVD decomposition). Of course using a different setup, results may differ i.e. using a Tesla K20 card will improve the MAGMA results in all benchmarks but currently I use the cheaper nVidia GTX Titan solution (where I can get 3x cards for the price of one Tesla K20).

Best regards,
Giovanni
User avatar
ggael
Moderator
Posts
3447
Karma
19
OS
Hi, here is fix for gpotrf (the input matrix was not SPD):

Code: Select all
diff --git a/src/main/cpp/benchmark_main.cc b/src/main/cpp/benchmark_main.cc
index 4cdb5b5..1ab053e 100644
--- a/src/main/cpp/benchmark_main.cc
+++ b/src/main/cpp/benchmark_main.cc
@@ -104,34 +104,42 @@ static void run_benchmark(long N, int warm_ups, int num_runs, workload_type work
    }
 }
 
+EIGEN_DONT_INLINE
 static double dgemm(long N) {
    C = A*B;
    // flops see http://www.netlib.org/lapack/lawnspdf/lawn41.pdf page 120
    return 2*N*N*N;
 }
 
+EIGEN_DONT_INLINE
 static double dgeqp3(long N) {
    Eigen::ColPivHouseholderQR<MatrixXd> qr = A.colPivHouseholderQr();
    // flops see http://www.netlib.org/lapack/lawnspdf/lawn41.pdf page 121
    return N*N*N - (2/3)*N*N*N + N*N + N*N + (14/3)*N;
 }
 
+EIGEN_DONT_INLINE
 static double dgemv(long N) {
    C = A*b;
    return 2*N*N - N;
 }
 
+EIGEN_DONT_INLINE
 static double dtrsm(long N) {
    C = Aqr.solve(B);
    return N*N*N;
 }
 
+MatrixXd L;
+
+EIGEN_DONT_INLINE
 static double dpotrf(long N) {
    Eigen::LLT<MatrixXd> lltOfA(A);
-   MatrixXd L = lltOfA.matrixL();
+   L = lltOfA.matrixL();
    return N*N*N/3.0 + N*N/2.0 + N/6.0;
 }
 
+EIGEN_DONT_INLINE
 static double dgesvd(long N) {
    Eigen::JacobiSVD<MatrixXd> svd(A, Eigen::ComputeThinU | Eigen::ComputeThinV);
    return 22*N*N*N;
@@ -211,6 +219,10 @@ int main(int argc, char** argv) {
                // input data specific only to this function
                                    Aqr = A.colPivHouseholderQr();
             }
+            
+            if (function == "dpotrf") {
+          A = A.adjoint()*A;
+        }
 
             real_time_acc = bench_accumulator();
             gflops_acc = bench_accumulator();
bravegag
Registered Member
Posts
52
Karma
0
Thank you! I have applied the patch, benchmarking now, when the matrix is SPD then they are all slower ... I will upload the results as soon as they are ready. btw if you have suggestions how to improve the performance let me know. A few thoughts on how to improve:

1) when magma is enabled, then use pinned Host memory.
2) switch some methods to mgpu i.e. multi-gpu e.g. dpotrf has a mgpu version, maybe this could be configured via another define.
3) caching data copied to device but it is kind of hard to do at this level.

Best regards,
Giovanni

UPDATE: results uploaded and MAGMA now shines!
https://github.com/bravegag/eigen-magma-benchmark
bravegag
Registered Member
Posts
52
Karma
0
Major performance improvements in many of the factorizations e.g.

* DGEMM now tops at ~800 GFlop/s before was ~200 GFlop/s
* DPOTRF now tops at ~250 GFlop/s before was ~125 GFlop/s

The improvements are due to two aspects:

* Enabling CUDA-Double precision on the nVidia driver. The Titan GTX has this peculiarity: the Double-precision performance ratio defaults to 1/24th of the Single-Precision performance and this ratio can be changed using the nvidia driver to 1/3 of the SP performance.
* Upgrading to MAGMA official 1.4.0 release.

PS: still benchmarking
bravegag
Registered Member
Posts
52
Karma
0
Benchmarks updated after doing this (nvidia.NVreg_EnablePCIeGen3=1 enabling PCIe v3.0 and the maximum link speed to 8GT/s instead of 5GT/s i.e. speed ups memory transfers Host <-> Device):
https://devtalk.nvidia.com/default/topi ... -on-titan/

This gave a performance edge on the Titan cards worth updating the benchmarks.
Royi
Registered Member
Posts
34
Karma
0
Any progress on that?

By the way, why isn't Eigne on GitHub?
NeoPhix
Registered Member
Posts
1
Karma
0
Join to the question: is any progress here?

Now I am working on a project which needs MAGMA (or any GPU BLAS and LAPACK implementation). MKL does not satisfy to necessary performance (+ I need to build project for ARM with nvidia GPU, where I cannot use MKL).

Eigen documentation tells me (https://eigen.tuxfamily.org/dox/TopicUs ... apack.html):

Since Eigen version 3.3 and later, any F77 compatible BLAS or LAPACK libraries can be used as backends for dense matrix products and dense matrix decompositions. For instance, one can use Intel® MKL, Apple's Accelerate framework on OSX, OpenBLAS, Netlib LAPACK, etc.


It works for me when I compile project with OpenBLAS or MKL, but when I try to link project with MAGMA libraries (libmagma.a for example) and flags EIGEN_USE_BLAS and EIGEN_USE_LAPACKE, I have many linking errors of undefined symbols in LAPACK and BLAS calls.

In Eigen 3 sources I see some like this:

Code: Select all
info = LAPACKE_##LAPACKE_PREFIX##getrf( matrix_order, m, n, (LAPACKE_TYPE*)a, lda, ipiv );


and in library symbols I see these instructions: like "magma_sgetrf_".
Does it mean that I have invalid ##LAPACKE_PREFIX## ?

Can you give me any tips for linking with MAGMA? Is this option supported?


Bookmarks



Who is online

Registered users: Bing [Bot], Google [Bot], Sogou [Bot]