Registered Member
|
Hello,
I have been working on the MAGMA/CUBLAS backend https://github.com/bravegag/eigen-magma and some factorizations and products really payoff e.g. DGEMM, DGEQP3, DPORTF, etc whereas other operations don't perform well e.g. DGEMV, DTRSM. At the moment in my project I use the MAGMA backend and I benefit a lot from the performing operations but as soon as I need e.g. EigenSolver DGEES which is not yet available in MAGMA, and it performs a lot better in MKL than the native Eigen implementation I can't use MKL because it would hardwire the MAGMA backend to MKL. However, in our specific project we have MKL and makes sense to combine the best of both backends MAGMA and MKL but I don't know how to do this without losing generality. For example, I could replace all the incomplete/nonperforming MAGMA/CUBLAS operations with those of MKL but this will not be really a good design choice for people who don't want MKL and choose a different MAGMA flavor e.g. MAGMA-ATLAS with gnu compilers. Another possibility is to maintain a separate branch where I choose the fastest between MAGMA-CUDA-MKL and MKL but this would be a maintenance toll hmm. Can anyone drop any thoughts on this? TIA, Best regards, Giovanni |
Moderator
|
You could simply offer a EIGEN_USE_MAGMA_AND_MKL token that, in addition to use MAGMA, would fallback to MKL. EIGEN_USE_MAGMA_AND_MKL would implies EIGEN_USE_MAGMA and includes the few *_MKL.h files that do not have MAGMA equivalent.
|
Registered Member
|
hi ggael,
I have implemented what you suggested introducing the new token EIGEN_USE_MAGMA_AND_MKL. Both projects are updated now. Suggestions and bugfixes are always welcome Cheers, Giovanni |
Moderator
|
Nice work.
At some point we should try to find a way to merge it upstream. My initial plan for GPU support was to introduce a new matrix type, e.g., DeviceMatrix/DeviceVector which would store its data in device memory. Custom kernels would be automatically generated for coefficient-wise operations, while products and solvers would be performed using cublas/magma (this is were your work will become handy). This approach should be more efficient (less copy), and more flexible as it allows to mix CPU and GPU computations. However, this requires much more work and might takes time to appear and your current approach would already be extremely useful so why not having both but I'am a bit worried about offering official support for it. The MKL layer is already getting us many troubles that we are not able to reproduce and fix. To this end it would be nice to have all the code lies in an unsupported/Eigen/MagmaSupport module. As far as I known, the EIGEN_USE_MAGMA mostly adds template specializations. Therefore #including <unsupported/Eigen/MagmaSupport> should be enough to activate the use of MAGMA. Tell me if I'm wrong, but the only problem would be to enable pinned memory allocation as it require modifications in Eigen/Core. We could enforce Eigen/MagmaSupport to be included before any other Eigen header with a #error directive in Eigen/MagmaSupport. Perhaps an even better approach would be to make pinned memory allocation optional. Indeed, in a single project one might want to enable MAGMA for a given .cpp file only, and keep using standard Eigen elsewhere. In this scenario it might likely be the case that matrices are exchanged between MAGAM-enabled and non-MAGMA code, and if some used pinned allocation while the other don't, I guess this might produce some problem (at least if one is allocated in one translation unit but deleted in another). So making pinned allocation optional would fix this issue too. What do you think? |
Registered Member
|
Hi ggael!
Thank you
I agree with merging upstream, just tell me what I need to do. I'm less familiar with the approach you propose "DeviceMatrix/DeviceVector" but I will contribute as much as possible. Indeed the current approach is sort of a losing strategy because of the copying while some expressions may definitely take advantage of matrix/vectors that have been previously copied to the Device.
I think it would be straightforward to do what you propose but AFAIK there is nothing fancy about pinned memory other than allocating it non-pageable, therefore it can be used for CPU-CPU computation np. Best regards, Giovanni |
Registered users: Bing [Bot], Google [Bot], Sogou [Bot], Yahoo [Bot]