Registered Member
|
Hello,
One possible optimization for the MAGMA backend is to have Eigen allocating all Host memory as pinned using the cudaHostMalloc. It is not very clear that it will always pay off because AFAIK it depends on several factors including how much memory is being transferred etc. I searched for the place where the allocation takes place in Eigen but found too many matches ... would I need to replace most of the occurrences of malloc or alloc for this purpose? By replace I mean introduce a #if defined(EIGEN_USE_MAGMA) block and do the allocation using page blocked cudaHostMalloc. Running the search gives me too many matches, I expected the memory allocations to be isolated to one or maybe two files: /Users/bravegag/code/eigen-magma$ find ./Eigen \( -type f -and -not -path "*git*" \) -exec grep -li "malloc" {} \; ./Eigen/Core ./Eigen/QtAlignedMalloc ./Eigen/src/Core/util/DisableStupidWarnings.h ./Eigen/src/Core/util/MAGMA_support.h ./Eigen/src/Core/util/Memory.h ./Eigen/src/Eigen2Support/Memory.h ./Eigen/src/QR/ColPivHouseholderQR_MAGMA.h ./Eigen/src/SparseCore/SparseMatrix.h /Users/bravegag/code/eigen-magma$ find ./Eigen \( -type f -and -not -path "*git*" \) -exec grep -li "alloc" {} \; ./Eigen/Core ./Eigen/QtAlignedMalloc ./Eigen/src/Cholesky/LDLT.h ./Eigen/src/Cholesky/LLT.h ./Eigen/src/Cholesky/LLT_MAGMA.h ./Eigen/src/Core/arch/NEON/PacketMath.h ./Eigen/src/Core/Array.h ./Eigen/src/Core/Block.h ./Eigen/src/Core/DenseBase.h ./Eigen/src/Core/DenseStorage.h ./Eigen/src/Core/GeneralProduct.h ./Eigen/src/Core/Matrix.h ./Eigen/src/Core/PlainObjectBase.h ./Eigen/src/Core/products/GeneralMatrixMatrix.h ./Eigen/src/Core/products/GeneralMatrixMatrix_MAGMA.h ./Eigen/src/Core/products/GeneralMatrixMatrixTriangular.h ./Eigen/src/Core/products/GeneralMatrixVector_MAGMA.h ./Eigen/src/Core/products/SelfadjointMatrixMatrix.h ./Eigen/src/Core/products/TriangularSolverMatrix_MAGMA.h ./Eigen/src/Core/util/Constants.h ./Eigen/src/Core/util/DisableStupidWarnings.h ./Eigen/src/Core/util/Macros.h ./Eigen/src/Core/util/MAGMA_support.h ./Eigen/src/Core/util/Memory.h ./Eigen/src/Core/util/StaticAssert.h ./Eigen/src/Core/VectorBlock.h ./Eigen/src/Eigen2Support/Block.h ./Eigen/src/Eigen2Support/Memory.h ./Eigen/src/Eigenvalues/ComplexEigenSolver.h ./Eigen/src/Eigenvalues/ComplexSchur.h ./Eigen/src/Eigenvalues/EigenSolver.h ./Eigen/src/Eigenvalues/GeneralizedEigenSolver.h ./Eigen/src/Eigenvalues/GeneralizedSelfAdjointEigenSolver.h ./Eigen/src/Eigenvalues/HessenbergDecomposition.h ./Eigen/src/Eigenvalues/SelfAdjointEigenSolver.h ./Eigen/src/Eigenvalues/Tridiagonalization.h ./Eigen/src/LU/FullPivLU.h ./Eigen/src/LU/PartialPivLU.h ./Eigen/src/OrderingMethods/Eigen_Colamd.h ./Eigen/src/PardisoSupport/PardisoSupport.h ./Eigen/src/PaStiXSupport/PaStiXSupport.h ./Eigen/src/plugins/BlockMethods.h ./Eigen/src/QR/ColPivHouseholderQR.h ./Eigen/src/QR/ColPivHouseholderQR_MAGMA.h ./Eigen/src/QR/FullPivHouseholderQR.h ./Eigen/src/QR/HouseholderQR.h ./Eigen/src/SparseCore/AmbiVector.h ./Eigen/src/SparseCore/CompressedStorage.h ./Eigen/src/SparseCore/SparseBlock.h ./Eigen/src/SparseCore/SparseColEtree.h ./Eigen/src/SparseCore/SparseMatrix.h ./Eigen/src/SparseCore/SparseSparseProductWithPruning.h ./Eigen/src/SparseCore/SparseVector.h ./Eigen/src/SparseLU/SparseLU.h ./Eigen/src/SparseLU/SparseLU_column_bmod.h ./Eigen/src/SparseLU/SparseLU_column_dfs.h ./Eigen/src/SparseLU/SparseLU_copy_to_ucol.h ./Eigen/src/SparseLU/SparseLU_Memory.h ./Eigen/src/SparseQR/SparseQR.h ./Eigen/src/StlSupport/details.h ./Eigen/src/StlSupport/StdDeque.h ./Eigen/src/StlSupport/StdList.h ./Eigen/src/StlSupport/StdVector.h ./Eigen/src/SuperLUSupport/SuperLUSupport.h ./Eigen/src/SVD/JacobiSVD.h ./Eigen/src/SVD/JacobiSVD_MAGMA.h ./Eigen/src/SVD/JacobiSVD_MKL.h ./Eigen/src/UmfPackSupport/UmfPackSupport.h Thanks in advance, Best regards, Giovanni |
Registered Member
|
Search for "alloc(" and you get a more manageable list. I thought all memory allocations go through util/Memory.h , but there seems to be more in SparseCore/SparseMatrix.h .
|
Registered Member
|
Hi,
I have implemented this in Memory.h and I get about 1.1x speed up in my benchmarks e.g. now DGEMM reaches 750 GFlop/s and before was about 690 GFlop/s: https://github.com/bravegag/eigen-magma ... l/Memory.h Just a quick note. Unlike MAGMA and CUBLAS samples (CUBLAS 0_Simple/matrixMulCUBLAS modified to Double-Precision DGEMM reaches 1.3 TFlop/s but the Host <-> Device transfer times are unaccounted for), my benchmark account for the transfer Host <-> Device times (though some people may argue it is unfair) this is the only way to tell whether there is an actual performance gain using the given kernel via the MAGMA backend. Cheers, Giovanni |
Moderator
|
note that pinned memory improve performance for large enough matrices only (~16MB): http://www.cs.virginia.edu/~mwb7w/cuda_ ... deoff.html
|
Registered Member
|
Hi ggael!
Thank you. I saw this site too but the benchmarks improved from matrix sizes N=1000 and up e.g. for N=1000 the matrix has 1000x1000x8=8'000'000 bytes / 1'048'576=7.29MB and pinned memory already gave me a speed up here. The speed up increases with the size of course. I believe the RAM speed also plays a role because the downside of pinned memory is allocating such memory and if you have a fast RAM this downside will be less e.g. the memory in the box where I benchmark reaches 1866 MHz. Best regards, Giovanni |
Registered users: Bing [Bot], Google [Bot], Sogou [Bot], Yahoo [Bot]