pinned memory allocation with cudaHostMalloc

Fri Aug 09, 2013 3:02 pm

Hello,

One possible optimization for the MAGMA backend is to have Eigen allocating all Host memory as pinned using the cudaHostMalloc. It is not very clear that it will always pay off because AFAIK it depends on several factors including how much memory is being transferred etc. I searched for the place where the allocation takes place in Eigen but found too many matches ... would I need to replace most of the occurrences of malloc or alloc for this purpose? By replace I mean introduce a #if defined(EIGEN_USE_MAGMA) block and do the allocation using page blocked cudaHostMalloc.

Running the search gives me too many matches, I expected the memory allocations to be isolated to one or maybe two files:

/Users/bravegag/code/eigen-magma$ find ./Eigen $ -type f -and -not -path "*git*" $ -exec grep -li "malloc" {} \;
./Eigen/Core
./Eigen/QtAlignedMalloc
./Eigen/src/Core/util/DisableStupidWarnings.h
./Eigen/src/Core/util/MAGMA_support.h
./Eigen/src/Core/util/Memory.h
./Eigen/src/Eigen2Support/Memory.h
./Eigen/src/QR/ColPivHouseholderQR_MAGMA.h
./Eigen/src/SparseCore/SparseMatrix.h

/Users/bravegag/code/eigen-magma$ find ./Eigen $ -type f -and -not -path "*git*" $ -exec grep -li "alloc" {} \;
./Eigen/Core
./Eigen/QtAlignedMalloc
./Eigen/src/Cholesky/LDLT.h
./Eigen/src/Cholesky/LLT.h
./Eigen/src/Cholesky/LLT_MAGMA.h
./Eigen/src/Core/arch/NEON/PacketMath.h
./Eigen/src/Core/Array.h
./Eigen/src/Core/Block.h
./Eigen/src/Core/DenseBase.h
./Eigen/src/Core/DenseStorage.h
./Eigen/src/Core/GeneralProduct.h
./Eigen/src/Core/Matrix.h
./Eigen/src/Core/PlainObjectBase.h
./Eigen/src/Core/products/GeneralMatrixMatrix.h
./Eigen/src/Core/products/GeneralMatrixMatrix_MAGMA.h
./Eigen/src/Core/products/GeneralMatrixMatrixTriangular.h
./Eigen/src/Core/products/GeneralMatrixVector_MAGMA.h
./Eigen/src/Core/products/SelfadjointMatrixMatrix.h
./Eigen/src/Core/products/TriangularSolverMatrix_MAGMA.h
./Eigen/src/Core/util/Constants.h
./Eigen/src/Core/util/DisableStupidWarnings.h
./Eigen/src/Core/util/Macros.h
./Eigen/src/Core/util/MAGMA_support.h
./Eigen/src/Core/util/Memory.h
./Eigen/src/Core/util/StaticAssert.h
./Eigen/src/Core/VectorBlock.h
./Eigen/src/Eigen2Support/Block.h
./Eigen/src/Eigen2Support/Memory.h
./Eigen/src/Eigenvalues/ComplexEigenSolver.h
./Eigen/src/Eigenvalues/ComplexSchur.h
./Eigen/src/Eigenvalues/EigenSolver.h
./Eigen/src/Eigenvalues/GeneralizedEigenSolver.h
./Eigen/src/Eigenvalues/GeneralizedSelfAdjointEigenSolver.h
./Eigen/src/Eigenvalues/HessenbergDecomposition.h
./Eigen/src/Eigenvalues/SelfAdjointEigenSolver.h
./Eigen/src/Eigenvalues/Tridiagonalization.h
./Eigen/src/LU/FullPivLU.h
./Eigen/src/LU/PartialPivLU.h
./Eigen/src/OrderingMethods/Eigen_Colamd.h
./Eigen/src/PardisoSupport/PardisoSupport.h
./Eigen/src/PaStiXSupport/PaStiXSupport.h
./Eigen/src/plugins/BlockMethods.h
./Eigen/src/QR/ColPivHouseholderQR.h
./Eigen/src/QR/ColPivHouseholderQR_MAGMA.h
./Eigen/src/QR/FullPivHouseholderQR.h
./Eigen/src/QR/HouseholderQR.h
./Eigen/src/SparseCore/AmbiVector.h
./Eigen/src/SparseCore/CompressedStorage.h
./Eigen/src/SparseCore/SparseBlock.h
./Eigen/src/SparseCore/SparseColEtree.h
./Eigen/src/SparseCore/SparseMatrix.h
./Eigen/src/SparseCore/SparseSparseProductWithPruning.h
./Eigen/src/SparseCore/SparseVector.h
./Eigen/src/SparseLU/SparseLU.h
./Eigen/src/SparseLU/SparseLU_column_bmod.h
./Eigen/src/SparseLU/SparseLU_column_dfs.h
./Eigen/src/SparseLU/SparseLU_copy_to_ucol.h
./Eigen/src/SparseLU/SparseLU_Memory.h
./Eigen/src/SparseQR/SparseQR.h
./Eigen/src/StlSupport/details.h
./Eigen/src/StlSupport/StdDeque.h
./Eigen/src/StlSupport/StdList.h
./Eigen/src/StlSupport/StdVector.h
./Eigen/src/SuperLUSupport/SuperLUSupport.h
./Eigen/src/SVD/JacobiSVD.h
./Eigen/src/SVD/JacobiSVD_MAGMA.h
./Eigen/src/SVD/JacobiSVD_MKL.h
./Eigen/src/UmfPackSupport/UmfPackSupport.h

Thanks in advance,
Best regards,
Giovanni

bravegag Registered Member Posts 52 Karma 0	pinned memory allocation with cudaHostMalloc Fri Aug 09, 2013 3:02 pm Hello, One possible optimization for the MAGMA backend is to have Eigen allocating all Host memory as pinned using the cudaHostMalloc. It is not very clear that it will always pay off because AFAIK it depends on several factors including how much memory is being transferred etc. I searched for the place where the allocation takes place in Eigen but found too many matches ... would I need to replace most of the occurrences of malloc or alloc for this purpose? By replace I mean introduce a #if defined(EIGEN_USE_MAGMA) block and do the allocation using page blocked cudaHostMalloc. Running the search gives me too many matches, I expected the memory allocations to be isolated to one or maybe two files: /Users/bravegag/code/eigen-magma$ find ./Eigen \( -type f -and -not -path "git" \) -exec grep -li "malloc" {} \; ./Eigen/Core ./Eigen/QtAlignedMalloc ./Eigen/src/Core/util/DisableStupidWarnings.h ./Eigen/src/Core/util/MAGMA_support.h ./Eigen/src/Core/util/Memory.h ./Eigen/src/Eigen2Support/Memory.h ./Eigen/src/QR/ColPivHouseholderQR_MAGMA.h ./Eigen/src/SparseCore/SparseMatrix.h /Users/bravegag/code/eigen-magma$ find ./Eigen \( -type f -and -not -path "git" \) -exec grep -li "alloc" {} \; ./Eigen/Core ./Eigen/QtAlignedMalloc ./Eigen/src/Cholesky/LDLT.h ./Eigen/src/Cholesky/LLT.h ./Eigen/src/Cholesky/LLT_MAGMA.h ./Eigen/src/Core/arch/NEON/PacketMath.h ./Eigen/src/Core/Array.h ./Eigen/src/Core/Block.h ./Eigen/src/Core/DenseBase.h ./Eigen/src/Core/DenseStorage.h ./Eigen/src/Core/GeneralProduct.h ./Eigen/src/Core/Matrix.h ./Eigen/src/Core/PlainObjectBase.h ./Eigen/src/Core/products/GeneralMatrixMatrix.h ./Eigen/src/Core/products/GeneralMatrixMatrix_MAGMA.h ./Eigen/src/Core/products/GeneralMatrixMatrixTriangular.h ./Eigen/src/Core/products/GeneralMatrixVector_MAGMA.h ./Eigen/src/Core/products/SelfadjointMatrixMatrix.h ./Eigen/src/Core/products/TriangularSolverMatrix_MAGMA.h ./Eigen/src/Core/util/Constants.h ./Eigen/src/Core/util/DisableStupidWarnings.h ./Eigen/src/Core/util/Macros.h ./Eigen/src/Core/util/MAGMA_support.h ./Eigen/src/Core/util/Memory.h ./Eigen/src/Core/util/StaticAssert.h ./Eigen/src/Core/VectorBlock.h ./Eigen/src/Eigen2Support/Block.h ./Eigen/src/Eigen2Support/Memory.h ./Eigen/src/Eigenvalues/ComplexEigenSolver.h ./Eigen/src/Eigenvalues/ComplexSchur.h ./Eigen/src/Eigenvalues/EigenSolver.h ./Eigen/src/Eigenvalues/GeneralizedEigenSolver.h ./Eigen/src/Eigenvalues/GeneralizedSelfAdjointEigenSolver.h ./Eigen/src/Eigenvalues/HessenbergDecomposition.h ./Eigen/src/Eigenvalues/SelfAdjointEigenSolver.h ./Eigen/src/Eigenvalues/Tridiagonalization.h ./Eigen/src/LU/FullPivLU.h ./Eigen/src/LU/PartialPivLU.h ./Eigen/src/OrderingMethods/Eigen_Colamd.h ./Eigen/src/PardisoSupport/PardisoSupport.h ./Eigen/src/PaStiXSupport/PaStiXSupport.h ./Eigen/src/plugins/BlockMethods.h ./Eigen/src/QR/ColPivHouseholderQR.h ./Eigen/src/QR/ColPivHouseholderQR_MAGMA.h ./Eigen/src/QR/FullPivHouseholderQR.h ./Eigen/src/QR/HouseholderQR.h ./Eigen/src/SparseCore/AmbiVector.h ./Eigen/src/SparseCore/CompressedStorage.h ./Eigen/src/SparseCore/SparseBlock.h ./Eigen/src/SparseCore/SparseColEtree.h ./Eigen/src/SparseCore/SparseMatrix.h ./Eigen/src/SparseCore/SparseSparseProductWithPruning.h ./Eigen/src/SparseCore/SparseVector.h ./Eigen/src/SparseLU/SparseLU.h ./Eigen/src/SparseLU/SparseLU_column_bmod.h ./Eigen/src/SparseLU/SparseLU_column_dfs.h ./Eigen/src/SparseLU/SparseLU_copy_to_ucol.h ./Eigen/src/SparseLU/SparseLU_Memory.h ./Eigen/src/SparseQR/SparseQR.h ./Eigen/src/StlSupport/details.h ./Eigen/src/StlSupport/StdDeque.h ./Eigen/src/StlSupport/StdList.h ./Eigen/src/StlSupport/StdVector.h ./Eigen/src/SuperLUSupport/SuperLUSupport.h ./Eigen/src/SVD/JacobiSVD.h ./Eigen/src/SVD/JacobiSVD_MAGMA.h ./Eigen/src/SVD/JacobiSVD_MKL.h ./Eigen/src/UmfPackSupport/UmfPackSupport.h Thanks in advance, Best regards, Giovanni
jitseniesen Registered Member Posts 204 Karma 2	Re: pinned memory allocation with cudaHostMalloc Fri Aug 09, 2013 5:07 pm Search for "alloc(" and you get a more manageable list. I thought all memory allocations go through util/Memory.h , but there seems to be more in SparseCore/SparseMatrix.h .
bravegag Registered Member Posts 52 Karma 0	Re: pinned memory allocation with cudaHostMalloc Tue Aug 20, 2013 9:35 am Hi, I have implemented this in Memory.h and I get about 1.1x speed up in my benchmarks e.g. now DGEMM reaches 750 GFlop/s and before was about 690 GFlop/s: https://github.com/bravegag/eigen-magma ... l/Memory.h Just a quick note. Unlike MAGMA and CUBLAS samples (CUBLAS 0_Simple/matrixMulCUBLAS modified to Double-Precision DGEMM reaches 1.3 TFlop/s but the Host <-> Device transfer times are unaccounted for), my benchmark account for the transfer Host <-> Device times (though some people may argue it is unfair) this is the only way to tell whether there is an actual performance gain using the given kernel via the MAGMA backend. Cheers, Giovanni
ggael Moderator Posts 3447 Karma 19 OS	Re: pinned memory allocation with cudaHostMalloc Thu Aug 22, 2013 11:28 am note that pinned memory improve performance for large enough matrices only (~16MB): http://www.cs.virginia.edu/~mwb7w/cuda_ ... deoff.html
bravegag Registered Member Posts 52 Karma 0	Re: pinned memory allocation with cudaHostMalloc Thu Aug 22, 2013 12:43 pm Hi ggael! Thank you. I saw this site too but the benchmarks improved from matrix sizes N=1000 and up e.g. for N=1000 the matrix has 1000x1000x8=8'000'000 bytes / 1'048'576=7.29MB and pinned memory already gave me a speed up here. The speed up increases with the size of course. I believe the RAM speed also plays a role because the downside of pinned memory is allocating such memory and if you have a fast RAM this downside will be less e.g. the memory in the box where I benchmark reaches 1866 MHz. Best regards, Giovanni

pinned memory allocation with cudaHostMalloc

Page 1 of 1 (5 posts)

pinned memory allocation with cudaHostMalloc

Re: pinned memory allocation with cudaHostMalloc

Re: pinned memory allocation with cudaHostMalloc

Re: pinned memory allocation with cudaHostMalloc

Re: pinned memory allocation with cudaHostMalloc

Bookmarks

Who is online