This forum has been archived. All content is frozen. Please use KDE Discuss instead.

pinned memory allocation with cudaHostMalloc

Tags: None
(comma "," separated)
bravegag
Registered Member
Posts
52
Karma
0
Hello,

One possible optimization for the MAGMA backend is to have Eigen allocating all Host memory as pinned using the cudaHostMalloc. It is not very clear that it will always pay off because AFAIK it depends on several factors including how much memory is being transferred etc. I searched for the place where the allocation takes place in Eigen but found too many matches ... would I need to replace most of the occurrences of malloc or alloc for this purpose? By replace I mean introduce a #if defined(EIGEN_USE_MAGMA) block and do the allocation using page blocked cudaHostMalloc.

Running the search gives me too many matches, I expected the memory allocations to be isolated to one or maybe two files:

/Users/bravegag/code/eigen-magma$ find ./Eigen \( -type f -and -not -path "*git*" \) -exec grep -li "malloc" {} \;
./Eigen/Core
./Eigen/QtAlignedMalloc
./Eigen/src/Core/util/DisableStupidWarnings.h
./Eigen/src/Core/util/MAGMA_support.h
./Eigen/src/Core/util/Memory.h
./Eigen/src/Eigen2Support/Memory.h
./Eigen/src/QR/ColPivHouseholderQR_MAGMA.h
./Eigen/src/SparseCore/SparseMatrix.h

/Users/bravegag/code/eigen-magma$ find ./Eigen \( -type f -and -not -path "*git*" \) -exec grep -li "alloc" {} \;
./Eigen/Core
./Eigen/QtAlignedMalloc
./Eigen/src/Cholesky/LDLT.h
./Eigen/src/Cholesky/LLT.h
./Eigen/src/Cholesky/LLT_MAGMA.h
./Eigen/src/Core/arch/NEON/PacketMath.h
./Eigen/src/Core/Array.h
./Eigen/src/Core/Block.h
./Eigen/src/Core/DenseBase.h
./Eigen/src/Core/DenseStorage.h
./Eigen/src/Core/GeneralProduct.h
./Eigen/src/Core/Matrix.h
./Eigen/src/Core/PlainObjectBase.h
./Eigen/src/Core/products/GeneralMatrixMatrix.h
./Eigen/src/Core/products/GeneralMatrixMatrix_MAGMA.h
./Eigen/src/Core/products/GeneralMatrixMatrixTriangular.h
./Eigen/src/Core/products/GeneralMatrixVector_MAGMA.h
./Eigen/src/Core/products/SelfadjointMatrixMatrix.h
./Eigen/src/Core/products/TriangularSolverMatrix_MAGMA.h
./Eigen/src/Core/util/Constants.h
./Eigen/src/Core/util/DisableStupidWarnings.h
./Eigen/src/Core/util/Macros.h
./Eigen/src/Core/util/MAGMA_support.h
./Eigen/src/Core/util/Memory.h
./Eigen/src/Core/util/StaticAssert.h
./Eigen/src/Core/VectorBlock.h
./Eigen/src/Eigen2Support/Block.h
./Eigen/src/Eigen2Support/Memory.h
./Eigen/src/Eigenvalues/ComplexEigenSolver.h
./Eigen/src/Eigenvalues/ComplexSchur.h
./Eigen/src/Eigenvalues/EigenSolver.h
./Eigen/src/Eigenvalues/GeneralizedEigenSolver.h
./Eigen/src/Eigenvalues/GeneralizedSelfAdjointEigenSolver.h
./Eigen/src/Eigenvalues/HessenbergDecomposition.h
./Eigen/src/Eigenvalues/SelfAdjointEigenSolver.h
./Eigen/src/Eigenvalues/Tridiagonalization.h
./Eigen/src/LU/FullPivLU.h
./Eigen/src/LU/PartialPivLU.h
./Eigen/src/OrderingMethods/Eigen_Colamd.h
./Eigen/src/PardisoSupport/PardisoSupport.h
./Eigen/src/PaStiXSupport/PaStiXSupport.h
./Eigen/src/plugins/BlockMethods.h
./Eigen/src/QR/ColPivHouseholderQR.h
./Eigen/src/QR/ColPivHouseholderQR_MAGMA.h
./Eigen/src/QR/FullPivHouseholderQR.h
./Eigen/src/QR/HouseholderQR.h
./Eigen/src/SparseCore/AmbiVector.h
./Eigen/src/SparseCore/CompressedStorage.h
./Eigen/src/SparseCore/SparseBlock.h
./Eigen/src/SparseCore/SparseColEtree.h
./Eigen/src/SparseCore/SparseMatrix.h
./Eigen/src/SparseCore/SparseSparseProductWithPruning.h
./Eigen/src/SparseCore/SparseVector.h
./Eigen/src/SparseLU/SparseLU.h
./Eigen/src/SparseLU/SparseLU_column_bmod.h
./Eigen/src/SparseLU/SparseLU_column_dfs.h
./Eigen/src/SparseLU/SparseLU_copy_to_ucol.h
./Eigen/src/SparseLU/SparseLU_Memory.h
./Eigen/src/SparseQR/SparseQR.h
./Eigen/src/StlSupport/details.h
./Eigen/src/StlSupport/StdDeque.h
./Eigen/src/StlSupport/StdList.h
./Eigen/src/StlSupport/StdVector.h
./Eigen/src/SuperLUSupport/SuperLUSupport.h
./Eigen/src/SVD/JacobiSVD.h
./Eigen/src/SVD/JacobiSVD_MAGMA.h
./Eigen/src/SVD/JacobiSVD_MKL.h
./Eigen/src/UmfPackSupport/UmfPackSupport.h

Thanks in advance,
Best regards,
Giovanni
jitseniesen
Registered Member
Posts
204
Karma
2
Search for "alloc(" and you get a more manageable list. I thought all memory allocations go through util/Memory.h , but there seems to be more in SparseCore/SparseMatrix.h .
bravegag
Registered Member
Posts
52
Karma
0
Hi,

I have implemented this in Memory.h and I get about 1.1x speed up in my benchmarks e.g. now DGEMM reaches 750 GFlop/s and before was about 690 GFlop/s:
https://github.com/bravegag/eigen-magma ... l/Memory.h

Just a quick note. Unlike MAGMA and CUBLAS samples (CUBLAS 0_Simple/matrixMulCUBLAS modified to Double-Precision DGEMM reaches 1.3 TFlop/s but the Host <-> Device transfer times are unaccounted for), my benchmark account for the transfer Host <-> Device times (though some people may argue it is unfair) this is the only way to tell whether there is an actual performance gain using the given kernel via the MAGMA backend.

Cheers,
Giovanni
User avatar
ggael
Moderator
Posts
3447
Karma
19
OS
note that pinned memory improve performance for large enough matrices only (~16MB): http://www.cs.virginia.edu/~mwb7w/cuda_ ... deoff.html
bravegag
Registered Member
Posts
52
Karma
0
Hi ggael!

Thank you. I saw this site too but the benchmarks improved from matrix sizes N=1000 and up e.g. for N=1000 the matrix has 1000x1000x8=8'000'000 bytes / 1'048'576=7.29MB and pinned memory already gave me a speed up here. The speed up increases with the size of course. I believe the RAM speed also plays a role because the downside of pinned memory is allocating such memory and if you have a fast RAM this downside will be less e.g. the memory in the box where I benchmark reaches 1866 MHz.

Best regards,
Giovanni


Bookmarks



Who is online

Registered users: Bing [Bot], Google [Bot], Sogou [Bot], Yahoo [Bot]