Registered Member
|
Hello,
I bet this design problem is modeled somehow in the Eigen framework and I would like to understand exactly how. I would appreciate pointers on this. The design problem is the following. When using Expression Templates an "Expr<A>" type hierarchy is used where the atomic scalar operations take place e.g. the sum of two doubles using operator + e.g. "class DExprSum : public Expr< DExprSum< A,B > >". The type Matrix or Vector has an operator= defined as e.g. "void operator=(const Expr<A>& a)" which is where the actual single loop happens and this is where there is opportunity for exploiting vectorization/locality/threading with OpenMP etc. Conversely, trying to employ BLAS/LAPACK one ends up calling functions e.g. cblas_daxpy which will evaluate for every pair of resulting expressions e.g. x = a + b - c where they are all UDT vector would require by overloading operators + and - , two cblas_daxpy calls, two temporary objects with corresponding copy and one extra loop to copy the result to x. How can both solutions coexist without clashing in compile-time? How do you make it so that the choice of using Expression Templates over BLAS/LAPACK is set at runtime? The most optimal way would be to decide based on the expression at hand which approach offers higher performance e.g. an expression like x = a + b + c - d + e - f + g all UDT vectors will obviously be faster using Expression Templates i.e. only one loop and no temporary vector instances needed. TIA, Best regards, Giovanni |
Moderator
|
There is indeed no interest at all in calling BLAS level1 operations, i.e., coefficient-wise operation. On the other hand, the case of level2 (matrix-vector) and level3 (matrix-matrix) is very different because to be fast, such operations have to be evaluated into a temporary. This is what we do in Eigen. We have our own implementation/variants of gemv, gemm and the likes with smart analyzers of the expressions to detect complex expressions that can be reduced to a pointer, a stride, a storage order, and a conjugate flag.
Look at the first two plots of our benchmark to get an illustration of what you said: http://eigen.tuxfamily.org/index.php?title=Benchmark. (the first one requires a single call while the second requires two calls when using BLAS directly) |
Registered Member
|
Hi ggael,
Nice! Thank you for your answer! What happens then when someone "enables MKL" in Eigen? is this discarding the Expression Templates and doing instead multiple BLAS/LAPACK calls? TIA, Best regards, Giovanni |
Moderator
|
No no, MKL is only used when it makes sense, i.e., for products and decompositions.
|
Registered users: Bing [Bot], claydoh, Google [Bot], rblackwell, Yahoo [Bot]