This forum has been archived. All content is frozen. Please use KDE Discuss instead.

Design question: BLAS/LAPACK & Expression Templates

Tags: None
(comma "," separated)
bravegag
Registered Member
Posts
52
Karma
0
Hello,

I bet this design problem is modeled somehow in the Eigen framework and I would like to understand exactly how. I would appreciate pointers on this.

The design problem is the following. When using Expression Templates an "Expr<A>" type hierarchy is used where the atomic scalar operations take place e.g. the sum of two doubles using operator + e.g. "class DExprSum : public Expr< DExprSum< A,B > >". The type Matrix or Vector has an operator= defined as e.g. "void operator=(const Expr<A>& a)" which is where the actual single loop happens and this is where there is opportunity for exploiting vectorization/locality/threading with OpenMP etc.

Conversely, trying to employ BLAS/LAPACK one ends up calling functions e.g. cblas_daxpy which will evaluate for every pair of resulting expressions e.g. x = a + b - c where they are all UDT vector would require by overloading operators + and - , two cblas_daxpy calls, two temporary objects with corresponding copy and one extra loop to copy the result to x.

How can both solutions coexist without clashing in compile-time? How do you make it so that the choice of using Expression Templates over BLAS/LAPACK is set at runtime? The most optimal way would be to decide based on the expression at hand which approach offers higher performance e.g. an expression like x = a + b + c - d + e - f + g all UDT vectors will obviously be faster using Expression Templates i.e. only one loop and no temporary vector instances needed.

TIA,
Best regards,
Giovanni
User avatar
ggael
Moderator
Posts
3447
Karma
19
OS
There is indeed no interest at all in calling BLAS level1 operations, i.e., coefficient-wise operation. On the other hand, the case of level2 (matrix-vector) and level3 (matrix-matrix) is very different because to be fast, such operations have to be evaluated into a temporary. This is what we do in Eigen. We have our own implementation/variants of gemv, gemm and the likes with smart analyzers of the expressions to detect complex expressions that can be reduced to a pointer, a stride, a storage order, and a conjugate flag.

Look at the first two plots of our benchmark to get an illustration of what you said: http://eigen.tuxfamily.org/index.php?title=Benchmark. (the first one requires a single call while the second requires two calls when using BLAS directly)
bravegag
Registered Member
Posts
52
Karma
0
Hi ggael,

Nice! Thank you for your answer! What happens then when someone "enables MKL" in Eigen? is this discarding the Expression Templates and doing instead multiple BLAS/LAPACK calls?

TIA,
Best regards,
Giovanni
User avatar
ggael
Moderator
Posts
3447
Karma
19
OS
No no, MKL is only used when it makes sense, i.e., for products and decompositions.


Bookmarks



Who is online

Registered users: Bing [Bot], claydoh, Google [Bot], rblackwell, Yahoo [Bot]