Registered Member
|
So I wanted to test Eigen vs Theano(http://deeplearning.net/software/theano/) for the case on a CPU. Currently, theano is used mainly because of autodiff and GPU backend, so I was doing some autodiff based on Eigen, however interestingly enough Theano still had a very significant margin of speed up. So my thought was that I'm doing something wrong with Eigen, thus wanted to ask here for any suggestions. Roughly speaking, the code fro the Eigen computation is below:
Currently, all of the variables are held as "Eigen::ArrayXXf". The last step is needed, because the interface requires to return a vector of eigen arrays. shared_vars is a vector of global arrays as well. What you see is an autoencoder implementation. Both Theano and the Eigen code are build with MKL and OpenMP. One very important part is that this code is compiled as a shared dynamic library, than the code is called inside the main function using dlopen. This, I presume, might have some impact on the performance and perhaps there is a better way? Anyway currently with input images coming in a matrix of size (1000, 784) Eigen makes a single iteration in mean time - 877ms, while Theano does it for 220ms. This is significantly lower. Thus maybe someone could tell me why this is the case? PS: The command for compiling the dynamic library is:
The full code can now be found here: https://gist.github.com/Botev/d4edc36e7aedbdd311d9 |
Moderator
|
Your function should be largely dominated by matrix-matrix products, and thus , in theory, you are mostly benchmarking MKL itself, not Eigen or theano. To check that this assumption is correct and that Eigen is not doing something wrong when calling MKL, the best is to run it with a profiler (perf on Linux, instrument on OSX, or vtune...). You should see that sgemm is responsible for 90 or more of the computation time. If that's not the case, then please send us your finding!
Nonetheless, I've also tried your code without MKL, so using pure Eigen. With Eigen 3.2 without openmp, it takes 1.8s, and 0.56s with OpenMP (4 threads). When switching to 3.3beta1 and enabling FMA (-mfma), I then get 0.68s without openmp, and 0.29s with openmp. I'm on a Core i7 @ 2.6GHz. Edit: after removing calls to tanh, I get 0.23s, meaning that after all optimizations, matrix products does not represents 90% of the computation anymore.... so there are room on improvements there too. Self-contained code below (slightly modified to remove some temporaries):
|
Registered Member
|
I upstreamed an efficient implementation of the tanh function recently. You can leverage it by replacing the calls to .unaryExpr(std::ptr_fun(tanhf)) with calls to .tanh(). This implementation takes advantage of the SSE and AVX instructions that are available on your CPU, so it should be 5 to 10x faster than calling the tanhf function directly.
To speed thing up a little more, you could also leverage the tensor module to do your sum reductions. This module makes it possible to multithread every operation, so you should get a nice boost there as well. |
Registered users: Bing [Bot], Evergrowing, Google [Bot], rockscient