This forum has been archived. All content is frozen. Please use KDE Discuss instead.

Eigen slower than Theano on CPU

Tags: None
(comma "," separated)
botev
Registered Member
Posts
2
Karma
0

Eigen slower than Theano on CPU

Tue Jan 19, 2016 10:42 am
So I wanted to test Eigen vs Theano(http://deeplearning.net/software/theano/) for the case on a CPU. Currently, theano is used mainly because of autodiff and GPU backend, so I was doing some autodiff based on Eigen, however interestingly enough Theano still had a very significant margin of speed up. So my thought was that I'm doing something wrong with Eigen, thus wanted to ask here for any suggestions. Roughly speaking, the code fro the Eigen computation is below:
Code: Select all
   // Calculate all of the computation nodes
   Eigen::ArrayXXf node_20 = (((inputs[0]).matrix() * (shared_vars[0]->value).matrix()).array() + shared_vars[1]->value.replicate(1000, 1)).unaryExpr(std::ptr_fun(tanhf));
   Eigen::ArrayXXf node_24 = (((node_20).matrix() * (shared_vars[2]->value).matrix()).array() + shared_vars[3]->value.replicate(1000, 1)).unaryExpr(std::ptr_fun(tanhf));
   Eigen::ArrayXXf node_28 = (((node_24).matrix() * (shared_vars[4]->value).matrix()).array() + shared_vars[5]->value.replicate(1000, 1)).unaryExpr(std::ptr_fun(tanhf));
   Eigen::ArrayXXf node_32 = (((node_28).matrix() * (shared_vars[6]->value).matrix()).array() + shared_vars[7]->value.replicate(1000, 1)).unaryExpr(std::ptr_fun(tanhf));
   Eigen::ArrayXXf node_36 = (((node_32).matrix() * (shared_vars[8]->value).matrix()).array() + shared_vars[9]->value.replicate(1000, 1)).unaryExpr(std::ptr_fun(tanhf));
   Eigen::ArrayXXf node_40 = (((node_36).matrix() * (shared_vars[10]->value).matrix()).array() + shared_vars[11]->value.replicate(1000, 1)).unaryExpr(std::ptr_fun(tanhf));
   Eigen::ArrayXXf node_44 = (((node_40).matrix() * (shared_vars[12]->value).matrix()).array() + shared_vars[13]->value.replicate(1000, 1)).unaryExpr(std::ptr_fun(tanhf));
   Eigen::ArrayXXf node_47 = (((node_44).matrix() * (shared_vars[14]->value).matrix()).array() + shared_vars[15]->value.replicate(1000, 1));
   Eigen::ArrayXXf node_59 = 0.001000 * (1.0 / (1.0 + (-node_47).array().exp()).array() - inputs[0]);
   Eigen::ArrayXXf node_69 = ((node_59).matrix() * (shared_vars[14]->value.transpose()).matrix()).array() * (1.000000 - node_44.square());
   Eigen::ArrayXXf node_79 = ((node_69).matrix() * (shared_vars[12]->value.transpose()).matrix()).array() * (1.000000 - node_40.square());
   Eigen::ArrayXXf node_89 = ((node_79).matrix() * (shared_vars[10]->value.transpose()).matrix()).array() * (1.000000 - node_36.square());
   Eigen::ArrayXXf node_99 = ((node_89).matrix() * (shared_vars[8]->value.transpose()).matrix()).array() * (1.000000 - node_32.square());
   Eigen::ArrayXXf node_109 = ((node_99).matrix() * (shared_vars[6]->value.transpose()).matrix()).array() * (1.000000 - node_28.square());
   Eigen::ArrayXXf node_119 = ((node_109).matrix() * (shared_vars[4]->value.transpose()).matrix()).array() * (1.000000 - node_24.square());
   Eigen::ArrayXXf node_129 = ((node_119).matrix() * (shared_vars[2]->value.transpose()).matrix()).array() * (1.000000 - node_20.square());

   // Update all shared variables
   shared_vars[0]->value -= 0.010000 * ((inputs[0].transpose()).matrix() * (node_129).matrix()).array();
   shared_vars[1]->value -= 0.010000 * node_129.colwise().sum();
   shared_vars[2]->value -= 0.010000 * ((node_20.transpose()).matrix() * (node_119).matrix()).array();
   shared_vars[3]->value -= 0.010000 * node_119.colwise().sum();
   shared_vars[4]->value -= 0.010000 * ((node_24.transpose()).matrix() * (node_109).matrix()).array();
   shared_vars[5]->value -= 0.010000 * node_109.colwise().sum();
   shared_vars[6]->value -= 0.010000 * ((node_28.transpose()).matrix() * (node_99).matrix()).array();
   shared_vars[7]->value -= 0.010000 * node_99.colwise().sum();
   shared_vars[8]->value -= 0.010000 * ((node_32.transpose()).matrix() * (node_89).matrix()).array();
   shared_vars[9]->value -= 0.010000 * node_89.colwise().sum();
   shared_vars[10]->value -= 0.010000 * ((node_36.transpose()).matrix() * (node_79).matrix()).array();
   shared_vars[11]->value -= 0.010000 * node_79.colwise().sum();
   shared_vars[12]->value -= 0.010000 * ((node_40.transpose()).matrix() * (node_69).matrix()).array();
   shared_vars[13]->value -= 0.010000 * node_69.colwise().sum();
   shared_vars[14]->value -= 0.010000 * ((node_44.transpose()).matrix() * (node_59).matrix()).array();
   shared_vars[15]->value -= 0.010000 * node_59.colwise().sum();

   // Write all of the output nodes in correct order
   Eigen::ArrayXXf node_54(1,1);
   node_54 << (inputs[0] * (softplus(-node_47, 50) - softplus(node_47, 50)) + softplus(node_47, 50)).sum() * 0.001000;
   return {node_54};

Currently, all of the variables are held as "Eigen::ArrayXXf". The last step is needed, because the interface requires to return a vector of eigen arrays. shared_vars is a vector of global arrays as well. What you see is an autoencoder implementation. Both Theano and the Eigen code are build with MKL and OpenMP. One very important part is that this code is compiled as a shared dynamic library, than the code is called inside the main function using dlopen. This, I presume, might have some impact on the performance and perhaps there is a better way? Anyway currently with input images coming in a matrix of size (1000, 784) Eigen makes a single iteration in mean time - 877ms, while Theano does it for 220ms. This is significantly lower. Thus maybe someone could tell me why this is the case?

PS: The command for compiling the dynamic library is:
Code: Select all
MKL_NUM_THREADS=4 OMP_NUM_THREADS=4 g++ -O3 -shared -fPIC -std=c++11 -I/opt/eigen -I/opt/intel/mkl/include -Wall -Werror=return-type -Wno-unused-variable -Wno-narrowing -m64 -fopenmp -L/opt/intel/mkl/lib/intel64 -Wl,--no-as-needed -lmkl_rt -lpthread -lmkl_intel_lp64 -lmkl_core -lmkl_sequential -lpthread -lm  -o mnist_hinton_e/mnist_hinton_e_optim.so mnist_hinton_e/mnist_hinton_e_optim.cpp


The full code can now be found here: https://gist.github.com/Botev/d4edc36e7aedbdd311d9
User avatar
ggael
Moderator
Posts
3447
Karma
19
OS

Re: Eigen slower than Theano on CPU

Wed Jan 20, 2016 9:34 am
Your function should be largely dominated by matrix-matrix products, and thus , in theory, you are mostly benchmarking MKL itself, not Eigen or theano. To check that this assumption is correct and that Eigen is not doing something wrong when calling MKL, the best is to run it with a profiler (perf on Linux, instrument on OSX, or vtune...). You should see that sgemm is responsible for 90 or more of the computation time. If that's not the case, then please send us your finding!

Nonetheless, I've also tried your code without MKL, so using pure Eigen. With Eigen 3.2 without openmp, it takes 1.8s, and 0.56s with OpenMP (4 threads). When switching to 3.3beta1 and enabling FMA (-mfma), I then get 0.68s without openmp, and 0.29s with openmp. I'm on a Core i7 @ 2.6GHz.

Edit: after removing calls to tanh, I get 0.23s, meaning that after all optimizations, matrix products does not represents 90% of the computation anymore.... so there are room on improvements there too.


Self-contained code below (slightly modified to remove some temporaries):

Code: Select all
#include <Eigen/Core>
#include <bench/BenchTimer.h>
#include <iostream>
using namespace Eigen;

ArrayXXf shared_vars[16];
#define TRACK
//std::cout << __LINE__ << "\n";
EIGEN_DONT_INLINE float foo(const ArrayXXf &input)
{

   // Calculate all of the computation nodes
   Eigen::ArrayXXf node_20 = ((input).matrix() * (shared_vars[0]).matrix());   node_20 = (node_20 + shared_vars[1].replicate(1000, 1)) .unaryExpr(std::ptr_fun(tanhf));TRACK
   Eigen::ArrayXXf node_24 = ((node_20).matrix() * (shared_vars[2]).matrix()); node_24 = (node_24 + shared_vars[3].replicate(1000, 1)) .unaryExpr(std::ptr_fun(tanhf));TRACK
   Eigen::ArrayXXf node_28 = ((node_24).matrix() * (shared_vars[4]).matrix()); node_28 = (node_28 + shared_vars[5].replicate(1000, 1)) .unaryExpr(std::ptr_fun(tanhf));TRACK
   Eigen::ArrayXXf node_32 = ((node_28).matrix() * (shared_vars[6]).matrix()); node_32 = (node_32 + shared_vars[7].replicate(1000, 1)) .unaryExpr(std::ptr_fun(tanhf));TRACK
   Eigen::ArrayXXf node_36 = ((node_32).matrix() * (shared_vars[8]).matrix()); node_36 = (node_36 + shared_vars[9].replicate(1000, 1)) .unaryExpr(std::ptr_fun(tanhf));TRACK
   Eigen::ArrayXXf node_40 = ((node_36).matrix() * (shared_vars[10]).matrix());node_40 = (node_40 + shared_vars[11].replicate(1000, 1)).unaryExpr(std::ptr_fun(tanhf));TRACK
   Eigen::ArrayXXf node_44 = ((node_40).matrix() * (shared_vars[12]).matrix());node_44 = (node_44 + shared_vars[13].replicate(1000, 1)).unaryExpr(std::ptr_fun(tanhf));TRACK
   Eigen::ArrayXXf node_47 = ((node_44).matrix() * (shared_vars[14]).matrix());node_47 += shared_vars[15].replicate(1000, 1);TRACK
   Eigen::ArrayXXf node_59 = 0.001000 * (1.0 / (1.0 + (-node_47).exp()) - input);TRACK
   Eigen::ArrayXXf node_69 = ((node_59).matrix() * (shared_vars[14].transpose()).matrix());  node_69 *= (1.000000 - node_44.square());TRACK
   Eigen::ArrayXXf node_79 = ((node_69).matrix() * (shared_vars[12].transpose()).matrix());  node_79 *= (1.000000 - node_40.square());TRACK
   Eigen::ArrayXXf node_89 = ((node_79).matrix() * (shared_vars[10].transpose()).matrix());  node_89 *= (1.000000 - node_36.square());TRACK
   Eigen::ArrayXXf node_99 = ((node_89).matrix() * (shared_vars[8].transpose()).matrix());   node_99 *= (1.000000 - node_32.square());TRACK
   Eigen::ArrayXXf node_109 = ((node_99).matrix() * (shared_vars[6].transpose()).matrix());  node_109 *= (1.000000 - node_28.square());TRACK
   Eigen::ArrayXXf node_119 = ((node_109).matrix() * (shared_vars[4].transpose()).matrix()); node_119 *= (1.000000 - node_24.square());TRACK
   Eigen::ArrayXXf node_129 = ((node_119).matrix() * (shared_vars[2].transpose()).matrix()); node_129 *= (1.000000 - node_20.square());TRACK

   // Update all shared variables
   shared_vars[0].matrix().noalias() -= 0.010000 * (input.transpose()).matrix() * (node_129).matrix();
   shared_vars[1] -= 0.010000 * node_129.colwise().sum();
   shared_vars[2].matrix().noalias() -= 0.010000 * (node_20.transpose()).matrix() * (node_119).matrix();
   shared_vars[3] -= 0.010000 * node_119.colwise().sum();
   shared_vars[4].matrix().noalias() -= 0.010000 * (node_24.transpose()).matrix() * (node_109).matrix();
   shared_vars[5] -= 0.010000 * node_109.colwise().sum();
   shared_vars[6].matrix().noalias() -= 0.010000 * (node_28.transpose()).matrix() * (node_99).matrix();
   shared_vars[7] -= 0.010000 * node_99.colwise().sum();
   shared_vars[8].matrix().noalias() -= 0.010000 * (node_32.transpose()).matrix() * (node_89).matrix();
   shared_vars[9] -= 0.010000 * node_89.colwise().sum();
   shared_vars[10].matrix().noalias() -= 0.010000 * (node_36.transpose()).matrix() * (node_79).matrix();
   shared_vars[11] -= 0.010000 * node_79.colwise().sum();
   shared_vars[12].matrix().noalias() -= 0.010000 * (node_40.transpose()).matrix() * (node_69).matrix();
   shared_vars[13] -= 0.010000 * node_69.colwise().sum();
   shared_vars[14].matrix().noalias() -= 0.010000 * (node_44.transpose()).matrix() * (node_59).matrix();
   shared_vars[15] -= 0.010000 * node_59.colwise().sum();

  return (input * node_47).sum() * 0.001000;
}

int main() {
  int n = 1000;
  int m = 784;
  shared_vars[0].setRandom(m,n);
  shared_vars[1].setRandom(1,n);
  shared_vars[2].setRandom(n,n);
  shared_vars[3].setRandom(1,n);
  shared_vars[4].setRandom(n,n);
  shared_vars[5].setRandom(1,n);
  shared_vars[6].setRandom(n,n);
  shared_vars[7].setRandom(1,n);
  shared_vars[8].setRandom(n,n);
  shared_vars[9].setRandom(1,n);
  shared_vars[10].setRandom(n,n);
  shared_vars[11].setRandom(1,n);
  shared_vars[12].setRandom(n,n);
  shared_vars[13].setRandom(1,n);
  shared_vars[14].setRandom(n,m);
  shared_vars[15].setRandom(1,m);
  ArrayXXf input; input.setRandom(n,m);

  BenchTimer t;
  BENCH(t,4,1,foo(input));
  std::cout << t.best() << "\n";
}
User avatar
benoitsteiner
Registered Member
Posts
13
Karma
0

Re: Eigen slower than Theano on CPU

Tue Mar 29, 2016 6:15 pm
I upstreamed an efficient implementation of the tanh function recently. You can leverage it by replacing the calls to .unaryExpr(std::ptr_fun(tanhf)) with calls to .tanh(). This implementation takes advantage of the SSE and AVX instructions that are available on your CPU, so it should be 5 to 10x faster than calling the tanhf function directly.

To speed thing up a little more, you could also leverage the tensor module to do your sum reductions. This module makes it possible to multithread every operation, so you should get a nice boost there as well.




Bookmarks



Who is online

Registered users: Bing [Bot], Evergrowing, Google [Bot], rockscient