This forum has been archived. All content is frozen. Please use KDE Discuss instead.

New to Eigen3, simple example not vectorizing at all.

Tags: None
(comma "," separated)
sayguh
Registered Member
Posts
3
Karma
0
I apologize for the simple example/question. I've looked around stackoverflow and the forum a little for similar examples but haven't had much luck. I am just playing around with eigen3 now and wanted to test an FIR filter using the fir_double_h method from dspguru. Since most of the work is dot products I expected it to vectorize really well however my speed tests show a slow down when compiled with -march=native

My code is at the bottom; here is the output with the different compiler options.

ylb@Atlas:~/tmp$ g++ -std=c++11 -O3 -I/usr/include/eigen3 test.cc -o speedtest
ylb@Atlas:~/tmp$ ./speedtest
163.656ms
I'm outputing this so that the compiler doesn't outsmart me: 1.94693e+16

Then with -march=native I would expect a significant speedup

ylb@Atlas:~/tmp$ g++ -std=c++11 -O3 -I/usr/include/eigen3 -march=native test.cc -o speedtest
ylb@Atlas:~/tmp$ ./speedtest
173.98ms
I'm outputing this so that the compiler doesn't outsmart me: 1.94693e+16

Clearly I am misunderstanding something about eigen, or gcc or the vectorization process. Any tips before I start making more complicated eigen3 based libraries?

#include <Eigen/Dense>
#include <vector>
#include <numeric>
#include <iostream>
#include <chrono>

using namespace std;
using namespace Eigen;

int main() {
int numTaps = 1024;
int numSamples = 10000000;

// Create random input
vector<float> input(numSamples);
generate(input.begin(), input.end(), rand);

// Generate taps, then create double taps, a vector of taps twice.
VectorXf taps = VectorXf::Random(numTaps);

VectorXf doubleTaps;
doubleTaps.resize(2*numTaps);
doubleTaps.head(numTaps) = taps;
doubleTaps.tail(numTaps) = taps;

// The delay line
VectorXf delay = VectorXf::Zero(numTaps);

float tot = 0;
int state = 0;

sleep(0.0);

auto begin = chrono::high_resolution_clock::now();

// I would expect this to vectorize really well. The bulk of the computation is a dot product.
for (const float &i : input) {
delay[state] = i;
tot += doubleTaps.segment(numTaps - state, numTaps).dot(delay);

if (--state < 0)
state += numTaps;
}

auto end = chrono::high_resolution_clock::now();
cerr << chrono::duration_cast<chrono::nanoseconds>(end-begin).count()/(10e6) << "ms" << endl;
cout << "I'm outputing this so that the compiler doesn't outsmart me: " << tot << endl;
return 0;
}
sayguh
Registered Member
Posts
3
Karma
0
I just wanted to add that my original post was using Eigen3.2.5 installed from the Ubuntu 14.04 repository. Installing Eigen 3.3 from source yields more positive results.

Using eigen 3.3 and no march=native
111.006ms
I'm outputing this so that the compiler doesn't outsmart me: 1.94693e+16

Using eigen 3.3 with march=native
71.2423ms
I'm outputing this so that the compiler doesn't outsmart me: 1.94694e+16

Using eigen 3.2 and no march=native
162.678ms
I'm outputing this so that the compiler doesn't outsmart me: 1.94693e+16

Using eigen 3.2 with march=native
175.795ms
I'm outputing this so that the compiler doesn't outsmart me: 1.94693e+16
User avatar
ggael
Moderator
Posts
3447
Karma
19
OS
The version without -march=native is already vectorized using SSE2, so at best if your CPU supports AVX and that you are using Eigen 3.3, then you could expect a x2 gain with arch=native.

Moreover, with gcc, it is a bad idea to bench code within the main() function. For some unknown reason, gcc usually does weird things there.... Here is a more proper version with more guarantee on the reproducibility:

Code: Select all
#include <iostream>
#include <vector>
#include <Eigen/Dense>
#include <bench/BenchTimer.h>
using namespace Eigen;
using namespace std;

EIGEN_DONT_INLINE
int foo(vector<float> &input,VectorXf &delay, VectorXf& doubleTaps)
{
  int numTaps = doubleTaps.size()/2;
  float tot = 0;
  int state = 0;
  // I would expect this to vectorize really well. The bulk of the computation is a dot product.
  for (const float &i : input) {
    delay[state] = i;
    tot += doubleTaps.segment(numTaps - state, numTaps).dot(delay);

    if (--state < 0)
    state += numTaps;
  }
  return tot;
}

int main()
{
  int tries = 2;
  int rep = 1;
  BenchTimer t;

  int numTaps = 1024;
  int numSamples = 10000000;

  // Create random input
  vector<float> input(numSamples);
  generate(input.begin(), input.end(), rand);

  // Generate taps, then create double taps, a vector of taps twice.
  VectorXf taps = VectorXf::Random(numTaps);

  VectorXf doubleTaps;
  doubleTaps.resize(2*numTaps);
  doubleTaps.head(numTaps) = taps;
  doubleTaps.tail(numTaps) = taps;
  VectorXf delay = VectorXf::Zero(numTaps);

  float tot = 0;
  BENCH(t, tries, rep, tot += foo(input, delay, doubleTaps));
  std::cout << "Time: " << t.best() << "s (" << tot << ")" << std::endl;
}
User avatar
ggael
Moderator
Posts
3447
Karma
19
OS
BTW, 10e6 == 10^7 not 10^6 ;)
sayguh
Registered Member
Posts
3
Karma
0
ggael wrote:BTW, 10e6 == 10^7 not 10^6 ;)


What's a factor of 10 between friends! haha.

Thanks for the quick reply, that makes a lot more sense now and explains the roughly 2x speedup I saw when using Eigen3.3. I'll have to take a look at the bench utility you used, seems very helpful, and less error prone than anything I'd come up with :)

Thanks again for the info.


Bookmarks



Who is online

Registered users: abc72656, Bing [Bot], daret, Google [Bot], Sogou [Bot], Yahoo [Bot]