Registered Member
|
I apologize for the simple example/question. I've looked around stackoverflow and the forum a little for similar examples but haven't had much luck. I am just playing around with eigen3 now and wanted to test an FIR filter using the fir_double_h method from dspguru. Since most of the work is dot products I expected it to vectorize really well however my speed tests show a slow down when compiled with -march=native
My code is at the bottom; here is the output with the different compiler options. ylb@Atlas:~/tmp$ g++ -std=c++11 -O3 -I/usr/include/eigen3 test.cc -o speedtest ylb@Atlas:~/tmp$ ./speedtest 163.656ms I'm outputing this so that the compiler doesn't outsmart me: 1.94693e+16 Then with -march=native I would expect a significant speedup ylb@Atlas:~/tmp$ g++ -std=c++11 -O3 -I/usr/include/eigen3 -march=native test.cc -o speedtest ylb@Atlas:~/tmp$ ./speedtest 173.98ms I'm outputing this so that the compiler doesn't outsmart me: 1.94693e+16 Clearly I am misunderstanding something about eigen, or gcc or the vectorization process. Any tips before I start making more complicated eigen3 based libraries? #include <Eigen/Dense> #include <vector> #include <numeric> #include <iostream> #include <chrono> using namespace std; using namespace Eigen; int main() { int numTaps = 1024; int numSamples = 10000000; // Create random input vector<float> input(numSamples); generate(input.begin(), input.end(), rand); // Generate taps, then create double taps, a vector of taps twice. VectorXf taps = VectorXf::Random(numTaps); VectorXf doubleTaps; doubleTaps.resize(2*numTaps); doubleTaps.head(numTaps) = taps; doubleTaps.tail(numTaps) = taps; // The delay line VectorXf delay = VectorXf::Zero(numTaps); float tot = 0; int state = 0; sleep(0.0); auto begin = chrono::high_resolution_clock::now(); // I would expect this to vectorize really well. The bulk of the computation is a dot product. for (const float &i : input) { delay[state] = i; tot += doubleTaps.segment(numTaps - state, numTaps).dot(delay); if (--state < 0) state += numTaps; } auto end = chrono::high_resolution_clock::now(); cerr << chrono::duration_cast<chrono::nanoseconds>(end-begin).count()/(10e6) << "ms" << endl; cout << "I'm outputing this so that the compiler doesn't outsmart me: " << tot << endl; return 0; } |
Registered Member
|
I just wanted to add that my original post was using Eigen3.2.5 installed from the Ubuntu 14.04 repository. Installing Eigen 3.3 from source yields more positive results.
Using eigen 3.3 and no march=native 111.006ms I'm outputing this so that the compiler doesn't outsmart me: 1.94693e+16 Using eigen 3.3 with march=native 71.2423ms I'm outputing this so that the compiler doesn't outsmart me: 1.94694e+16 Using eigen 3.2 and no march=native 162.678ms I'm outputing this so that the compiler doesn't outsmart me: 1.94693e+16 Using eigen 3.2 with march=native 175.795ms I'm outputing this so that the compiler doesn't outsmart me: 1.94693e+16 |
Moderator
|
The version without -march=native is already vectorized using SSE2, so at best if your CPU supports AVX and that you are using Eigen 3.3, then you could expect a x2 gain with arch=native.
Moreover, with gcc, it is a bad idea to bench code within the main() function. For some unknown reason, gcc usually does weird things there.... Here is a more proper version with more guarantee on the reproducibility:
|
Moderator
|
BTW, 10e6 == 10^7 not 10^6
|
Registered Member
|
What's a factor of 10 between friends! haha. Thanks for the quick reply, that makes a lot more sense now and explains the roughly 2x speedup I saw when using Eigen3.3. I'll have to take a look at the bench utility you used, seems very helpful, and less error prone than anything I'd come up with Thanks again for the info. |
Registered users: abc72656, Bing [Bot], daret, Google [Bot], Sogou [Bot], Yahoo [Bot]