Registered Member
|
Hi everyone
For my first post here I'd like to say that I have been using Eigen on a couple of projects for HPC, and I find it to be a gorgeous library with a beautiful documentation. Deep down inside I have a thing for statistics, so I thought I would run a quick benchmark for Eigen and whatever library I can get to run on Windows in less than than 2 minutes. Naturally I started with BLAS level 1, axpy. However I find that the results I have are not really coherent with the ones given here http://eigen.tuxfamily.org/index.php?title=Benchmark or there https://code.google.com/p/blaze-lib/wiki/Benchmarks. This was compiled with Intel Compiler 14 and run on a i7 4770k (process had realtime priority): Firstly, the other benchmarks seems to reach near to 10000 MFLOPS (at least with MKL) while I only reach ~1000 MFLOPS. Is my CPU somehow slower than the ones in the above benchmarks, or am I counting badly? I used MFLOPS = 1e-6 * 2 * N / t, with t the time in seconds. Ooops, I did account for the number of repetitions, but I forgot to update its value when I plot. So actually the order of magnitude of the MFLOPS looks OK. Secondly, in other benchmarks, Eigen reaches at least the first MKL plateau (around N=5000). Am I compiling it wrong? I used NDEBUG and O2. I ran the Intel Performance Guide where they had me use O3, QxHost (to build for the host architecture) and Profile Guided optimization, the results were similar. Here is a build log, there's a lot of options that I have no idea about
And the code I'm using looks something like this:
So all in all, am I missing something in the code/compile options, or is this the normal performance of Eigen on my machine? Best, Romain |
Moderator
|
it seems that you're compiling on a 32 bit systems, either compile in 64 bits mode or enable vectorization (e.g., SSE with eigen 3.2). You might even enable AVX with the devel branch.
|
Registered Member
|
In Visual Studio the platform is set to x64.
As far as I understand, the QxHost flag is supposed to enable optimizations based on the compiling CPU. When I use QxHost with eg. QxSSE3, I have a warning saying that QxHost overrides QxSSE3. However it seems that something fishy is going on here, because whether I compile with QxHost, SSE*, AVX or none of them, the performance stays the same. CPU-Z reports the instructions
For exemple I have the same results when compiled with
|
Moderator
|
can you paste the complete file so that I can try to reproduce with your code. thanks.
|
Registered Member
|
OK so I put a 1-file version of the code below. Now I have investigated a bit more and it might be that my timer class is interfering somehow, because depending on if I put the tic/toc calls around the inner repetition or the repetition loop I get different results, especially between when the loops have 1 or several repetitions.
But I still don't have a clue about what's happening, so any comment is welcome. The purpose of the inner loop originally was to have longer runtime since the time it takes for small vectors can be close to the clock resolution. The outer loop is for averaging. If it can also help I put the code to my timer class here https://bitbucket.org/futrzynski/tic-toc-profiler/src/8abf3a0dfc26ad7aa1553b687be0efdd16e6ada2/tic-toc-profiler.hpp
|
Registered users: Baidu [Spider], Bing [Bot], Google [Bot], rblackwell