This forum has been archived. All content is frozen. Please use KDE Discuss instead.

Odd performance on axpy

Tags: None
(comma "," separated)
fromain
Registered Member
Posts
3
Karma
0

Odd performance on axpy

Sun May 18, 2014 8:50 pm
Hi everyone

For my first post here I'd like to say that I have been using Eigen on a couple of projects for HPC, and I find it to be a gorgeous library with a beautiful documentation.
Deep down inside I have a thing for statistics, so I thought I would run a quick benchmark for Eigen and whatever library I can get to run on Windows in less than than 2 minutes. Naturally I started with BLAS level 1, axpy.
However I find that the results I have are not really coherent with the ones given here http://eigen.tuxfamily.org/index.php?title=Benchmark or there https://code.google.com/p/blaze-lib/wiki/Benchmarks. This was compiled with Intel Compiler 14 and run on a i7 4770k (process had realtime priority):

Image Image

Firstly, the other benchmarks seems to reach near to 10000 MFLOPS (at least with MKL) while I only reach ~1000 MFLOPS. Is my CPU somehow slower than the ones in the above benchmarks, or am I counting badly? I used MFLOPS = 1e-6 * 2 * N / t, with t the time in seconds.
Ooops, I did account for the number of repetitions, but I forgot to update its value when I plot. So actually the order of magnitude of the MFLOPS looks OK.

Secondly, in other benchmarks, Eigen reaches at least the first MKL plateau (around N=5000). Am I compiling it wrong? I used NDEBUG and O2. I ran the Intel Performance Guide where they had me use O3, QxHost (to build for the host architecture) and Profile Guided optimization, the results were similar.

Here is a build log, there's a lot of options that I have no idea about
Code: Select all
Build started 2014-05-18 22:31:16.
     1>Project "D:\Benching\Benching\Benching.vcxproj" on node 2 (Rebuild target(s)).
     1>ClCompile:
         C:\Program Files (x86)\Intel\Composer XE 2013 SP1\bin\Intel64\icl.exe /c /Qvc12 /Qlocation,link,"C:\Program Files (x86)\Microsoft Visual Studio 12.0\VC\bin\amd64" /Zi /nologo /W3 /O2 /Oi /Qipo /Qftz- /Qunroll:500 /D __INTEL_COMPILER=1400 /D WIN32 /D NDEBUG /D _CONSOLE /D _LIB /D _UNICODE /D UNICODE /EHsc /MD /GS /Gy /Zc:wchar_t /Zc:forScope /Yc"stdafx.h" /Fp"x64\Release\Benching.pch" /Fo"x64\Release\\" /Fd"x64\Release\vc120.pdb" /TP stdafx.cpp /QxHost
         stdafx.cpp
         C:\Program Files (x86)\Intel\Composer XE 2013 SP1\bin\Intel64\icl.exe /c /Qvc12 /Qlocation,link,"C:\Program Files (x86)\Microsoft Visual Studio 12.0\VC\bin\amd64" /Zi /nologo /W3 /O2 /Oi /Qipo /Qftz- /Qunroll:500 /D __INTEL_COMPILER=1400 /D WIN32 /D NDEBUG /D _CONSOLE /D _LIB /D _UNICODE /D UNICODE /EHsc /MD /GS /Gy /Zc:wchar_t /Zc:forScope /Yu"stdafx.h" /Fp"x64\Release\Benching.pch" /Fo"x64\Release\\" /Fd"x64\Release\vc120.pdb" /TP Benching.cpp /QxHost
         Benching.cpp
       Link:
         C:\Program Files (x86)\Intel\Composer XE 2013 SP1\bin\Intel64\xilink.exe /OUT:"D:\Benching\x64\Release\Benching.exe" /INCREMENTAL:NO /NOLOGO mkl_intel_lp64.lib mkl_core.lib mkl_sequential.lib "libyaml-cppmd.lib" kernel32.lib user32.lib gdi32.lib winspool.lib comdlg32.lib advapi32.lib shell32.lib ole32.lib oleaut32.lib uuid.lib odbc32.lib odbccp32.lib kernel32.lib user32.lib gdi32.lib winspool.lib comdlg32.lib advapi32.lib shell32.lib ole32.lib oleaut32.lib uuid.lib odbc32.lib odbccp32.lib /MANIFEST /ManifestFile:"x64\Release\Benching.exe.intermediate.manifest" /MANIFESTUAC:"level='asInvoker' uiAccess='false'" /DEBUG /PDB:"D:\Benching\x64\Release\Benching.pdb" /SUBSYSTEM:CONSOLE /OPT:REF /OPT:ICF /TLBID:1 /DYNAMICBASE /NXCOMPAT /IMPLIB:"D:\Benching\x64\Release\Benching.lib" /MACHINE:X64 x64\Release\Benching.obj
         x64\Release\stdafx.obj
         xilink: executing 'link'
         Benching.vcxproj -> D:\Benching\x64\Release\Benching.exe
     1>Done Building Project "D:\Benching\Benching\Benching.vcxproj" (Rebuild target(s)).

Build succeeded.

Time Elapsed 00:00:10.83


And the code I'm using looks something like this:
Code: Select all
namespace EigenBench
{
   using namespace Eigen;
   template<typename precision>
   inline void axpy(Matrix<precision, Dynamic, 1> x, Matrix<precision, Dynamic, 1> y, precision alpha, profiler ttp, string ss)
   {
      ttp.tic(ss);
      for (int inner_rep = 0; inner_rep < BENCH::inner_rep_max /*1000*/; ++inner_rep)
      {
         y.noalias() += alpha*x;
      }
      ttp.toc(ss);      
   }
}

---- main.cpp ----
   ttp.tic("float");
   for (int p = 0; p < BENCH::size_max; ++p)
   {
      for (int s = 1; s < 9; ++s)
      {
         const int size = s * std::pow(10, p);
         std::stringstream ss;
         ss << "s" << size;
         std::string size_text = ss.str();

         Eigen::Matrix<float, Eigen::Dynamic, 1> y = y_float.block(0, 0, size, 1);
         Eigen::Matrix<float, Eigen::Dynamic, 1> x = x_float.block(0, 0, size, 1);

         for (int rep = 0; rep < BENCH::rep_max /*50*/; ++rep)
         {
            EigenBench::axpy<float>(x, y, a_float, ttp, size_text);
         }
      }
   }
   ttp.toc("float", "EIGEN axpy float took (s): ");


So all in all, am I missing something in the code/compile options, or is this the normal performance of Eigen on my machine?

Best,
Romain
User avatar
ggael
Moderator
Posts
3447
Karma
19
OS

Re: Odd performance on axpy

Mon May 19, 2014 10:43 am
it seems that you're compiling on a 32 bit systems, either compile in 64 bits mode or enable vectorization (e.g., SSE with eigen 3.2). You might even enable AVX with the devel branch.
fromain
Registered Member
Posts
3
Karma
0

Re: Odd performance on axpy

Mon May 19, 2014 11:32 am
In Visual Studio the platform is set to x64.
As far as I understand, the QxHost flag is supposed to enable optimizations based on the compiling CPU. When I use QxHost with eg. QxSSE3, I have a warning saying that QxHost overrides QxSSE3.
However it seems that something fishy is going on here, because whether I compile with QxHost, SSE*, AVX or none of them, the performance stays the same.

CPU-Z reports the instructions
MMX, SSE, SSE2, SSE3, SSSE3, SSE4.1, SSE4.2, EM64T, VT-x, AES, AVX, AVX2, FMA3


For exemple I have the same results when compiled with
Code: Select all
1>------ Rebuild All started: Project: Benching, Configuration: Release x64 ------
1>  icl /Qvc12 "/Qlocation,link,C:\Program Files (x86)\Microsoft Visual Studio 12.0\VC\bin\amd64" /Zi /W3 /O3 /Oi /Qipo /Qftz- /Qunroll:0 -D __INTEL_COMPILER=1400 -D WIN32 -D NDEBUG -D _CONSOLE -D _LIB -D _UNICODE -D UNICODE /EHsc /MD /GS /Gy /QaxCORE-AVX2 /QxCORE-AVX2 /Zc:wchar_t /Zc:forScope /Qstd=c++11 /Ycstdafx.h /Fpx64\Release\Benching.pch /Fox64\Release\ /Fdx64\Release\vc120.pdb /TP stdafx.cpp
1> 
1>  Intel(R) C++ Intel(R) 64 Compiler XE for applications running on Intel(R) 64, Version 14.0.3.202 Build 20140422
1>  Copyright (C) 1985-2014 Intel Corporation.  All rights reserved.
1> 
1>  stdafx.cpp
1>  icl /Qvc12 "/Qlocation,link,C:\Program Files (x86)\Microsoft Visual Studio 12.0\VC\bin\amd64" /Zi /W3 /O3 /Oi /Qipo /Qftz- /Qunroll:0 -D __INTEL_COMPILER=1400 -D WIN32 -D NDEBUG -D _CONSOLE -D _LIB -D _UNICODE -D UNICODE /EHsc /MD /GS /Gy /QaxCORE-AVX2 /QxCORE-AVX2 /Zc:wchar_t /Zc:forScope /Qstd=c++11 /Yustdafx.h /Fpx64\Release\Benching.pch /Fox64\Release\ /Fdx64\Release\vc120.pdb /TP Benching.cpp
1> 
1>  Intel(R) C++ Intel(R) 64 Compiler XE for applications running on Intel(R) 64, Version 14.0.3.202 Build 20140422
1>  Copyright (C) 1985-2014 Intel Corporation.  All rights reserved.
1> 
1>  Benching.cpp
1>  xilink: executing 'link'
1>  Benching.vcxproj -> D:\Benching\x64\Release\Benching.exe
========== Rebuild All: 1 succeeded, 0 failed, 0 skipped ==========
User avatar
ggael
Moderator
Posts
3447
Karma
19
OS

Re: Odd performance on axpy

Mon May 19, 2014 9:36 pm
can you paste the complete file so that I can try to reproduce with your code. thanks.
fromain
Registered Member
Posts
3
Karma
0

Re: Odd performance on axpy

Tue May 20, 2014 2:29 pm
OK so I put a 1-file version of the code below. Now I have investigated a bit more and it might be that my timer class is interfering somehow, because depending on if I put the tic/toc calls around the inner repetition or the repetition loop I get different results, especially between when the loops have 1 or several repetitions.
But I still don't have a clue about what's happening, so any comment is welcome.

The purpose of the inner loop originally was to have longer runtime since the time it takes for small vectors can be close to the clock resolution. The outer loop is for averaging.

If it can also help I put the code to my timer class here https://bitbucket.org/futrzynski/tic-toc-profiler/src/8abf3a0dfc26ad7aa1553b687be0efdd16e6ada2/tic-toc-profiler.hpp

Code: Select all
#include "stdafx.h"

#ifndef USE_PRECOMPILED_HEADER

#define USE_PROFILER
#define USE_BOOST_CHRONO
#include "tic-toc-profiler.hpp"

#include "targetver.h"

#include <iostream>
#include <fstream>

#include <Eigen/Dense>

#include <iostream>
#include <tchar.h>

namespace BENCH
{
   const int size_max = 5;
   const int rep_max = 10;
   const int inner_rep_max = 1000;
}
#endif


int _tmain(int argc, _TCHAR* argv[])
{
   Eigen::Matrix<double, Eigen::Dynamic, 1> y_double=Eigen::Matrix<double, Eigen::Dynamic, 1>::Random(9 * std::pow(10, BENCH::size_max), 1);
   Eigen::Matrix<double, Eigen::Dynamic, 1> x_double=Eigen::Matrix<double, Eigen::Dynamic, 1>::Random(9 * std::pow(10, BENCH::size_max), 1);

   Eigen::Matrix<double, 1, 1> a_random_double=Eigen::Matrix<double, 1, 1>::Random(1, 1);
   double a_double = a_random_double(0,0);

   profiler ttp;
   ttp.tic("axpy");
   ttp.tic("double");

   for (int p = 0; p < BENCH::size_max; ++p)
   {
      for (int s = 2; s < 18; ++s)
      {
         const int size = 0.5 * s * std::pow(10, p);
         
         Eigen::Matrix<double, Eigen::Dynamic, 1> y = y_double.block(0, 0, size, 1);
         Eigen::Matrix<double, Eigen::Dynamic, 1> x = x_double.block(0, 0, size, 1);

         std::stringstream ss;
         ss << "s" << size;
         std::string size_text = ss.str();

         for (int rep = 0; rep < BENCH::rep_max; ++rep)
         {
            ttp.tic(ss.str());
            for (int inner_rep = 0; inner_rep < BENCH::inner_rep_max; ++inner_rep)
            {               
               y.noalias() += a_double * x;
            }
            ttp.toc(ss.str());
         }
      }
   }
   ttp.toc("double");
   ttp.toc("axpy", "EIGEN axpy double took (s): ");
   ttp.dump("EigenBench.yml");


Bookmarks



Who is online

Registered users: Baidu [Spider], Bing [Bot], Google [Bot], rblackwell