This forum has been archived. All content is frozen. Please use KDE Discuss instead.

improve aliasing analysis / restrict pointers?

Tags: None
(comma "," separated)
hansecke
Registered Member
Posts
2
Karma
0
Hi there-

I've recently written a simple vector class which is as simple as can be underneath (*). An example function:

Code: Select all
void sum(floatvec a, floatvec b, floatvec s) {
   const size_t n=a.size();
   for(size_t i=0; i<n; i++) {
      s[i]=a[i]+b[i];
   }
}


Looking at the generated assembler shows that the compiler can not determine that a,b, and s refer to distinct memory areas. As a result, the assembler loop shows unnecessary loads. If I change the function to the below it does not have those loads and performs better in benchmarks:

Code: Select all
void sum(myvec a, myvec b, myvec s) {
   float*__restrict__ va = a.memory();
   float*__restrict__ vb = b.memory();
   float*__restrict__ vs = s.memory();
   const size_t n=a.size();
   for(size_t i=0; i<n; i++) {
      vs[i]=va[i]+vb[i];
   }
}


In C you can fix issues like this by declaring variables with __restrict__, but I have not been able to figure out where to put that in my class or in the declaration of sum().

Has somebody looked at the assembler generated by Eigen to see if it is as fast as possible? Worked with __restrict__ and aliasing analysis to make sure the compiler understands how to make the code as fast as possible?

Thanks!

Hans

(* My vector class has other features as well, but I can replicate my problem even with a super-simple class which does no more than wrap a heap-allocated float array)
User avatar
ggael
Moderator
Posts
3447
Karma
19
OS
Pointer aliasing cannot explain this. My bet is that floatvec::operator[] is expecting an int instead of a std::size_t. In Eigen we are careful to use the same type everywhere for indexes and sizes.
hansecke
Registered Member
Posts
2
Karma
0
Hi Gaël,

And thank you for your reply. This was not the reason, but I did finally figure it out:

  • If I compile with -Os (optimize for size) the generated assembler is a very clear 27 lines (13 lines of instructions). As described above, the inner loop of the first version of the sum() function has those superfluous loads.
  • If I compile with -O3 the sum() function gets compiled to an assembler file of 94 lines, most of which I do not understand. GCC creates the correct short inner loops for both versions of sum().
  • If I compile with -O2 the sum() function gets compiled to an assembler file of 31 lines (14 lines of instructions). Both versions have the correct short inner loop and both assembler programs are very understandable.

Takeaway:

  • -Os might generate nicer looking and shorter code, but even for simple programs it can optimize significantly worse than -O2 or -O3
  • -O2 might generate the best balance of brevity and correct optimization.

Cheers

Hans


Bookmarks



Who is online

Registered users: Baidu [Spider], Bing [Bot], Google [Bot], rblackwell