This forum has been archived. All content is frozen. Please use KDE Discuss instead.

dot product not utilizing SSE

Tags: None
(comma "," separated)
triedl
Registered Member
Posts
1
Karma
0

dot product not utilizing SSE

Tue Feb 10, 2015 7:43 pm
I perform the dot product operation of a vector and a vector segment. Both are complex<double> valued. This is the code segment in question:

Code: Select all
asm("#it begins here!");
output = w.dot(vec.segment(idxS, L));
asm("#it ends here!"); }


where L = w.size() and idxS is some non-negative integer smaller or equal vec.size()-L.

I compile my code with the "-std=c++11 -g -O3 -Wall -march=native" flags. I am on the newest Mac OS X and compile with Apple LLVM version 6.0 (clang-600.0.56) (based on LLVM 3.5svn). The problem is that eigen won't use packed SSE instructions for the dot product computation making the code probably slower than necessary.

This is the assembly code I get for the above code snippet:

Code: Select all
LBB11_9:                                ## =>This Inner Loop Header: Depth=1
   ##DEBUG_VALUE: multiplyVectorOnBufferSegment<Eigen::Block<Eigen::Matrix<std::__1::complex<double>, -1, -1, 0, -1, -1>, 1, -1, false> >:output <- RDX
   ##DEBUG_VALUE: i <- 1
   ##DEBUG_VALUE: coeffByOuterInner:outer <- 0
   ##DEBUG_VALUE: coeff:col <- 0
   ##DEBUG_VALUE: coeff:col <- 0
   ##DEBUG_VALUE: coeff:row <- 0
   .loc   45 275 0                ## ./Eigen/src/Core/CoreEvaluators.h:275:0
   vmovsd   -8(%rdi), %xmm1
   vmovsd   (%rdi), %xmm2
Ltmp810:
   ##DEBUG_VALUE: coeff:col <- 0
   .loc   11 394 0                ## /Library/Developer/CommandLineTools/usr/bin/../include/c++/v1/complex:394:0
   vmovsd   -8(%rsi), %xmm3
Ltmp811:
   .loc   11 395 0                ## /Library/Developer/CommandLineTools/usr/bin/../include/c++/v1/complex:395:0
   vmovsd   (%rsi), %xmm4
Ltmp812:
   .loc   70 80 35                ## ./Eigen/src/Core/util/BlasUtil.h:80:35
   vmulsd   %xmm3, %xmm1, %xmm5
   .loc   70 80 69                ## ./Eigen/src/Core/util/BlasUtil.h:80:69
   vmulsd   %xmm4, %xmm2, %xmm6
   vaddsd   %xmm6, %xmm5, %xmm5
Ltmp813:
   ##DEBUG_VALUE: complex:__re <- XMM5
   ##DEBUG_VALUE: complex:__re <- XMM5
   .loc   70 80 102               ## ./Eigen/src/Core/util/BlasUtil.h:80:102
   vmulsd   %xmm4, %xmm1, %xmm1
   .loc   70 80 136               ## ./Eigen/src/Core/util/BlasUtil.h:80:136
   vmulsd   %xmm3, %xmm2, %xmm2
   vsubsd   %xmm2, %xmm1, %xmm1
Ltmp814:
   ##DEBUG_VALUE: complex:__im <- XMM1
   ##DEBUG_VALUE: complex:__im <- XMM1
   .loc   52 202 0                ## ./Eigen/src/Core/Redux.h:202:0
   vunpcklpd   %xmm1, %xmm5, %xmm1 ## xmm1 = xmm5[0],xmm1[0]
Ltmp815:
   .loc   11 415 0                ## /Library/Developer/CommandLineTools/usr/bin/../include/c++/v1/complex:415:0
   vaddpd   %xmm1, %xmm0, %xmm0
Ltmp816:
   .loc   52 201 0                ## ./Eigen/src/Core/Redux.h:201:0
   addq   $16, %rsi
   addq   %rcx, %rdi
   decq   %rax
   jne   LBB11_9
Ltmp817:
LBB11_10:                               ## %_ZN5Eigen8internal11dot_nocheckINS_5BlockINS_6MatrixINSt3__17complexIdEELin1ELin1ELi0ELin1ELin1EEELi1ELin1ELb0EEENS2_INS3_IS6_Lin1ELi1ELi0ELin1ELi1EEELin1ELi1ELb0EEELb1EE3runERKNS_10MatrixBaseIS8_EERKNSC_ISA_EE.exit
   ##DEBUG_VALUE: multiplyVectorOnBufferSegment<Eigen::Block<Eigen::Matrix<std::__1::complex<double>, -1, -1, 0, -1, -1>, 1, -1, false> >:output <- RDX
   .loc   68 252 0                ## ./RingBuffer.h:252:0
   vmovupd   %xmm0, (%rdx)
   .loc   68 253 0                ## ./RingBuffer.h:253:0
   ## InlineAsm Start
   #it ends here!
   ## InlineAsm End


Each complex<double> occupies 128 bits in memory so I would not think memory alignment is an issue and I had thought something like the follwing would work and speed up performance by a factor of two: http://wwweic.eri.u-tokyo.ac.jp/computer/manual/altix/compile/CC/Intel_Cdoc100/main_cls/mergedProjects/intref_cls/common/intref_sample_double_comp.htm
User avatar
ggael
Moderator
Posts
3447
Karma
19
OS

Re: dot product not utilizing SSE

Wed Feb 11, 2015 11:55 am
You probably did not looked at the right part of the assembly, because here is what I got (same compiler, same OS):

clang++ -O3 -DNDEBUG -std=c++11 -mavx -S
Code: Select all
LBB1_4:                                 ## %.lr.ph25
                                        ## =>This Inner Loop Header: Depth=1
   vmovupd   -32(%rdx), %xmm3
   vinsertf128   $1, -16(%rdx), %ymm3, %ymm3
   vmovupd   -32(%rsi), %xmm4
   vinsertf128   $1, -16(%rsi), %ymm4, %ymm4
   vxorpd   %ymm2, %ymm3, %ymm3
   vmovddup   %ymm3, %ymm5
   vmulpd   %ymm5, %ymm4, %ymm5
   vunpckhpd   %ymm3, %ymm3, %ymm3 ## ymm3 = ymm3[1,1,3,3]
   vpermilpd   $5, %ymm4, %ymm4 ## ymm4 = ymm4[1,0,3,2]
   vmulpd   %ymm3, %ymm4, %ymm3
   vaddsubpd   %ymm3, %ymm5, %ymm3
   vaddpd   %ymm3, %ymm0, %ymm0
   vmovupd   (%rdx), %xmm3
   vinsertf128   $1, 16(%rdx), %ymm3, %ymm3
   vmovupd   (%rsi), %xmm4
   vinsertf128   $1, 16(%rsi), %ymm4, %ymm4
   vxorpd   %ymm2, %ymm3, %ymm3
   vmovddup   %ymm3, %ymm5
   vmulpd   %ymm5, %ymm4, %ymm5
   vunpckhpd   %ymm3, %ymm3, %ymm3 ## ymm3 = ymm3[1,1,3,3]
   vpermilpd   $5, %ymm4, %ymm4 ## ymm4 = ymm4[1,0,3,2]
   vmulpd   %ymm3, %ymm4, %ymm3
   vaddsubpd   %ymm3, %ymm5, %ymm3
   vaddpd   %ymm3, %ymm1, %ymm1
   addq   $4, %rdi
   addq   $64, %rsi
   addq   $64, %rdx
   cmpq   %r11, %rdi
   jl   LBB1_4


and with:
clang++ -O3 -DNDEBUG -std=c++11 -march=native -S
Code: Select all
LBB0_4:                                 ## %.lr.ph30.i
                                        ## =>This Inner Loop Header: Depth=1
   movdqu   -16(%rdx), %xmm3
   movupd   -16(%rdi), %xmm4
   pshufd   $68, %xmm3, %xmm5       ## xmm5 = xmm3[0,1,0,1]
   mulpd   %xmm4, %xmm5
   movhlps   %xmm3, %xmm3            ## xmm3 = xmm3[1,1]
   pshufd   $78, %xmm4, %xmm4       ## xmm4 = xmm4[2,3,0,1]
   mulpd   %xmm3, %xmm4
   xorpd   %xmm2, %xmm4
   addpd   %xmm5, %xmm4
   addpd   %xmm4, %xmm0
   movdqu   (%rdx), %xmm3
   movupd   (%rdi), %xmm4
   pshufd   $68, %xmm3, %xmm5       ## xmm5 = xmm3[0,1,0,1]
   mulpd   %xmm4, %xmm5
   movhlps   %xmm3, %xmm3            ## xmm3 = xmm3[1,1]
   pshufd   $78, %xmm4, %xmm4       ## xmm4 = xmm4[2,3,0,1]
   mulpd   %xmm3, %xmm4
   xorpd   %xmm2, %xmm4
   addpd   %xmm5, %xmm4
   addpd   %xmm4, %xmm1
   addq   $2, %rsi
   addq   $32, %rdi
   addq   $32, %rdx
   cmpq   %rcx, %rsi
   jl   LBB0_4


C++ file to reproduce:
Code: Select all
#include <Eigen/Dense>
using namespace Eigen;
std::complex<double> foo(VectorXcd &a, VectorXcd &b, int s, int l) {
  return a.dot(b.segment(s,l));
}


Bookmarks



Who is online

Registered users: Bing [Bot], Google [Bot], Yahoo [Bot]