This forum has been archived. All content is frozen. Please use KDE Discuss instead.

Tensor operation optimisation

Tags: None
(comma "," separated)
kenette
Registered Member
Posts
2
Karma
0

Tensor operation optimisation

Wed Sep 23, 2015 4:58 pm
I have been trying to use Eigen Tensor class to implement a convolutional neural network class in C++

Currently i try to beat an already existing implementation in OpenCV (which must be suboptimal compared to Eigen)
But my implementation with Eigen is slower, could you give me some hint on the reasons why i could be slower:

Pseudo code

Code: Select all
Eigen::Tensor final_maps
for o in  kernel_tensor_dim_0:
    Eigen::Tensor output_maps
    for i in image_tensor_dim_0:

         K = extract_kernel(i) // dt_1 = 0.17 ms (with inits)
         I = extract_image(i)// dt_2 = 0.01 ms
         convolution_map = I * K //dt_3 = 0.01 ms
         output_maps.slice(i) = convolution_map //dt_4 =1.6 ms
   
    final_maps(o)=output_maps.sum(dim_0)



Here this set of operation is done around 10^4 times so the second is reached whereas by using cv::Mat and std::map i reach only 500ms.
Any clue for the reason of the bottleneck ?
Would it be something with eval() ? I let the auto build a tensor expression hoping it would optimize itself alone ^^'.

UPDATE :

If i assign every sub result to a Eigen::Tensor, i get
cost 1 : 0.28 ms
cost 2 : 0.37 ms
cost 3 : 0.8 ms
cost 4 : 0.06 ms
User avatar
ggael
Moderator
Posts
3447
Karma
19
OS

Re: Tensor operation optimisation

Sat Sep 26, 2015 9:20 pm
what are the dimensions and sizes of your tensors? Compiler version and options? Also make sure you declared your intermediate tensors outside the loops..
kenette
Registered Member
Posts
2
Karma
0

Re: Tensor operation optimisation

Mon Sep 28, 2015 11:44 am
The first set of operations do a convolution of 128 set of 3 kernels on the 3 input channels of the image.

I define the tensors vars (intermediate variables) in the scope of the first loop (128 sets of kernels).
Nothing unnecessary is assigned in the input channel loop (i tried to assign the vars outside the loop the Eigen::array but no difference)
The slices of the tensors are 64x64 for the input and 5x5 for the kernels.

The operation defined in the loop costs around 1 ms per iteration for a total of 0.145794 s

The compilation is done with clang++ with flags std=c++0x -msse2 -O3 (i tried also -msse3 but no difference)


The C++ code

Code: Select all
//for a given set of kernels o
for(int j=0;j<input.dimension(0);j++){
           
            //extract kernel for the channel
           Eigen::array<int, 4> koffsets = {j,0,0,o};
           Eigen::array<int, 4> kextents = {1,int(current_conv_filter.dimension(1)),int(current_conv_filter.dimension(2)),1}
           Eigen::array<Eigen::DenseIndex, 2> one_dim({int(current_conv_filter.dimension(1)),int(current_conv_filter.dimension(2))});
           kernel = input.slice(koffsets,kextents).reshape(one_dim);
               
            //extract corresponding channel
           Eigen::array<int, 3> ioffsets = {j,0,0};
           Eigen::array<int, 3> iextents = {1,int(input.dimension(1)),int(input.dimension(2))}
           Eigen::array<Eigen::DenseIndex, 2> two_dim({int(input.dimension(1)),int(input.dimension(2))});
           map = input.slice(koffsets,kextents).reshape(two_dim);
               
            //Do the conv op
            Eigen::array<ptrdiff_t, 2> dims({0,1});
            convolved_map = map.convolve(kernel,dims);
               
            //combine on the current output map
            Eigen::array<int, 3> size_img = {1,int(current_output_map.dimension(1)),int(current_output_map.dimension(2))};
            current_output_map.slice(ioffsets,size_img ) = convolved_map;
               
 }


Bookmarks



Who is online

Registered users: Bing [Bot], Google [Bot], q.ignora, watchstar