This forum has been archived. All content is frozen. Please use KDE Discuss instead.

Need Urgent Help! Do not understand the problem!!!

Tags: None
(comma "," separated)
oscarjgv
Registered Member
Posts
1
Karma
0
I am testing some codes and I do not understand the computation times that I obtain from this. They make absolutely NO SENSE at all, so here it goes.

(For the final test codes, go to the end!)

I have an optimisation problem with dimensions (Np=100, nx=4, nu=1) which requires the following matrices:
-Matrix H of dimensions Np*nx x Np*nu to be calculated.
-Matrix F of dimensions Np*nu x Np*nu to be calculated.
-Matrix Atotal of dimensions Np*nx x nx filled with randoms.
-Matrix Btotal of dimensions Np*nx x nu filled with randoms.
-Matrix Kxtotal of dimensions Np*nu x nx filled with randoms.

Also, I have the following "extra" variables to support the algorithm:
-Matrix Ak of dimensions nx x nx
-Matrix Kx of dimensions nu x nx
-Vector B0 of dimensions nx x 1
-Vector B1 of dimensions nx x 1
-Vector F1 of dimensions nu x 1

All matrices have the standard ordering which I if I've checked correctly is "column-major", were declared as doubles and preallocated, eg. as in the H file below

Code: Select all
------ Types.h H File ----
typedef Matrix<double, Np*nx, Np*nu> Matrix_H;
extern Matrix_H H;
------ End Of HFile -----


I am running this function in a loop around 12000 times to capture the minimum computation time that I obtain.

Code: Select all
------ Function ------
int Calculate_HF(){
    //Dual Mode H F Matrices
    int row,row0,row_nu;
    for(int k=0;k<Np;k++){
        row=k*nx;
        row0=row-nx;
        row_nu=k*nu;
        H.block<nx,nu>(row,row_nu)=Btotal.block<nx,nu>(row,0);
        F.block<nu,nu>(row_nu,row_nu).setIdentity();
        Ak=Atotal.block<nx,nx>(row,0);
        Kx=Kxtotal.block<nu,nx>(row_nu,0);
        for(int i=0;i<row_nu;i++){
            Map<Vector_X>B1(&H(row,i));
            Map<Vector_X>B0(&H(row0,i));
            Map<Vector_U>F1(&F(row_nu,i));
            B1=Ak*B0;
            F1=-Kx*B0;
//             H.block<nx,1>(row,i)=Ak*H.block<nx,1>(row0,i);
//             F.block<nu,1>(row_nu,i)=-Kx*H.block<nx,1>(row0,i);
        }
    }
    return 0;
}
------ End of Function -----


The minimum time that I get when computing this is 12 microseconds per call to the function.

What makes absolute and completely no sense at all to me is that if I comment the lines:
Map<Vector_U>F1(&F(row_nu,i));
F1=-Kx*B0;

which overall would remove the "vector" times "vector" multiplication and basically leave the "matrix" times "vector (B1=Ak*B0;), both of which are in the same loop, gives me 6 microseconds only, way less than what I would expect.

I mean just think about it like this: Doing whatever number of times the operation (B1=Ak*B0;) which is (4 x 4) x (4 x 1) is done 6 microseconds.
In contrast, doing the same number of times the operations (B1=Ak*B0; AND F1=-Kx*B0;) which is (4 x 4) x (4 x 1) AND (1x4) x (4 x 1) is done in 12 microseconds.

That by itself it's just nonsense.

On top of that, if I duplicate the first operation in the form of a summation to the previous value (eg. B1+=Ak*B0;), ie. compute it twice and add a summation with the alternative for loop below, it gives me 10 microseconds.... BELOW THE OTHER PREVIOUS OPERATION WHICH IS SIMPLER!!!!! :( :(

Code: Select all
for(int i=0;i<row_nu;i++){
            Map<Vector_X>B1(&H(row,i));
            Map<Vector_X>B0(&H(row0,i));
            Map<Vector_U>F1(&F(row_nu,i));
            B1=Ak*B0;
            B1+=Ak*B0;
//             F1=-Kx*B0;
//             H.block<nx,1>(row,i)=Ak*H.block<nx,1>(row0,i);
//             F.block<nu,1>(row_nu,i)=-Kx*H.block<nx,1>(row0,i);
        }


I am completely lost and I do not know what to make out of this. It just makes absolutely no sense whatsoever and cannot produce an accurate discussion around it without making sure that the result is valid.

Can someone help or confirm this?

I am compiling the code with -O3 -mavx -mfma optimisation flags, and running my code in Ubuntu 20.04 with Real-Time priority (eg. chrt -r 99 ./main)

Thanks a lot.

Look forward to hearing back from you.

I was able to produce the exact results that I am getting with this code:


Code: Select all
------ C Code Start ------------
#include <typeinfo>
#include <math.h>
#include <stdio.h>
#include <cstdio>
#include <stdlib.h>
#include <unistd.h>
#include <signal.h>
#include <time.h>
#include <ctime>    // For time()

#define EIGEN_STACK_ALLOCATION_LIMIT 0
#include "../Eigen/Dense"
using namespace Eigen;
using namespace std;


//General Type Definitions
const int Np=100;
const int nx=4;
const int nu=1;
typedef double Real_T;
typedef Matrix<Real_T, nx, 1> Vector_X;
typedef Matrix<Real_T, nu, 1> Vector_U;
typedef Matrix<Real_T, nx, nx> Matrix_Ak;
typedef Matrix<Real_T, nu, nx> Matrix_Kx;
typedef Matrix<Real_T, Np*nx, nx> Matrix_Atotal;
typedef Matrix<Real_T, Np*nx, nu> Matrix_Btotal;
typedef Matrix<Real_T, Np*nu, nx> Matrix_Kxtotal;
typedef Matrix<Real_T, Np*nx, Np*nu> Matrix_H;
typedef Matrix<Real_T, Np*nu, Np*nu> Matrix_F;

//Matrices declaration
Matrix_Ak Ak;
Matrix_Kx Kx;
Matrix_Atotal Atotal;
Matrix_Btotal Btotal;
Matrix_Kxtotal Kxtotal;
Matrix_H H;
Matrix_F F;
Vector_X B0;
Vector_X B1;
Vector_U F1;

//Time variables
struct timespec real_time_clock_start;
struct timespec real_time_clock_end;
double rt_clock_start=0;
double rt_clock_end=0;
double exec_time=0;
double min_exec_time=100000;   

int Calculate_HF(){
    //Dual Mode H F Matrices
    int row,row0,row_nu;
    for(int k=0;k<Np;k++){
        row=k*nx;
        row0=row-nx;
        row_nu=k*nu;
        H.block<nx,nu>(row,row_nu)=Btotal.block<nx,nu>(row,0);
        F.block<nu,nu>(row_nu,row_nu).setIdentity();
        Ak=Atotal.block<nx,nx>(row,0);
        Kx=Kxtotal.block<nu,nx>(row_nu,0);
        for(int i=0;i<row_nu;i++){
            Map<Vector_X>B1(&H(row,i));
            Map<Vector_X>B0(&H(row0,i));
            Map<Vector_U>F1(&F(row_nu,i));
            B1=Ak*B0;
            B1+=Ak*B0;
//             F1=-Kx*B0;
//             H.block<nx,1>(row,i)=Ak*H.block<nx,1>(row0,i);
//             F.block<nu,1>(row_nu,i)=-Kx*H.block<nx,1>(row0,i);
        }
    }
    return 0;
}

int main(int, const char * const [])
{   
    Atotal.setRandom();
    Btotal.setRandom();
    Kxtotal.setRandom();
    for(int N=0;N<12000;N++){
        clock_gettime(CLOCK_REALTIME,&real_time_clock_start);
        Calculate_HF();
        clock_gettime(CLOCK_REALTIME,&real_time_clock_end);
        rt_clock_start=real_time_clock_start.tv_sec+(double)real_time_clock_start.tv_nsec/1000000000.0f;       
        rt_clock_end=real_time_clock_end.tv_sec+(double)real_time_clock_end.tv_nsec/1000000000.0f;
        exec_time=rt_clock_end-rt_clock_start;
        if(exec_time<min_exec_time)min_exec_time=exec_time;
    }   
    printf("Minimum time = %lf\n",min_exec_time);
    return 0;
}
------- C Code End ---------


And with this make commands:
g++ -std=c++11 -O3 -mavx -mfma -I. -I../Eigen -c -o main.o main.c
g++ -std=c++11 main.o -lm -lstdc++ -lrt -o main

Note: You must have the Eigen library in the folder outside of the folder the main.c file is.
andrew-dy
Registered Member
Posts
15
Karma
0
A couple things I know about optimizations (some, you probably already know, "1." being the only concrete suggestion):

1. The -O3 flag is really smart.
If you don't do anything with the result of an operation, the -O3 flag won't translate it to machine code in your executable (i.e. it won't compute an unused value). The use of global variables in your code makes it difficult to track what the compiler considers to be "unused". You can always inspect the disassembly of your compiled code to see roughly what your compiler does with each line. You may also "use" the result of your function in a time-trivial way so the compiler doesn't get rid of it.

For example, an executable with only
for(int i = 0; i < 100000; ++i){do_expensive_operation();}
doesn't actually reflect how long it takes to perform "do_expensive_operation()" 100K times (it may as well be commented out). But if you store and use the results like so
int total = 0;
for(int i = 0; i < 100000; ++i){total += do_expensive_operation();}
std::cout << total;
your whole -O3 code will suddenly take a whole lot of time (around [time cost of function]*100k).

2. Expectations are not reality, and profiling reveals why.
The best advice you can get to determine if code will run faster with a change is to actually run it.
Real computers are messy, so differences from theory are to be expected -- especially with small working sets as algorithmic complexity is large-scale.
You can use a line-by-line profiler (that you can perform a dig-down analysis with using the call-graph) to see where a small code takes its time. Using function sampling with larger codes can capture ubiquitous, expensive, and unoptimized functions/operations, where a call-graph aggregation of times may not, so sampling via multiple methods can be useful.

3. There can be a lot of variability in the time it takes a function to run, and results are subject to interpretation.
Your function seems to run very close to the clock speed of modern CPUs (~1000 times slower).
I see that you account for this by giving your executable a high priority and taking the fastest time of 12K trials, so total runtime is close to a second.
But I think an average is more representative of how fast a function typically run and decreases variance in an intuitive and measurable way.


Bookmarks



Who is online

Registered users: bartoloni, Bing [Bot], Google [Bot], Yahoo [Bot]