This forum has been archived. All content is frozen. Please use KDE Discuss instead.

Matrix Multiplication Performance on Raspberry PI 4 64-bit

Tags: None
(comma "," separated)
User avatar
doleron
Registered Member
Posts
1
Karma
0
Hi, everyone!

I use Eigen and Raspberry Pi for some time but never both together. So, I wrote a simple benchmark consisting of 30 executions of 100 matrix multiplications batches running on my new Raspberry PI 4 - 64-bit OS. The following chart shows the execution time for 4 different setups (512x512 matrices, 460x460 matrices, with and without vectorization):

Image

I compiled it using:

Code: Select all
$ g++ -O3 -DNDEBUG -I eigen/ main.cpp -o test

and
Code: Select all
$ g++ -O3 -DNDEBUG -DEIGEN_DONT_VECTORIZE -I eigen/ main.cpp -o test


My question is: Can I do something else in order to speed up matrix multiplication on this hardware and OS? Or have I hit my nose on the RPI4 best performance?

Some notes:

0 - I wouldn't like to use
Code: Select all
 -fopenmp
in this comparision in order to avoid multithreading/parallelization concerns.

1 - I've followed the instructions on http://eigen.tuxfamily.org/index.php?ti ... ization.3F:
On 64-bit ARM, SIMD is enabled by default, you don't have to do anything extra.
.

My GCC is:
Code: Select all
pi@raspberrypi:~ $ g++ --version
g++ (Debian 8.3.0-6) 8.3.0


The code used in this experiment is below:

Code: Select all
#include <iostream>
#include <Eigen/Dense>
#include <chrono>

using namespace Eigen;

const int size = 460; // or 512

MatrixXd A = 10 * MatrixXd::Random(size, size);
MatrixXd B = 10 * MatrixXd::Random(size, size);
MatrixXd C;
double test = 0;

void foo() {

    for (int i = 0; i < 100; ++i)
    {
        C.noalias() = A * B;

        int x = 0;
        int y = 0;

        test += C(x, y);
    }

}

int main()
{

    #ifdef EIGEN_VECTORIZE
        std::cout << "vectorization occurs" << "\n";
    #endif
    #ifndef EIGEN_VECTORIZE
        std::cout << "vectorization not occurs" << "\n";
    #endif

    std::chrono::high_resolution_clock::time_point begin_time_ref;
    std::chrono::high_resolution_clock::time_point end_time_ref;

    std::cout << "size is " << size << "\n";

    for (int step = 0; step < 30; ++step) {
        begin_time_ref = std::chrono::high_resolution_clock::now();

        foo();

        end_time_ref = std::chrono::high_resolution_clock::now();
        std::chrono::milliseconds ms = std::chrono::duration_cast<std::chrono::milliseconds>(end_time_ref - begin_time_ref);
        std::cout << step << "\t" << ms.count() << "\n";
       
    }

    std::cout << "test value is:" << test << "\n";
}


I really appreciate any advice and/or critic!


Bookmarks



Who is online

Registered users: bartoloni, Bing [Bot], Google [Bot], Yahoo [Bot]