Registered Member
|
Hi,
I have a large number of small fixed size matrices (e.g. 4x4 or 16x16) that I want to multiply consecutively and only consecutively. Specifically I have N small matrices: W_i where i = 1 ... N and I wish to calculate every possible left and right partial product, i.e L(j) = product(i=1 to j) W_i and R(j) = product(i=j to N) W_i for all j. Then I want to modify the W_i in place and repeat, again and again and again. So I don't expect the initial overhead of memory allocation to be my bottleneck but I do want the matrix multiplications to be as efficient as possible. Obviously this means ensuring alignment for vectorization. But am I also correct in assuming that since matrices are only ever multiplied consecutively it would be most efficient to have the matrix data stored consecutively in memory? If this is the case, what is the best way to achieve this? Is it possible to just create a C-style array of Eigen Matrix objects: Matrix4d W[N]; while still preserving alignment for vectorization? And if so would this guarantee the actual matrix data being consecutive in memory? Or would it be better to do: double * Wmemory = aligned_allocator<double>().allocate(N); followed by replacing every reference to W[i] with a reference to Map<Matrix4d>(Wmemory+i*16*sizeof(double))? Although I'd prefer not to have to write something so cumbersome(!). I'm also prepared to sacrifice portability and assume that the architecture is intel 64-bit (and so malloc should already return aligned pointers) if this would grant me a simpler syntax in my code. (e.g. in this case could I just create a valarray<Matrix4d>(N) and preserve alignment? And would it have the matrix data consecutive or would it be interspersed with internal variables from the Matrix4d class?). Any ideas greatly appreciated... |
Moderator
|
is the simplest and best way to achieve what you want (alignment + sequential storage). |
Registered Member
|
Just out of curiosity. You said that you have a large number of small matrices. How big is N? I am just asking since
Regards, Hauke |
Registered Member
|
The exact value of N will be allowed to vary (optimization problem). However it is anticipated that it will be in the region of 40. I will definitely also want to be doing the problem with 16x16 matrices though, not only 4x4 ones. And I don't yet know whether I'll need double precision. The L's and R's need storing too, but are in fact post and pre multiplied by a row/column vector so they are actually lists of vectors. However I will need two variants of each, and I'll also maybe need to make an entire copy of the data for 'trial functions' when choosing how to update the W's. So we're looking at around maybe
40x16x16xsizeof(double) bytes of storage for the W's (80KB) 40x16xsizeof(double) for each of the L's and R's (gives 5KB * 4) + a full copy of the data => 200KB total Modern linuxes have a default stack of 8MB, I think(?). So hopefully should be okay. But if it turns out that I do want to explore much larger N, what would be the equivalent way to achieve alignment and sequential storage on the heap? |
Registered Member
|
Oh, and on the topic of stacks.
If I wrap my data inside a class/struct, e.g.
will W still be allocated on the stack if I instatiate an object of type Sequence on the stack? And what if I instantiate my object on the heap, presumably then the Eigen data will also be on the heap? Lastly, should I be worried: is there a performance penalty (other than at allocation time) for performing matrix calculations on heap data vs stack data? |
Registered Member
|
Regarding performance, you need to profile. There is only so much we can guess.
I would start with a
Your wrapper won't help, the type still resides on the stack if you use it as is, unless you allocate on the heap and then you have the same result as with a regular std::vector. - Hauke |
Registered users: Bing [Bot], Google [Bot], Yahoo [Bot]