Hi,
I am trying to multiply square matrices in parallele with MPI.
I use a MPI_Type_vector to send square submatrixes (arrays of float) to the processes, so they can calculate subproducts. Then, for the next iterations, these submatrices are send to neighbours processes as MPI_Type_contiguous (the whole submatrix is sent). This part is working as expected, and local results are corrects.
Then, I use MPI_Gather with the contiguous types to send all local results back to the root process. The problem is, the final matrix is build (obviously, by this method) line by line instead of submatrix by submatrix.
I wrote an ugly procedure rearranging the final matrix, but I would like to know if there is a direct way of performing the "inverse" operation of sending MPI_Type_vectors (i.e., sending an array of values and directly arranging it in a subarray form in the receiving array).
An example, to try and clarify my long text :
A[16] and B[16] are 4x4 matrices to be multiplied ; C[16] will contain the result ; 4 processes are used (Pi with i from 0 to 3) :
Pi gets two 2x2 submatrices : subAi[4] and subBi[4] ; their product is stored locally in subCi[4].
For instance, P0 gets :
subA0[4] containing A[0], A[1], A[4] and A[5] ;
subB0[4] containing B[0], B[1], B[4] and B[5].
After everything is calculed, root process gathers all subCi[4].
Then C[16] contains :
[
subC0[0], subC0[1], subC0[2], subC0[3],
subC1[0], subC1[1], subC1[2], subC1[3],
subC2[0], subC2[1], subC2[2], subC2[3],
subC3[0], subC3[1], subC3[2], subC3[3]]
and I would like it to be :
[
subC0[0], subC0[1], subC1[0], subC1[1],
subC0[2], subC0[3], subC1[2], subC1[3],
subC2[0], subC2[1], subC3[0], subC3[1],
subC2[2], subC2[3], subC3[2], subC3[3]]
without further operation. Does someone know a way ?
Thanks for your advices.