Stack usage with MMX intrinsics and Microsoft C++
- by arik-funke
I have an inline assembler loop that cumulatively adds elements from an int32 data array with MMX instructions. In particular, it uses the fact that the MMX registers can accommodate 16 int32s to calculate 16 different cumulative sums in parallel.
I would now like to convert this piece of code to MMX intrinsics but I am afraid that I will suffer a performance penalty because one cannot explicitly intruct the compiler to use the 8 MMX registers to accomulate 16 independent sums.
Can anybody comment on this and maybe propose a solution on how to convert the piece of code below to use intrinsics?
== inline assembler (only part within the loop) ==
paddd mm0, [esi+edx+8*0] ; add first & second pair of int32 elements
paddd mm1, [esi+edx+8*1] ; add third & fourth pair of int32 elements ...
paddd mm2, [esi+edx+8*2]
paddd mm3, [esi+edx+8*3]
paddd mm4, [esi+edx+8*4]
paddd mm5, [esi+edx+8*5]
paddd mm6, [esi+edx+8*6]
paddd mm7, [esi+edx+8*7] ; add 15th & 16th pair of int32 elements
esi points to the beginning of the data array
edx provides the offset in the data array for the current loop iteration
the data array is arranged such that the elements for the 16 independent sums are interleaved.