Hello, I'm working on converting a bit of code to SSE, and while I have the correct output it turns out to be slower than standard c++ code.
The bit of code that I need to do this for is:
float ox = p2x - (px * c - py * s)*m;
float oy = p2y - (px * s - py * c)*m;
What I've got for SSE code is:
void assemblycalc(vector4 &p, vector4 &sc, float &m, vector4 &xy)
{
vector4 r;
__m128 scale = _mm_set1_ps(m);
__asm
{
mov eax, p //Load into CPU reg
mov ebx, sc
movups xmm0, [eax] //move vectors to SSE regs
movups xmm1, [ebx]
mulps xmm0, xmm1 //Multiply the Elements
movaps xmm2, xmm0 //make a copy of the array
shufps xmm2, xmm0, 0x1B //shuffle the array
subps xmm0, xmm2 //subtract the elements
mulps xmm0, scale //multiply the vector by the scale
mov ecx, xy //load the variable into cpu reg
movups xmm3, [ecx] //move the vector to the SSE regs
subps xmm3, xmm0 //subtract xmm3 - xmm0
movups [r], xmm3 //Save the retun vector, and use elements 0 and 3
}
}
Since its very difficult to read the code, I'll explain what I did:
loaded vector4 , xmm0 _ p = [px , py , px , py ]
mult. by vector4, xmm1 _ cs = [c , c , s , s ]
_____________mult----------------------------
result,______ xmm0 = [px*c, py*c, px*s, py*s]
reuse result, xmm0 = [px*c, py*c, px*s, py*s]
shuffle result, xmm2 = [py*s, px*s, py*c, px*c]
___________subtract----------------------------
result, xmm0 = [px*c-py*s, py*c-px*s, px*s-py*c, py*s-px*c]
reuse result, xmm0 = [px*c-py*s, py*c-px*s, px*s-py*c, py*s-px*c]
load m vector4, scale = [m, m, m, m]
______________mult----------------------------
result, xmm0 = [(px*c-py*s)*m, (py*c-px*s)*m, (px*s-py*c)*m, (py*s-px*c)*m]
load xy vector4, xmm3 = [p2x, p2x, p2y, p2y]
reuse, xmm0 = [(px*c-py*s)*m, (py*c-px*s)*m, (px*s-py*c)*m, (py*s-px*c)*m]
___________subtract----------------------------
result, xmm3 = [p2x-(px*c-py*s)*m, p2x-(py*c-px*s)*m, p2y-(px*s-py*c)*m, p2y-(py*s-px*c)*m]
then ox = xmm3[0] and oy = xmm3[3], so I essentially don't use xmm3[1] or xmm3[4]
I apologize for the difficulty reading this, but I'm hoping someone might be able to provide some guidance for me, as the standard c++ code runs in 0.001444ms and the SSE code runs in 0.00198ms.
Let me know if there is anything I can do to further explain/clean this up a bit. The reason I'm trying to use SSE is because I run this calculation millions of times, and it is a part of what is slowing down my current code.
Thanks in advance for any help!
Brett