Hello, I'm working on converting a bit of code to SSE, and while I have the correct output it turns out to be slower than standard c++ code.
The bit of code that I need to do this for is:
float ox = p2x - (px * c - py * s)*m;
float oy = p2y - (px * s - py * c)*m;
What I've got for SSE code is:
void assemblycalc(vector4 &p, vector4 &sc, float &m, vector4 &xy)
{
    vector4 r;
    __m128 scale = _mm_set1_ps(m);
__asm
{
    mov     eax,    p       //Load into CPU reg
    mov     ebx,    sc
    movups  xmm0,   [eax]   //move vectors to SSE regs
    movups  xmm1,   [ebx]
    mulps   xmm0,   xmm1    //Multiply the Elements
    movaps  xmm2,   xmm0    //make a copy of the array  
    shufps  xmm2,   xmm0,  0x1B //shuffle the array     
    subps   xmm0,   xmm2    //subtract the elements
    mulps   xmm0,   scale   //multiply the vector by the scale
    mov     ecx,    xy      //load the variable into cpu reg
    movups  xmm3,   [ecx]   //move the vector to the SSE regs
    subps   xmm3,   xmm0    //subtract xmm3 - xmm0
    movups  [r],    xmm3    //Save the retun vector, and use elements 0 and 3
    }
}
Since its very difficult to read the code, I'll explain what I did:
loaded vector4 , xmm0 _ p = [px  , py  , px  , py  ]
mult. by vector4, xmm1 _ cs = [c   , c   , s   , s   ]
_____________mult----------------------------
result,______ xmm0 = [px*c, py*c, px*s, py*s]
reuse result, xmm0 = [px*c, py*c, px*s, py*s]
shuffle result, xmm2 = [py*s, px*s, py*c, px*c]
___________subtract----------------------------
result, xmm0 = [px*c-py*s, py*c-px*s, px*s-py*c, py*s-px*c]
reuse result, xmm0 = [px*c-py*s, py*c-px*s, px*s-py*c, py*s-px*c]
load m vector4, scale = [m, m, m, m]
______________mult----------------------------
result, xmm0 = [(px*c-py*s)*m, (py*c-px*s)*m, (px*s-py*c)*m, (py*s-px*c)*m]
load xy vector4, xmm3 = [p2x, p2x, p2y, p2y]
reuse, xmm0 = [(px*c-py*s)*m, (py*c-px*s)*m, (px*s-py*c)*m, (py*s-px*c)*m]
___________subtract----------------------------
result, xmm3 = [p2x-(px*c-py*s)*m, p2x-(py*c-px*s)*m, p2y-(px*s-py*c)*m, p2y-(py*s-px*c)*m]
then ox = xmm3[0] and oy = xmm3[3], so I essentially don't use xmm3[1] or xmm3[4]
I apologize for the difficulty reading this, but I'm hoping someone might be able to provide some guidance for me, as the standard c++ code runs in 0.001444ms and the SSE code runs in 0.00198ms.
Let me know if there is anything I can do to further explain/clean this up a bit. The reason I'm trying to use SSE is because I run this calculation millions of times, and it is a part of what is slowing down my current code.
Thanks in advance for any help!
Brett