I've been trying to figure out how to gain some improvement in my code at a very crucial couple lines:
float x = a*b;
float y = c*d;
float z = e*f;
float w = g*h;
all a, b, c... are floats.
I decided to look into using SSE, but can't seem to find any improvement, in fact it turns out to be twice as slow. My SSE code is:
Vector4 abcd, efgh, result;
abcd = [float a, float b, float c, float d];
efgh = [float e, float f, float g, float h];
_asm {
movups xmm1, abcd
movups xmm2, efgh
mulps xmm1, xmm2
movups result, xmm1
}
I also attempted using standard inline assembly, but it doesn't appear that I can pack the register with the four floating points like I can with SSE.
Any comments, or help would be greatly appreciated, I mainly need to understand why my calculations using SSE are slower than the serial C++ code?
I'm compiling in Visual Studio 2005, on a Windows XP, using a Pentium 4 with HT if that provides any additional information to assit.
Thanks in advance!