VFP Unit Matrix Multiply problem on the iPhone
- by Ian Copland
Hi. I'm trying to write a Matrix3x3 multiply using the Vector Floating Point on the iPhone, however i'm encountering some problems. This is my first attempt at writing any ARM assembly, so it could be a faily simple solution that i'm not seeing.
I've currently got a small application running using a maths library that i've written. I'm investigating into the benifits using the Vector Floating Point Unit would provide so i've taken my matrix multiply and converted it to asm. Previously the application would run without a problem, however now my objects will all randomly disappear. This seems to be caused by the results from my matrix multiply becoming NAN at some point.
Heres the code
IMatrix3x3 operator*(IMatrix3x3 & _A, IMatrix3x3 & _B)
{
IMatrix3x3 C;
//C++ code for the simulator
#if TARGET_IPHONE_SIMULATOR == true
C.A0 = _A.A0 * _B.A0 + _A.A1 * _B.B0 + _A.A2 * _B.C0;
C.A1 = _A.A0 * _B.A1 + _A.A1 * _B.B1 + _A.A2 * _B.C1;
C.A2 = _A.A0 * _B.A2 + _A.A1 * _B.B2 + _A.A2 * _B.C2;
C.B0 = _A.B0 * _B.A0 + _A.B1 * _B.B0 + _A.B2 * _B.C0;
C.B1 = _A.B0 * _B.A1 + _A.B1 * _B.B1 + _A.B2 * _B.C1;
C.B2 = _A.B0 * _B.A2 + _A.B1 * _B.B2 + _A.B2 * _B.C2;
C.C0 = _A.C0 * _B.A0 + _A.C1 * _B.B0 + _A.C2 * _B.C0;
C.C1 = _A.C0 * _B.A1 + _A.C1 * _B.B1 + _A.C2 * _B.C1;
C.C2 = _A.C0 * _B.A2 + _A.C1 * _B.B2 + _A.C2 * _B.C2;
//VPU ARM asm for the device
#else
//create a pointer to the Matrices
IMatrix3x3 * pA = &_A;
IMatrix3x3 * pB = &_B;
IMatrix3x3 * pC = &C;
//asm code
asm volatile(
//turn on a vector depth of 3
"fmrx r0, fpscr \n\t"
"bic r0, r0, #0x00370000 \n\t"
"orr r0, r0, #0x00020000 \n\t"
"fmxr fpscr, r0 \n\t"
//load matrix B into the vector bank
"fldmias %1, {s8-s16} \n\t"
//load the first row of A into the scalar bank
"fldmias %0!, {s0-s2} \n\t"
//calulate C.A0, C.A1 and C.A2
"fmuls s17, s8, s0 \n\t"
"fmacs s17, s11, s1 \n\t"
"fmacs s17, s14, s2 \n\t"
//save this into the output
"fstmias %2!, {s17-s19} \n\t"
//load the second row of A into the scalar bank
"fldmias %0!, {s0-s2} \n\t"
//calulate C.B0, C.B1 and C.B2
"fmuls s17, s8, s0 \n\t"
"fmacs s17, s11, s1 \n\t"
"fmacs s17, s14, s2 \n\t"
//save this into the output
"fstmias %2!, {s17-s19} \n\t"
//load the third row of A into the scalar bank
"fldmias %0!, {s0-s2} \n\t"
//calulate C.C0, C.C1 and C.C2
"fmuls s17, s8, s0 \n\t"
"fmacs s17, s11, s1 \n\t"
"fmacs s17, s14, s2 \n\t"
//save this into the output
"fstmias %2!, {s17-s19} \n\t"
//set the vector depth back to 1
"fmrx r0, fpscr \n\t"
"bic r0, r0, #0x00370000 \n\t"
"orr r0, r0, #0x00000000 \n\t"
"fmxr fpscr, r0 \n\t"
//pass the inputs and set the clobber list
: "+r"(pA), "+r"(pB), "+r" (pC) :
:"cc", "memory","s0", "s1", "s2", "s8", "s9", "s10", "s11", "s12", "s13", "s14", "s15", "s16", "s17", "s18", "s19"
);
#endif
return C;
}
As far as i can see that makes sence. While debugging i've managed to notice that if i were to say _A = C prior to the return and after the ASM, _A will not necessarily be equal to C which has only increased my confusion. I had thought it was possibly due to the pointers I'm giving to the VFPU being incrimented by lines such as "fldmias %0!, {s0-s2} \n\t" however my understanding of asm is not good enough to properly understand the problem, nor to see an alternative approach to that line of code.
Anyway, I was hoping someone with a greater understanding than me would be able to see a solution, and any help would be greatly appreciated, thank you :-)