implement SIMD in C++
- by Hristo
I'm working on a bit of code and I'm trying to optimize it as much as possible, basically get it running under a certain time limit.
The following makes the call...
static affinity_partitioner ap;
parallel_for(blocked_range<size_t>(0, T), LoopBody(score), ap);
... and the following is what is executed.
void operator()(const blocked_range<size_t> &r) const {
int temp;
int i;
int j;
size_t k;
size_t begin = r.begin();
size_t end = r.end();
for(k = begin; k != end; ++k) { // for each trainee
temp = 0;
for(i = 0; i < N; ++i) { // for each sample
int trr = trRating[k][i];
int ei = E[i];
for(j = 0; j < ei; ++j) { // for each expert
temp += delta(i, trr, exRating[j][i]);
}
}
myscore[k] = temp;
}
}
I'm using Intel's TBB to optimize this. But I've also been reading about SIMD and SSE2 and things along that nature. So my question is, how do I store the variables (i,j,k) in registers so that they can be accessed faster by the CPU? I think the answer has to do with implementing SSE2 or some variation of it, but I have no idea how to do that. Any ideas?
Thanks,
Hristo