I have a single line of code, that consumes 25% - 30% of the runtime of my application. It is a less-than comparator for an std::set (the set is implemented with a Red-Black-Tree).
It is called about 180 Million times within 52 seconds.
struct Entry {
const float _cost;
const long _id;
// some other vars
Entry(float cost, float id) : _cost(cost), _id(id) {
}
};
template<class T>
struct lt_entry: public binary_function <T, T, bool>
{
bool operator()(const T &l, const T &r) const
{
// Most readable shape
if(l._cost != r._cost) {
return r._cost < l._cost;
} else {
return l._id < r._id;
}
}
};
The entries should be sorted by cost and if the cost is the same by their id.
I have many insertions for each extraction of the minimum. I thought about using Fibonacci-Heaps, but I have been told that they are theoretically nice, but suffer from high constants and are pretty complicated to implement. And since insert is in O(log(n)) the runtime increase is nearly constant with large n. So I think its okay to stick to the set.
To improve performance I tried to express it in different shapes:
return l._cost < r._cost || r._cost > l._cost || l._id < r._id;
return l._cost < r._cost || (l._cost == r._cost && l._id < r._id);
Even this:
typedef union {
float _f;
int _i;
} flint;
//...
flint diff;
diff._f = (l._cost - r._cost);
return (diff._i && diff._i >> 31) || l._id < r._id;
But the compiler seems to be smart enough already, because I haven't been able to improve the runtime.
I also thought about SSE but this problem is really not very applicable for SSE...
The assembly looks somewhat like this:
movss (%rbx),%xmm1
mov $0x1,%r8d
movss 0x20(%rdx),%xmm0
ucomiss %xmm1,%xmm0
ja 0x410600 <_ZNSt8_Rb_tree[..]+96>
ucomiss %xmm0,%xmm1
jp 0x4105fd <_ZNSt8_Rb_[..]_+93>
jne 0x4105fd <_ZNSt8_Rb_[..]_+93>
mov 0x28(%rdx),%rax
cmp %rax,0x8(%rbx)
jb 0x410600 <_ZNSt8_Rb_[..]_+96>
xor %r8d,%r8d
I have a very tiny bit experience with assembly language, but not really much.
I thought it would be the best (only?) point to squeeze out some performance, but is it really worth the effort? Can you see any shortcuts that could save some cycles?
The platform the code will run on is an ubuntu 12 with gcc 4.6 (-stl=c++0x) on a many-core intel machine. Only libraries available are boost, openmp and tbb.
I am really stuck on this one, it seems so simple, but takes that much time. I have been crunching my head since days thinking how I could improve this line...
Can you give me a suggestion how to improve this part, or is it already at its best?