Plagued by multithreaded bugs
- by koncurrency
On my new team that I manage, the majority of our code is platform, TCP socket, and http networking code. All C++. Most of it originated from other developers that have left the team. The current developers on the team are very smart, but mostly junior in terms of experience.
Our biggest problem: multi-threaded concurrency bugs. Most of our class libraries are written to be asynchronous by use of some thread pool classes. Methods on the class libraries often enqueue long running taks onto the thread pool from one thread and then the callback methods of that class get invoked on a different thread. As a result, we have a lot of edge case bugs involving incorrect threading assumptions. This results in subtle bugs that go beyond just having critical sections and locks to guard against concurrency issues.
What makes these problems even harder is that the attempts to fix are often incorrect. Some mistakes I've observed the team attempting (or within the legacy code itself) includes something like the following:
Common mistake #1 - Fixing concurrency issue by just put a lock around the shared data, but forgetting about what happens when methods don't get called in an expected order. Here's a very simple example:
void Foo::OnHttpRequestComplete(statuscode status)
{
m_pBar->DoSomethingImportant(status);
}
void Foo::Shutdown()
{
m_pBar->Cleanup();
delete m_pBar;
m_pBar=nullptr;
}
So now we have a bug in which Shutdown could get called while OnHttpNetworkRequestComplete is occuring on. A tester finds the bug, captures the crash dump, and assigns the bug to a developer. He in turn fixes the bug like this.
void Foo::OnHttpRequestComplete(statuscode status)
{
AutoLock lock(m_cs);
m_pBar->DoSomethingImportant(status);
}
void Foo::Shutdown()
{
AutoLock lock(m_cs);
m_pBar->Cleanup();
delete m_pBar;
m_pBar=nullptr;
}
The above fix looks good until you realize there's an even more subtle edge case. What happens if Shutdown gets called before OnHttpRequestComplete gets called back? The real world examples my team has are even more complex, and the edge cases are even harder to spot during the code review process.
Common Mistake #2 - fixing deadlock issues by blindly exiting the lock, wait for the other thread to finish, then re-enter the lock - but without handling the case that the object just got updated by the other thread!
Common Mistake #3 - Even though the objects are reference counted, the shutdown sequence "releases" it's pointer. But forgets to wait for the thread that is still running to release it's instance. As such, components are shutdown cleanly, then spurious or late callbacks are invoked on an object in an state not expecting any more calls.
There are other edge cases, but the bottom line is this:
Multithreaded programming is just plain hard, even for smart people.
As I catch these mistakes, I spend time discussing the errors with each developer on developing a more appropriate fix. But I suspect they are often confused on how to solve each issue because of the enormous amount of legacy code that the "right" fix will involve touching.
We're going to be shipping soon, and I'm sure the patches we're applying will hold for the upcoming release. Afterwards, we're going to have some time to improve the code base and refactor where needed. We won't have time to just re-write everything. And the majority of the code isn't all that bad. But I'm looking to refactor code such that threading issues can be avoided altogether.
One approach I am considering is this. For each significant platform feature, have a dedicated single thread where all events and network callbacks get marshalled onto. Similar to COM apartment threading in Windows with use of a message loop. Long blocking operations could still get dispatched to a work pool thread, but the completion callback is invoked on on the component's thread. Components could possibly even share the same thread. Then all the class libraries running inside the thread can be written under the assumption of a single threaded world.
Before I go down that path, I am also very interested if there are other standard techniques or design patterns for dealing with multithreaded issues. And I have to emphasize - something beyond a book that describes the basics of mutexes and semaphores. What do you think?
I am also interested in any other approaches to take towards a refactoring process. Including any of the following:
Literature or papers on design patterns around threads. Something beyond an introduction to mutexes and semaphores. We don't need massive parallelism either, just ways to design an object model so as to handle asynchronous events from other threads correctly.
Ways to diagram the threading of various components, so that it will be easy to study and evolve solutions for. (That is, a UML equivalent for discussing threads across objects and classes)
Educating your development team on the issues with multithreaded code.
What would you do?