Lockless queue implementation ends up having a loop under stress
- by Fozi
I have lockless queues written in C in form of a linked list that contains requests from several threads posted to and handled in a single thread. After a few hours of stress I end up having the last request's next pointer pointing to itself, which creates an endless loop and locks up the handling thread.
The application runs (and fails) on both Linux and Windows. I'm debugging on Windows, where my COMPARE_EXCHANGE_PTR maps to InterlockedCompareExchangePointer.
This is the code that pushes a request to the head of the list, and is called from several threads:
void push_request(struct request * volatile * root, struct request * request)
{
assert(request);
do {
request->next = *root;
} while(COMPARE_EXCHANGE_PTR(root, request, request->next) != request->next);
}
This is the code that gets a request from the end of the list, and is only called by a single thread that is handling them:
struct request * pop_request(struct request * volatile * root)
{
struct request * volatile * p;
struct request * request;
do {
p = root;
while(*p && (*p)->next) p = &(*p)->next; // <- loops here
request = *p;
} while(COMPARE_EXCHANGE_PTR(p, NULL, request) != request);
assert(request->next == NULL);
return request;
}
Note that I'm not using a tail pointer because I wanted to avoid the complication of having to deal with the tail pointer in push_request. However I suspect that the problem might be in the way I find the end of the list.
There are several places that push a request into the queue, but they all look generaly like this:
// device->requests is defined as struct request * volatile requests;
struct request * request = malloc(sizeof(struct request));
if(request) {
// fill out request fields
push_request(&device->requests, request);
sem_post(device->request_sem);
}
The code that handles the request is doing more than that, but in essence does this in a loop:
if(sem_wait_timeout(device->request_sem, timeout) == sem_success) {
struct request * request = pop_request(&device->requests);
// handle request
free(request);
}
I also just added a function that is checking the list for duplicates before and after each operation, but I'm afraid that this check will change the timing so that I will never encounter the point where it fails. (I'm waiting for it to break as I'm writing this.)
When I break the hanging program the handler thread loops in pop_request at the marked position. I have a valid list of one or more requests and the last one's next pointer points to itself. The request queues are usually short, I've never seen more then 10, and only 1 and 3 the two times I could take a look at this failure in the debugger.
I thought this through as much as I could and I came to the conclusion that I should never be able to end up with a loop in my list unless I push the same request twice. I'm quite sure that this never happens. I'm also fairly sure (although not completely) that it's not the ABA problem.
I know that I might pop more than one request at the same time, but I believe this is irrelevant here, and I've never seen it happening. (I'll fix this as well)
I thought long and hard about how I can break my function, but I don't see a way to end up with a loop.
So the question is: Can someone see a way how this can break? Can someone prove that this can not?
Eventually I will solve this (maybe by using a tail pointer or some other solution - locking would be a problem because the threads that post should not be locked, I do have a RW lock at hand though) but I would like to make sure that changing the list actually solves my problem (as opposed to makes it just less likely because of different timing).