A site for solving at least some of your technical problems...
A site for solving at least some of your technical problems...
Once in a while, in rather complicated application, I end up with a deadlock.
Looking for what is happening can be tedious since a deadlock doesn't tell you anything other than: this thread is waiting on a mutex (or possibly even a condition on a mutex).
In case of a deadlock, though, the mutex is already locked by another mutex and it is easy to find out which other thread locked the mutex.
In many cases, when a deadlock occurs, the two threads use the same two (or more) mutexes and at some point both are trying to lock those two mutexes out of order (A then B for one and B then A for the other). This can still work if you know the deadlock can happen by using a trylock instead of a plain lock. If the trylock fails, then you can't obtain both of those locks and you can't go on.
Now, when looking at the locks using gdb, I've been wondering how to determine which thread was actually the owner of that lock (i.e. which thread is currently holding the lock and preventing this other thread from moving forward). This is found in your pthread_mutex_t variable (defined in sysdeps/nptl/bits/pthreadtypes.h from the glibc project):
typedef union { struct __pthread_mutex_s __data; char __size[__SIZEOF_PTHREAD_MUTEX_T]; long int __align; } pthread_mutex_t;
As we can see here, the type is a union of three parameters:
The structure at the top is clearly defined as an internal structure and is found in sysdeps/nptl/bits/thread-shared-types.h and looks as follow:
struct __pthread_mutex_s { int __lock __LOCK_ALIGNMENT; unsigned int __count; int __owner; #if !__PTHREAD_MUTEX_NUSERS_AFTER_KIND unsigned int __nusers; #endif /* KIND must stay at this position in the structure to maintain binary compatibility with static initializers. Concurrency notes: The __kind of a mutex is initialized either by the static PTHREAD_MUTEX_INITIALIZER or by a call to pthread_mutex_init. After a mutex has been initialized, the __kind of a mutex is usually not changed. BUT it can be set to -1 in pthread_mutex_destroy or elision can be enabled. This is done concurrently in the pthread_mutex_*lock functions by using the macro FORCE_ELISION. This macro is only defined for architectures which supports lock elision. For elision, there are the flags PTHREAD_MUTEX_ELISION_NP and PTHREAD_MUTEX_NO_ELISION_NP which can be set in addition to the already set type of a mutex. Before a mutex is initialized, only PTHREAD_MUTEX_NO_ELISION_NP can be set with pthread_mutexattr_settype. After a mutex has been initialized, the functions pthread_mutex_*lock can enable elision - if the mutex-type and the machine supports it - by setting the flag PTHREAD_MUTEX_ELISION_NP. This is done concurrently. Afterwards the lock / unlock functions are using specific elision code-paths. */ int __kind; __PTHREAD_COMPAT_PADDING_MID #if __PTHREAD_MUTEX_NUSERS_AFTER_KIND unsigned int __nusers; #endif #if !__PTHREAD_MUTEX_USE_UNION __PTHREAD_SPINS_DATA; __pthread_list_t __list; # define __PTHREAD_MUTEX_HAVE_PREV 1 #else __extension__ union { __PTHREAD_SPINS_DATA; __pthread_slist_t __list; }; # define __PTHREAD_MUTEX_HAVE_PREV 0 #endif __PTHREAD_COMPAT_PADDING_END };
As we can see, the first three parameters are __lock, __count, and __owner. What particularly interests us is the third parameter, which is the lock owner. All three are integers, so 32 bits in most likelihood.
Now in gdb we can do:
(gdb) x/3d 0x558b16eed0 0x558b16eed0: 0 0 0
Where 0x558b16eed0 is the address of your mutex. If all zeroes as shown above, then your mutex is not currently locked. It's not a deadlock. Maybe your code is waiting on a condition, in which case that condition is not happening.
When another thread owns the mutex, the data is more likely to look like so:
(gdb) x/3d 0x558b16eed0 0x558b16eed0: 1 1 22022
The 22022 is the thread identifier. Unfortunately, gdb doesn't understand that identifier. Instead you have to manually look for the corresponding thread number. At least, when listing your threads, they appear in order:
(gdb) info thread ... 16 Thread 0x7f689efdb0 (LWP 22022) "NVMDecFrmStatsT" 0x0000007fb5d952a4 in futex_wait_cancelable (private=<optimized out>, expected=0, futex_word=0x558b197e28) at ../sysdeps/unix/sysv/linux/futex-internal.h:88 ...
This says that thread with identififer 22022 is number 16. Now we can switch to it and see that that thread is trying to do:
(gdb) thread 16 (gdb) where
The stack should tell you what happened and where to look into your code to fix the issue. Especially, it may tell you that the thread locked another mutex and thus it created a deadlock.