Debugging a mutex deadlock on Linux with gdb

Submitted by Alexis Wilke on Sun, 08/15/2021 - 10:05

Rusty lock... when a deadlock happens, locks tend to rust over time.

Once in a while, in rather complicated application, I end up with a deadlock.

Looking for what is happening can be tedious since a deadlock doesn't tell you anything other than: this thread is waiting on a mutex (or possibly even a condition on a mutex).

In case of a deadlock, though, the mutex is already locked by another mutex and it is easy to find out which other thread locked the mutex.

In many cases, when a deadlock occurs, the two threads use the same two (or more) mutexes and at some point both are trying to lock those two mutexes out of order (A then B for one and B then A for the other). This can still work if you know the deadlock can happen by using a trylock instead of a plain lock. If the trylock fails, then you can't obtain both of those locks and you can't go on.

Now, when looking at the locks using gdb, I've been wondering how to determine which thread was actually the owner of that lock (i.e. which thread is currently holding the lock and preventing this other thread from moving forward). This is found in your pthread_mutex_t variable (defined in sysdeps/nptl/bits/pthreadtypes.h from the glibc project):

typedef union
{
  struct __pthread_mutex_s __data;
  char __size[__SIZEOF_PTHREAD_MUTEX_T];
  long int __align;
} pthread_mutex_t;

As we can see here, the type is a union of three parameters:

One at the bottom to ensure alignment to a long integer.
One in the middle to ensure a certain size.
One at the top which is an actualy structure.

The structure at the top is clearly defined as an internal structure and is found in sysdeps/nptl/bits/thread-shared-types.h and looks as follow:

struct __pthread_mutex_s
{
  int __lock __LOCK_ALIGNMENT;
  unsigned int __count;
  int __owner;
#if !__PTHREAD_MUTEX_NUSERS_AFTER_KIND
  unsigned int __nusers;
#endif
  /* KIND must stay at this position in the structure to maintain
     binary compatibility with static initializers.

     Concurrency notes:
     The __kind of a mutex is initialized either by the static
     PTHREAD_MUTEX_INITIALIZER or by a call to pthread_mutex_init.

     After a mutex has been initialized, the __kind of a mutex is usually not
     changed.  BUT it can be set to -1 in pthread_mutex_destroy or elision can
     be enabled.  This is done concurrently in the pthread_mutex_*lock functions
     by using the macro FORCE_ELISION. This macro is only defined for
     architectures which supports lock elision.

     For elision, there are the flags PTHREAD_MUTEX_ELISION_NP and
     PTHREAD_MUTEX_NO_ELISION_NP which can be set in addition to the already set
     type of a mutex.
     Before a mutex is initialized, only PTHREAD_MUTEX_NO_ELISION_NP can be set
     with pthread_mutexattr_settype.
     After a mutex has been initialized, the functions pthread_mutex_*lock can
     enable elision - if the mutex-type and the machine supports it - by setting
     the flag PTHREAD_MUTEX_ELISION_NP. This is done concurrently. Afterwards
     the lock / unlock functions are using specific elision code-paths.  */
  int __kind;
  __PTHREAD_COMPAT_PADDING_MID
#if __PTHREAD_MUTEX_NUSERS_AFTER_KIND
  unsigned int __nusers;
#endif
#if !__PTHREAD_MUTEX_USE_UNION
  __PTHREAD_SPINS_DATA;
  __pthread_list_t __list;
# define __PTHREAD_MUTEX_HAVE_PREV      1
#else
  __extension__ union
  {
    __PTHREAD_SPINS_DATA;
    __pthread_slist_t __list;
  };
# define __PTHREAD_MUTEX_HAVE_PREV      0
#endif
  __PTHREAD_COMPAT_PADDING_END
};

As we can see, the first three parameters are __lock, __count, and __owner. What particularly interests us is the third parameter, which is the lock owner. All three are integers, so 32 bits in most likelihood.

Now in gdb we can do:

(gdb) x/3d 0x558b16eed0
0x558b16eed0:    0    0    0

Where 0x558b16eed0 is the address of your mutex. If all zeroes as shown above, then your mutex is not currently locked. It's not a deadlock. Maybe your code is waiting on a condition, in which case that condition is not happening.

When another thread owns the mutex, the data is more likely to look like so:

(gdb) x/3d 0x558b16eed0
0x558b16eed0:    1    1    22022

The 22022 is the thread identifier. Unfortunately, gdb doesn't understand that identifier. Instead you have to manually look for the corresponding thread number. At least, when listing your threads, they appear in order:

(gdb) info thread
...
  16   Thread 0x7f689efdb0 (LWP 22022) "NVMDecFrmStatsT" 0x0000007fb5d952a4 in futex_wait_cancelable (private=<optimized out>,
    expected=0, futex_word=0x558b197e28) at ../sysdeps/unix/sysv/linux/futex-internal.h:88
...

This says that thread with identififer 22022 is number 16. Now we can switch to it and see that that thread is trying to do:

(gdb) thread 16
(gdb) where

The stack should tell you what happened and where to look into your code to fix the issue. Especially, it may tell you that the thread locked another mutex and thus it created a deadlock.

Add new comment

Main Menu

Debugging a mutex deadlock on Linux with gdb