前往小程序,Get更优阅读体验!
立即前往
首页
学习
活动
专区
工具
TVP
发布
社区首页 >专栏 >[译]Debugging a futex crash

[译]Debugging a futex crash

作者头像
王很水
发布2024-07-30 15:05:49
1000
发布2024-07-30 15:05:49
举报
文章被收录于专栏:C++ 动态新闻推送

[1] Debugging a futex crash: https://rustylife.github.io/2023/08/15/futex-crash.html

很精彩的文章, 水友群一个朋友遇到了一个ceph挂掉的问题,现象非常像这个文章,分析后发现一样

水友的堆栈是这样的

这里分享给大家,已经获得原作者授权

首先core的堆栈仅有 futex相关

代码语言:javascript
复制
>>> bt
#0  0x00007f7430470fbb in raise () from /home/rusty/futex/sysroot/lib/x86_64-linux-gnu/libc.so.6
#1  0x00007f7430456864 in abort () from /home/rusty/futex/sysroot/lib/x86_64-linux-gnu/libc.so.6
#2  0x00007f74304b949c in ?? () from /home/rusty/futex/sysroot/lib/x86_64-linux-gnu/libc.so.6
#3  0x00007f74304b97b0 in __libc_fatal () from /home/rusty/futex/sysroot/lib/x86_64-linux-gnu/libc.so.6
#4  0x00007f743062f01c in __lll_lock_wait () from /home/rusty/futex/sysroot/lib/x86_64-linux-gnu/libpthread.so.0
#5  0x00007f7430627733 in pthread_mutex_lock () from /home/rusty/futex/sysroot/lib/x86_64-linux-gnu/libpthread.so.0
#6  0x0000000001acedc0 in ?? ()
#7  0x0000000000000000 in ?? ()

怎么抓bug?

首先有__libc_fatal或者futex_fatal_error 基本能定位到是futex_wait上出问题了 ll_futex_wait基本包一层futex_wait

代码语言:javascript
复制
void
__lll_lock_wait (int *futex, int private)
{
  if (atomic_load_relaxed (futex) == 2)
    goto futex;

  while (atomic_exchange_acquire (futex, 2) != 0)
    {
    futex:
      LIBC_PROBE (lll_lock_wait, 1, futex);
      futex_wait ((unsigned int *) futex, 2, private); /* Wait if *futex == 2.  */
    }
}
libc_hidden_def (__lll_lock_wait)

futex_wait的代码如下

代码语言:javascript
复制
static __always_inline int
futex_wait (unsigned int *futex_word, unsigned int expected, int private)
{
  int err = lll_futex_timed_wait (futex_word, expected, NULL, private);
  switch (err)
    {
    case 0:
    case -EAGAIN:
    case -EINTR:
      return -err;

    case -ETIMEDOUT: /* Cannot have happened as we provided no timeout.  */
    case -EFAULT: /* Must have been caused by a glibc or application bug.  */
    case -EINVAL: /* Either due to wrong alignment or due to the timeout not
             being normalized.  Must have been caused by a glibc or
             application bug.  */
    case -ENOSYS: /* Must have been caused by a glibc bug.  */
    /* No other errors are documented at this time.  */
    default:
      futex_fatal_error ();
    }
}

这几个错误是最值得关注的,按图索骥就行了 首先不是ETIMEDOUT,其次EFAULT和ENOSYS是比较难查的,EINVAL反而比较好查,先确认是不是wrong alignment 对齐问题,简单来说,就是查地址,思路有了,怎么查?切到frame 5

代码语言:javascript
复制
>>> frame 5
#5  0x00007f7430627733 in pthread_mutex_lock () from /home/rusty/futex/sysroot/lib/x86_64-linux-gnu/libpthread.so.0

>>> disassemble pthread_mutex_lock
...
   0x00007f7430627723 <+211>: mov    %rdi,0x8(%rsp)
   0x00007f7430627728 <+216>: and    $0x80,%esi
   0x00007f743062772e <+222>: call   0x7f743062efc0 <__lll_lock_wait>
=> 0x00007f7430627733 <+227>: mov    0x8(%rsp),%rdi
   0x00007f7430627738 <+232>: jmp    0x7f7430627689 <pthread_mutex_lock+57>
...

>>> x/w ($rsp + 8)
0x7f7371ff1ce8: 0x08001ea1

抓到了不对齐??看懂怎么定位的了没?

__lll_lock_wait (int *futex, int private) 怎么传futex和private?

mov 0x8(%rsp),%rdi到lock_wait,说明 退回去就是futex和private,对吧

不记得的琢磨一下汇编哈,看看这个 https://godbolt.org/z/bbET88a1x

复现一下

代码语言:javascript
复制
#include <stdlib.h>
#include <stdio.h>
#include <unistd.h>
#include <pthread.h>

pthread_mutex_t mutex __attribute__((section(".reproducer"))) __attribute__ ((aligned(1)));

int main(int argc, char **argv)
{
    printf("mutex %p\n", &mutex);
    pthread_mutex_lock(&mutex);
    pthread_mutex_lock(&mutex);
}

为什么锁两次,因为锁一次不能复现,正常来说顶多死锁

编译执行

代码语言:javascript
复制
rusty@nuc10:~/futex$ gcc -g futex0.c -lpthread  -Wl,--section-start=.reproducer=0x8001ea1
rusty@nuc10:~/futex$ ./a.out 
mutex 0x8001ea1
The futex facility returned an unexpected error code.
Aborted (core dumped)
rusty@nuc10:~/futex$

注意这个复现思路哈,指定地址

能复现,接下来就要确认futex哪里报错了

代码语言:javascript
复制
rusty@nuc10:~/code/kernel$ git grep -c "\-EINVAL" kernel/futex/
kernel/futex/core.c:1
kernel/futex/pi.c:4
kernel/futex/requeue.c:11
kernel/futex/syscalls.c:8
kernel/futex/waitwake.c:7

trace-cmd 录一下调用栈

代码语言:javascript
复制
rusty@nuc10:~/futex$ sudo trace-cmd record -q -F -p function_graph ./a.out
mutex 0x8001ea1
The futex facility returned an unexpected error code.

rusty@nuc10:~/futex$ sudo trace-cmd report
...
           a.out-867356 [004] 112365.616487: funcgraph_entry:        0.168 us   |  exit_to_user_mode_prepare();
           a.out-867356 [004] 112365.616487: funcgraph_entry:                   |  __x64_sys_futex() {
           a.out-867356 [004] 112365.616488: funcgraph_entry:                   |    do_futex() {
           a.out-867356 [004] 112365.616488: funcgraph_entry:                   |      futex_wait() {
           a.out-867356 [004] 112365.616488: funcgraph_entry:        0.150 us   |        futex_setup_timer();
           a.out-867356 [004] 112365.616488: funcgraph_entry:                   |        futex_wait_setup() {
           a.out-867356 [004] 112365.616488: funcgraph_entry:        0.199 us   |          get_futex_key();
           a.out-867356 [004] 112365.616489: funcgraph_exit:         0.425 us   |        }       
           a.out-867356 [004] 112365.616489: funcgraph_exit:         0.954 us   |      }         
           a.out-867356 [004] 112365.616489: funcgraph_exit:         1.205 us   |    }         
           a.out-867356 [004] 112365.616489: funcgraph_exit:         1.581 us   |  }

能看到死在get_futex_key里,看一下代码

代码语言:javascript
复制
         /*
          * The futex address must be "naturally" aligned.
          */
         key->both.offset = address % PAGE_SIZE;
         if (unlikely((address % sizeof(u32)) != 0))
                  return -EINVAL;

就是你了

本文参与 腾讯云自媒体同步曝光计划,分享自微信公众号。
原始发表:2024-04-10,如有侵权请联系 cloudcommunity@tencent.com 删除

本文分享自 CPP每周推送 微信公众号,前往查看

如有侵权,请联系 cloudcommunity@tencent.com 删除。

本文参与 腾讯云自媒体同步曝光计划  ,欢迎热爱写作的你一起参与!

评论
登录后参与评论
0 条评论
热度
最新
推荐阅读
领券
问题归档专栏文章快讯文章归档关键词归档开发者手册归档开发者手册 Section 归档