我的flamingo服务器(关于flamingo可以参看这里)最近在杀掉进程(如使用Ctrl + C或者kill + 程序pid)偶尔会出现崩溃问题,虽然这个问题没多大影响,因为进程本来就马上要死了,在退出的过程中崩溃也就无所谓了,但是本着严谨和求知的态度,我还是排查了一下。下面记录一下debug的过程,希望对读者有所启发。
正常情况下,我的程序处理了Ctrl+C信号时,会走正常的退出流程,预想的程序不会崩溃的,但实际还是崩溃了。
主线程是一个EventLoop无限循环,同时程序接收到Ctrl+C信号时,设置主线程退出标志。代码如下:
int main(int argc, char* argv[])
2 {
3 //设置信号处理
4 signal(SIGCHLD, SIG_DFL);
5 signal(SIGPIPE, SIG_IGN);
6 signal(SIGINT, prog_exit);
7 signal(SIGTERM, prog_exit);
8
9 //省略无关代码...
10
11 g_mainLoop.loop();
12
13 return 0;
14}
信号处理程序如下:
1void prog_exit(int signo)
2 {
3 std::cout << "program recv signal [" << signo << "] to exit." << std::endl;
4
5 g_mainLoop.quit();
6
7}
通过日志也看不到对于排查崩溃现象的有任何帮助的日志信息,于是启用linux的linux的coredump文件生成机制,某次产生了如下coredump文件:
于是使用gdb调试查看一下崩溃调用堆栈(第一步使用命令gdb 可执行文件名称 core文件名,第二步使用bt命令查看崩溃堆栈):
1[zhangyl@iZ238vnojlyZ myimserver]$ gdb mychatserver core.9798
2GNU gdb (GDB) Red Hat Enterprise Linux 7.6.1-80.el7
3 Copyright (C) 2013 Free Software Foundation, Inc.
4 License GPLv3+: GNU GPL version 3 or later <http://gnu.org/licenses/gpl.html>
5 This is free software: you are free to change and redistribute it.
6 There is NO WARRANTY, to the extent permitted by law. Type "show copying"
7 and "show warranty" for details.
8 This GDB was configured as "x86_64-redhat-linux-gnu".
9 For bug reporting instructions, please see:
10 <http://www.gnu.org/software/gdb/bugs/>...
11 Reading symbols from /home/zhangyl/myimserver/mychatserver...done.
12 [New LWP 9798]
13 [New LWP 9802]
14 [New LWP 9804]
15 [New LWP 9800]
16 [New LWP 9803]
17 [New LWP 9801]
18 [New LWP 9805]
19 [Thread debugging using libthread_db enabled]
20 Using host libthread_db library "/lib64/libthread_db.so.1".
21 Core was generated by `./mychatserver -d'.
22 Program terminated with signal 11, Segmentation fault.
23 #0 0x00000000004f5d3f in FixedBuffer<4000000>::avail (this=0x7f7067564010) at /home/zhangyl/myimserver/base/asynclogging.h:45
24 45 int avail() const { return static_cast<int>(end() - cur_); }
25Missing separate debuginfos, use: debuginfo-install glibc-2.17-106.el7_2.6.x86_64 libgcc-4.8.5-11.el7.x86_64 libstdc++-4.8.5-11.el7.x86_64
26(gdb) bt
27#0 0x00000000004f5d3f in FixedBuffer<4000000>::avail (this=0x7f7067564010) at /home/zhangyl/myimserver/base/asynclogging.h:45
28#1 0x00000000004f4e20 in AsyncLogging::append (this=0x7ffd5d58ad50,
29 logline=0x7ffd5d58fe70 "20180820 15:46:06 DEBUG ~EventLoop EventLoop destructs in other thread - eventloop.cpp:87\n", len=90)
30 at /home/zhangyl/myimserver/base/asynclogging.cpp:98
31#2 0x000000000053bfc8 in asyncOutput (msg=0x7ffd5d58fe70 "20180820 15:46:06 DEBUG ~EventLoop EventLoop destructs in other thread - eventloop.cpp:87\n", len=90)
32 at /home/zhangyl/myimserver/chatserversrc/main.cpp:33
33#3 0x00000000004f205d in Logger::~Logger (this=0x7ffd5d58fe60, __in_chrg=<optimized out>) at /home/zhangyl/myimserver/base/logging.cpp:149
34#4 0x0000000000504bf1 in net::EventLoop::~EventLoop (this=0x7d1c00 <g_mainLoop>, __in_chrg=<optimized out>) at /home/zhangyl/myimserver/net/eventloop.cpp:87
35#5 0x00007f7067d79e69 in __run_exit_handlers () from /lib64/libc.so.6
36#6 0x00007f7067d79eb5 in exit () from /lib64/libc.so.6
37#7 0x00007f7067d62b1c in __libc_start_main () from /lib64/libc.so.6
38#8 0x00000000004f0ba9 in _start ()
39(gdb)
崩溃的根源在于堆栈#0,使用frame命令(frame 堆栈编号,如frame 0,表示切换到堆栈#0)切换到堆栈#0:
1(gdb) frame 0
2#0 0x00000000004f5d3f in FixedBuffer<4000000>::avail (this=0x7f7067564010) at /home/zhangyl/myimserver/base/asynclogging.h:45
345 int avail() const { return static_cast<int>(end() - cur_); }
4(gdb) p this
5$1 = (const FixedBuffer<4000000> * const) 0x7f7067564010
6(gdb)
使用print命令(简写成p)查看下当前对象的this指针,this指针时一个地址值,但是不代表这个对象有效,再使用print命令查看一下这个对象本身的数据(p *this),发现这个对象已经失效:
1(gdb) p *this
2Cannot access memory at address 0x7f7067564010
3(gdb)
这个对象为啥失效呢?而且堆栈#3和堆栈#4显示的是在EventLoop里面调用日志类Logger的析构函数。先看下堆栈#4处的函数代码:
1EventLoop::~EventLoop()
2{
3 LOG_DEBUG << "EventLoop destructs in other thread";
4
5 //std::stringstream ss;
6 //ss << "eventloop destructs threadid = " << threadId_;
7 //std::cout << ss.str() << std::endl;
8
9 wakeupChannel_->disableAll();
10 wakeupChannel_->remove();
11 ::close(wakeupFd_);
12 t_loopInThisThread = NULL;
13}
除了打印一行日志,也没有和日志Logger类相关的代码,且EventLoop也没有与Logger类相关的成员变量(也就是无隐式析构)。我们再看下堆栈#3处的代码:
1Logger::~Logger()
2{
3 impl_.finish();
4 const LogStream::Buffer& buf(stream().buffer());
5 g_output(buf.data(), buf.length());
6 if (impl_.level_ == FATAL)
7 {
8 g_flush();
9 abort();
10 }
11}
崩溃的地方就是g_output(buf.data(), buf.length()); 而这个buf的内容正好就是EventLoop析构函数里面打印的一行日志:
EventLoop destructs in other thread
那么这个g_output为啥会调用AsyncLogging::append函数呢?(参考堆栈#1)。我们通过全局搜索发现g_output是一个全局变量,并且在定义时已经有初始值:
Logger::OutputFunc g_output = defaultOutput;
而初始值defaultOutput的行为就是将日志输出到控制台:
1void defaultOutput(const char* msg, int len)
2 {
3 size_t n = fwrite(msg, 1, len, stdout);
4 //FIXME check n
5 (void)n;
6}
fwrite中的第二个参数是1,在linux中0代表标准输入控制台,1代表标准输出控制台,2代表错误输出控制台。
我们其实在main函数中,改变了日志的输出行为,让日志即可以输出到控制台也能输出到日志文件中:
1EventLoop g_mainLoop;
2AsyncLogging* g_asyncLog = NULL;
3void asyncOutput(const char* msg, int len)
4 {
5 if (g_asyncLog != NULL)
6 {
7 g_asyncLog->append(msg, len);
8 std::cout << msg << std::endl;
9 }
10}
11
12void asyncOutput(const char* msg, int len)
13 {
14 if (g_asyncLog != NULL)
15 {
16 g_asyncLog->append(msg, len);
17 std::cout << msg << std::endl;
18 }
19}
20
21int main(int argc, char* argv[])
22 {
23 //省略无关代码
24 std::string strLogFileFullPath(logfilepath);
25 strLogFileFullPath += logfilename;
26 Logger::setLogLevel(Logger::DEBUG);
27 int kRollSize = 500 * 1000 * 1000;
28 AsyncLogging log(strLogFileFullPath.c_str(), kRollSize);
29 log.start();
30 g_asyncLog = &log;
31 Logger::setOutput(asyncOutput);
32
33 //省略无关代码
34
35 g_mainLoop.loop();
36
37 return 0;
38}
看到这里问题就来了,当main函数执行完毕,即出了main函数的作用域,log对象已经销毁。此时正好正在销毁全局对象g_mainLoop,此时调用EventLoop的析构函数,在其析构函数中打印“EventLoop destructs in other thread”这行日志,仍然会调用asyncOutput函数,但是此时g_asyncLog 已经是一个无效的指针了,且不是空指针(野指针),所以你调用它的append方法必然会引起内存问题。
如何改正这个问题呢? 方法有多种,方法一就是在g_mainLoop.loop()后面加上一行,将全局指针对象g_asyncLog显式设置为空指针,这样EventLoop里面的析构函数打印的日志就不会打印了,因为asyncOutput函数中已经做了空指针判断。代码改成如下:
1int main(int argc, char* argv[])
2 {
3 //省略无关代码
4 std::string strLogFileFullPath(logfilepath);
5 strLogFileFullPath += logfilename;
6 Logger::setLogLevel(Logger::DEBUG);
7 int kRollSize = 500 * 1000 * 1000;
8 AsyncLogging log(strLogFileFullPath.c_str(), kRollSize);
9 log.start();
10 g_asyncLog = &log;
11 Logger::setOutput(asyncOutput);
12
13 //省略无关代码
14
15 g_mainLoop.loop();
16 g_asyncLog = NULL;
17
18 return 0;
19}
这种解决方案缺点是丢了部分日志,方法二就是将这个log对象也申明成全局变量,即将g_asyncLog改成对象而不是指针。我们这里采用方法二,修改后的代码如下:
1EventLoop g_mainLoop;
2
3AsyncLogging g_asyncLog;
4
5void asyncOutput(const char* msg, int len)
6 {
7 g_asyncLog.append(msg, len);
8 std::cout << msg << std::endl;
9}
10
11int main(int argc, char* argv[])
12 {
13 //无关代码省略...
14 std::string strLogFileFullPath(logfilepath);
15 strLogFileFullPath += logfilename;
16 Logger::setLogLevel(Logger::DEBUG);
17 int kRollSize = 1024 * 1024 * 1024;
18 //AsyncLogging log(strLogFileFullPath.c_str(), kRollSize);
19 g_asyncLog.setBaseName(strLogFileFullPath.c_str());
20 g_asyncLog.setRollSize(kRollSize);
21 g_asyncLog.start();
22 Logger::setOutput(asyncOutput);
23
24 //无关代码省略...
25
26 g_mainLoop.loop();
27
28
29 return 0;
30}
改完之后,重新编译程序,我们要验证一下,我们让程序接收Ctrl + C信号,由于gdb默认会自己处理Ctrl + C指令(行为是让gdb中断下来接收用户输入),我们可以修改gdb的设置让gdb不要处理这个信号,而是把这个信号传给我们的程序,在gdb中执行如下指令:
handle SIGINT nostop print pass
这样程序就能响应Ctrl + C了(SIGINT是Ctrl + C产生的信号值)。
1(gdb) handle SIGINT nostop print pass
2SIGINT is used by the debugger.
3Are you sure you want to change it? (y or n) y
4Signal Stop Print Pass to program Description
5SIGINT No Yes Yes Interrupt
6(gdb) r
7Starting program: /home/zhangyl/flamingoserver/chatserver
8[Thread debugging using libthread_db enabled]
9Using host libthread_db library "/lib64/libthread_db.so.1".
10[New Thread 0x7ffff442b700 (LWP 11460)]
11[New Thread 0x7ffff3c2a700 (LWP 11461)]
12[Thread 0x7ffff3c2a700 (LWP 11461) exited]
1320180820 22:00:27 INFO show databases - DatabaseMysql.cpp:93
14
1520180820 22:00:27 INFO CMysqlManager::_IsDBExist, find database(flamingo) - MysqlManager.cpp:195
16
1720180820 22:00:27 INFO desc t_user - DatabaseMysql.cpp:93
18
1920180820 22:00:27 INFO desc t_user_relationship - DatabaseMysql.cpp:93
20
2120180820 22:00:27 INFO desc t_chatmsg - DatabaseMysql.cpp:93
22
2320180820 22:00:27 INFO SELECT f_user_id, f_username, f_nickname, f_password, f_facetype, f_customface, f_gender, f_birthday, f_signature, f_address, f_phonenumber, f_mail, f_teaminfo FROM t_user ORDER BY f_user_id DESC - DatabaseMysql.cpp:93
24
2520180820 22:00:27 INFO current base userid: 0, current base group id: 268435455 - UserManager.cpp:111
26
27[New Thread 0x7ffff3c2a700 (LWP 11462)]
28[New Thread 0x7ffff2a74700 (LWP 11463)]
29[New Thread 0x7ffff2273700 (LWP 11464)]
30[New Thread 0x7ffff1a72700 (LWP 11465)]
31[New Thread 0x7ffff1271700 (LWP 11466)]
32[New Thread 0x7ffff0a70700 (LWP 11467)]
33[New Thread 0x7fffd3fff700 (LWP 11468)]
34[New Thread 0x7fffd37fe700 (LWP 11469)]
3520180820 22:00:27 INFO chatserver initialization completed, now you can use client to connect it. - main.cpp:139
36
37^C
38Program received signal SIGINT, Interrupt.
39program recv signal [2] to exit.
40Exit loop...
41Exit chatserver....
42[Thread 0x7ffff442b700 (LWP 11460) exited]
43[Thread 0x7fffd3fff700 (LWP 11468) exited]
44[Thread 0x7fffd37fe700 (LWP 11469) exited]
45[Thread 0x7ffff0a70700 (LWP 11467) exited]
46[Thread 0x7ffff1271700 (LWP 11466) exited]
47[Thread 0x7ffff1a72700 (LWP 11465) exited]
48[Thread 0x7ffff2273700 (LWP 11464) exited]
49[Thread 0x7ffff3c2a700 (LWP 11462) exited]
50[Thread 0x7ffff7fe5a80 (LWP 11459) exited]
51[Inferior 1 (process 11459) exited normally]
52(gdb)
至此,我们验证并修复了该bug,这样程序在响应Ctrl + C或者用kill + pid杀死进程就可以走正常退出流程了,而不再崩溃。不知道你学到了没有?
小结一下: 通过上面的例子我们可以发现,作为一个合格的linux后台开发人员,我们不仅要熟悉业务代码本身,还要熟练适用gdb各种命令,同时对操作系统的一些机制也要了解(例如:如何设置程序崩溃以后产生core文件)。如果你还没有掌握,建议一定要好好练习一下gdb的使用。