Debugging a rarely occurring bug under Debian
For months now I have been trying to figure out a bug. This sounds like a crazy amount of time, but most of that time I've been waiting for the bug to happen. The bug happens very rarely and I do not know how to trigger it!
The problem manifests as Irssi hanging and/or consuming 100% CPU. The particular Irssi instance runs a script to poll Strava's API for recent activities (bike rides). I've seen the problem about three or four times in the past year.
The prior times it happened I've
strace'd Irssi. Each time I've found
the process stuck in the
recv(2) system call trying to receive from the
Strava API. In an asynchronous variant of the script, I found the process stuck
recv(2) in an infinite loop (I believe
recv() was returning
EAGAIN). Unfortunately I don't have
strace output to show.
As Irssi is a C project, ideally I would attach to the process with gdb and debug it there. However I had no debugging symbols available as Irssi and the libraries it depends on were from the regular Debian packages.
I suspect the bug is somewhere in
tcl-tls (both dependencies of
Attempt #1: Build Irssi in a Docker container and wait for the problem
I wanted to figure out the problem. I also did not want to build and install
hand compiled versions of Irssi and various libraries on my main system. I
decided to move the Irssi client to a Docker container where I would build
everything from source and have debugging symbols available. Then when the
problem occurred, I could attach to Irssi with
gdb and figure it out.
I downloaded the most recent versions of
irssi and built and installed them in a container. Something like these
wget http://prdownloads.sourceforge.net/tcl/tcl8.6.6-src.tar.gz tar zxvf tcl8.6.6-src.tar.gz cd tcl8.6.6/unix ./configure --enable-symbols time make -j4 make install cd wget 'http://downloads.sourceforge.net/project/tls/tls/1.6.7/tls1.6.7-src.tar.gz?r=https%3A%2F%2Fsourceforge.net%2Fprojects%2Ftls%2Ffiles%2F&ts=1476221428&use_mirror=pilotfiber' tar zxvf tls* cd tls1.6.7 ./configure --enable-symbols time make -j4 make install cd wget https://github.com/tcltk/tcllib/archive/tcllib_1_18.tar.gz tar zxvf tcllib_1_18.tar.gz cd tcllib-tcllib_1_18 ./configure make install cd wget https://github.com/irssi/irssi/releases/download/0.8.20/irssi-0.8.20.tar.gz tar zxvf irssi-0.8.20.tar.gz cd irssi-0.8.20 ./configure time make -j4 make install
Then I built and installed irssi-tcl, a module for Irssi which lets it run Tcl scripts. Afterwards I installed irssi-strava and configured it.
I ran Irssi in this container and waited. For months. It's now been about four months. The bug never re-occurred!
Attempt #2: Back to the bare metal
I'm not sure why the bug never appeared when running in the Docker container. Maybe it's just bad (or good, depending on your viewpoint) luck. Maybe it is because of differences between the two systems. The container ran on my 64-bit desktop computer. Every time I've seen the bug it has been on my 32-bit server. The desktop is far more powerful, and the server is usually overloaded (to the point of triggering the oom killer). Also, given I built Irssi and its dependencies myself, it's very possible different flags or versions changed the behaviour.
I wanted to go back to running Irssi on my server. I still did not want to build and install hand compiled versions of Irssi or the libraries it depends on though. I knew it should be possible to get a core dump from Irssi running on the server and then take that core dump and use it on a different system with debug symbols available. This is a bit of a round about way but not unusual.
gdb includes a program called
gcore [pid] generates a core
dump from a running process (and leaves that process intact!). I tested
generating a core from a running Irssi process. Then I wanted to verify I could
load it on another system with debugging symbols.
To use the core dump, I believed I needed to build the same version of Irssi
and its libraries, but with debugging symbols. I set up a new Docker container
to work in. I came across the Debian wiki page
HowToGetABackTrace. This page
talks about how to build packages with debugging symbols. Most package
maintainers have transitioned their packages so that a
-dbgsym version of
their package automatically builds when you build their package.
For example, building Irssi like this:
apt-get build-dep irssi apt-get source irssi cd irssi-1.0.0 dpkg-buildpackage -us -uc
Results in these
.deb files (among others):
irssi-dbgsym depends on
irssi. Installing both means we have debugging
symbols available for Irssi.
Even better, Debian provides repositories where one can find
already built. This means to do what I wanted to do I didn't need to build any
packages. I could do a regular
-dbgsym packages are a fairly
new addition. A hugely useful one!
All of the libraries/dependencies I needed debugging symbols for (save one) have such packages:
The one that does not is
libglib2.0-0. However there is a
-dbg package for
-dbg was a precursor to
-dbgsym it seems):
tcllib includes only Tcl scripts, so debugging symbols are not necessary for
After installing these packages, I could see the functions and lines when
loading the core with
# gcore 12934 # ... Saved corefile core.12934 $ gdb /usr/bin/irssi core.12934 # ... (gdb) bt #0 0xb7740cf9 in __kernel_vsyscall () #1 0xb6f49b0f in poll () at ../sysdeps/unix/syscall-template.S:84 #2 0xb7373190 in poll (__timeout=1000, __nfds=3, __fds=0x82d297a8) at /usr/include/i386-linux-gnu/bits/poll2.h:46 #3 g_poll (fds=0x82d297a8, nfds=3, timeout=1000) at ././glib/gpoll.c:124 #4 0xb736371c in g_main_context_poll (priority=<optimized out>, n_fds=3, fds=0x82d297a8, timeout=<optimized out>, context=0x81dac000) at ././glib/gmain.c:4228 #5 g_main_context_iterate (context=context@entry=0x81dac000, block=block@entry=1, dispatch=dispatch@entry=1, self=<optimized out>) at ././glib/gmain.c:3924 #6 0xb7363854 in g_main_context_iteration (context=0x81dac000, may_block=1) at ././glib/gmain.c:3990 #7 0x800ed3a5 in main (argc=<optimized out>, argv=<optimized out>) at irssi.c:327
This means I don't need to use a separate system to inspect the core dumps. I
can install the
-dbg packages on my main system and debug
Now it's back to waiting for the bug to occur! Here's hoping a bug kills my IRC client?