Debugging a rarely occurring bug under Debian
For months now I have been trying to figure out a bug. This sounds like a crazy amount of time, but most of that time I've been waiting for the bug to happen. The bug happens very rarely and I do not know how to trigger it!
The problem manifests as Irssi hanging and/or consuming 100% CPU. The particular Irssi instance runs a script to poll Strava's API for recent activities (bike rides). I've seen the problem about three or four times in the past year.
The prior times it happened I've
strace'd Irssi. Each time I've found
the process stuck in the recv(2)
system call trying to receive from the
Strava API. In an asynchronous variant of the script, I found the process stuck
calling recv(2)
in an infinite loop (I believe recv()
was returning
EAGAIN
). Unfortunately I don't have strace
output to show.
As Irssi is a C project, ideally I would attach to the process with gdb and debug it there. However I had no debugging symbols available as Irssi and the libraries it depends on were from the regular Debian packages.
I suspect the bug is somewhere in tcl
or tcl-tls
(both dependencies of
strava.tcl).
Attempt #1: Build Irssi in a Docker container and wait for the problem
I wanted to figure out the problem. I also did not want to build and install
hand compiled versions of Irssi and various libraries on my main system. I
decided to move the Irssi client to a Docker container where I would build
everything from source and have debugging symbols available. Then when the
problem occurred, I could attach to Irssi with gdb
and figure it out.
I downloaded the most recent versions of tcl
, tcl-tls
, tcllib
, and
irssi
and built and installed them in a container. Something like these
commands:
wget http://prdownloads.sourceforge.net/tcl/tcl8.6.6-src.tar.gz
tar zxvf tcl8.6.6-src.tar.gz
cd tcl8.6.6/unix
./configure --enable-symbols
time make -j4
make install
cd
wget 'http://downloads.sourceforge.net/project/tls/tls/1.6.7/tls1.6.7-src.tar.gz?r=https%3A%2F%2Fsourceforge.net%2Fprojects%2Ftls%2Ffiles%2F&ts=1476221428&use_mirror=pilotfiber'
tar zxvf tls*
cd tls1.6.7
./configure --enable-symbols
time make -j4
make install
cd
wget https://github.com/tcltk/tcllib/archive/tcllib_1_18.tar.gz
tar zxvf tcllib_1_18.tar.gz
cd tcllib-tcllib_1_18
./configure
make install
cd
wget https://github.com/irssi/irssi/releases/download/0.8.20/irssi-0.8.20.tar.gz
tar zxvf irssi-0.8.20.tar.gz
cd irssi-0.8.20
./configure
time make -j4
make install
Then I built and installed irssi-tcl, a module for Irssi which lets it run Tcl scripts. Afterwards I installed irssi-strava and configured it.
I ran Irssi in this container and waited. For months. It's now been about four months. The bug never re-occurred!
Attempt #2: Back to the bare metal
I'm not sure why the bug never appeared when running in the Docker container. Maybe it's just bad (or good, depending on your viewpoint) luck. Maybe it is because of differences between the two systems. The container ran on my 64-bit desktop computer. Every time I've seen the bug it has been on my 32-bit server. The desktop is far more powerful, and the server is usually overloaded (to the point of triggering the oom killer). Also, given I built Irssi and its dependencies myself, it's very possible different flags or versions changed the behaviour.
I wanted to go back to running Irssi on my server. I still did not want to build and install hand compiled versions of Irssi or the libraries it depends on though. I knew it should be possible to get a core dump from Irssi running on the server and then take that core dump and use it on a different system with debug symbols available. This is a bit of a round about way but not unusual.
gdb
includes a program called gcore
. Calling gcore [pid]
generates a core
dump from a running process (and leaves that process intact!). I tested
generating a core from a running Irssi process. Then I wanted to verify I could
load it on another system with debugging symbols.
To use the core dump, I believed I needed to build the same version of Irssi
and its libraries, but with debugging symbols. I set up a new Docker container
to work in. I came across the Debian wiki page
HowToGetABackTrace. This page
talks about how to build packages with debugging symbols. Most package
maintainers have transitioned their packages so that a -dbgsym
version of
their package automatically builds when you build their package.
For example, building Irssi like this:
apt-get build-dep irssi
apt-get source irssi
cd irssi-1.0.0
dpkg-buildpackage -us -uc
Results in these .deb
files (among others):
irssi_1.0.0-1_amd64.deb
irssi-dbgsym_1.0.0-1_amd64.deb
irssi-dbgsym
depends on irssi
. Installing both means we have debugging
symbols available for Irssi.
Even better, Debian provides repositories where one can find -dbgsym
packages
already built. This means to do what I wanted to do I didn't need to build any
packages. I could do a regular apt-get install
.
Apparently
these -dbgsym
packages are a fairly
new addition. A hugely useful one!
All of the libraries/dependencies I needed debugging symbols for (save one) have such packages:
tcl8.6-dbgsym
tcl-tls-dbgsym
The one that does not is libglib2.0-0
. However there is a -dbg
package for
it (-dbg
was a precursor to -dbgsym
it seems):
libglib2.0-0-dbg
tcllib
includes only Tcl scripts, so debugging symbols are not necessary for
it.
After installing these packages, I could see the functions and lines when
loading the core with gdb
:
# gcore 12934
# ...
Saved corefile core.12934
$ gdb /usr/bin/irssi core.12934
# ...
(gdb) bt
#0 0xb7740cf9 in __kernel_vsyscall ()
#1 0xb6f49b0f in poll () at ../sysdeps/unix/syscall-template.S:84
#2 0xb7373190 in poll (__timeout=1000, __nfds=3, __fds=0x82d297a8)
at /usr/include/i386-linux-gnu/bits/poll2.h:46
#3 g_poll (fds=0x82d297a8, nfds=3, timeout=1000) at ././glib/gpoll.c:124
#4 0xb736371c in g_main_context_poll (priority=<optimized out>, n_fds=3, fds=0x82d297a8,
timeout=<optimized out>, context=0x81dac000) at ././glib/gmain.c:4228
#5 g_main_context_iterate (context=context@entry=0x81dac000, block=block@entry=1,
dispatch=dispatch@entry=1, self=<optimized out>) at ././glib/gmain.c:3924
#6 0xb7363854 in g_main_context_iteration (context=0x81dac000, may_block=1)
at ././glib/gmain.c:3990
#7 0x800ed3a5 in main (argc=<optimized out>, argv=<optimized out>) at irssi.c:327
This means I don't need to use a separate system to inspect the core dumps. I
can install the -dbgsym
and -dbg
packages on my main system and debug
there.
Now it's back to waiting for the bug to occur! Here's hoping a bug kills my IRC client?