The One and the Many

Debugging a rarely occurring bug under Debian

For months now I have been trying to figure out a bug. This sounds like a crazy amount of time, but most of that time I've been waiting for the bug to happen. The bug happens very rarely and I do not know how to trigger it!

The problem manifests as Irssi hanging and/or consuming 100% CPU. The particular Irssi instance runs a script to poll Strava's API for recent activities (bike rides). I've seen the problem about three or four times in the past year.

The prior times it happened I've strace'd Irssi. Each time I've found the process stuck in the recv(2) system call trying to receive from the Strava API. In an asynchronous variant of the script, I found the process stuck calling recv(2) in an infinite loop (I believe recv() was returning EAGAIN). Unfortunately I don't have strace output to show.

As Irssi is a C project, ideally I would attach to the process with gdb and debug it there. However I had no debugging symbols available as Irssi and the libraries it depends on were from the regular Debian packages.

I suspect the bug is somewhere in tcl or tcl-tls (both dependencies of strava.tcl).

Attempt #1: Build Irssi in a Docker container and wait for the problem

I wanted to figure out the problem. I also did not want to build and install hand compiled versions of Irssi and various libraries on my main system. I decided to move the Irssi client to a Docker container where I would build everything from source and have debugging symbols available. Then when the problem occurred, I could attach to Irssi with gdb and figure it out.

I downloaded the most recent versions of tcl, tcl-tls, tcllib, and irssi and built and installed them in a container. Something like these commands:

wget http://prdownloads.sourceforge.net/tcl/tcl8.6.6-src.tar.gz
tar zxvf tcl8.6.6-src.tar.gz
cd tcl8.6.6/unix
./configure --enable-symbols
time make -j4
make install

cd
wget 'http://downloads.sourceforge.net/project/tls/tls/1.6.7/tls1.6.7-src.tar.gz?r=https%3A%2F%2Fsourceforge.net%2Fprojects%2Ftls%2Ffiles%2F&ts=1476221428&use_mirror=pilotfiber'
tar zxvf tls*
cd tls1.6.7
./configure --enable-symbols
time make -j4
make install

cd
wget https://github.com/tcltk/tcllib/archive/tcllib_1_18.tar.gz
tar zxvf tcllib_1_18.tar.gz
cd tcllib-tcllib_1_18
./configure
make install

cd
wget https://github.com/irssi/irssi/releases/download/0.8.20/irssi-0.8.20.tar.gz
tar zxvf irssi-0.8.20.tar.gz
cd irssi-0.8.20
./configure
time make -j4
make install

Then I built and installed irssi-tcl, a module for Irssi which lets it run Tcl scripts. Afterwards I installed irssi-strava and configured it.

I ran Irssi in this container and waited. For months. It's now been about four months. The bug never re-occurred!

Attempt #2: Back to the bare metal

I'm not sure why the bug never appeared when running in the Docker container. Maybe it's just bad (or good, depending on your viewpoint) luck. Maybe it is because of differences between the two systems. The container ran on my 64-bit desktop computer. Every time I've seen the bug it has been on my 32-bit server. The desktop is far more powerful, and the server is usually overloaded (to the point of triggering the oom killer). Also, given I built Irssi and its dependencies myself, it's very possible different flags or versions changed the behaviour.

I wanted to go back to running Irssi on my server. I still did not want to build and install hand compiled versions of Irssi or the libraries it depends on though. I knew it should be possible to get a core dump from Irssi running on the server and then take that core dump and use it on a different system with debug symbols available. This is a bit of a round about way but not unusual.

gdb includes a program called gcore. Calling gcore [pid] generates a core dump from a running process (and leaves that process intact!). I tested generating a core from a running Irssi process. Then I wanted to verify I could load it on another system with debugging symbols.

To use the core dump, I believed I needed to build the same version of Irssi and its libraries, but with debugging symbols. I set up a new Docker container to work in. I came across the Debian wiki page HowToGetABackTrace. This page talks about how to build packages with debugging symbols. Most package maintainers have transitioned their packages so that a -dbgsym version of their package automatically builds when you build their package.

For example, building Irssi like this:

apt-get build-dep irssi
apt-get source irssi
cd irssi-1.0.0
dpkg-buildpackage -us -uc

Results in these .deb files (among others):

irssi_1.0.0-1_amd64.deb
irssi-dbgsym_1.0.0-1_amd64.deb

irssi-dbgsym depends on irssi. Installing both means we have debugging symbols available for Irssi.

Even better, Debian provides repositories where one can find -dbgsym packages already built. This means to do what I wanted to do I didn't need to build any packages. I could do a regular apt-get install. Apparently these -dbgsym packages are a fairly new addition. A hugely useful one!

All of the libraries/dependencies I needed debugging symbols for (save one) have such packages:

tcl8.6-dbgsym
tcl-tls-dbgsym

The one that does not is libglib2.0-0. However there is a -dbg package for it (-dbg was a precursor to -dbgsym it seems):

libglib2.0-0-dbg

tcllib includes only Tcl scripts, so debugging symbols are not necessary for it.

After installing these packages, I could see the functions and lines when loading the core with gdb:

# gcore 12934
# ...
Saved corefile core.12934

$ gdb /usr/bin/irssi core.12934
# ...
(gdb) bt                              
#0  0xb7740cf9 in __kernel_vsyscall ()               
#1  0xb6f49b0f in poll () at ../sysdeps/unix/syscall-template.S:84
#2  0xb7373190 in poll (__timeout=1000, __nfds=3, __fds=0x82d297a8)
    at /usr/include/i386-linux-gnu/bits/poll2.h:46
#3  g_poll (fds=0x82d297a8, nfds=3, timeout=1000) at ././glib/gpoll.c:124
#4  0xb736371c in g_main_context_poll (priority=<optimized out>, n_fds=3, fds=0x82d297a8,
    timeout=<optimized out>, context=0x81dac000) at ././glib/gmain.c:4228
#5  g_main_context_iterate (context=context@entry=0x81dac000, block=block@entry=1,
    dispatch=dispatch@entry=1, self=<optimized out>) at ././glib/gmain.c:3924
#6  0xb7363854 in g_main_context_iteration (context=0x81dac000, may_block=1)             
    at ././glib/gmain.c:3990                                             
#7  0x800ed3a5 in main (argc=<optimized out>, argv=<optimized out>) at irssi.c:327

This means I don't need to use a separate system to inspect the core dumps. I can install the -dbgsym and -dbg packages on my main system and debug there.

Now it's back to waiting for the bug to occur! Here's hoping a bug kills my IRC client?