The One and the Many

tcl-tls, OpenSSL, threads, and Irssi crashing

Yesterday I was testing an irssi-tcl script of mine would segfault when I unloaded the irssi-tcl module. I was concerned there was a bug in this module (which I wrote), so I decided to figure out what was going on.

This post is about my investigation, why it happened, and how I fixed it.

On the weekend I updated my system to Debian Testing. I was certain that unloading the module did not crash Irssi before this update, so I expected that there was something in a newer version of Irssi that was causing the crash. I initially thought that there may have been changes in Irssi's module support that would require an update, as I knew that there had been some work done on Irssi in that area. I had already needed to update the module to add an ABI check function.

I also wondered whether the issue could be something fixed upstream in Irssi. I took a quick look at the differences between my Irssi version (the latest, 0.8.20) and what was in the repository. I noticed commits about NULL pointer issues in invalid config files which I thought could be another possible cause.

My first step was to try to reproduce the problem reliably. I reset to a default Irssi configuration (mv .irssi .irssi.backup), and re-introduced my configuration, scripts, and modules one by one. Eventually I narrowed down reproducing the crash to this:

Each of these is key, or the crash won't happen.

I tried to debug this on my main system, but as there were no debug symbols, debugging with gdb was difficult. I started up a new Compute VM on the Google Cloud Platform, and downloaded and compiled the following (primarily so I would have debug symbols available, and easy access to changing the source):

I first verified I could reproduce the crash again (to rule out that it was an issue with how the package was created in Debian). I could:

Program received signal SIGSEGV, Segmentation fault.
0x00007f936110022d in CryptoThreadIdCallback () at tls.c:194
194     return (unsigned long) Tcl_GetCurrentThread();

I could see a useful backtrace. The backtrace varies, but here is an excerpt showing the most relevant parts:

#0  0x00007fc04c96722d in CryptoThreadIdCallback () at tls.c:191
#1  0x00007fc04f64c622 in CRYPTO_THREADID_current ()
   from /usr/lib/x86_64-linux-gnu/libcrypto.so.1.0.0
#2  0x00007fc04f6da888 in ERR_get_state () from /usr/lib/x86_64-linux-gnu/libcrypto.so.1.0.0
#3  0x00007fc04f6daadf in ERR_clear_error ()
   from /usr/lib/x86_64-linux-gnu/libcrypto.so.1.0.0
#4  0x00000000004913ac in irssi_ssl_write (handle=0x184adb0,
    buf=0x7ffeed92d0c0 "WHOIS will\r\n", len=12, ret=0x7ffeed92d000, gerr=0x7ffeed92d058)
    at network-openssl.c:338
#5  0x00007fc04fc75e74 in g_io_channel_write_chars ()
   from /lib/x86_64-linux-gnu/libglib-2.0.so.0
#6  0x0000000000484378 in net_transmit (handle=<optimized out>,
    data=data@entry=0x7ffeed92d0c0 "WHOIS will\r\n", len=len@entry=12) at network.c:366

This shows that the crash occurs in tls.c, which is part of tcl-tls. This is the function:

static unsigned long
CryptoThreadIdCallback(void)
{
    return (unsigned long) Tcl_GetCurrentThread();
}

This told me that a Tcl function was being called. However, at this point there was no more Tcl interpreter - unloading the irssi-tcl module deletes the interpreter. This callback function should not be calling this function after we unload the module.

I was looking at a bug in tcl-tls. This is a coincidence since I recently blamed it for a bug and it turned out tcl-tls was not at fault!

After comparing the differences between the tcl-tls version without the crash (1.6) and the current one (1.6.7), I found that this function was added in 1.6.3. In that release, there was a change made to make the use of OpenSSL thread safe. You can read more about how to do that here. You can see the patch the tcl-tls team applied here.

What the change does is tell OpenSSL to call this function to find out its current thread id. OpenSSL remembered this callback even when we removed the module that contained it. The next time there was any interaction with OpenSSL, it would try to call this callback, and segfault.

This also explains why each of the steps I listed to reproduce the bug were critical. If we don't have a connection to a TLS server, then there would be no problematic interaction with OpenSSL. If we didn't load urltitle.tcl, then we wouldn't load tcl-tls (so really any script that loaded the tls package would suffice for loading urltitle.tcl).

To resolve this I thought the ideal solution would be to clear the callbacks when we unload the Tcl interpreter. After looking at the Tcl C API for a while, I came across the function Tcl_CallWhenDeleted() (docs here). This function registers a callback in the Tcl interpreter to be called when we delete the interpreter.

I added a function to set the OpenSSL callbacks to NULL:

static void
TlsInterpDeleteCallback(ClientData clientData, Tcl_Interp *interp)
{
#if defined(OPENSSL_THREADS) && defined(TCL_THREADS)
    CRYPTO_set_locking_callback(NULL);
    CRYPTO_set_id_callback(NULL);
#endif
}

And call it in tcl-tls when initializing the package:

Tcl_CallWhenDeleted(interp, TlsInterpDeleteCallback, (ClientData) 0);

This fixed the problem. No more crashes!

I wondered about if this would be sufficient in all cases. What about applications that already register these OpenSSL callbacks? In the case of irssi-tcl, the interpreter is in the same thread as the main Irssi process, and Irssi does not set these functions, but there's no guarantee this will be the case in all applications. It is possible for an application to already be using OpenSSL safely with threads and have these callbacks registered. In that case, loading and then unloading the tcl-tls module would destroy the thread safety guarantee as the callbacks would be unset. It may be more appropriate to have these callbacks be defined in the application itself rather than inside this library, or to perhaps to at least provide a configurable way inside tcl-tls to allow that.

Another interesting case is if an application loads tcl-tls, and then afterwards itself (or another library) sets these OpenSSL callback functions. In that case, OpenSSL may not be safe to use in any Tcl threads (though if no Tcl threads were used, presumably it would be safe). Even if Tcl threads were used, depending on the callbacks set, it may still be safe, as the callbacks set by the application may be sufficient to recognize an executing Tcl thread.

For the first of these cases, I realized we could know the pre-existing callbacks at the time tcl-tls loads, and set them back to when we delete the interpreter. I updated my patch to include this when initializing tcl-tls:

old_crypto_thread_lock_callback = CRYPTO_get_locking_callback();
old_crypto_thread_id_callback = CRYPTO_get_id_callback();

CRYPTO_set_locking_callback(CryptoThreadLockCallback);
CRYPTO_set_id_callback(CryptoThreadIdCallback);

And in the interpreter deletion callback:

CRYPTO_set_locking_callback(old_crypto_thread_lock_callback);
CRYPTO_set_id_callback(old_crypto_thread_id_callback);

This does not address the second of the two possible problems, but I'm not sure of a good way to do so.

One more point to mention: These CRYPTO_ functions were deprecated in OpenSSL 1.0.0, so replacing them all together with the newer equivalents will probably eventually be necessary.

I created a ticket about this bug, and included my patch.

Comments