The One and the Many

Google's blog and TLS SNI

I heard about Google's new blog, The Keyword last night. I showed it to some friends in a chat channel. In this channel, a bot visits the site and pulls out its title, and then writes the title into the channel. This is a way to know something about what people link. The bot was unable to retrieve the title from Google's blog. This post is about how I figured out the problem, and why it was happening.

TL;DR: The bot's TLS connection to blog.google was not using the TLS SNI extension. blog.google's server closes your connection without replying if you do not use this extension (which I believe is against the TLS spec).

Initial debugging

I enabled debug mode on the bot's script that performs the retrieval.

Oddly, the request was completing without error, but it was not finding a title. I added more debug logging and discovered the response was 0 bytes.

I tested using curl, and of course I saw the response body.

My first theory was that Google was doing some kind of user agent sniffing, and rejecting the bot's request for that reason. I tried adjusting the user agent on the bot and with curl, but this made no difference.

I reviewed the script and tried to see if there was anything I could see wrong that would cause this. I could not solve the problem here.

Library/interpreter bug?

I did some searching and found reported issues with the TLS wrapper library in use, tcl-tls. I decided to try using a newer version to see if it was fixed in a newer version.

Using the latest version, 1.6.7, there was different behaviour. Instead of getting back 0 bytes and no errors, I was now seeing this:

connect failed due to unexpected EOF
    while executing
"::http::geturl $url -binary 1 -headers [list Accept text/html] -method GET"

I found reference to this error showing up as a bug in prior versions of this library, so I thought this might be another case that was not yet fixed.

Debugging tcl-tls

This morning I decided I would take a shot at figuring out the bug in tcl-tls.

I booted up a VM on Google's Compute Engine and downloaded the latest Tcl and tcl-tls, and compiled and installed them. (I wanted to be able to make install them without causing chaos on my desktop, and I don't have VMs set up on it yet. A VM I can blow away was quick to set up on GCP).

I decided to rule out the Tcl http library by writing a small HTTP client using tcl-tls's socket directly. It looks like this:

set host blog.google
set sock [::tls::socket -ssl2 0 -ssl3 0 -tls1 0 -tls1.1 0 -tls1.2 1 $host 443]
fconfigure $sock -translation binary
puts -nonewline $sock "GET / HTTP/1.1\r\n"
puts -nonewline $sock "Host: $host\r\n"
puts -nonewline $sock "Accept: text/html\r\n"
puts -nonewline $sock "\r\n"
flush $sock
set sz 0
while {[gets $sock line] >= 0} {
    incr sz [string length $line]
    puts "Read $sz bytes"
}
puts "Total: Read $sz bytes"

This client doesn't completely receive the response, but it is enough to get part of the body back.

This showed that the problem was not specific to the http library. This program wasn't able to make a connection either.

I thought: Maybe there are commits not released in tcl-tls that might make a difference. After checking it out from its CVS (!) repository, I found the released version identical. There were no new changes in the repository.

I dug into the tcl-tls code. I found that in tls.tcl we open a regular ::socket, and then pass it to the tcl-tls library with ::tls::import which takes over and sets up the connection in C.

I found in Tls_WaitForConnect() that the SSL_connect() call was returning an error. I added a bunch of verbose error code checking and print statements, and discovered the condition was this (as described in the OpenSSL documentation):

SSL_ERROR_SYSCALL

"Some I/O error occurred. The OpenSSL error queue may contain more information on the error. If the error queue is empty (i.e. ERR_get_error() returns 0), ret can be used to find out more about the error: If ret == 0, an EOF was observed that violates the protocol. If ret == -1, the underlying BIO reported an I/O error (for socket I/O on Unix systems, consult errno for details)."

In my case, ret was 0. This means there was an EOF that violated the protocol. This matched up with what I was seeing reported by the Tcl http library.

But why? And was what this error description said actually true?

strace revealed the following:

connect(3, {sa_family=AF_INET, sin_port=htons(443), sin_addr=inet_addr("216.239.32.21")}, 16) = 0
sendto(3, "\26\3\1\1\34\1\0\1\30\3\3\205$\0\313\254\376\360t\301\1\223\333\211kmsl\rvH\232"..., 289, 0, NULL, 0) = 289
recvfrom(3, "", 7, 0, NULL, NULL)       = 0

We were connecting, sending our TLS ClientHello (presumably), and the other side was closing the connection without replying.

I figured there could be a problem with how tcl-tls was creating its TLS connection. Maybe it was doing something silly. I read about and adjusted the options used during its SSL_CTX creation. Nothing changed the behaviour.

The discovery

At this point I was feeling stuck. I started thinking about whether I should try to reverse engineer the ClientHello to see if there was something in it that would tell me more.

Then I decided to try a few more sites with my test HTTP client. I found that another one was failing, but with a different error. I was seeing this instead of the unexpected EOF condition:

error:14077438:SSL routines:SSL23_GET_SERVER_HELLO:tlsv1 alert internal error

I knew this site worked elsewhere too. I pulled out openssl s_client. It showed the same error.

Looking this error up, I found reference to needing -servername (to use SNI). After adding that, the connection worked fine.

I decided to try connecting to Google's blog with this option. What do you know, it worked!

Indeed, adding a -servername into my test HTTP client made the request work there as well:

set sock [::tls::socket -ssl2 0 -ssl3 0 -tls1 0 -tls1.1 0 -tls1.2 1 -servername $host $host 443]

I plan to update the bot's script to pass a -servername.

For reference, here's what connecting to blog.google showed (without -servername):

140445991155344:error:140790E5:SSL routines:SSL23_WRITE:ssl handshake failure:s23_lib.c:177:

Is Google violating the TLS spec?

As far as I can tell from the TLS 1.2 RFC and the TLS extensions RFC that defines server_name, what Google's HTTP server is doing is in violation of the spec. It should not be closing the connection without replying in some way to the ClientHello.

It may be is acceptable for them to require SNI, but how they are dealing with it not being there is problematic. Cloudflare also requires SNI (for their free tier at least), but in their case they send back a response to the ClientHello (a fatal internal error alert).

Indeed, RFC 5246 says this about ServerHello:

"The server will send this message in response to a ClientHello message when it was able to find an acceptable set of algorithms. If it cannot find such a match, it will respond with a handshake failure alert."

The TLS 1.3 draft says something very much the same.

The 1.2 RFC also says:

"The client sends a ClientHello message to which the server must respond with a ServerHello message, or else a fatal error will occur and the connection will fail."

And further:

"A server MUST accept ClientHello messages both with and without the extensions field, and (as for all other messages) it MUST check that the amount of data in the message precisely matches one of these formats; if not, then it MUST send a fatal "decode_error" alert."

Together, these imply to me that a response is required.

The end

This shows how much you can learn from playing with a little IRC bot!

What is most surprising to me (other than Google doing something incorrect like this) is how many sites work without SNI, and that the bot was not using it. I suppose I assumed the HTTP and TLS libraries I was using did it for me.