The One and the Many

Perl and character encoding

One of the challenges when working with Perl is understanding its approach to character encoding. I've struggled several times with it, and after a recent debugging session I thought I'd write up my thoughts and some tricks I've found helpful.

Why is it challenging?

The main complexity is that Perl has an internal character encoding. This encoding allows Perl to hold Unicode characters in strings and treat them as single characters. This is useful as Unicode characters may be represented by multiple bytes in a Unicode encoding such as UTF-8. This means there are two types of strings: Those with characters in its internal encoding, and those not. The latter typically can be treated as binary data ("octets"), though strictly speaking they are octets only if they come from outside a program. i.e., not strings defined in source code.

Programmers need to juggle back and forth between the internal encoding and an external encoding. Unfortunately it is often not clear at a particular point whether you have a string with octets or the internal encoding, and it is not easy to check. You have to trace through the program and examine the transformations done to each string. Tracing through is difficult as it is common for functions to not state whether you should provide them octets or strings in the internal encoding, nor what they return.

The existence of the internal character encoding has other side effects that lead to complexity. One of those is Perl's I/O layers. These exist to make translating back and forth between the internal encoding and an external encoding automatic. For example, you can say that stdout is UTF-8, and then when you print a string in the internal encoding to stdout, Perl automatically encodes it to UTF-8 before writing it out. This means that not only do you need to think about what is in each string, but also what settings you have on each filehandle.

Here are some docs to show how complex the situation is:

What is in my string?

If you have a string and you're not entirely sure if it contains octets or Perl's internal encoding, what do you do? Suppose you have the string "hi þere" encoded as UTF-8, but you're not sure whether somewhere earlier in the program it was decoded to the internal encoding.

The surest way is to trace through starting from where the data enters the program. But can you tell without doing that?

Try this program:


use strict;
use warnings;

use Data::Dumper qw( Dumper );

# Use double quotes for strings. This enables showing octal and the \x{} output
# we rely on below.
$Data::Dumper::Useqq = 1;

my $s = ... # set somehow

print Dumper($s), "\n";

If it's in Perl's internal encoding, any non-ASCII characters display with the syntax \x{} (the contents of {} indicate the Unicode code point):


$VAR1 = "hi \x{fe}ere";

If it contains UTF-8 octets:


$VAR1 = "hi \303\276ere";

Yes, octal:


$ perl -e 'my ($a, $b) = (oct(303), oct(276)); printf "%x %x\n", $a, $b'
c3 be

The UTF-8 encoding of "þ" (U+00FE) is 0xc3 0xbe (see here).

Why Data::Dumper?

I like Data::Dumper because it is a core module.

However, other modules can accomplish something similar:

Data::Dumper::Concise: This wraps around Data::Dumper to enable some options for you (including the Useqq option).

Data::Printer: This can be useful and is what I recommend, but you can't use its default options to answer the question. You'll probably want at least these options:


use Data::Printer {
    show_unicode => 1,
    escape_chars => 'nonascii',
};

print np($s), "\n";

With that, we'll see this if the string is in Perl's internal encoding:


"\x{fe}" (U)

Or this if it's UTF-8 octets:


"\x{c3}\x{be}"

As you can see, the way to tell whether we have the internal encoding with this module is with the presence or lack of "(U)".

It is also nice because it shows hex instead of octal.

What other ways can you examine what you have?


my ($hexchars) = unpack('H*', $octets);
print "$hexchars\n";

This will show:


c3be

Of course you have to already know that you have octets for this to be useful.

Can you rely on these tricks?

Unfortunately none of the above is entirely reliable.

Primarily this is because these differences are a result of the internal "UTF8 flag" being on or off. By using these tricks we're essentially looking at this flag indirectly. This can be helpful in debugging, but you should not rely on it. For one thing, it says nothing about correctness. Consider a program that decodes a string twice. The flag will be on yet you will have garbage.

Another reason is that Encode's documentation states this flag will not be on after decoding if the octets were all ASCII. Note the current behaviour does set this flag in this case (the docs are wrong). However, one day this behaviour may change, and if you happen to decode an ASCII string, there would be no way to tell.

As well, while you may see a string without any internal characters (\x{}) or octal (\000), there's nothing to say whether it was decoded or not. You might happen to look at a piece of data that is all ASCII. You can currently get around this by checking for the UTF8 flag (Data::Printer is particularly useful here).

There are also gotchas when writing the character in source code rather than reading it from an external source. Not every character behaves the same way. Consider this program:


use strict;
use warnings;

use Data::Printer {
    show_unicode => 1,
    escape_chars => 'nonascii',
};
use Encode ();

my $h = {
    a => {
        decoded => Encode::decode('UTF-8', "\x41"),
        inline  => "A",
        unicode => "\x{41}",
        utf8    => "\x41",
    },
    grinning => {
        decoded => Encode::decode('UTF-8', "\xf0\x9f\x98\x80"),
        inline  => "😀",
        unicode => "\x{1f600}",
        utf8    => "\xf0\x9f\x98\x80",
    },
    thorn => {
        decoded => Encode::decode('UTF-8', "\xc3\xbe"),
        inline  => "þ",
        unicode => "\x{fe}",
        utf8    => "\xc3\xbe",
    },
};

for my $k (sort keys %$h) {
    p $h->{$k}
}

This outputs:


\ {
    decoded   "A" (U),
    inline    "A",
    unicode   "A",
    utf8      "A"
}
\ {
    decoded   "\x{1f600}" (U),
    inline    "\x{f0}\x{9f}\x{98}\x{80}",
    unicode   "\x{1f600}" (U),
    utf8      "\x{f0}\x{9f}\x{98}\x{80}"
}
\ {
    decoded   "\x{fe}" (U),
    inline    "\x{c3}\x{be}",
    unicode   "\x{fe}",
    utf8      "\x{c3}\x{be}"
}

Look at the unicode keys where we write the characters by Unicode code points using the \x{} escape sequence. In the "grinning" case, we see we have a string in the internal encoding. In the "A" and "thorn" cases we don't, though we indeed have the character we specified. This difference is because both 0x41 and 0xfe are <= 255.

Another complication is the utf8 pragma. If you had that on, you'd see this output:


\ {
    decoded   "A" (U),
    inline    "A",
    unicode   "A",
    utf8      "A"
}
\ {
    decoded   "\x{1f600}" (U),
    inline    "\x{1f600}" (U),
    unicode   "\x{1f600}" (U),
    utf8      "\x{f0}\x{9f}\x{98}\x{80}"
}
\ {
    decoded   "\x{fe}" (U),
    inline    "\x{fe}" (U),
    unicode   "\x{fe}",
    utf8      "\x{c3}\x{be}"
}

Here the inline key changed. This is the key where we have UTF-8 encoded characters the source code.

What should you do?

You need to follow the same principle in any language: Track how data gets transformed in your program as it passes through. In Perl you just need to be extra careful.

If you're writing a new program, one approach that can work is to have binmode set on all filehandles and stdout/stdin. Then Encode::decode() at the point data enters the program, and Encode::encode() on its way out. This lets you avoid the complexity of I/O layers, and means you have a defined boundary where translation occurs.

Gotchas and notes

Summary

Perl can be fun to work with and is a powerful tool. However, its approach to character encoding is one of its weak points. Even after working with it for years I still end up in situations where I spend a lot of time debugging what is going on. Writing this post lead to me learning a few things, and I wrote it because I thought I had a good grasp on the situation!

In general I vastly prefer the approach to character encoding of languages like C and Go. They do not have any special internal encoding. Instead, when you need to think about encoding or characters, you use the appropriate libraries and functions and are explicit about it. (Go does have some special treatment of UTF-8, and C does have wide character support in its standard library). Even PHP is superior to Perl in this regard, mostly by ignoring the problem and assuming octets everywhere.

Further reading

Tags: Perl, programming, charsets, encoding, UTF-8

Comments

Posted by Mark Fowler at
The best way, using core modules, to see what a string (well, a scalar) actually contains is to use Devel::Peek; perl -e 'my $string = "I \N{BLACK HEART SUIT} NY"; use Devel::Peek; Dump($string)' SV = PV(0x7fbd3e004c80) at 0x7fbd3e016610 REFCNT = 1 FLAGS = (PADMY,POK,pPOK,UTF8) PV = 0x7fbd3d5027c0 "I \342\231\245 NY"\0 [UTF8 "I \x{2665} NY"] CUR = 8 LEN = 16 Here you can see the PV ("Pointer Value") that contains the string. In quotes are the actual bytes, and you can also see the UTF8 representation. You can also see FLAGS=UTF8, which is the internal flag Perl uses to keep track of if internally it's using UTF-8 encoding or not. For example: perl -e 'my $string = "L\x{e9}on"; use Devel::Peek; Dump($string)' SV = PV(0x7fba53803c80) at 0x7fba53815610 REFCNT = 1 FLAGS = (PADMY,POK,pPOK) PV = 0x7fba52405660 "L\351on"\0 CUR = 4 LEN = 16 This internally is a Latin-1 string byte sequence. Perl is smart enough to upgrade these as it needs to: perl -e 'my $string = "L\x{e9}on"; $string .= "\N{DOUBLE EXCLAMATION MARK}"; use Devel::Peek; Dump($string)' SV = PV(0x7fd8c2004c80) at 0x7fd8c2016610 REFCNT = 1 FLAGS = (PADMY,POK,pPOK,UTF8) PV = 0x7fd8c1f00430 "L\303\251on\342\200\274"\0 [UTF8 "L\x{e9}on\x{203c}"] CUR = 8 LEN = 16 But this can be a source of bugs. For example, concatenating a scalar that contains a UTF-8 byte sequence together with a scalar that contains a string. perl -e 'my $string = "L\x{c3}\x{a9}on"; $string .= "\N{DOUBLE EXCLAMATION MARK}"; use Devel::Peek; Dump($string)' SV = PV(0x7ff9dc004c80) at 0x7ff9dd001e10 REFCNT = 1 FLAGS = (PADMY,POK,pPOK,UTF8) PV = 0x7ff9dbc05360 "L\303\203\302\251on\342\200\274"\0 [UTF8 "L\x{c3}\x{a9}on\x{203c}"] CUR = 10 LEN = 16 Now Perl thinks this is a unicode string that legitimately contains a à and ©. Ooops.

Comments