Perl and character encoding
One of the challenges when working with Perl is understanding its approach to character encoding. I've struggled several times with it, and after a recent debugging session I thought I'd write up my thoughts and some tricks I've found helpful.
Why is it challenging?
The main complexity is that Perl has an internal character encoding. This encoding allows Perl to hold Unicode characters in strings and treat them as single characters. This is useful as Unicode characters may be represented by multiple bytes in a Unicode encoding such as UTF-8. This means there are two types of strings: Those with characters in its internal encoding, and those not. The latter typically can be treated as binary data ("octets"), though strictly speaking they are octets only if they come from outside a program. i.e., not strings defined in source code.
Programmers need to juggle back and forth between the internal encoding and an external encoding. Unfortunately it is often not clear at a particular point whether you have a string with octets or the internal encoding, and it is not easy to check. You have to trace through the program and examine the transformations done to each string. Tracing through is difficult as it is common for functions to not state whether you should provide them octets or strings in the internal encoding, nor what they return.
The existence of the internal character encoding has other side effects that lead to complexity. One of those is Perl's I/O layers. These exist to make translating back and forth between the internal encoding and an external encoding automatic. For example, you can say that stdout is UTF-8, and then when you print a string in the internal encoding to stdout, Perl automatically encodes it to UTF-8 before writing it out. This means that not only do you need to think about what is in each string, but also what settings you have on each filehandle.
Here are some docs to show how complex the situation is:
- Quote and Quote-like Operators: Pay special attention to the section about escape sequences.
- Encode: Functions for decoding to Perl's internal encoding, and for encoding from that to another encoding. Check out the sections about the UTF8 flag and UTF-8 vs. utf8 (note these sections are mainly useful for gaining insight into Perl's design rather than anything else).
- Cpanel::JSON::XS: This module's documentation has two separate sections dedicated to character encoding: "A few notes on Unicode and Perl" and "Encoding/Codeset flag notes". One arguable point is the statement "Perl does not associate an encoding with your strings". That is true in that Perl never stores whether the string is UTF-8 or UTF-16, etc, but it does distinguish between its internal encoding and not.
What is in my string?
If you have a string and you're not entirely sure if it contains octets or
Perl's internal encoding, what do you do? Suppose you have the string "hi
þere
" encoded as UTF-8, but you're not sure whether somewhere earlier in
the program it was decoded to the internal encoding.
The surest way is to trace through starting from where the data enters the program. But can you tell without doing that?
Try this program:
use strict;
use warnings;
use Data::Dumper qw( Dumper );
# Use double quotes for strings. This enables showing octal and the \x{} output
# we rely on below.
$Data::Dumper::Useqq = 1;
my $s = ... # set somehow
print Dumper($s), "\n";
If it's in Perl's internal encoding, any non-ASCII characters display with
the syntax \x{}
(the contents of {}
indicate the Unicode code point):
$VAR1 = "hi \x{fe}ere";
If it contains UTF-8 octets:
$VAR1 = "hi \303\276ere";
Yes, octal:
$ perl -e 'my ($a, $b) = (oct(303), oct(276)); printf "%x %x\n", $a, $b'
c3 be
The UTF-8 encoding of "þ" (U+00FE) is 0xc3
0xbe
(see
here).
Why Data::Dumper
?
I like Data::Dumper
because it is a core module.
However, other modules can accomplish something similar:
Data::Dumper::Concise
:
This wraps around Data::Dumper
to enable some options for you (including
the Useqq
option).
Data::Printer
: This can be
useful and is what I recommend, but you can't use its default options to
answer the question. You'll probably want at least these options:
use Data::Printer {
show_unicode => 1,
escape_chars => 'nonascii',
};
print np($s), "\n";
With that, we'll see this if the string is in Perl's internal encoding:
"\x{fe}" (U)
Or this if it's UTF-8 octets:
"\x{c3}\x{be}"
As you can see, the way to tell whether we have the internal encoding with this module is with the presence or lack of "(U)".
It is also nice because it shows hex instead of octal.
What other ways can you examine what you have?
my ($hexchars) = unpack('H*', $octets);
print "$hexchars\n";
This will show:
c3be
Of course you have to already know that you have octets for this to be useful.
Can you rely on these tricks?
Unfortunately none of the above is entirely reliable.
Primarily this is because these differences are a result of the internal "UTF8 flag" being on or off. By using these tricks we're essentially looking at this flag indirectly. This can be helpful in debugging, but you should not rely on it. For one thing, it says nothing about correctness. Consider a program that decodes a string twice. The flag will be on yet you will have garbage.
Another reason is that Encode
's documentation states this flag will not
be on after decoding if the octets were all ASCII. Note the current
behaviour does set
this flag in this case (the docs are wrong). However, one day this
behaviour may change, and if you happen to decode an ASCII string, there
would be no way to tell.
As well, while you may see a string without any internal characters
(\x{}
) or octal (\000
), there's nothing to say whether it was decoded
or not. You might happen to look at a piece of data that is all ASCII. You
can currently get around this by checking for the UTF8 flag
(Data::Printer
is particularly useful here).
There are also gotchas when writing the character in source code rather than reading it from an external source. Not every character behaves the same way. Consider this program:
use strict;
use warnings;
use Data::Printer {
show_unicode => 1,
escape_chars => 'nonascii',
};
use Encode ();
my $h = {
a => {
decoded => Encode::decode('UTF-8', "\x41"),
inline => "A",
unicode => "\x{41}",
utf8 => "\x41",
},
grinning => {
decoded => Encode::decode('UTF-8', "\xf0\x9f\x98\x80"),
inline => "😀",
unicode => "\x{1f600}",
utf8 => "\xf0\x9f\x98\x80",
},
thorn => {
decoded => Encode::decode('UTF-8', "\xc3\xbe"),
inline => "þ",
unicode => "\x{fe}",
utf8 => "\xc3\xbe",
},
};
for my $k (sort keys %$h) {
p $h->{$k}
}
This outputs:
\ {
decoded "A" (U),
inline "A",
unicode "A",
utf8 "A"
}
\ {
decoded "\x{1f600}" (U),
inline "\x{f0}\x{9f}\x{98}\x{80}",
unicode "\x{1f600}" (U),
utf8 "\x{f0}\x{9f}\x{98}\x{80}"
}
\ {
decoded "\x{fe}" (U),
inline "\x{c3}\x{be}",
unicode "\x{fe}",
utf8 "\x{c3}\x{be}"
}
Look at the unicode
keys where we write the characters by Unicode code
points using the \x{}
escape sequence. In the "grinning" case, we see we
have a string in the internal encoding. In the "A" and "thorn" cases we
don't, though we indeed have the character we specified. This difference is
because both 0x41
and 0xfe
are <= 255.
Another complication is the utf8
pragma. If you had that on, you'd see
this output:
\ {
decoded "A" (U),
inline "A",
unicode "A",
utf8 "A"
}
\ {
decoded "\x{1f600}" (U),
inline "\x{1f600}" (U),
unicode "\x{1f600}" (U),
utf8 "\x{f0}\x{9f}\x{98}\x{80}"
}
\ {
decoded "\x{fe}" (U),
inline "\x{fe}" (U),
unicode "\x{fe}",
utf8 "\x{c3}\x{be}"
}
Here the inline
key changed. This is the key where we have UTF-8 encoded
characters the source code.
What should you do?
You need to follow the same principle in any language: Track how data gets transformed in your program as it passes through. In Perl you just need to be extra careful.
If you're writing a new program, one approach that can work is to have
binmode
set on all filehandles and stdout/stdin. Then Encode::decode()
at the point data enters the program, and Encode::encode()
on its way
out. This lets you avoid the complexity of I/O layers, and means you have a
defined boundary where translation occurs.
Gotchas and notes
- Your string can be either binary or in Perl's internal encoding.
- If you see a warning about "wide character in print", then you are
trying to write a string in Perl's internal encoding to a filehandle
that does not expect it. Use
Encode::encode()
to encode it first. - Always use
UTF-8
rather thanutf8
as the encoding in theEncode
module. The former produces correct UTF-8. The latter is something Perl specific and is not UTF-8. - The
\xXX
or\x{XX}
syntax puts a character by Unicode code point into the string, except when it is <= 255, in which case it uses the native encoding (which will usually but not always be ASCII). - Because of the prior point, in practice you can create octet strings
with quoted strings by sticking to characters <= 255. However the
most universal way is to use
pack('C*', ...)
. Note this is only applicable to strings you write in source code. For example:my $internal = decode('UTF-8', pack('C*', 0xc3, 0xbe))
. - If you have a string in Perl's internal encoding, don't try to
unpack('H*', ...)
it. This will give you surprising results. For example, if you decoded the octets0xc3
0xbe
from UTF-8 to the internal encoding, you will have the character\x{fe}
. Unpacking it will yieldfe
. However, if the internal string had a character > 255, you'd see an error. This can lead you to think you have0xfe
, depending on how confused you are while debugging! - The
Encode
module talks about a "UTF8 flag". You should generally ignore this. It is an implementation detail and a misnomer. The correct approach is todecode()
/encode()
as appropriate. As well, you may read that Perl's internal encoding is essentially UTF-8. That is also an implementation detail that you should not rely on.
Summary
Perl can be fun to work with and is a powerful tool. However, its approach to character encoding is one of its weak points. Even after working with it for years I still end up in situations where I spend a lot of time debugging what is going on. Writing this post lead to me learning a few things, and I wrote it because I thought I had a good grasp on the situation!
In general I vastly prefer the approach to character encoding of languages like C and Go. They do not have any special internal encoding. Instead, when you need to think about encoding or characters, you use the appropriate libraries and functions and are explicit about it. (Go does have some special treatment of UTF-8, and C does have wide character support in its standard library). Even PHP is superior to Perl in this regard, mostly by ignoring the problem and assuming octets everywhere.
Further reading
Tags: Perl, programming, charsets, encoding, UTF-8