Philip Guo (Phil Guo, Philip J. Guo, Philip Jia Guo, pgbovine)

Unicode errors in Python 2

Unicode Wars Episode V: UnicodeEncodeError Strikes Back
Summary
In this sequel to my Unicode strings in Python article, I explain a few common errors that arise when handling Unicode strings in Python 2. 猫猫猫

This article is a sequel to Unicode strings in Python: A basic tutorial. Read that article first if you want to get up to speed on Unicode basics. This one focuses exclusively on Unicode-related errors in Python 2. (Python 3 eliminates many of these errors.)

tl;dr 1: In Python 2, never directly write a unicode object to the terminal, to a file, or to a database. Always convert (encode) it to a plain str using .encode('utf-8') before writing it anywhere.

tl;dr 2: Never mix str and unicode objects together in expressions such as concatenation (+) or format operations (%). Always first convert (decode) the str object to unicode by calling .decode('utf-8') and then work only with unicode objects.

tl;dr 3: To convert unicode to str, use the .encode('utf-8') method. Think of a unicode object as an instance of an abstract data type that you need to encode into concrete bytes before, say, writing it to a file. To convert str to unicode, use the .decode('utf-8') method. Think of this as you being given a sequence of bytes in some garbled (encoded) format, and you now need to decode those bytes into an instance of the Unicode abstract data type.

Pop quiz: same code, different outputs

OK, I just started Python 2.7 on my Mac and typed in the following in a Unicode-enabled terminal:

>>> x = u'\u732b'
>>> type(x)
<type 'unicode'>
>>> print x
猫

Looks fine, right? I created a Unicode string literal '\u732b' which represents the Chinese character 猫, assigned it to x, and printed x to the terminal.

OK now I'm going to change some mysterious settings on my computer, start the exact same version of Python 2.7 again in the same terminal, and type in the exact same commands:

>>> x = u'\u732b'
>>> type(x)
<type 'unicode'>
>>> print x
Traceback (most recent call last):
  File "", line 1, in 
UnicodeEncodeError: 'ascii' codec can't encode character
u'\u732b' in position 0: ordinal not in range(128)

What did I change? Well, I'm not going to tell you, because it doesn't matter. What matters is that the exact same Python interpreter running on the exact same computer in the exact same terminal gave us different outputs. That's troubling!

Solution: what does print do?

The solution to this conundrum is to figure out what print x does. Since x is a unicode object, the print statement tries to encode it into bytes before printing it to stdout. What encoding scheme is used? The default encoding of the stdout stream, as given by sys.stdout.encoding. Thus, print x roughly translates into:

sys.stdout.write(x.encode(sys.stdout.encoding) + '\n')

Note that print appends a newline to the end.

When I ran Python the first time, sys.stdout.encoding was 'UTF-8', and when I ran it the second time, it was 'US-ASCII'. Encoding the 猫 character as UTF-8 works fine, but encoding 猫 as ASCII fails since 猫 is obviously not an ASCII character.

Let's recap those sessions. The working one:

>>> x = u'\u732b'
>>> import sys
>>> sys.stdout.encoding
'UTF-8'
>>> x.encode(sys.stdout.encoding)
'\xe7\x8c\xab'
>>> sys.stdout.write(x.encode(sys.stdout.encoding) + '\n')
猫
>>> print x
猫

And the broken one:

>>> x = u'\u732b'
>>> import sys
>>> sys.stdout.encoding
'US-ASCII'
>>> x.encode(sys.stdout.encoding)
Traceback (most recent call last):
  File "", line 1, in 
UnicodeEncodeError: 'ascii' codec can't encode character
u'\u732b' in position 0: ordinal not in range(128)

The moral of this story?

Don't ever directly print a unicode object, since you don't know what encoding will be used. The output could differ depending on the state of your command-line environment.

What's the solution? Always explicitly encode your unicode object into a plain str before printing it. I recommend using UTF-8 as the encoding scheme since it's currently the dominant one.

OK I'm going to run the broken session again, except this time I will explicitly call x.encode('utf-8') to encode it into a str in UTF-8 format before printing it.

>>> x = u'\u732b'
>>> import sys
>>> sys.stdout.encoding
'US-ASCII'
>>> x.encode('utf-8')
'\xe7\x8c\xab'
>>> print x.encode('utf-8')
猫

Even though sys.stdout.encoding is still 'US-ASCII', it doesn't matter since I've explicitly converted x into a UTF-8 encoded string. My code is now immune to whatever my computer happened to have magically set for the default encoding of stdout.

This same recommendation applies not only for printing to the terminal, but also whenever you're writing a Unicode string to a file or database. Always call .encode('utf-8') first to be safe!

Don't mix str and unicode objects

Also don't mix str and unicode objects together in expressions such as concatenation (+) or format operations (%). Python 2 will automatically cast str objects to unicode, but the problem (again!) is that this cast uses the default encoding set by your command-line environment. (As long as your str objects contain only ASCII characters, this cast will always work; but as soon as someone enters a character like 猫, your code will break.)

For instance, things start out innocently enough:

>>> x = '猫'
>>> y = u'猫'
>>> type(x)
<type 'str'>
>>> type(y)
<type 'unicode'>

Adding two str objects works fine, as does adding two unicode objects:

>>> x
'\xe7\x8c\xab'
>>> y
u'\u732b'
>>> x + x
'\xe7\x8c\xab\xe7\x8c\xab'
>>> y + y
u'\u732b\u732b'

However, if you try to mix them, something explodes:

>>> x + y
Traceback (most recent call last):
  File "", line 1, in 
UnicodeDecodeError: 'ascii' codec can't decode byte
0xe7 in position 0: ordinal not in range(128)

What went wrong? When Python sees x + y, it automatically converts x from str to unicode:

unicode(x) + y

This operation calls decode with no parameters:

x.decode() + y

And the decode fails since Python tries using the default encoding scheme (most likely ASCII) to decode x:

>>> x.decode()
Traceback (most recent call last):
  File "", line 1, in 
UnicodeDecodeError: 'ascii' codec can't decode byte
0xe7 in position 0: ordinal not in range(128)

To avoid this failure, explicitly decode x using utf-8 before mixing it with y. Now this should work since we're just adding two unicode objects:

>>> x.decode('utf-8') + y
u'\u732b\u732b'

Note that the same problem arises with format operations:

>>> u'hello there, %s' % x
Traceback (most recent call last):
  File "", line 1, in 
UnicodeDecodeError: 'ascii' codec can't decode byte
0xe7 in position 0: ordinal not in range(128)

Again, to fix, explicitly decode x:

>>> u'hello there, %s' % x.decode('utf-8')
u'hello there, \u732b'

Don't call encode on a str object

Finally, one common confusion arises when you remember to explicitly convert str to unicode but instinctively call encode instead of decode. I've made this mistake many times by thinking to myself, “Oh, I'm going to encode my string into Unicode!” What happens if you try to do so? You get the weirdest error ever:

>>> x.encode('utf-8')
Traceback (most recent call last):
  File "", line 1, in 
UnicodeDecodeError: 'ascii' codec can't decode byte
0xe7 in position 0: ordinal not in range(128)

Wait, wtf?!? Why do I get a UnicodeDecodeError when I'm trying to encode a string?!? This is super super confusing.

The solution to this conundrum is that encode is a method intended for unicode objects, not for str objects. Also, Python 2 “helpfully” casts str into unicode for you, so when it sees that x is a str, it tries to cast it first before calling encode on it. Your code translates into:

unicode(x).encode('utf-8')

which further translates into:

x.decode().encode('utf-8')

Now do you see where the UnicodeDecodeError comes from? It comes from x.decode() because it uses the ASCII encoding by default, and x holds the bytes for a UTF-8 string.

A similarly confusing error occurs when you try to call decode on a unicode object:

>>> y.decode('utf-8')
Traceback (most recent call last):
  File "", line 1, in 
  File "/Users/pgbovine/anaconda/lib/python2.7/encodings/utf_8.py", line 16, in decode
    return codecs.utf_8_decode(input, errors, True)
UnicodeEncodeError: 'ascii' codec can't encode character
u'\u732b' in position 0: ordinal not in range(128)

Again, it's because decode is a str method, but y is a unicode object, so Python first automatically tries to cast it to a str before calling decode. The cast translates into y.encode(), which fails with a UnicodeEncodeError (not a UnicodeDecodeError!):

>>> y.encode().decode('utf-8')
Traceback (most recent call last):
  File "", line 1, in 
UnicodeEncodeError: 'ascii' codec can't encode character
u'\u732b' in position 0: ordinal not in range(128)

Just remember to call the right method and you'll be fine. To convert unicode to str, use the .encode('utf-8') method. Think of a unicode object as an instance of an abstract data type that you need to encode into concrete bytes before, say, writing it to a file. To convert str to unicode, use the .decode('utf-8') method. Think of this as you being given a sequence of bytes in some garbled (encoded) format, and you now need to decode those bytes into an instance of the Unicode abstract data type.

Created: 2015-12-02
Last modified: 2015-12-02
Related pages tagged as programming: