Encoding in Python
Posté le Sat 29 September 2018 dans coding • 6 min de lecture
Encoding in Python is known to cause some problems because people often overlook some essential things: whenever you have to read an input (resp. write into an output), you should transform the content from (resp. to) the internal representation of the language to (resp. from) a sequence of bytes.
For this transformation, you have to choose an encoding which specifies how the content is represented as a sequence of bytes.
The transition from Python 2 to Python 3 caused some problems since the two versions handle text differently. First, we will see how the text is represented in Python 2 and Python 3, then how to do the conversion between the different representations, and then the different places where encoding step in:
- the encoding of the source code,
- the implicit conversions,
- the encoding of the input and the output,
- and the file system encoding.
Types to represent text
Text in Python can be represented in several ways, especially as:
- a sequence of bytes
- a sequence of symbols (i.e. characters)
Python 2 and 3 use different types that are summarized in this table:
Python 2 | Python 3 | Elements |
---|---|---|
str (ex: "my string") | bytes (ex: b"my string") | byte |
unicode (ex: u"my string") | str (ex: "my string") | symbol |
Here, I simplified since Python offers some mutable and immutable types, for instance Python 3 uses the bytes class for immutable sequences of single bytes, and bytearray class for its mutable counterpart.
The main point is: you can either handle text as a sequence of bytes, or as a sequence of symbols. The conversion between these two types are discussed in the following session.
For "my string" to be unicode in Python 2 (act like str in Python 3) you can use:
from __future__ import unicode_literals
Transformations between text representations
Let's consider a small example in Python 3:
s = "chaîne"
b = s.encode("utf-8")
print(f"type of s: {type(s)}")
print(f"type of b: {type(b)}")
print(f"len(s): {len(s)}")
print(f"len(b): {len(b)}")
print(f"s: {s}")
print(f"b: {b}")
We store the string chaîne in a variable s and store its encoding in utf-8 in the variable b. We then print the types of the two variables, their lengths and their representations. It outputs:
type of s: <class 'str'>
type of b: <class 'bytes'>
len(s): 6
len(b): 7
s: chaîne
b: b'cha\xc3\xaene'
First we can see that the string (i.e. a sequence of symbols) s is of length 6 which seems intuitive, but b of type bytes (i.e. a sequence of bytes) is of lenght 7 which seems unintuitive. When we see their representations, we can understand better what happens: the symbol î is transformed into two bytes (represented by \xc3 and \xae).
Note that when printing the representations of s and b, the method __repr__ of the classes str and bytes is applied to transform the sequence of symbols and the sequence of bytes respectively. For b, this method transforms the sequence of bytes into a string. The following code illustrates how the content is stored in b:
print(type(b[0]))
print(b[0])
It outputs:
<class 'int'>
99
It is the wrapper, here the bytes class, which knows how it is appropriate to represent itself since the individual elements are bytes stored as int by Python.
If you try to transform the string s into bytes encoded with ASCII:
s.encode('ascii')
you get an exception since you cannot represent î with ASCII:
UnicodeEncodeError: 'ascii' codec can't encode character '\xee' in position 3: ordinal not in range(128)
We saw how to encode a string into bytes. We can also transform bytes to a string:
b.decode('utf-8')
You have to be careful when selecting the encoding:
- When encoding, the encoding should support all the symbols of the string.
- When decoding, the encoding should correspond to the one you used for encoding it.
The code:
b.decode('latin-1')
produces the string "chaîne", showing how important the second point is.
Source code encoding
We previously saw that some symbols cannot be used with some encodings (ASCII cannot represent the symbol î for instance). If you want to use some symbols in your Python source code, you should then use an encoding that supports it.
By default, the encoding of the source code is:
- utf-8 for Python 3 (PEP 3120)
- ASCII for Python 2
To have the same behavior as Python 3 with Python 2, you can use at the beginning of the file (PEP 263):
# -*- coding: utf-8 -*-
You should also of course check that you editor is using the correct encoding.
Implicit conversions
Sometimes, Python uses implicit conversions. This is the case for instance if you try (in Python 3):
"{}".format(b"my bytes")
The bytes b"my bytes" is converted implicitly into a string. This is also the case with the built-in print or str functions. They apply the __str__ method of the class of the object to convert it into a string.
By default, the encoding used for implicit conversions is:
- utf-8 for Python 3
- ASCII for Python 2
You can check it with:
import sys
sys.getdefaultencoding()
Input and output encoding
Python can read from the input stream sys.stdin, for instance with the input function. It can write to the standard output sys.stdout with the print function and also to the standard error sys.stderr for its error messages. You can check their encodings with (ioencodings.py):
import sys
print("stdin: {}".format(sys.stdin.encoding))
print("stdout: {}".format(sys.stdout.encoding))
print("stderr: {}".format(sys.stderr.encoding))
You can define the encoding with the PYTHONIOENCODING environment variable.
Python behaves differently if you are in a shell or if you uses pipes:
python ioencoding.py
outputs (it can differ depending on your environment) both in Python 2 and 3:
stdin: UTF-8
stdout: UTF-8
stderr: UTF-8
However:
python ioencoding.py | cat
outputs the same thing with Python 3 but with Python 2 it outputs:
stdin: UTF-8
stdout: None
stderr: UTF-8
Here is a broader example (encodingtest.py) you can experiment with:
# -*- coding: utf-8 -*-
from __future__ import unicode_literals
import sys, locale, os
print("IO encodings:")
print("stdin: {}".format(sys.stdin.encoding))
print("stdout: {}".format(sys.stdout.encoding))
print("stderr: {}".format(sys.stderr.encoding))
print("\ndefault encoding: {}".format(sys.getdefaultencoding()))
print("\ntty(-like) device?")
print("stdin: {}".format(sys.stdin.isatty()))
print("stdout: {}".format(sys.stdout.isatty()))
print("stderr: {}".format(sys.stderr.isatty()))
print("\nPYTHONIOENCODING: {}".format(os.environ.get("PYTHONIOENCODING")))
print("î")
print(chr(238))
print("Ā")
print(chr(256))
For instance with Python 3:
echo "" | python encodingtest.py
outputs:
IO encodings:
stdin: UTF-8
stdout: UTF-8
stderr: UTF-8
default encoding: utf-8
tty(-like) device?
stdin: False
stdout: True
stderr: True
PYTHONIOENCODING: None
î
î
Ā
Ā
and with Python 2:
IO encodings:
stdin: None
stdout: UTF-8
stderr: UTF-8
default encoding: ascii
tty(-like) device?
stdin: False
stdout: True
stderr: True
PYTHONIOENCODING: None
î
�
Ā
Traceback (most recent call last):
File "encodingtest.py", line 24, in <module>
print(chr(256))
ValueError: chr() arg not in range(256)
It produces an an error since in Python 2 chr creates a Python 2 str for values in between 0 and 255 (a byte is \(2^8 = 256\) bits and Python 2 str are sequences of bytes). For 0 to 127, it corresponds to its ASCII code. In Python 3, chr created a Python 3 str for values corresponding to its Unicode code (hence a broader range).
Note also that it produces � instead of î because the extended ASCII code 238 corresponds to it (238 is the Unicode code for î, hence it worked as intended in Python 3).
If we specify the encoding for the input and outputs to be ASCII (it was previously undefined for the input with Python 2 because of the pipe):
echo "" | PYTHONIOENCODING=ascii python2 encodingtest.py
it outputs:
IO encodings:
stdin: ascii
stdout: ascii
stderr: ascii
default encoding: ascii
tty(-like) device?
stdin: False
stdout: True
stderr: True
PYTHONIOENCODING: ascii
Traceback (most recent call last):
File "encodingtest.py", line 20, in <module>
print("î")
UnicodeEncodeError: 'ascii' codec can't encode character u'\xee' in position 0: ordinal not in range(128)
Here we have an error since the print function encodes the unicode string (because of unicode_literals with Python 2) to a sequence of bytes in ASCII (the encoding for sys.sdtout).
File system encoding
The last encoding setting to be distinguished is the encoding used for the file system. It is used to convert between Unicode filenames and bytes filenames. You can check it with:
import sys
sys.getfilesystemencoding()
Summary
In this blog post, we saw that Python has two main types to handle the texts:
- a sequence of symbols, allowing to handle texts in an intuitive way;
- a sequence of bytes, used when you have to transfer text between Python and other software (especially the file system).
We also saw that Python 2 and 3 behave differently, Python 3 fixing some bad choices of Python 2.
Encoding appears at different places, when:
- writing the source code itself,
- implicitly converting the two main types,
- reading and writing to the inputs and outputs,
- and handling the filenames.
I hope this post clarifies some misconceptions about Python encoding.