chapter 7: mangle data ( encode, decode, format)

Text string

Unicode

ASCII was plain old text coding system. The basic unit of computer storage is the byte, which can store 256 unique values in its eight bits. For various reasons, ASCII only used 7 bits (128 unique values): 26 uppercase letters, 26 lowercase letters, 10 digits, some punctuation symbols, some spacing characters, and some nonprinting control codes.

Other method include: Latin-1, windows code page 1252.

Each of these uses all eight bits, but even that’s not enough, especially when you need non-European languages.

Unicode is an ongoing international standard to define the characters of all the world’s languages, plus symbols from mathematics and other fields. Unicode provides a unique number for every character, no matter what the platform, no matter what the program, no matter what the language.

The characters are divided into 8-bit sets called planes. The first 256 planes are the basic multilingual planes.

Python 3 Unicode stings

Python 3 strings are Unicode strings, not byte arrays. This is the single largest change from Python 2. You can use a character in a Python string if you know Unicode ID or name for a character.

A \u followed by four hex numbers specifies a character in one of Unicode’s 256 basic multilingual planes. The first two are the plane number (00 to FF), and the next two are the index of the character within the plane. Plan 00 is good old ASCII, and the character positions within that plane are the same as ASCII.
For characters in the higher planes, we need more bits. The Python escape sequence for these is \U followed by 8 hex characters, the leftmost ones need to be 0.
For all characters, \N{name} lets you specify it by its standard name.

In python unicodedata module, Functions that translate the unicode form:

lookup() : takes a case-insensitive name and returns a Unicode character
name(): takes a Unicode character and returns an uppercase name.

we can specify the string café by code or by name:

>>> place = 'caf\u00e9'
>>> place
'café'
>>> place = 'caf\N{LATIN SMALL LETTER E WITH ACUTE}'
>>> place
'café'

Note that The string len function counts Unicode characters, not bytes:

>>> len('$')
1
>>> len('\U0001f47b')

To create a function that can show the name and character:

import unicodedata

def unicode_test(value):

    name=unicodedata.name(value)

    value2=unicodedata.lookup(name)

    print('value="%s", name ="%s", value2="%s"' %( value, name, value2))




unicode_test('\u')

Encode and decode with UTF-8

You don’t need to worry about how Python stores each Unicode character when you do normal string processing.when you exchange data with the outside world, you need a couple of things:

• A way to encode character strings to bytes
• A way to decode bytes to character strings

UTF-8 dynamic encoding scheme uses one to four bytes per Unicode character:
• One byte for ASCII
• Two bytes for most Latin-derived (but not Cyrillic) languages
• Three bytes for the rest of the basic multilingual plane
• Four bytes for the rest, including some Asian languages and symbols

UTF-8 is the standard text encoding in Python, Linux, and HTML. It’s fast, complete, and works well. If you use UTF-8 encoding throughout your code, life will be much easier than trying to hop in and out of various encoding.

If you create a Python string by copying and pasting from another source such as a web page, be sure the source is encoded in the UTF-8 format.

Encoding

You encode a string to bytes. The string encode() function’s first argument is the encoding name.

The choices include:

ascii	Good old seven-bit ASCII
utf-8	Eight-bit variable-length encoding, and what you almost always want to use
latin-1	Also known as ISO 8859-1
cp-1252	A common windows encoding
unicode-escape	Python unicode literal format, \uxxxx or \Uxxxxxxxx

For example

snowman='\u2603'

ds=snowman.encode('utf-8')
print(ds)

output:

b'\xe2\x98\x83'

Handle with error

You can use encodings other than UTF-8, but you will get errors if the Unicode string can’t be handled by the encoding.

E.g. if you try to encode snowman with ASCII,

snowman.encode('ascii')

UnicodeEncodeError: 'ascii' codec can't encode character '\u2603' in position 0: ordinal not in range(128)

The encode() function takes a second argument to help you avoid encoding exceptions. Its default value, which you can see in the previous example, is ‘strict’; it raises a UnicodeEncodeError if it sees a non-ASCII character.

There are other encodings. Use ‘ignore’ to throw away anything that won’t encode:

>>> snowman.encode('ascii', 'ignore')
b''

Use ‘replace’ to substitute ? for unknown characters:

>>> snowman.encode('ascii', 'replace')
b'?'

Use ‘backslashreplace’ to produce a Python Unicode character string, like unicodeescape:

>>> snowman.encode('ascii', 'backslashreplace')
b'\\u2603'

To produce a character you can use in web pages:

>>> snowman.encode('ascii', 'xmlcharrefreplace')
b'&#9731;'

Decoding

Decode byte strings to Unicode strings. The problem is that nothing in the byte string itself says that what encoding was used.

Continure with last example, we decode it in the same way ( utf-8):

snowman='\u2603'

ds=snowman.encode('utf-8')

print(ds)

ds2=ds.decode('utf-8')

print(ds2)

There is no problem in the above example, because we know the encoding method, if we don’t know it and tried an wrong way:

ds3=ds.decode('ascii')

UnicodeDecodeError: 'ascii' codec can't decode byte 0xe2 in position 0: ordinal not in range(128)

Best practise:

whenever possible, use UTF-8 encoding. It works, is supported everywhere, can express every Unicode character, and is quickly decoded and encoded.

Work with Chinese characters

There are varies standards for Simplified and traditional Chinese characters, and the latest one is gb18030. To make sure the Chinese characters display correctly, decode it to utf-8 and then encode to gb18030, such as:

"业务质量周报（2016第五周）.xlsx“.decode("utf-8").encode("gb18030")

Codecs

The module defines the following functions for encoding and decoding with any codec:

codecs.encode(obj, encoding=’utf-8′, errors=’strict’)

Encodes obj using the codec registered for encoding.

Errors may be given to set the desired error handling scheme. The default error handler is 'strict' meaning that encoding errors raise ValueError (or a more codec specific subclass, such as UnicodeEncodeError). Refer to Codec Base Classes for more information on codec error handling.

codecs.decode(obj, encoding=’utf-8′, errors=’strict’)

Decodes obj using the codec registered for encoding.

Errors may be given to set the desired error handling scheme. The default error handler is 'strict' meaning that decoding errors raise ValueError (or a more codec specific subclass, such as UnicodeDecodeError). Refer to Codec Base Classes for more information on codec error handling.

The full details for each codec can also be looked up directly:

codecs.lookup(encoding)

Looks up the codec info in the Python codec registry and returns a CodecInfo object as defined below.

Encodings are first looked up in the registry’s cache. If not found, the list of registered search functions is scanned. If no CodecInfo object is found, a LookupError is raised. Otherwise, the CodecInfo object is stored in the cache and returned to the caller.

Pages: 1 2 3