chapter 7: mangle data ( encode, decode, format) – Page 3

Regular expressions

Similar to wildcard patterns on the command line, such as ls *.py, which means list all filenames ending in .py.

More complex pattern matching is regular expressions. These are provided in the standard module re, which we’ll import. You define a string pattern that you want to match, and the source string to match against.

Match object instances also have several methods and attributes; the most important ones are:

Method/Attribute	Purpose
group()	Return the string matched by the RE
start()	Return the starting position of the match
end()	Return the ending position of the match
span()	Return a tuple containing the (start, end) positions of the match

Exact Match with match()

Match() works only if the pattern is at the beggining of the source.

For simple matches, usage looks like this:
result = re.match('You', 'Young Frankenstein')
Here, ‘You’ is the pattern and ‘Young Frankenstein’ is the source—the string you want to check.

match() checks whether the source begins with the pattern.
For more complex matches, you can compile your pattern first to speed up the match later:
youpattern = re.compile('You')

Then, you can perform your match against the compiled pattern:

>>>result = youpattern.match('Young Frankenstein')
>>>if result:
 print(result.group())

You

If you want to search a pattern not at the beggining, we can use regular expression to overcome this:

• . means any single character.
• * means any number of the preceding thing. Together, .* mean any number of characters (even zero).

>>> m = re.match('.*Frank', source)
 >>> if m:  # match returns an object
 ...     print(m.group())
 Young Frank

This time it returns all the characters until the “frank”.

Group() and groups()

When using match() or search(), all matches are returned from the result object m as　m.group(). If you enclose a pattern in parentheses, the match will be saved to its own group, and a tuple of them will be available as m.groups(), as shown here:

 >>> source = '''I wish I may, I wish I might
... Have a dish of fish tonight.''' 
 >>>m = re.search(r'(. dish\b).*(\bfish)', source)
 >>> m.group()
 'a dish of fish'
 >>> m.groups()
 ('a dish', 'fish')

First match with Search()

Match() works only if the pattern is at the beginning of the source. Search() find the first match pattern if the pattern is anywhere.

All matches with findall()

Find all will show how many instances in the source, and return a list. You can use len(list_name) to know how many were found.

E.g. if you want to know how many instances of the single-letter string ‘n’ are in the string?

 >>> m = re.findall('n', source)
 >>> m   # findall returns a list
 ['n', 'n', 'n', 'n']
 >>> print('Found', len(m), 'matches')
 Found 4 matches

How about ‘n’ followed by any character( not include empty)?

 >>> m = re.findall('n.', source)
 >>> m
 ['ng', 'nk', 'ns']

Notice that it did not match that final ‘n’ because there is no character after the n. We need to say that the character after ‘n’ is optional with ?:

 >>> m = re.findall('n.?', source)
 >>> m
 ['ng', 'nk', 'ns', 'n']

Split at matches with split()

we can split a string into a list according to our need, such as by space, comma, semi-column or other symbol, letter.

string1='this is a test'

m=re.split(' ',string1)

print(m)

output

['this', ' is', 'a', 'test']

Replace at matches with sub()

This is like the string replace() method, but for patterns rather than literal strings:

 >>> m = re.sub('n', '?', source)
 >>> m # sub returns a string
 'You?g Fra?ke?stei?'

Special characters

The basics:

• Literal matches with any non-special characters
• Any single character except \n with .
• Any number (including zero) with *
• Optional (zero or one) with ?

Pattern	Matches
\d	a single digit
\D	a single non-digit
\w	an alphanumeric character 0-10,A-Z, and _
\W	a non-alphanumeric character
\s	a whitespace character
\S	a non-whitespace character
\b	a word boundary (between a \w and a \W, in either order)
\B	a non-word boundary

string module

The python string module has predefined strings for testing.

e.g.

import string

printable_str=string.printable

print(len(printable_str))

print(printable_str[0:10],printable_str[10:36])

output

100

0123456789 abcdefghijklmnopqrstuvwxyz

To find all digits:

digits_all=re.findall('\d',printable_str)

print(digits_all)

To find all digits, letters, or an underscore

re.findall('\w', printable)

Pattern specifiers

The characters ^ and $ are called anchors:

^ anchors the search to the beginning of the search string,
$ anchors it to the end. .$ matches any character at the end of the line, including a period,

Pattern: Matches
 abc:  literal abc
 ( expr ) :  expr
 expr1 | expr2 : expr1 or expr2
 . : any character except \n
 ^ : start of source string
 $ : end of source string
 prev ? : zero or one prev
 prev * : zero or more prev, as many as possible
 prev *? : zero or more prev, as few as possible
 prev + : one or more prev, as many as possible
 prev +? :one or more prev, as few as possible
 prev { m } ：　m consecutive prev
 prev { m, n } ：　m to n consecutive prev, as many as possible
 prev { m, n }? ：　m to n consecutive prev, as few as possible
 [ abc ] ：　a or b or c (same as a|b|c)
 [^ abc ] ：　not (a or b or c)
 prev (?= next ) ：　prev if followed by next
 prev (?! next ) ：　prev if not followed by next
 (?<= prev ) ：　next next if preceded by prev
 (?<! prev ) ：　next next if not preceded by prev

E.g.

>>> source = '''I wish I may, I wish I might
... Have a dish of fish tonight.'''

find wish or fish anywhere:

>>> re.findall('wish|fish', source)

['wish', 'wish', 'fish']

Or you can use another way:

>>> re.findall('[wf]ish', source)
['wish', 'wish', 'fish']

Find wish at the beginning:

>>> re.findall('^wish', source)
[]

Find fish at the end:

>>> re.findall('fish$', source)
[]

Find ght followed by a non-alphanumeric:

>>> re.findall('ght\W', source)
['ght\n', 'ght.']

Find I followed by wish:

>>> re.findall('I (?=wish)', source)
['I ', 'I ']

Deal with conflict between special escape characters and regular expression

In the example above, the following pattern should match any word that begins with fish:

>>> re.findall('\bfish', source)
[]

in special escape characters, \b means backspace in strings, but in the mini-language of regular expressions it means the beginning of a word.

To avoid this, . Always put an r character before your regular expression pattern string, and Python escape characters will be disabled.

>>> re.findall(r'\bfish', source)
['fish']

Binary data

Bytes: immutable, like a tuple of bytes.

Bytearray: mutable, like a list of bytes.

blist=[1,2,3,255]

the_bytes=bytes(blist)

print(the_bytes)

print(the_bytes[1])

print(bytearray(the_bytes))

Output:

b'\x01\x02\x03\xff'
2
bytearray(b'\x01\x02\x03\xff')

you can not change a bytes variable:

the_bytes[1]=124

Bytearray:

the_byte_array=bytearray(blist)

print(the_byte_array)

output

bytearray(b'\x01\x02\x03\xff')

you can change a byte array value:

the_byte_array[1]=127

Convert Binary Data with struct

With the struct module, you can convert binary data to and from Python data structures.

In the following example, we will extracts the width and height of an image from some PNG data.

In this example, I downloaded a png from here: http://oldsite.polycode.org/img/polycode_logo.png to /Volumes/Data/colors.png.

import struct

file=open('/Volumes/Data/colors.png','rb') # this line and the next one will be covered in chapter 8

data=file.read()

valid_png=b'\x89PNG\r\n\x1a\n'

if data[:8]==valid_png:

    print("It's a valid png file")

    width,height=struct.unpack('>2L',data[16:24])

    print('the width is {} and the height is {}'.format(width,height))

else:

    print('invalid png file')

Output:

It's a valid png file

the width is 204 and the height is 204

• valid_png_header contains the 8-byte sequence that marks the start of a valid PNG file.
• width is extracted from bytes 16-20, and height from bytes 21-24.

The >LL is the format string that instructs unpack() how to interpret its input byte sequences and assemble them into Python data types. Here’s the breakdown:

• The > means that integers are stored in big-endian format.
• Each L specifies a 4-byte unsigned long integer.

The endian specifiers go first in the format string.
Endian specifiers:

Specifier	Byte order
<	little endian
>	big endian

Big-endian integers have the most significant bytes to the left. For more about endian, check here http://frankfu.click/python/endian-big-and-little/.

You can examine each 4-byte value directly:

>>> data[16:20]
b'\x00\x00\x00\x9a'
>>> data[20:24]0x9a
b'\x00\x00\x00\x8d'

Because the width and height are each less than 255, they fit into the last byte of each sequence. You can verify that these hex values match the expected decimal values:

>>> 0x9a
154
>>> 0x8d
141

When you want to go in the other direction and convert Python data to bytes, use the struct pack() function:

>>> import struct
>>> struct.pack('>L', 154)
b'\x00\x00\x00\x9a'
>>> struct.pack('>L', 141)
b'\x00\x00\x00\x8d'

Specifier	Description	Bytes
x	skip a byte	1
b	signed byte	1
B	unsigned byte	1
h	signed short integer	2
H	unsigned short integer	2
i	signed integer	4
I	unsigned integer	4
l	signed long integer	4
L	unsigned long integer	4
Q	unsigned long long integer	8
f	single precision float	4
d	double precision float	8
p	count and characters	1+ count
s	characters	count

The type specifiers follow the endian character. Any specifier may be preceded by a number that indicates the count; 5B is the same as BBBBB.

E.g. You can use a count prefix instead of >LL:

>>> struct.unpack('>2L', data[16:24])
(154, 141)

Or, we can use another way:

We could also use the x specifier to skip the uninteresting parts, We have 30 bytes data, but we only need from 17-24:
• Use big-endian integer format (>)
• Skip 16 bytes (16x)
• Read 8 bytes—two unsigned long integers (2L)
• Skip the final 6 bytes (6x)

>>> struct.unpack('>16x2L6x', data)
(154, 141)

Other Binary Data tools

• bitstring
• construct
• hachoir
• binio

E.g, we use construct to extract the png info:

>>> from construct import Struct, Magic, UBInt32, Const, String
>>> # adapted from code at https://github.com/construct
>>> fmt = Struct('png',
... Magic(b'\x89PNG\r\n\x1a\n'),
... UBInt32('length'),
... Const(String('type', 4), b'IHDR'),
... UBInt32('width'),
... UBInt32('height')
... )
>>> data = b'\x89PNG\r\n\x1a\n\x00\x00\x00\rIHDR' + \
... b'\x00\x00\x00\x9a\x00\x00\x00\x8d\x08\x02\x00\x00\x00\xc0'
>>> result = fmt.parse(data)
>>> print(result)

Output:

Container:
length=13
type = b'IHDR'
width = 154
height = 141

Convert Bytes /strings with binascii()

To convert between binary data and various string representations: hex(base 16), base 64, unencoded, and others.

use binascii.hexlify(binary_data) to convert from binary to hex.

use binascii.unhexlify(hex_data) to convert from hex to binary.

Bit operators

If the integer a=5 binary 0b0101, and b=1, binary 0b0001.

Operator	Description	Example	Decimal result	Binary result
&	and	a&b	1	0b0001
\|	or	a\|b	5	0b0101
ˆ	exclusive or	a^b	4	0b0100
˜	flip bits	~a	-6	binary representation depends on init size
<<	left shift	a<<1	10	0b1010
>>	right shift	a>>1	2	0b0010

Explanation:

The & operator returns bits that are the same in both arguments, and | returns bits that are set in either of them. The ^ operator returns bits that are in one or the other, but not both. The ~ operator reverses all the bits in its single argument; this also reverses the sign because an integer’s highest bit indicates its sign (1 = negative) in two’s complement arithmetic, used in all modern computers. The << and >> operators just move bits to the left or right.

Reference

Format: http://www.python-course.eu/python3_formatted_output.php

Regular expression: https://docs.python.org/3/library/re.html

Endian (Big and little )

Pages: 1 2 3