Structured Text Files

With simple text files, the only level of organization is the line. Sometimes, you want more structure than that. You might want to save data for your program to use later, or send data to another program.

There are many formats, and here’s how you can distinguish them:

• A separator, or delimiter, character like tab (‘\t’), comma (‘,’), or vertical bar (‘|’). This is an example of the comma-separated values (CSV) format.

• ‘<‘ and ‘>’ around tags. Examples include XML and HTML.

• Punctuation. An example is JavaScript Object Notation (JSON).

• Indentation. An example is YAML (which depending on the source you use means “YAML Ain’t Markup Language;” you’ll need to research that one yourself).

• Miscellaneous, such as configuration files for programs.

CSV

 

Delimited files are often used as an exchange format for spreadsheets and databases. You could read CSV files manually, a line at a time, splitting each line into fields at comma separators, and adding the results to data structures such as lists and dictionaries. But it’s better to use the standard csv module, because parsing these files can get more complicated than you think.

• Some have alternate delimiters besides a comma: ‘|’ and ‘\t’ (tab) are common.
• Some have escape sequences. If the delimiter character can occur within a field, the entire field might be surrounded by quote characters or preceded by some escape character.
• Files have different line-ending characters. Unix uses ‘\n’, Microsoft uses ‘\r \n’, and Apple used to use ‘\r’ but now uses ‘\n’.
• There can be column names in the first line.

Write a csv file:

First , you need to import csv module:

>> import csv

Then, assign value to an object, note that there is a comma after each pair:

>>> villains = [
 ... ['Doctor', 'No'],
 ... ['Rosa', 'Klebb'],
 ... ['Mister', 'Big'],
 ... ['Auric', 'Goldfinger'],
 ... ]

Finally, open the file, and write the object into the file.

>>> with open('villains', 'wt') as fout: # a context manager
 ... csvout = csv.writer(fout)
 ... csvout.writerows(villains)

To verify the file has been correctly written:

$ cat villains

Doctor,No
 Rosa,Klebb
 Mister,Big
 Auric,Goldfinger

Another way is to write the csv file line by line, we first read a csv file from finance.yahoo.com into memory, and then write the csv file line by line.

from urllib import request
telstra_stock= "http://real-chart.finance.yahoo.com/table.csv?s=TLS.AX&d=11&e=2&f=2016&g=d&a=10&b=28&c=1997&ignore=.csv"
def download_stock_data(csv_url):
 response = request.urlopen(csv_url)
 csv=response.read()
 csv_str=str(csv)
 lines=csv_str.split("\\n") #we split the downloaded data into lines, "\n" means carriage return
 dest_url=r'telstra.csv' #the path to save the file.
 fx=open(dest_url,"w") # create a file object to save the data.
 for line in lines:
 fx.write(line+"\n") # write the file line by line via a loop.
 fx.close()
 
download_stock_data(telstra_stock)

 

Read from a CSV file:

 

Read as list by Reader() function:

It obligingly created rows in the cin object that we can extract in a for loop.
Using reader() and writer() with their default options, the columns are separated by commas and the rows by line feeds.

Read as dictionaries by DictReader() function and write by DictWriter():

import csv
 villains = [
 {'first': 'Doctor', 'last': 'No'},
 {'first': 'Rosa', 'last': 'Klebb'},
 {'first': 'Mister', 'last': 'Big'},
 {'first': 'Auric', 'last': 'Goldfinger'},
 {'first': 'Ernst', 'last': 'Blofeld'},
 ]
with open('villains', 'wt') as fout:
 cout = csv.DictWriter(fout, ['first', 'last'])
 cout.writeheader()
 cout.writerows(villains)

That creates a villains file with a header line:

 first,last
 Doctor,No
 Rosa,Klebb
 Mister,Big
 Auric,Goldfinger
 Ernst,Blofeld

Now we’ll read it back. By omitting the fieldnames argument in the DictReader() call, we instruct it to use the values in the first line of the file (first,last) as column labels and matching dictionary keys:

 >>>import csv
 >>> with open('villains', 'rt') as fin:
 ... cin = csv.DictReader(fin)
 ... villains = [row for row in cin]
 ...
 >>> print(villains)
 [{'last': 'No', 'first': 'Doctor'},
 {'last': 'Klebb', 'first': 'Rosa'},
 {'last': 'Big', 'first': 'Mister'},
 {'last': 'Goldfinger', 'first': 'Auric'},
 {'last': 'Blofeld', 'first': 'Ernst'}]

 

XML

 

XML basic: http://frankfu.click/wp-admin/post.php?post=10633&action=edit

XML is the most prominent markup format that suits the bill. It uses tags to delimit data, as in this sample menu.xml file:

<?xml version="1.0"?>
<menu>
<breakfast hours="7-11">
<item price="$6.00">breakfast burritos</item>
<item price="$4.00">pancakes</item>
</breakfast>
<lunch hours="11-3">
<item price="$5.00">hamburger</item>
</lunch>
<dinner hours="3-10">
<item price="8.00">spaghetti</item>
</dinner>
</menu>

Following are a few important characteristics of XML:

• Tags begin with a < character. The tags in this sample were menu, breakfast, lunch, dinner, and item.
• Whitespace is ignored.
• Usually a start tag such as <menu> is followed by other content and then a final matching end tag such as </menu>.
• Tags can nest within other tags to any level. In this example, item tags are children of the breakfast, lunch, and dinner tags; they, in turn, are children of menu.
• Optional attributes can occur within the start tag. In this example, price is an attribute of item.
• Tags can contain values. In this example, each item has a value, such as pancakes for the second breakfast item.
• If a tag named thing has no values or children, it can be expressed as the single tag by including a forward slash just before the closing angle bracket, such as <thing/
>, rather than a start and end tag, like <thing></thing>.
• The choice of where to put data—attributes, values, child tags—is somewhat arbitrary. For instance, we could have written the last item tag as <item price=”$8.00″ food=”spaghetti”/>.

 >>> import xml.etree.ElementTree as et
 >>> tree = et.ElementTree(file='menu.xml')
 >>> root = tree.getroot()
 >>> root.tag
 'menu'
 >>> for child in root:
         print('tag:', child.tag, 'attributes:', child.attrib)

         for grandchild in child:
             print('\ttag:', grandchild.tag, 'attributes:', grandchild.attrib)
 
 tag: breakfast attributes: {'hours': '7-11'}
     tag: item attributes: {'price': '$6.00'}
     tag: item attributes: {'price': '$4.00'}
 tag: lunch attributes: {'hours': '11-3'}
     tag: item attributes: {'price': '$5.00'}
 tag: dinner attributes: {'hours': '3-10'}
     tag: item attributes: {'price': '8.00'}

>>> len(root)# number of menu sections
 3
>>> len(root[0]) # number of breakfast items
 2

For more about xml.etree:

https://docs.python.org/3.3/library/xml.etree.elementtree.html

 

Other standard Python XML libraries include:

xml.dom
The Document Object Model (DOM), familiar to JavaScript developers, represents Web documents as hierarchical structures. This module loads the entire XML file
into memory and lets you access all the pieces equally.
xml.sax
Simple API for XML, or SAX, parses XML on the fly, so it does not have to load everything into memory at once. Therefore, it can be a good choice if you need to
process very large streams of XML.

Security of xml

use the defusedxml library as a security frontend for the other libraries:

 >>># insecure:
 >>> from xml.etree.ElementTree import parse
 >>> et = parse(xmlfile)
 >>> # protected:
 >>> from defusedxml.ElementTree import parse
 >>> et = parse(xmlfile)
JSON

 

Unlike the variety of XML modules, there’s one main JSON module, with the unforgettable name json. This program encodes (dumps) data to a JSON string and decodes (loads) a JSON string back to data.

Example:

menu = \
 {
 "breakfast": {
         "hours": "7-11",
         "items": {
               "breakfast burritos": "$6.00",
               "pancakes": "$4.00"
                }
         },
 "lunch" : {
          "hours": "11-3",
          "items": {
              "hamburger": "$5.00"
              }
          },
 "dinner": {
         "hours": "3-10",
         "items": {
                  "spaghetti": "$8.00"
                  }
         x}
 }
 import json
 menu_json=json.dumps(menu)
 print(menu_json)
{"lunch": {"hours": "11-3", "items": {"hamburger": "$5.00"}}, 
"breakfast": {"hours": "7-11", "items": {"pancakes": "$4.00", "breakfast burritos": "$6.00"}}, 
"dinner": {"hours": "3-10", "items": {"spaghetti": "$8.00"}}}

To turn the JSON string menu_json back into a Python data structure (menu2) by using loads():

 >>> menu2 = json.loads(menu_json)
 >>> menu2
 {'breakfast': {'items': {'breakfast burritos': '$6.00', 'pancakes':
 '$4.00'}, 'hours': '7-11'}, 'lunch': {'items': {'hamburger': '$5.00'},
 'hours': '11-3'}, 'dinner': {'items': {'spaghetti': '$8.00'}, 'hours': '3-10'}}

Handle time data by JSON

Because the JSON standard does not define date or time types; it expects you to define how to handle them, some error may happen:

import datetime
 >>> now = datetime.datetime.utcnow()
 >>> now
 datetime.datetime(2013, 2, 22, 3, 49, 27, 483336)
 >>> json.dumps(now)
 Traceback (most recent call last):
 # ... (deleted stack trace to save trees)
 TypeError: datetime.datetime(2013, 2, 22, 3, 49, 27, 483336) is not JSON serializable

Anyway, we can convert the datetime to something JSON understands, such as a string:

 >>> now_str = str(now)
 >>> json.dumps(now_str)
 '"2013-02-22 03:49:27.483336"'
 >>> from time import mktime
 >>> now_epoch = int(mktime(now.timetuple()))
 >>> json.dumps(now_epoch)
 '1361526567'

Let’s modify it for datetime:

>>class DTEncoder(json.JSONEncoder):
 ... def default(self, obj):
 ...     # isinstance() checks the type of obj
 ...     if isinstance(obj, datetime.datetime):
 ...         return int(mktime(obj.timetuple()))
 ...     # else it's something the normal decoder knows:
 ...     return json.JSONEncoder.default(self, obj)
 ...
 >>> json.dumps(now, cls=DTEncoder)
 '1361526567'

The isinstance() function checks whether the object obj is of the class datetime.datetime.

 >>> type(234)
 <class 'int'>
 >>> isinstance(234, int)
 True
 >>> type('hey')
 <class 'str'>
 >>> isinstance('hey', str)
 True
 >>> isinstance('234 str)
 False