Web crawler by Python

Example1: To find the links of title in a webpage list:

The target website is http://www.yeeyi.com/bbs/forum-304-1.html

1. Analyse on the url:

if we poke around the page 1, page 2, or others , we can find the rule:

the common part is http://www.yeeyi.com/bbs/forum-304-x.html, the x represent the page number.

2. Analyse on the element we need

We only want to craw the links of the titles, so if we click here, we get into the page.

yeeyi_title1

At the same Time, there are a lot of other links, such as the author name:

yeeyi_author

So we need to check the source code of the webpage and find the unique tag for the target data, right click in a blank area, click “inspect element” or similar name according to your web browser. Usually web developer use class or id to identify elements in webpage.

class_tag

Now we need to create a loop to craw from page 1 to the desired page (max page).

import requests

from bs4 import BeautifulSoup

def trade_spider(max_pages):

    page=1

    while page <= max_pages:

        url1="http://www.yeeyi.com/bbs/forum-304-"+ str(page)+".html"

        print("the page is %s" % url1)

        source_code=requests.get(url1)

        plain_text=source_code.text

        soup=BeautifulSoup(plain_text)

        for link in soup.findAll('a', {'class':'s xst'}):
            href1=link.get('href')
            title=link.string
            print(title +":")
            print(href1)

        page += 1  #the page number will increment after each loop.

Call the crawling function:

trade_spider(3) #we craw from page 1 to page 3

If you want to add the title of the post in front of the link, replace the last 3 lines with following codes:

            title=link.string

            print(title +":")

            print(href1)

            print()

If we call it again

trade_spider(3)

The result :

二手储物架壁橱格子柜出售:

http://www.yeeyi.com/bbs/thread-3641546-1-3.html

出售一台外星人x51打包benq显示器送键盘:

http://www.yeeyi.com/bbs/thread-3636075-1-3.html

………………..

EXAMPLE2: TO FIND THE phone numbers IN A WEBPAGE:

Sometimes we also want to get the phone number from a page.

Target page: http://www.yeeyi.com/bbs/forum.php?mod=viewthread&tid=3645129

In this page, we can find the class of <td>( table data) in every row is same, the title column is “tit”, and the value column is “con” or “con2”. So how we can extract only the phone number out of this table?

After some analysis, I notice that the content of the td before the phone number td is always “电话:”, so I can find the td “电话:” first, then extract the content of the next sibling td ( phone number).

Step1, We begin by reading in the source code for a given web page and creating a Beautiful Soup object with the BeautifulSoup function.

import requests

from bs4 import BeautifulSoup

def get_single_item_data(item_url):

    source_code2=requests.get(item_url)

    plain_text=source_code2.text

    soup=BeautifulSoup(plain_text,"lxml")
    content=soup.prettify() # get the web page code.

Step2, find the table data of “电话:”

The related part of the html code is :

<tr>
                  <td class="tit">
                   电话:            # this is the text we are looking for, the phone number will be the next one.
                  </td>
                  <td class="con2">

                   12345678        # this is the text we actually needed.
                  </td>
                  <td class="tit">
                   微信:
                  </td>
                  <td class="con">
                  </td>

</tr>

In the first example, we find the content according to the class which was easy to find the desired content, this time we are going to find the content between the open and closing tag. So we will use the property “contents“, then use the “nextSibling” go get the content we need.

for hit in soup.findAll(attrs={'class':'tit'}):     # Use for loop to read all the class tit
       if hit.contents[0]=="电话:":                 # Check if the content of the previous tag matches ”电话“
            phone=hit.nextSibling               # If it matched, the next sibling tag content will be the phone number.
            print(phone.contents[0].strip())

url="http://www.yeeyi.com/bbs/forum.php?mod=viewthread&tid=3645129"
get_single_item_data(url)

Step3. Correct the error

By now, if we test the code, we can get the phone number, but there will be some error code:

    if hit.contents[0]=="电话:":

IndexError: list index out of range

This is because we forgot to test if the hit is empty or not, and if think further the hit.nextSibling should also be tested before we print the result. we can add a line like following to correct this:

if hit is not None and len(hit) > 0

    for hit in soup.findAll(attrs={'class':'tit'}):
        if hit is not None and len(hit) > 0:
            if hit.contents[0]=="电话:":
                phone=hit.nextSibling
                If phone is not None and Len(phone) > 0
                   print(phone.contents[0].strip())

Example 3 : combine example 1 and 2, we can call function

All we need is to call the function “get_single_item_data()” after print the title and url:

...
          print(href1)
          get_single_item_data(url)
      page+=1
...

and pass the url crawled from example 1. so the program will crawl all the url and title, contact numbers according to the max page you needed.