Tutorial, Download a Web Page

2015-03-02

In this tutorial, we try to download a web page through Python.

Basic

Browser is a software which shows web page. There is a standard protocol, HTTP, between browser and server, to make them understand each other. Browser finds the server address through the web page address, the so called URL, opens a connection to the server, and writes request information accordingly. The server acts as a listener. When a connection request with data arrived, the server interpret it and replys contents respectively.

As already implemented in standard Python module, we can write a simple HTTP server in one command line:

python -m http.server 8128

After that, open your browser and type address localhost:8128, you will see a list of files in current work directoy. If you have a public IP address, you current work directory is exposed to the whole world, every body can access your file now.

The server side is not covered by this tutorial, but we may need this to verify that our downloading program acts correctly.

We usually call the program which launchs the connection to remote side, server side, as a client. The browser is a client. A downloading client is much simpler than a browser. As already implemented by standrad Python module, we just need to understand parts meaning to use it.

Downloading Client

import urllib.request

def Download(url):
  """Download url and return its content."""
  fp = urllib.request.urlopen(url)
  return fp.read();


def DownloadToFile(url, filename):
  """Download url and write its content to a file."""
  wfp = open(filename, "wb")
  wfp.write(Download(url))

To verify it, we can verify it through token:

assert ("Download url and write its content"
         in Download("http://note.weizi.org/ariel/download_page.html"))
There are some servers who may return different content to different clients. A client can tell servers its type through HTTP request Header, User-Agent. We can config our client to tell server it is a real browser.
def DownloadAs(url, user_agent):
  """Download url and return its content, request with user_agent."""
  request = urllib.request.Request(
    url, None, {
      "User-Agent" :
         "Mozilla/5.0 (X11; U; Linux i686) Gecko/20071127 Firefox/2.0.0.11"
    }
  )
  fp = urllib.request.urlopen(request)
  return fp.read();

To verify the user agent, we can use a some server which list the client user agent:

url = "http://www.useragentstring.com/"
default = Download(url)
chrome_ua = "Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/41.0.2272.43 Safari/537.36"
chrome = DownloadAs(url, chrome_ua)
assert chrome_ua in chrome
assert not chrome_ua in default
# # print them if you want to check.
# print("===============Default===============\n", default)
# print("===============Chrome================\n", chrome)

Problems

Problem 1The arguments after the python script can be accessed through sys.argv, please write a python file:

import sys
print(sys.argv)

And then use it with different/random arguments to learn sys.argv.

Problem 2 Write a simple command line program which accepts two argument, the url and the filename, download the url and write its content to the file.