Python offers a module urllib and its advanced version urllib2 to allow downloading files from given URLs. The following shows three different ways to access the internet data.

#!/usr/bin/env python

url = "";

import urllib2
robots = urllib2.urlopen(url)
output = open("c:\\robots1.txt","wb")

import urllib
urllib.urlretrieve(url, "c:\\robots2.txt")

# or more sophisticated way
# from stackoverflow
file_name = url.split("/")[-1]
u = urllib2.urlopen(url)
f = open("c:\\robots3.txt", "wb")
meta =
file_size = int(meta.getheaders("Content-Length")[0])
print "Downloading: %s Bytes: %s" % (file_name, file_size)

file_size_dl = 0
block_sz = 16384
while True:
    buffer =
    if not buffer: break
    file_size_dl += len(buffer)
    status = r"%10d  [%3.2f%%]" % (file_size_dl, file_size_dl * 100. / file_size)
    status = status + chr(8) * (len(status) + 1)
    print status,


The above script will download the file robots.txt from my website and create three identical copies under C:\ drive. However, the exception handling is missing from above script, if you try to download something that does not exist, or on the non-existent domain, exceptions will be thrown out. For example,

Traceback (most recent call last):
  File "C:\Python27\", line 6, in <module>
    robots = urllib2.urlopen(url)
  File "C:\Python27\lib\", line 126, in urlopen
    return, data, timeout)
  File "C:\Python27\lib\", line 400, in open
    response = self._open(req, data)
  File "C:\Python27\lib\", line 418, in _open
    "_open", req)
  File "C:\Python27\lib\", line 378, in _call_chain
    result = func(*args)
  File "C:\Python27\lib\", line 1207, in http_open
    return self.do_open(httplib.HTTPConnection, req)
  File "C:\Python27\lib\", line 1177, in do_open
    raise URLError(err)
URLError: <urlopen errno="" error="" failed="" getaddrinfo="">

The downloading file is one of the essential techniques that is quite useful in processing interent data, e.g. spiders.

