Downloading URL using Python


Python offers a module urllib and its advanced version urllib2 to allow downloading files from given URLs. The following shows three different ways to access the internet data.

#!/usr/bin/env python

url = "https://steakovercooked.com/robots.txt";

import urllib2
robots = urllib2.urlopen(url)
output = open("c:\\robots1.txt","wb")
output.write(robots.read())
output.close()

import urllib
urllib.urlretrieve(url, "c:\\robots2.txt")

# or more sophisticated way
# from stackoverflow
file_name = url.split("/")[-1]
u = urllib2.urlopen(url)
f = open("c:\\robots3.txt", "wb")
meta = u.info()
file_size = int(meta.getheaders("Content-Length")[0])
print "Downloading: %s Bytes: %s" % (file_name, file_size)

file_size_dl = 0
block_sz = 16384
while True:
    buffer = u.read(block_sz)
    if not buffer: break
    file_size_dl += len(buffer)
    f.write(buffer)
    status = r"%10d  [%3.2f%%]" % (file_size_dl, file_size_dl * 100. / file_size)
    status = status + chr(8) * (len(status) + 1)
    print status,

f.close()

The above script will download the file robots.txt from my website and create three identical copies under C:\ drive. However, the exception handling is missing from above script, if you try to download something that does not exist, or on the non-existent domain, exceptions will be thrown out. For example,

Traceback (most recent call last):
  File "C:\Python27\test.py", line 6, in <module>
    robots = urllib2.urlopen(url)
  File "C:\Python27\lib\urllib2.py", line 126, in urlopen
    return _opener.open(url, data, timeout)
  File "C:\Python27\lib\urllib2.py", line 400, in open
    response = self._open(req, data)
  File "C:\Python27\lib\urllib2.py", line 418, in _open
    "_open", req)
  File "C:\Python27\lib\urllib2.py", line 378, in _call_chain
    result = func(*args)
  File "C:\Python27\lib\urllib2.py", line 1207, in http_open
    return self.do_open(httplib.HTTPConnection, req)
  File "C:\Python27\lib\urllib2.py", line 1177, in do_open
    raise URLError(err)
URLError: <urlopen errno="" error="" failed="" getaddrinfo="">
</urlopen></module>

The downloading file is one of the essential techniques that is quite useful in processing interent data, e.g. spiders.

–EOF (The Ultimate Computing & Technology Blog) —

GD Star Rating
loading...
388 words
Last Post: Checking Bots using PHP Script
Next Post: How to Implement file_put_contents and file_get_contents in PHP?

The Permanent URL is: Downloading URL using Python

Leave a Reply