Python offers a module urllib and its advanced version urllib2 to allow downloading files from given URLs. The following shows three different ways to access the internet data.
#!/usr/bin/env python url = "https://steakovercooked.com/robots.txt"; import urllib2 robots = urllib2.urlopen(url) output = open("c:\\robots1.txt","wb") output.write(robots.read()) output.close() import urllib urllib.urlretrieve(url, "c:\\robots2.txt") # or more sophisticated way # from stackoverflow file_name = url.split("/")[-1] u = urllib2.urlopen(url) f = open("c:\\robots3.txt", "wb") meta = u.info() file_size = int(meta.getheaders("Content-Length")[0]) print "Downloading: %s Bytes: %s" % (file_name, file_size) file_size_dl = 0 block_sz = 16384 while True: buffer = u.read(block_sz) if not buffer: break file_size_dl += len(buffer) f.write(buffer) status = r"%10d [%3.2f%%]" % (file_size_dl, file_size_dl * 100. / file_size) status = status + chr(8) * (len(status) + 1) print status, f.close()
The above script will download the file robots.txt from my website and create three identical copies under C:\ drive. However, the exception handling is missing from above script, if you try to download something that does not exist, or on the non-existent domain, exceptions will be thrown out. For example,
Traceback (most recent call last): File "C:\Python27\test.py", line 6, in <module> robots = urllib2.urlopen(url) File "C:\Python27\lib\urllib2.py", line 126, in urlopen return _opener.open(url, data, timeout) File "C:\Python27\lib\urllib2.py", line 400, in open response = self._open(req, data) File "C:\Python27\lib\urllib2.py", line 418, in _open "_open", req) File "C:\Python27\lib\urllib2.py", line 378, in _call_chain result = func(*args) File "C:\Python27\lib\urllib2.py", line 1207, in http_open return self.do_open(httplib.HTTPConnection, req) File "C:\Python27\lib\urllib2.py", line 1177, in do_open raise URLError(err) URLError: <urlopen errno="" error="" failed="" getaddrinfo=""> </urlopen></module>
The downloading file is one of the essential techniques that is quite useful in processing interent data, e.g. spiders.
–EOF (The Ultimate Computing & Technology Blog) —
GD Star Rating
loading...
388 wordsloading...
Last Post: Checking Bots using PHP Script
Next Post: How to Implement file_put_contents and file_get_contents in PHP?