Assuming you have followed in order of the main write up and have python installed then you will be ok to follow below in IDLE. This was originally going to be the whole scraping process but was getting a little long so have just shown how to contain the urllib2 code and will do a seperate blog purely on the scrape itself.

The first step will be to contain the urllib2 code to retrieve HTML from a website inside a def.  A def allows you to use this code over and over again in your programs by simply using the name that you put after def and then including in brackets any information you need to send to it for it to work eg.

import urllib2

def openurl(url):
    req = urllib2.Request(url)
    response = urllib2.urlopen(req)
    the_page = response.read()

print the_page

If you run this code in IDLE it will give the following error:

Traceback (most recent call last):
File “C:\Users\*\Desktop\*.py”, line 8, in <module>
print the_page
NameError: name ‘the_page’ is not defined

This is because although the code in def openurl is there and able to be used by python you need to call on a def for the code to run (probably to speed up overall programs as blocks of code will only be run when needed rather than all the time) so you can just remove that for now. But now your problem is that def openurl does not have any real purpose apart from opening a url and reading/storing the information it simply works within python and has no influence on your program. To fix this simply ask it to return that information for you to use and the command for this (strangely enough) is return.

import urllib2

def openurl(url):
    req = urllib2.Request(url)
    response = urllib2.urlopen(req)
    the_page = response.read()
    return the_page

This will now return the_page (req being the initial request for the HTML, response getting the url opened and the_page being the final read of the HTML, req/response/the_page can be named as you wish just so long as you call on them by the name in the next line)

Now that is set up just run a quick test to make sure its returning properly.  At the bottom add:

print openurl('http://www.google.co.uk')

This (when you press run, f5 for me in IDLE) will then print out the HTML code from google. Exactly like before but now its set so you can simply open a url by editing what is inside the brackets.  Now that we know that is working the next part is how to use it and i will continue that in a separate blog, save going on too long.

Print Friendly, PDF & Email