This is the first step to scraping a webpage, first you retrieve the information then you process it later on.  There are 2 ways that i know of (probably more but I’ve only used urllib2 and requests as stated in the main intro write up under How do i ‘request’ the HTML from a website? :-

The great thing about python is that a lot of things have been done already for you.  One option is to use urllib2 –

import urllib2

req = urllib2.Request('http://www.google.co.uk')
response = urllib2.urlopen(req)
the_page = response.read()

This opens the url in “req =” and the returns the html code so you can then ‘go to work on it’.  Another alternative is requests which (in my opinion) is just as quick but cuts down a few lines of code to remember –

import requests

the_page = requests.get('http://www.google.co.uk').content

Thats it. That will get the HTML code from google and store in python ready for you to do what you want with. The only issue i found some had was when using IDLE on their computer they had to install requests where as urllib2 comes as standard with python, both however are available to use in kodi so feel free to use either.

To run either of these just open up your software for running python (such as pythons IDLE) put in the code required followed by

print the_page

Then hit run and it will print out all the HTML from your required site into the shell or whatever the alternative is for your program you use to run python. The full code for either will be:

Using requests:-

import requests

the_page = requests.get('http://www.google.co.uk').content

print the_page

Using urllib2 :-

import urllib2

req = urllib2.Request('http://www.google.co.uk')
response = urllib2.urlopen(req)
the_page = response.read()

print the_page

Either of these will return HTML code from the majority of websites, however there are exceptions such as sites that use cloudflare, this sets ‘tasks’ to anything trying to open the url (timeout and some other response) and if they don’t complete these tasks then it will not return the HTML, so if you view the source code on a site then use that url in the code above and the print out does not match the source code then chances are you will need to put a bit more work in to get the HTML returned but that’s something you will have to research beyond basic scraping. For cloudflare check out this code that has been wrote but for it to work for me i had to install node.js and install pyexecjs, create a folder on desktop with another folder (containing the __init__.py inside) and a new python file to test it out and then ran with the following code:

import cfscrape

scraper = cfscrape.create_scraper()
the_page= scraper.get('http://www.google.co.uk').content

print the_page

I tried this code on a site running cloudflare, obviously strictly for educational purposes, and it returned the HTML just as i wanted, it took a little to return it due to the extra wait and completing tasks but it sure worked and well. Unfortunately i only tried briefly for a afternoon and most was spent working out how to get it to work so i never got around to seeing how to implement into Kodi but when i have some time i will have a look and update post.

I hope this explains what is happening with the request, python is simply requesting HTML code from the url you provide and the you can move on to the next stage which is parsing it (doing what you want with it).

Print Friendly, PDF & Email