FACTS, FAQs

How to parse HTML with regular expression

This is the final step to understanding and creating a web scraper and this is the part that could see you losing hours or days fine tuning and grabbing information from a HTML, if you have not checked out my blog on HTML it would be a good idea to have a little read so you can get a rough idea on it before you carry on. I will do another blog or 2 so i can explain a bit more on the python commands that we will be using. Fortunately regular expression comes as standard with python so you will not have to install anything to run any code i will add in this example.

For this example i will scrape info from the main write up explaining about web scraping.  It should remain fairly static so whatever i show should not change, if there are slight differences then something may have altered but you should be ok to follow on. Hopefully you’re not too bored just yet and is reasonably easy to follow along.

There are a lot of different commands with regular expression but the one i have pretty much always used (because its what i was first shown and kind of got stuck in my ways) is (.+?) and .+? these tell python (.+?) i want this information and .+? ignore this information, but you may decide that you find something else better or even prefer beautiful soup (which is probably a better long term bet but will do a separate blog on that). Regular expression basically matches patterns such as:

<html>

<body>

<p>my name is bob</p>

</body>

</html>

So to get ‘my name is bob’ your regex would be <p>(.+?)</p> how I’ve come to understand it python will read the HTML like a book until it hits a character in the HTML that matches the first character in your regex (being < in this case) then check the the next character in HTML matches the next one in the regex until it hits a command such as (.+?) which will tell it you want that information up until it hits the character after your command (being < again).  You would contain this in a regex, again I’ve always used re.compile as the command and .findall as its always worked but just have a practice this way to get a little understanding then have a look at the cheat sheet have a play with some other variations. Regex for this would be:

re.compile(‘<p>(.+?)</p>’).findall()

You will have something in the brackets after .findall telling it where to findall all the matches. This code here will find all the patterns that match your regex, i.e. all the information contained inside <p> </p> tags wherever you wish to scrape (normally a HTML, you can use to scrape text files and pretty much any file you can read in a text viewer like notepad++ but that’s another blog for another day maybe).

Hopefully you’re still awake reading this and now ill show an example of a scrape from the main write up post. First you need to import re (for regular expressions) and urllib2 (for fetching the HTML) then fetch the HTML and finally process it like so:

import urllib2, re

url_to_open = 'http://kodification.co.uk/wp/2017/04/28/web-scrapers-the-third-party-developers-holy-grail/'

def openurl(url):
    req = urllib2.Request(url)
    response = urllib2.urlopen(req)
    the_page = response.read()
    return the_page

fetch_html = openurl(url_to_open)
regex = re.compile('<p>(.+?)</p>').findall(fetch_html)

This will set your def to open a url (i defined url_to_open up the top job so can change and try another site quick and easy by just altering that) then it will run the 2 lines of code to send a url to the def to get the HTML returned and then set the regex for it, next you will need to use ‘for’ to tell it what you want to do with this regex. Simply put:

for item in regex:

This will tell python for every match it makes in that regex (which is regexing the HTML code) that you wish to do something. You can have more than one thing after the for but it needs to match the ammount of (.+?) in the regex eg:

<p>my name is bob<p><span>this is bobs age : 23 !</span>
<p>my name is rita<p><span>this is ritas age : 24 !</span>
<p>my name is sue<p><span>this is sues age : 25 !</span>

The regex for grabbing both parts of the information could be

regex = re.compile('<p>(.+?)<p><span>(.+?)</span>').findall()

or using ignore you could just grab the name in the second part to ignore the differences is names, bob, rita and sue because the rest of the text is an exact match and as they all say ‘my name is’ you could leave that in the regex also and just grab the names.

regex = re.compile('<p>my name is (.+?)<p><span>this is .+? age : (.+?) !</span>').findall()

Don’t forget that spaces also count as a character! Aswell as tabs! The exclamation marks are from hours of frustration when first learning. This will grab each persons name and age then with for you can tell python you wish to do something with this:

for name, age in regex:

Then finally you need to tell python what you want to do with it, we will simply print out into IDLE as is nice and simple but shows results. The final code for scraping the page and printing results is:

import urllib2, re

url_to_open = 'http://kodification.co.uk/wp/2017/04/28/web-scrapers-the-third-party-developers-holy-grail/'

def openurl(url):
    req = urllib2.Request(url)
    response = urllib2.urlopen(req)
    the_page = response.read()
    return the_page

fetch_html = openurl(url_to_open)
regex = re.compile('<p>(.+?)</p>').findall(fetch_html)
for item in regex:
print    item

This you can change the

in regex to different things in the HTML to test out, have a little play with printing stuff out, trying to grab more than one thing and then next ill show how to stick into a Kodi add-on. Like I’ve said this is ‘the way’ to scrape, there isn’t one really except what works, this is just a way that i have used and found pretty easy to understand at the start. Hopefully you are enjoying the blogs and sorry this was a long one, feel free to comment any edits or questions on any of my blogs.

Print Friendly, PDF & Email

2 Comments

  1. Petes

    How do I print item in kodi?

    • Origin

      xbmc.log(‘thing to print’) below kodi 17 and xbmc.log(‘thing to print’,xbmc.LOGNOTICE) for kodi 17. This will print to kodi log from code

Leave a Reply

Theme by Anders Norén