This is the final step to understanding and creating a web scraper and this is the part that could see you losing hours or days fine tuning and grabbing information from a HTML, if you have not checked out my blog on HTML it would be a good idea to have a little read so you can get a rough idea on it before you carry on. I will do another blog or 2 so i can explain a bit more on the python commands that we will be using. Unlike regular expression Beautiful Soup will need to be installed, there are a few ways you can do this, you can download it move the file it creates to C:/Python27/Lib   and extract it there like you would a zip file, you can install with pip but i’ve never used so best to have a little google or you can install using windows powershell (be sure to right click and open as administrator), then type:

easy_install BeautifulSoup4

When i installed Python it also installed easy_install which i have found to be a great way to install new packages quick and easily.

While you are there you may as well install requests so we can see it in action also so type:

easy_install requests

That’s it you are now good to go, plus in the future if you see you need to install a module or package this is what you will need to do.  For this example i will also scrape info from the main write up explaining about web scraping. It should remain fairly static so whatever i show should not change, if there are slight differences then something may have altered but you should be ok to follow on.

First off you need to get your code imported that you will need to run:

import requests
from bs4 import BeautifulSoup

Because both requests and Beautifulsoup are imported code there is a lot that they do but to use them you just need to import the code and then put certain commands to run them.  Nice and easy really, with requests you just use requests.get(‘url in here’) to fetch HTML, much like urllib2 requests HTML, then you add .text or .content to tell python that you want the plain text of the HTML (you can check .response to see if a url exists and some other bits aswell may be worth a google) then with Beautifulsoup you ask it to grab you all the tags of a certain type in the HTML, check HTML write up if you are not sure what a tag is. So something like this:

import requests
from bs4 import BeautifulSoup

url = 'http://kodification.co.uk/wp/2017/04/28/web-scrapers-the-third-party-developers-holy-grail/'

info = requests.get(url).text
soup = BeautifulSoup(info, "html.parser")
for link in soup.find_all('a'):
    print link.get('href')

This will open up the url (which i defined up the top so can be easily changed for another site while testing) for the main web scraping write up, return the text then use beautifulsoup to parse the result. In the ‘for’ part it is saying that you want everything with an <a> tag, then naming this as link. Then in the line underneath its retrieving all the urls in there so the tag itself would be <a href=” “></a> and it will find that and print out whats in the ” after href=.  For a look on some other commands for beautiful soup check out this doc. You may have noticed the “html.parser” in ‘soup =’ this is pretty much to tell beautiful soup how to parse things (or at least that’s what it said in the error message i got telling me to add it.

The great advantage for beautiful soup over regular expression is that with regular expression they only need to change a single character and you’re pretty screwed, you need to take your time out and redo your scrape, but with beautiful soup as long as the tag ( the a) and its attribute (the href or other bits after a, title=” ” etc) stay the same then your scrape will still work, which with 2 years of editing things I’m obviously a little annoyed i didn’t pick this up from the start.  There is also parsedom (parse domain) which once you’ve had a go with beautifulsoup it looks like its going to be pretty similar but I’ve not tried it myself so feel free to comment any info that mate help others.

Print Friendly, PDF & Email