In this tutorial, we are going to do web scraping using Python’s Beautiful Soup library step-by-step. Python 3 is ridiculously fast in web scraping. It provides a beautiful framework for that called beautiful soup.
What is Web Scraping?
When you want to extract some important data from a website, you can use web scraping. According to Wikipedia’s definition, web scraping, web harvesting, or web data abstraction is data scraping used for extracting data from websites.
Usually, the ideal way of picking up data from websites is through APIs which is recommended. But sometimes, when the APIs are not available, we go for web scraping.
Is Web Scraping Legally Allowed?
Web scraping is a little grey area. Web scraping is not legally allowed on most websites. You have to check from the website owner or the policies of the website.
So, make sure you are completely aware of what you are doing, and do web scraping only on legally allowed websites. You could scrap your own website for sure. But you can’t scrap or crawl someone else’s website, without obtaining their permission.
Why Use Python For Web Scraping?
Python 3 is the best programming language to do web scraping. Python is so fast and easy to do web scraping. Also, most of the tools of web scraping that are present in the Kali-Linux are being designed in Python.
Enough of the theories, let’s start scraping the web using the beautiful soup library.
Web Scraping using Python’s Beautiful Soup
The first thing you want to do when you are going to do web scraping is to go to the website that you want to scrap and analyze it. Web scraping is all about how you understand the website, its data structures, how things are looking, etc.
The next thing you need to do is to get all the necessary tools and packages. I’m using Python IDLE to do the scraping. So you should have that ready in your system.
You can also write code in your shell as well if needed. After that, we need to install the necessary packages. We need packages like ‘bs4’ which is the beautiful soup, ‘requests’ and ‘lxml’ to proceed.
So go to your command line (CMD) and install them one by one, if you don’t have them already. If you are on a MAC/Linux, use pip3 instead of pip in the following commands.
pip install bs4
pip install lxml
Generally, ‘requests’ already come up with Python. If you don’t have that in your system, install that too.
pip install requests
Now, all your packages are ready. Go to your Python IDLE or Python Shell and let’s write some code.
First of all, we need to import all three packages. So, let’s do that.
import requests import bs4 import lxml
Next, you have to make a request to the website that you want to scrap. Let’s create a variable ‘res’ to make a request.
res = requests.get('https://mywebsite.com')
You can type in your URL instead of mywebsite.com which I randomly typed for an example.
This ‘res’ variable is now storing the entire web page data. If you just type in ‘res.text’ and hit enter, you can see all the details that this variable is storing.
We need to extract information from this variable. Here comes the use of the beautiful soup library.
We are going to create an object called ‘soup’. For that, we use bs4 and its method called ‘BeautifulSoup’.
This method takes in two parameters, the first is ‘res.text’ and the second one is how you want to structure your data. In this case, we are using lxml.
soup = bs4.BeautifulSoup(res.text,'lxml')
For example, let’s say we want to extract the information about the title tag of that website. So, let’s create a new variable.
title = soup.select('title')
You can pass any HTML tag you want instead of ‘title’. Now, let’s check what is inside this ‘title’ variable.
Then, you will see the title of the website as the output. You have just scraped the title of that website using Python.
You can also scrape data based on certain CSS class or id using ‘.classname’ or ‘#idname’ respectively. Let’s see an example.
title = soup.select('.classname') #or title = soup.select('#idname')
Enter the name of the class or id you want to scrape in place of ‘classname’ and ‘idname’.
Finding all the Links from a Website
If you want to find all the links that are there on a website, we can do that too. For that, we are using a ‘for’ loop and a method called ‘find_all’.
for link in soup.find_all('a',href=True): print(link['href'])
Then, you can see all the links listed on your IDLE or shell as output.
That’s about the basics of web scraping using Python. If you have any doubts or queries, feel free to let me know in the comments section down below.
If you enjoyed this article, share it with your friends.