Updated April 13, 2023
Introduction to Python BeautifulSoup
BeautifulSoup reduces human effort and time while working. A Python library for data pulling from files of markup languages such as HTML and XML is Python BeautifulSoup. It is also Provides analogical ways to produce navigation, modifying, and searching of necessary files. Also used in tree parsing using your favorite parser. In this tutorial, let’s learn how the beautifulsoup works and how an individual can make what he wants to achieve. Also, the determination of action when it violates your orders. When you are provided with the downloads directly, when some web pages you seek show your research’s relevant data, This helps you overcome such problems, which is basically web scraping.
Installation of Python BeautifulSoup
Explanation on Installing python beautifulsoup is given below:
pip install beautifulsoup4
pip install lxml
sudo pip install lxml
pip install future
sudo pip install future
Accessing of the HTML Through a Webpage
import requests
URL = "https://www.educba.com/software-development/"
r = requests.get(URL)
print(r.content)
Let me elaborate on every piece of code for you:
- Import the library requests.
- Scraping the webpage of your desire by specifying the URL.
- To the specified URL, send an HTTP request and save the response. The response object is called r.
- r.content print is to be done later, which is the webpage’s raw HTML content and is not of ‘ string ’ type.
Parsing of the Content HTML
import requests
from bs4 import BeautifulSoup
URL = "https://www.educba.com/software-development/"
r = requests.get(URL)
soup = BeautifulSoup(r.content, 'html5lib' )
print = (soup.prettify())
BeautifulSoup library has a really nice thing: HTML parsing libraries like html.parser, lxml, html5lib, and others can be built.
Understanding the Python BeautifulSoup with Examples
Example of python beautifulsoup better are given below:
A Simple Quick Scrape: It is nothing more than using requests to request the data and providing the URL to the special HTML file that there is. Secondly, supplying some regex and extract data out of the HTML file. Note that this HTML file is full of names, emails, and phone numbers, and it is all just generated data. It is all just garbage, and it is real.
All that is done here is to create a regex that matches phone numbers for the data that I found in the file and matches just a really basic email format. It is not a great regex. To grab HTML from a webpage, let us write code. Let us see how to parse through it. The below code sends a request of GET to the desired webpage and creates a beautifulsoup object with HTML.
import requests
from bs4 import BeautifulSoup
vgm_url = 'https://www.vgmusic.com/music/console/nintendo/nes/'
html_text = requests.get(vgm_url).text
soup = BeautifulSoup(html_text,'html.parser')
The soup object searches and navigates with the HTML for your desired data.
find and findall are the modules amongst the most powerful one’s. Where there is only one element, you know that there is soup.find() to the body tag. Whereas soup.findall() is used in adventures of web scraping. Using which, iteration and printing their URLs through all the hyperlinks is done. Also, tagging attributes and providing different arguments with findall like regular expressions and change as specifically as you want.
Before writing the code for a parse, we will look into the HTML that the browser is being rendered. Pattern recognition and experimentation are required for a little web scraping as every webpage is different on its own. Let us download a bunch of MIDI files. Writing a code through a webpage to parse it usually helps through developer tools available in modern browsers. Inspecting HTML will help you figure out if you can access the data programmatically.
We are going to use findall() method for going through the links using regular expressions cause our goal is to get only the links containing MIDI files by filtering out and texts of such with no parenthesis. This allows us to exclude all the remixes and duplicates.
import re
import requests
from bs4 import BeautifulSoup
vgm_url = 'https://www.vgmusic.com/music/console/nintendo/nes/'
html_text = requests.get(vgm_url).text
soup = BeautifulSoup(html_text,’html.parser’)
if __name__ == '__main__':
attrs = {
'href' : re.compile(r'\.mid$')
}
tracks = soup.find_all('a', attrs=attrs, string=re.compile(r'^((?!\().)*$'))
count = 0
for track in tracks:
print(track)
count += 1
print(len(tracks))
It helps us filter out all the MIDI files and understand how to download them.
The same, we need to look into iterating all the MIDI files and just understand how to download them by giving in a code. Adding a little donwload_track and calling the function to the above helps us to download the files through iterating all the MIDI files.
import re
import requests
from bs4 import BeautifulSoup
vgm_url = 'https://www.vgmusic.com/music/console/nintendo/nes/'
html_text = requests.get(vgm_url).text
soup = BeautifulSoup(html_text, 'html.parser')
def download_track(count, track_element):
#Get the title of the track from the HTML element
track_title = track_element.text.strip().replace('/','-')
download_url = '{}{}'.format(vgm_url, track_element['href'])
file_name = '{}_{}.mid'.format(count,track_title)
#Download the track
r = requests.get(download_url,allow_redirects=True)
with open(file_name, 'wb') as f:
f.write(r.content)
#Print to the console to keep track of how the scraping is coming along.
print('Downloaded: {}'.format(track_title, download_url))
if __name__ == '__main__':
attrs = {
'href' : re.compile(r'\.mid$')
}
tracks = soup.find_all('a', attrs=attrs, string=re.compile(r'^((?!\().)*$'))
count = 0
for track in tracks:
print(track)
count += 1
print(len(tracks))
Passing the object of BeautifulSoup, which represents HTML and linking to a MIDI file with a unique number along and using the filename and overcoming possible naming collisions.
Conclusion
If you want to get some data out of any webpage, BeautifulSoup is here for you. It helps you overcome the code hurdles of web scraping. A Python library that helps to get out the data from markup languages such as XML and HTML. Content parsing from the data is simply created using an object of BeautifulSoup.
Recommended Articles
We hope that this EDUCBA information on “Python BeautifulSoup” was beneficial to you. You can view EDUCBA’s recommended articles for more information.