Extract all the external links or URL from a webpage using Python
In this tutorial, we will see how to extract all the external links or URLs from a webpage using Python. We can extract all the external links or URLs from a webpage using one of the very powerful tools of Python, known as Web scraping. So, with the help of web scraping let us learn and explore the process of extracting the external links and URLs from a webpage.
This article’s first and most important part is installing the required modules and packages on your terminal.
Installations
1. requests module:
This module of Python allows you to make HTTP requests. You can install this module by using the following command.
pip install requests
2. Beautiful Soup(bs4) module:
bs4 module of Python allows you to pull or extract the data out of HTML and XML files. This installation of the Python module can be done using the command given below:
pip install beautifulsoup4
As we are interested in extracting the external URLs of the web page, we will need to define an empty Python set, namely external_urls. Below is the implementation of the code of extracting the external links or URLs using an example:
Example
Let the see the code given below to understand the concept of extracting the external links or URLs from a webpage using Python:
#import the modules import requests from bs4 import BeautifulSoup # get the page url url = r"https://www.codespeedy.com/" #send get request response = requests.get(url) #parse html page html_page = BeautifulSoup(response.text, "html.parser") #get all tags external_urls = html_page.findAll("a") external_urls =set() for link in external_urls: href=link.get('href') if r"codespeedy.com" in href: external_urls.add(href) print(f"\n\nTotal External URLs: {len(external_urls)}\n") for url in external_urls: print(f"External URL {url}")
In this tutorial, as you can see that the first step is to import the necessary modules then get the page URL and send get request. Next by using the BoutifulSoup module get the parse HTML page and then get all the tags by setting the external URLs in a set. After all this use for loop in the external URLs to get the href in the link and at last your terminal will the print the number externals link or URLs if present in the webpage.
Output:
Total External URLs: 1
Leave a Reply