Extract all the external links or URL from a webpage using Python

In this tutorial, we will see how to extract all the external links or URLs from a webpage using Python. We can extract all the external links or URLs from a webpage using one of the very powerful tools of Python, known as Web scraping. So, with the help of web scraping let us learn and explore the process of extracting the external links and URLs from a webpage.

This article’s first and most important part is installing the required modules and packages on your terminal.

Installations

1. requests module:

This module of Python allows you to make HTTP requests. You can install this module by using the following command.

pip install requests

2. Beautiful Soup(bs4) module:

bs4 module of Python allows you to pull or extract the data out of HTML and XML files. This installation of the Python module can be done using the command given below:

pip install beautifulsoup4

As we are interested in extracting the external URLs of the web page, we will need to define an empty Python set, namely external_urls. Below is the implementation of the code of extracting the external links or URLs using an example:

Example

Let the see the code given below to understand the concept of extracting the external links or URLs from a webpage using Python:

#import the modules
import requests
from bs4 import BeautifulSoup


# get the page url
url = r"https://www.codespeedy.com/"

#send get request
response = requests.get(url)

#parse html page
html_page = BeautifulSoup(response.text, "html.parser")

#get all  tags
external_urls = html_page.findAll("a")
external_urls =set()

for link in external_urls:
    href=link.get('href')
    if r"codespeedy.com" in href:
        external_urls.add(href)
            
        

print(f"\n\nTotal External URLs: {len(external_urls)}\n")
for url in external_urls:
    print(f"External URL {url}")

In this tutorial, as you can see that the first step is to import the necessary modules then get the page URL and send get request. Next by using the BoutifulSoup module get the parse HTML page and then get all the tags by setting the external URLs in a set. After all this use for loop in the external URLs to get the href in the link and at last your terminal will the print the number externals link or URLs if present in the webpage.

Output:

Total External URLs: 1

javascript:void(0);

 

Leave a Reply

Your email address will not be published. Required fields are marked *