Get most frequent words from a web page in Python

Post Views: 650

In this tutorial, we will be web scrapping a website using Python to get the most frequent words from a web page. We will be using PyCharm IDE here.

Prerequisites

In order to build this program, we’ll require the following packages:

requests : Bring Content from website as a string
BeautifulSoup : Functions after parsing import requests
Collections : To count elements from a dictionary

import requests
from bs4 import BeautifulSoup
from collections import Counter

We will be following these 4 steps in order to solve this problem:

Retrieve the HTML
Parse the HTML
HTML tree traversal
Count the top 5 most frequent words

Retrieve the HTML

Here, we’ll be using the requests.get() function to open the url and then use .content to get all the html codes of the website.

#Step 1: Retrieve The HTML 
url="https://www.codespeedy.com/about-us/"
r=requests.get(url) 
htmlContent =r.content

Parse the HTML

We use Beautiful Soup and html.parser in order to parse the html page in order to covert into a usable form

soup=BeautifulSoup(htmlContent,'html.parser')

HTML Tree Traversal

Here, we use the find_all() function to find all paragraphs on this webpage

Once we get the all the paragraphs, we lower each word, remove all the punctuation marks and strip it to store the words in the form of a list

list=[]
#Get all paragraphs from the page
paras = soup.find_all('p')
count=0
for each in paras:
    PageText=str(each.get_text())
    list.append(PageText)

res = list[0].lower().strip("!@#$%^&*()_=+[{]}\\|;:\'\'',<.>/?").split()

Count the Top 5 most frequent words using BeautifulSoup

We create a dictionary that stores the words as keys and their occurrences as values.

Once our dictionary is filled with all the keys and values, we use the counter.most_common() function to count the 5 most common words from this dictionary

counts = dict()
for word in res:
    if word in counts:
        counts[word] += 1
    else:
        counts[word] = 1

counter = Counter(counts)

# Top 5 words from the website
top_5 = counter.most_common(5)
print(top_5)

Complete Code: Get and count most frequent words from a web page in Python

Here is the complete code for this problem

#requests  #Bring content from webstie as a string
#html5lib  #To parse i.e to convert it it a useful form to extract useful data
#beautifulsoup4 #Functions after parsing

import requests
from bs4 import BeautifulSoup
from collections import Counter



#Step 1: Retrieve The HTML
url="https://www.codespeedy.com/about-us/"
r=requests.get(url)
htmlContent =r.content

#Step 2: Parse the HTML
soup=BeautifulSoup(htmlContent,'html.parser')


#Step 3: HTML tree traversal
list=[]
#Get all paragraphs from the page
paras = soup.find_all('p')
count=0
for each in paras:
    PageText=str(each.get_text())
    list.append(PageText)


res = list[0].lower().strip("!@#$%^&*()_=+[{]}\\|;:\'\'',<.>/?").split()
#print(res)

#Step 4: Count the top 5 most frequent words
counts = dict()
for word in res:
    if word in counts:
        counts[word] += 1
    else:
        counts[word] = 1


counter = Counter(counts)
# Top 5 words from the website
top_5 = counter.most_common(5)
print(top_5)

Output:

[('and', 16), ('our', 10), ('we', 10), ('to', 8), ('the', 7)]