Get most frequent words from a web page in Python
In this tutorial, we will be web scrapping a website using Python to get the most frequent words from a web page. We will be using PyCharm IDE here.
Prerequisites
In order to build this program, we’ll require the following packages:
- requests : Bring Content from website as a string
- BeautifulSoup : Functions after parsing import requests
- Collections : To count elements from a dictionary
import requests from bs4 import BeautifulSoup from collections import Counter
We will be following these 4 steps in order to solve this problem:
- Retrieve the HTML
- Parse the HTML
- HTML tree traversal
- Count the top 5 most frequent words
Retrieve the HTML
Here, we’ll be using the requests.get() function to open the url and then use .content to get all the html codes of the website.
#Step 1: Retrieve The HTML url="https://www.codespeedy.com/about-us/" r=requests.get(url) htmlContent =r.content
Parse the HTML
We use Beautiful Soup and html.parser in order to parse the html page in order to covert into a usable form
soup=BeautifulSoup(htmlContent,'html.parser')
HTML Tree Traversal
Here, we use the find_all() function to find all paragraphs on this webpage
Once we get the all the paragraphs, we lower each word, remove all the punctuation marks and strip it to store the words in the form of a list
list=[] #Get all paragraphs from the page paras = soup.find_all('p') count=0 for each in paras: PageText=str(each.get_text()) list.append(PageText) res = list[0].lower().strip("!@#$%^&*()_=+[{]}\\|;:\'\'',<.>/?").split()
Count the Top 5 most frequent words using BeautifulSoup
We create a dictionary that stores the words as keys and their occurrences as values.
Once our dictionary is filled with all the keys and values, we use the counter.most_common() function to count the 5 most common words from this dictionary
counts = dict() for word in res: if word in counts: counts[word] += 1 else: counts[word] = 1 counter = Counter(counts) # Top 5 words from the website top_5 = counter.most_common(5) print(top_5)
Complete Code: Get and count most frequent words from a web page in Python
Here is the complete code for this problem
#requests #Bring content from webstie as a string #html5lib #To parse i.e to convert it it a useful form to extract useful data #beautifulsoup4 #Functions after parsing import requests from bs4 import BeautifulSoup from collections import Counter #Step 1: Retrieve The HTML url="https://www.codespeedy.com/about-us/" r=requests.get(url) htmlContent =r.content #Step 2: Parse the HTML soup=BeautifulSoup(htmlContent,'html.parser') #Step 3: HTML tree traversal list=[] #Get all paragraphs from the page paras = soup.find_all('p') count=0 for each in paras: PageText=str(each.get_text()) list.append(PageText) res = list[0].lower().strip("!@#$%^&*()_=+[{]}\\|;:\'\'',<.>/?").split() #print(res) #Step 4: Count the top 5 most frequent words counts = dict() for word in res: if word in counts: counts[word] += 1 else: counts[word] = 1 counter = Counter(counts) # Top 5 words from the website top_5 = counter.most_common(5) print(top_5)
Output:
[('and', 16), ('our', 10), ('we', 10), ('to', 8), ('the', 7)]
Leave a Reply