Email Exacter Application Project in Python
Fellow coders, in this tutorial we are going to create a project in Python programming language to extract the emails from a given website. This is a really interesting project which involves the concept of web scraping. We are going to use “beautiful soup” in order to extract information from a website.
Before we begin, please check that you have “beautifulsoup” installed in your python environment. If you are using Mac, please install it by the following command:
pip install beautifulsoup4
After installing this library, we can proceed further with our project.
Everyone visits webpages in their day to day life. We know for a fact that all the emails that the websites provide are in one of these pages: about, careers, contact or services. So we will be looking in these particular pages for any emails. In this tutorial, we are going to scrape ‘linkedin.com’ and extract all the emails that our bot can find.
Working with the code:
import requests import re from bs4 import BeautifulSoup allLinks = ;mails= url = 'https://linkedin.com/' response = requests.get(url) soup=BeautifulSoup(response.text,'html.parser') links = [a.attrs.get('href') for a in soup.select('a[href]') ] for i in links: if(("contact" in i or "Contact")or("Career" in i or "career" in i))or('about' in i or "About" in i)or('Services' in i or 'services' in i): allLinks.append(i) allLinks=set(allLinks) def findMails(soup): for name in soup.find_all('a'): if(name is not None): emailText=name.text match=bool(re.match('[a-zA-Z0-9_.+-][email protected][a-zA-Z0-9-]+\.[a-zA-Z0-9-.]+$',emailText)) if('@' in emailText and match==True): emailText=emailText.replace(" ",'').replace('\r','') emailText=emailText.replace('\n','').replace('\t','') if(len(mails)==0)or(emailText not in mails): print(emailText) mails.append(emailText) for link in allLinks: if(link.startswith("http") or link.startswith("www")): r=requests.get(link) data=r.text soup=BeautifulSoup(data,'html.parser') findMails(soup) else: newurl=url+link r=requests.get(newurl) data=r.text soup=BeautifulSoup(data,'html.parser') findMails(soup) mails=set(mails) if(len(mails)==0): print("NO MAILS FOUND")