Email Exacter Application Project in Python

Fellow coders, in this tutorial we are going to create a project in Python programming language to extract the emails from a given website. This is a really interesting project which involves the concept of web scraping. We are going to use “beautiful soup” in order to extract information from a website.
Before we begin, please check that you have “beautifulsoup” installed in your python environment. If you are using Mac, please install it by the following command:

pip install beautifulsoup4

After installing this library, we can proceed further with our project.

Everyone visits webpages in their day to day life. We know for a fact that all the emails that the websites provide are in one of these pages: about, careers, contact or services. So we will be looking in these particular pages for any emails. In this tutorial, we are going to scrape ‘linkedin.com’ and extract all the emails that our bot can find.

Working with the code:

import requests
import re
from bs4 import BeautifulSoup

allLinks = [];mails=[]
url = 'https://linkedin.com/'
response = requests.get(url)
soup=BeautifulSoup(response.text,'html.parser')
links = [a.attrs.get('href') for a in soup.select('a[href]') ]
for i in links:
    if(("contact" in i or "Contact")or("Career" in i or "career" in i))or('about' in i or "About" in i)or('Services' in i or 'services' in i):
        allLinks.append(i)
allLinks=set(allLinks)
def findMails(soup):
    for name in soup.find_all('a'):
        if(name is not None):
            emailText=name.text
            match=bool(re.match('[a-zA-Z0-9_.+-]+@[a-zA-Z0-9-]+\.[a-zA-Z0-9-.]+$',emailText))
            if('@' in emailText and match==True):
                emailText=emailText.replace(" ",'').replace('\r','')
                emailText=emailText.replace('\n','').replace('\t','')
                if(len(mails)==0)or(emailText not in mails):
                    print(emailText)
                mails.append(emailText)
for link in allLinks:
    if(link.startswith("http") or link.startswith("www")):
        r=requests.get(link)
        data=r.text
        soup=BeautifulSoup(data,'html.parser')
        findMails(soup)

    else:
        newurl=url+link
        r=requests.get(newurl)
        data=r.text
        soup=BeautifulSoup(data,'html.parser')
        findMails(soup)

mails=set(mails)
if(len(mails)==0):
    print("NO MAILS FOUND")

Output:

trademark@linkedin.com

Leave a Reply

Your email address will not be published. Required fields are marked *