Company name matching from csv with matching score – Python

Post Views: 1,297

Well, finally I have come up with a solution of matching company names using Python. This time, we are going to match company names from two CSV datasets even if the company names are misspelled or inaccurate. We can’t do this using simple string operations. So we need strong mathematical calculations to do this.

Thanks to the great mathematicians who built nice algorithms.

We can use Levenshtein Distance similarity ratio to check the similarity score between two strings and we can set a matching score for the qualification test. This is nothing but how many edits required to reach the second string from the first string.

I have two CSV files, in one file I have more than 100k rows with company names and in another file I have 300 company names. But the issue is that, the company names in the second file might be misspelled or there might be spacing issues or typos.

So my task is to match the company names of the second file with the first file. If matching company names are found, it will create a new file and store all the matching company names in it.

I need to check each company name with each row from the first large dataset. So it will be time-consuming for the high-end machines too.

I have come up with a solution to reduce the code execution times by using tokens.

let’s say my large data set eligible_company.csv

and the second dataset file is not_picked_by_matching.csv

Python company name matching code from dataframes

Then my final program will be like this.

import pandas as pd
from rapidfuzz import process, fuzz
from py_stringmatching.tokenizer.qgram_tokenizer import QgramTokenizer
from py_stringmatching.similarity_measure.jaccard import Jaccard


not_picked = pd.read_csv('not_picked_by_matching.csv')
eligible = pd.read_csv('eligible_company.csv')


not_picked_list = not_picked.iloc[:, 0].tolist()  # First column of not_picked_by_matching.csv
eligible_list = eligible['Organisation Name'].tolist()


tokenizer = QgramTokenizer(qval=3)
jaccard = Jaccard()

# Function to get blocking candidates based on Jaccard similarity
def get_blocking_candidates(query, choices, threshold=0.3):
    query_tokens = tokenizer.tokenize(query)
    candidates = []
    for choice in choices:
        choice_tokens = tokenizer.tokenize(choice)
        if jaccard.get_raw_score(query_tokens, choice_tokens) >= threshold:
            candidates.append(choice)
    return candidates

def fuzzy_match(query, choices, score_cutoff=80):
    matches = process.extract(query, choices, scorer=fuzz.ratio, score_cutoff=score_cutoff, limit=5)
    return matches


results = []


for company in not_picked_list:
    candidates = get_blocking_candidates(company, eligible_list)
    if candidates:
        matches = fuzzy_match(company, candidates)
        for match in matches:
            results.append([company, match[0], match[1]])


results_df = pd.DataFrame(results, columns=['Not Picked Company', 'Matched Company', 'Match Score'])

# Save the results to a CSV file
results_df.to_csv('matching_results.csv', index=False)

print("Results have been saved to matching_results.csv")

If you run this program: it will create a new CSV file where only the matching company names will be printed with matching scores as well.

Like this:

AlphaSights,AlphaSights Ltd.,81.4814814814815
AlphaSights,AlphaSights Ltd.,81.4814814814815
Associated British Ports,Associated British Ports (ABP),88.88888888888889
Associated British Ports,Associated British Foods Plc,80.76923076923077
Associated British Ports,Associated British Foods Plc,80.76923076923077
Bloomberg,Bloomberg LP,85.71428571428572
Bloomberg,Bloomberg LP,85.71428571428572

If you wish you can remove the duplicate values too.

So my main goal behind this tutorial is to tell you that we can literally do anything with data using Python and machine learning.

I have been working in the data analysis and ML field ( Python ) since 2018. If you want any kind of help related to this type of work, feel free to contact me. I run this website, so you can directly click on the contact button and send us message or directly email me at contact@codespeedy.com

Find URL of official website from company name

I have also created a Python program that will fetch the official websites of all the companies in a single click.

To let you know how to fetch a single website from a single company name, I have created a demo tutorial:

Get official URL of any company using Python

If you want a custom code for doing this in a more advanced way, feel free to reach us.

Company name matching from csv with matching score – Python

Python company name matching code from dataframes

Find URL of official website from company name

Leave a Reply Cancel reply

Related Posts