Company name matching from csv with matching score – Python
Well, finally I have come up with a solution of matching company names using Python. This time, we are going to match company names from two CSV datasets even if the company names are misspelled or inaccurate. We can’t do this using simple string operations. So we need strong mathematical calculations to do this.
Thanks to the great mathematicians who built nice algorithms.
We can use Levenshtein Distance similarity ratio to check the similarity score between two strings and we can set a matching score for the qualification test. This is nothing but how many edits required to reach the second string from the first string.
I have two CSV files, in one file I have more than 100k rows with company names and in another file I have 300 company names. But the issue is that, the company names in the second file might be misspelled or there might be spacing issues or typos.
So my task is to match the company names of the second file with the first file. If matching company names are found, it will create a new file and store all the matching company names in it.
I need to check each company name with each row from the first large dataset. So it will be time-consuming for the high-end machines too.
I have come up with a solution to reduce the code execution times by using tokens.
let’s say my large data set eligible_company.csv
and the second dataset file is not_picked_by_matching.csv
Python company name matching code from dataframes
Then my final program will be like this.
import pandas as pd from rapidfuzz import process, fuzz from py_stringmatching.tokenizer.qgram_tokenizer import QgramTokenizer from py_stringmatching.similarity_measure.jaccard import Jaccard not_picked = pd.read_csv('not_picked_by_matching.csv') eligible = pd.read_csv('eligible_company.csv') not_picked_list = not_picked.iloc[:, 0].tolist() # First column of not_picked_by_matching.csv eligible_list = eligible['Organisation Name'].tolist() tokenizer = QgramTokenizer(qval=3) jaccard = Jaccard() # Function to get blocking candidates based on Jaccard similarity def get_blocking_candidates(query, choices, threshold=0.3): query_tokens = tokenizer.tokenize(query) candidates = [] for choice in choices: choice_tokens = tokenizer.tokenize(choice) if jaccard.get_raw_score(query_tokens, choice_tokens) >= threshold: candidates.append(choice) return candidates def fuzzy_match(query, choices, score_cutoff=80): matches = process.extract(query, choices, scorer=fuzz.ratio, score_cutoff=score_cutoff, limit=5) return matches results = [] for company in not_picked_list: candidates = get_blocking_candidates(company, eligible_list) if candidates: matches = fuzzy_match(company, candidates) for match in matches: results.append([company, match[0], match[1]]) results_df = pd.DataFrame(results, columns=['Not Picked Company', 'Matched Company', 'Match Score']) # Save the results to a CSV file results_df.to_csv('matching_results.csv', index=False) print("Results have been saved to matching_results.csv")
If you run this program: it will create a new CSV file where only the matching company names will be printed with matching scores as well.
Like this:
AlphaSights,AlphaSights Ltd.,81.4814814814815 AlphaSights,AlphaSights Ltd.,81.4814814814815 Associated British Ports,Associated British Ports (ABP),88.88888888888889 Associated British Ports,Associated British Foods Plc,80.76923076923077 Associated British Ports,Associated British Foods Plc,80.76923076923077 Bloomberg,Bloomberg LP,85.71428571428572 Bloomberg,Bloomberg LP,85.71428571428572
If you wish you can remove the duplicate values too.
So my main goal behind this tutorial is to tell you that we can literally do anything with data using Python and machine learning.
I have been working in the data analysis and ML field ( Python ) since 2018. If you want any kind of help related to this type of work, feel free to contact me. I run this website, so you can directly click on the contact button and send us message or directly email me at contact@codespeedy.com
Find URL of official website from company name
I have also created a Python program that will fetch the official websites of all the companies in a single click.
To let you know how to fetch a single website from a single company name, I have created a demo tutorial:
Get official URL of any company using Python
If you want a custom code for doing this in a more advanced way, feel free to reach us.
Leave a Reply