How to extract dates from a text file using Python

In this article, we will discuss how to extract dates from a text file using Python. The text may contain several thousand lines and you might need to extract the dates alone. We will do this using an interesting concept called regular expressions.

Extract date from text using Python

Since we are using regular expressions for this purpose, we first need to know some basics of regular expressions.
Regular expressions are patterns that can be used to match strings that follow that pattern and there are several ways to specify patterns and it can look complicated but it is not. It is recommended that you read the following article to understand how regular expressions work.

From here on, it is assumed that you know the basics of regular expressions.

We will use only the basic notations for creating a regex pattern for dates. Our target is to match dates that follow the format day/month/year or day-month-year with the day and month containing 2 digits and the year containing 4 digits. Let’s now construct the pattern step by step.

You would have known that \d will match digits. In order to match the strings that contain exactly 2 digits, we need to specify the value 2 inside {}. So \d{2} will match all strings that contain 2 digits and nothing else. The pattern for the day is \d{2} and for the month is \d{2} and for the year is \d{4}. We need to combine these 3 using a ‘/’ or ‘-‘.

The final regex pattern looks like “\d{2}[/-]\d{2}[/-]\d{4}“.

The hard part is over and the rest of the work is simple.

import re

# Open the file that you want to search 
f = open("doc.txt", "r")

# Will contain the entire content of the file as a string
content = f.read()

# The regex pattern that we created
pattern = "\d{2}[/-]\d{2}[/-]\d{4}"

# Will return all the strings that are matched
dates = re.findall(pattern, content)

It is to be noted that invalid dates such as 40/32/2019 will also be extracted with our regex pattern. We need to filter those and the final code will look as follows

 

import re

# Open the file that you want to search 
f = open("doc.txt", "r")

# Will contain the entire content of the file as a string
content = f.read()

# The regex pattern that we created
pattern = "\d{2}[/-]\d{2}[/-]\d{4}"

# Will return all the strings that are matched
dates = re.findall(pattern, content)


for date in dates:
    if "-" in date:
        day, month, year = map(int, date.split("-"))
    else:
        day, month, year = map(int, date.split("/"))
    if 1 <= day <= 31 and 1 <= month <= 12:
        print(date)
f.close()

For example, if the content of the text file is as follows

My name is XXX. I was born on 07/04/1998 in YYY city. 
I graduated from ZZZ college on 09-05-2019.

The output for the above text file is

07/04/1998
09-05-2019

I hope the article was useful in helping you to extract dates from a text file using Python.

See also:

Leave a Reply

Your email address will not be published. Required fields are marked *