Detect all the URLs from a .txt file in Python

By VENKATA DINESH USARTI

Post Views: 656

In this tutorial, we will learn how to detect the URLs from a text file in Python with an example.

Relax and read the tutorial.

The purpose :

The main purpose of the tutorial is to detect the URLS in the text file.

Let’s dive deep with the help of code snippets

Importing the ‘re’ module and defining ‘extract_urls’ function

import re
def extract_urls(text):

“re” is a module in Python, the “re” module stands for regular expressions.
It provides operations for working with regular expressions, which are powerful tools for pattern matching and string manipulation.
Here the purpose of extract_urls function is to take a file path as input.

Creating the regular expression pattern

 url_pattern = re.compile(
        r'\b(?:'
        r'https?|ftp|sftp|ssh|rsync|git|svn|file|smb|nfs|afp|mailto|data|irc|gopher|news|nntp|telnet|wais|ldap|bluetooth|dns-sd|feed|rss|atom|ipfs|magnet|webcal|bitcoin|ethereum|storj|tor|dat|gemini|lbry|zeronet|ygg|amazon|wss|matrix|xmpp|dict|elasticsearch|dynamodb|aerospike|cassandra|mongodb|kafka|redis|couchbase|postgres|mysql|sqlite|oracle|db2|mssql|sybase|informix|couchdb|mariadb|voltdb|cratedb|clickhouse|greenplum|firebird|snowflake|teradata|presto|bigquery|redshift|hive|impala|sparksql|drill|phoenix|calcite|flink|pulsar|kinesis|dremio'
        r'):'
        r'(?://[^\s]*|[^\s]*)'
    )

This part highlights the regular expression pattern which is designed to match various types of URLs.
The ‘re.compile’ function is used to compile the regular expression pattern.
The ‘\b’ is word boundary, ensuring that the URL is matched as a whole word.
‘|’——–> This is (OR operator) used to separate the different schemes or types of URLs.
It matches the URL Schema followed by ‘://’ and then follows the rest of the URL.
‘(?://[^\s]*|[^\s]*)’ ———-> this part matches the actual URL content.
‘[^\s]’ matches zero or more characters that are not whitespace, this captures the components of the URL like name, path,….

Finding urls using ‘re.findall’

return re.findall(url_pattern, text)

The function “re.findall” is used with the url_pattern and text as arguments.
It searches all the non-overlapping occurrences of pattern within the text and returns the list of matched URLs.

Reading the text file

# Read the text file
file_path = "ulstemp.txt"
with open(file_path, 'r') as file:
    text = file.read()

“file_path” defines the text file that contains the URLs.
“ulstemp.txt” is the text file we used to store URLs.
The ‘r’ is used to open the file with (“with open( )”) in only read mode and close it automatically when done.
The “read()” reads the entire text content.

Extracting URLs and printing the results

# Extract URLs from the text
urls = extract_urls(text)

# Print the URLs
for idx, url in enumerate(urls, start=1):
    print(f"URL {idx}: {url}")

# Count the URLs
print(f"\nTotal URLs found: {len(urls)}")

urls = extract_urls(text)

The “extract_urls” function performs the regular expression matching and returns the list of extracted URLS.
Which is then stored in “urls” variable.

for idx, url in enumerate(urls, start=1):

The line starts with a for loop that iterates over the text elements in the “urls”.
The “enumerate” is the built-in function that provides index ‘idx’ and the value ‘url’ in each iteration.
The ‘start=1’ is the optional argument that ensures the index start from 1 but not from 0 the default value.

print(f”URL {idx}: {url}”)

This is the f-string that allows embedding variables within a string.
the “idx” and “url” is used to insert the current index value and current url list into the string.

print(f”\nTotal URLs found: {len(urls)}”)

After the loop finishes iterating over all URLS, this line prints another formatted string.
The “len(urls)”———> calculates the length of the URLS in the list(number of elements).
After this process is completed the output will be printed.

Here is the code and output and the text file details:

Code

import re

def extract_urls(text):
    # Regular expression pattern to match various types of URLs
    url_pattern = re.compile(
        r'\b(?:'
        r'https?|ftp|sftp|ssh|rsync|git|svn|file|smb|nfs|afp|mailto|data|irc|gopher|news|nntp|telnet|wais|ldap|bluetooth|dns-sd|feed|rss|atom|ipfs|magnet|webcal|bitcoin|ethereum|storj|tor|dat|gemini|lbry|zeronet|ygg|amazon|wss|matrix|xmpp|dict|elasticsearch|dynamodb|aerospike|cassandra|mongodb|kafka|redis|couchbase|postgres|mysql|sqlite|oracle|db2|mssql|sybase|informix|couchdb|mariadb|voltdb|cratedb|clickhouse|greenplum|firebird|snowflake|teradata|presto|bigquery|redshift|hive|impala|sparksql|drill|phoenix|calcite|flink|pulsar|kinesis|dremio'
        r'):'
        r'(?://[^\s]*|[^\s]*)'
    )
    return re.findall(url_pattern, text)

# Read the text file
file_path = "ulstemp.txt"
with open(file_path, 'r') as file:
    text = file.read()

# Extract URLs from the text
urls = extract_urls(text)

# Print the URLs
for idx, url in enumerate(urls, start=1):
    print(f"URL {idx}: {url}")

# Count the URLs
print(f"\nTotal URLs found: {len(urls)}")

Output1:

URL 1: https://www.example.com
URL 2: ftp://ftp.example.com
URL 3: sftp://sftp.example.com
URL 4: file:///path/to/file.txt
URL 5: mailto:someone@example.com
URL 6: data:text/plain;base64,SGVsbG8sIFdvcmxkIQ%3D%3D
URL 7: irc://irc.example.com/channel
URL 8: gopher://gopher.example.com/0/test
URL 9: news:comp.lang.python
URL 10: nntp://news.example.com/public/comp.lang.python
URL 11: telnet://telnet.example.com
URL 12: wais://wais.example.com/database
URL 13: ldap://ldap.example.com/o=Example%20Company,c=US

Total URLs found: 13

Text file :

Sample text file 

Standard web URL:
https://www.example.com

FTP URL:
ftp://ftp.example.com

SFTP URL:
sftp://sftp.example.com

File URL:
file:///path/to/file.txt

Email URL:
mailto:someone@example.com

Data URL:
data:text/plain;base64,SGVsbG8sIFdvcmxkIQ%3D%3D

IRC URL:
irc://irc.example.com/channel

Gopher URL:
gopher://gopher.example.com/0/test

News URL:
news:comp.lang.python

NNTP URL:
nntp://news.example.com/public/comp.lang.python

Telnet URL:
telnet://telnet.example.com

Wais URL:
wais://wais.example.com/database

LDAP URL:
ldap://ldap.example.com/o=Example%20Company,c=US