How to get URL from HTML using lxml in Python
In this tutorial, we will see two simple methods that are available in Python to get a URL from HTML using Python.
Method 1
At first, we have to know about the lxml library.
lxml:
To handle XML and HTML files, we can make use of lxml which is a library that is available in Python. Using this, parsing HTML will be an easy task.
Installation:
pip install lxml
Let’s look at the program.
from lxml import html
def LinkExtract(str_document):
link = list(str_document.iterlinks())
(element, attr, link,position) = link[0]
print ("attribute: ", attr)
print ("link: ", link)
print("Position:", position)
print ("Length of the link: ", len(link))
str_document = html.fromstring('Welcome <a href ="codespeedy.com">CodeSpeedy</a>')
LinkExtract(str_document)
Functions used:
1.formstring():
- It is used to parse the HTML string. It parses HTML and returns a single element/document
- Syntax: formstring(html_string)
2.iterlinks():
- iterlinks() method has four parameters of tuple form.
- element– Link is extracted from this parse node of the anchor tag.
- attr- It represents the attribute of the link.
- link– It is the actual URL that is extracted from the anchor tag.
- position– It returns the anchor tag’s numeric index in the document.
Output:
attribute: href link: codespeedy.com Position: 0 Length of the link: 18
Method 2
In this method, we have imported the codecs module in addition to the lxml library.
codecs:
To transcode the data present in our program, we can use the codecs module that provides file interfaces and streams.
Let’s take a look at the program.
from lxml import html
import codecs
def LinkExtract(str_document):
link = list(str_document.iterlinks())
(element, attr, link,position) = link[0]
print ("attribute: ", attr)
print ("link: ", link)
print ("Length of the link: ", len(link))
print("Position:", position)
f=codecs.open("link.html", 'r')
doc=f.read()
str_document = html.fromstring(doc)
LinkExtract(str_document)
Methods used:
1.codecs.open():
- We can use codecs.open() to open HTML file within Pyhton.
- Syntax: codecs.open(filename, mode, encoding)
2.read():
- It reads the content of the file.
- Syntax: filename.read()
Our HTML file will look like below.

Output:
attribute: href link: www.google.com Length of the link: 14 Position: 0
I hope that this tutorial is useful for everyone.
Leave a Reply