How to get URL from HTML using lxml in Python
In this tutorial, we will see two simple methods that are available in Python to get a URL from HTML using Python.
Method 1
At first, we have to know about the lxml library.
lxml:
To handle XML and HTML files, we can make use of lxml which is a library that is available in Python. Using this, parsing HTML will be an easy task.
Installation:
pip install lxml
Let’s look at the program.
from lxml import html def LinkExtract(str_document): link = list(str_document.iterlinks()) (element, attr, link,position) = link[0] print ("attribute: ", attr) print ("link: ", link) print("Position:", position) print ("Length of the link: ", len(link)) str_document = html.fromstring('Welcome <a href ="codespeedy.com">CodeSpeedy</a>') LinkExtract(str_document)
Functions used:
1.formstring():
- It is used to parse the HTML string. It parses HTML and returns a single element/document
- Syntax: formstring(html_string)
2.iterlinks():
- iterlinks() method has four parameters of tuple form.
- element– Link is extracted from this parse node of the anchor tag.
- attr- It represents the attribute of the link.
- link– It is the actual URL that is extracted from the anchor tag.
- position– It returns the anchor tag’s numeric index in the document.
Output:
attribute: href link: codespeedy.com Position: 0 Length of the link: 18
Method 2
In this method, we have imported the codecs module in addition to the lxml library.
codecs:
To transcode the data present in our program, we can use the codecs module that provides file interfaces and streams.
Let’s take a look at the program.
from lxml import html import codecs def LinkExtract(str_document): link = list(str_document.iterlinks()) (element, attr, link,position) = link[0] print ("attribute: ", attr) print ("link: ", link) print ("Length of the link: ", len(link)) print("Position:", position) f=codecs.open("link.html", 'r') doc=f.read() str_document = html.fromstring(doc) LinkExtract(str_document)
Methods used:
1.codecs.open():
- We can use codecs.open() to open HTML file within Pyhton.
- Syntax: codecs.open(filename, mode, encoding)
2.read():
- It reads the content of the file.
- Syntax: filename.read()
Our HTML file will look like below.
Output:
attribute: href link: www.google.com Length of the link: 14 Position: 0
I hope that this tutorial is useful for everyone.
Leave a Reply