How to get URL from HTML using lxml in Python

Post Views: 698

In this tutorial, we will see two simple methods that are available in Python to get a URL from HTML using Python.

Method 1

At first, we have to know about the lxml library.

lxml:

To handle XML and HTML files, we can make use of lxml which is a library that is available in Python. Using this, parsing HTML will be an easy task.

Installation:

pip install lxml

Let’s look at the program.

from lxml import html
def LinkExtract(str_document):
    link = list(str_document.iterlinks())
    (element, attr, link,position) = link[0]
    print ("attribute: ", attr) 
    print ("link: ", link)
    print("Position:", position)
    print ("Length of the link: ", len(link))
str_document = html.fromstring('Welcome <a href ="codespeedy.com">CodeSpeedy</a>')
LinkExtract(str_document)

Functions used:

1.formstring():

It is used to parse the HTML string. It parses HTML and returns a single element/document
Syntax: formstring(html_string)

2.iterlinks():

iterlinks() method has four parameters of tuple form.
element– Link is extracted from this parse node of the anchor tag.
attr- It represents the attribute of the link.
link– It is the actual URL that is extracted from the anchor tag.
position– It returns the anchor tag’s numeric index in the document.

Output:

attribute: href
link: codespeedy.com
Position: 0
Length of the link: 18

Method 2

In this method, we have imported the codecs module in addition to the lxml library.

codecs:

To transcode the data present in our program, we can use the codecs module that provides file interfaces and streams.

Let’s take a look at the program.

from lxml import html
import codecs
def LinkExtract(str_document):
    link = list(str_document.iterlinks()) 
    (element, attr, link,position) = link[0]
    print ("attribute: ", attr) 
    print ("link: ", link)
    print ("Length of the link: ", len(link))
    print("Position:", position)
f=codecs.open("link.html", 'r')
doc=f.read()
str_document = html.fromstring(doc)
LinkExtract(str_document)

Methods used:

1.codecs.open():

We can use codecs.open() to open HTML file within Pyhton.
Syntax: codecs.open(filename, mode, encoding)

2.read():

It reads the content of the file.
Syntax: filename.read()

Our HTML file will look like below.

anchor link

Output:

attribute: href
link: www.google.com
Length of the link: 14
Position: 0

I hope that this tutorial is useful for everyone.

How to get URL from HTML using lxml in Python

Method 1

Method 2

Leave a Reply Cancel reply