How to get URL from HTML using lxml in Python

In this tutorial, we will see two simple methods that are available in Python to get a URL from HTML using Python.

Method 1

At first, we have to know about the lxml library.

lxml:

To handle XML and HTML files, we can make use of lxml which is a library that is available in Python. Using this, parsing HTML will be an easy task.

Installation:

pip install lxml

Let’s look at the program.

from lxml import html
def LinkExtract(str_document):
    link = list(str_document.iterlinks())
    (element, attr, link,position) = link[0]
    print ("attribute: ", attr) 
    print ("link: ", link)
    print("Position:", position)
    print ("Length of the link: ", len(link))
str_document = html.fromstring('Welcome <a href ="codespeedy.com">CodeSpeedy</a>')
LinkExtract(str_document)

Functions used:

1.formstring():

  • It is used to parse the HTML string. It parses HTML and returns a single element/document
  • Syntax: formstring(html_string)

2.iterlinks():

  • iterlinks() method has four parameters of tuple form.
  • element– Link is extracted from this parse node of the anchor tag.
  • attr- It represents the attribute of the link.
  • link– It is the actual URL that is extracted from the anchor tag.
  • position– It returns the anchor tag’s numeric index in the document.

Output:

attribute: href
link: codespeedy.com
Position: 0
Length of the link: 18

Method 2

In this method, we have imported the codecs module in addition to the lxml library.

codecs:

To transcode the data present in our program, we can use the codecs module that provides file interfaces and streams.

Let’s take a look at the program.

from lxml import html
import codecs
def LinkExtract(str_document):
    link = list(str_document.iterlinks()) 
    (element, attr, link,position) = link[0]
    print ("attribute: ", attr) 
    print ("link: ", link)
    print ("Length of the link: ", len(link))
    print("Position:", position)
f=codecs.open("link.html", 'r')
doc=f.read()
str_document = html.fromstring(doc)
LinkExtract(str_document)

Methods used:

1.codecs.open():

  • We can use codecs.open() to open HTML file within Pyhton.
  • Syntax: codecs.open(filename, mode, encoding)

2.read():

  • It reads the content of the file.
  • Syntax: filename.read()

Our HTML file will look like below.

anchor link

Output:

attribute: href
link: www.google.com
Length of the link: 14
Position: 0

I hope that this tutorial is useful for everyone.

Leave a Reply

Your email address will not be published. Required fields are marked *