How to remove HTML tags from a string in Python

Sometimes, when we try to store a string in the database, it gets stored along with the HTML tags. But, certain websites need to render the strings in their raw format without any HTML tags from the database. Thus, in this tutorial, we will learn different methods on how to remove HTML tags from a string in Python.

Remove HTML tags from a string using regex in Python

A regular expression is a combination of characters that are going to represent a search pattern. In the regex module of python, we use the sub() function, which will replace the string that matches with a specified pattern with another string. The code for removing HTML strings from a string using regex is mentioned below.

import re

regex = re.compile(r'<[^>]+>')

def remove_html(string):
    return regex.sub('', string)

text=input("Enter String:")
new_text=remove_html(text)
print(f"Text without html tags: {new_text}")

Output 1:

Enter String:<div class="header"> Welcome to my website </div>
Text without html tags:  Welcome to my website

Output 2:

Enter String:<h1> Hello </h1>
Text without html tags:  Hello

How does the above code work?

  1. Initially, we import the regex module in python named ‘re’
  2. Then we use the re.compile() function of the regex module. The re. compile() method will create a regex pattern object from the regex pattern string provided as an input. This pattern object will use regex functions to search for a matching string in different target strings. The parameter to the function is the pattern to be matched with the input string. ‘<>’, matches opening and closing tags in the string.
  3.  ‘.*’ means zero or more than zero characters. Regex is a greedy method where it tries to match as many repetitions as possible. If this does not work then the entire procedure backtracks. To convert the greedy to non-greedy approach, we make use of the ‘?’ character in the regex string.  It will basically try to match with only a few repetitions and then backtrack if it does not work.
  4. Then we use re.sub() function to replace the matched pattern with a null string.
  5. Finally, we call the function remove_html which removes the HTML tags from the input string.

 

Remove HTML tags from a string without using the in-built function

The code for removing HTML strings from a string without using an in-built function is mentioned below.

def remove_html(string):
    tags = False
    quote = False
    output = ""

    for ch in string:
            if ch == '<' and not quote:
                tag = True
            elif ch == '>' and not quote:
                tag = False
            elif (ch == '"' or ch == "'") and tag:
                quote = not quote
            elif not tag:
                output = output + ch

    return output

text=input("Enter String:")
new_text=remove_html(text)
print(f"Text without html tags: {new_text}")

Output:

Enter String:<div class="header"> Welcome to my website </div>
Text without html tags:  Welcome to my website

 

How does the above code work?

In the above code, we keep two counters called tag and quote. The tag variable keeps track of tags whereas the quote variable keeps track of single and double quotes in the input string. We use a for loop and iterate over every character of the string. If the character is opening or closing tag then the Tag variable is set to False. If the character is a single or double quote the quote variable is set to False. Else, the character is appended to the output string. Thus, in the output of the above code, the div tags are removed leaving only the raw string.

 

Remove HTML tags from a string  using the XML module in Python

The code for removing HTML strings from a string without using XML modules is mentioned below. XML is a markup language that is used to store and transport a large amount of data or information. Python has certain in-built modules which can help us to parse the XML documents.XML documents have individual units called elements that are defined under an opening and closing tag(<>). Whatever lies in between the opening and the closing tag is the element’s content. An element can consist of multiple sub-elements called child elements. Using the ElementTree module in python we can easily manipulate these XML documents.

import xml.etree.ElementTree
def remove_html(string):
    return ''.join(xml.etree.ElementTree.fromstring(string).itertext())

text=input("Enter String:")
new_text=remove_html(text)
print(f"Text without html tags: {new_text}")

Output:

Enter String:<p class="intro"> I love Coding </p>
Text without html tags:  I love Coding

 

How does the above code work?

  1. Initially, we import the xml.etree.ElementTree module in Python
  2. We use formstring() method to convert or parse the string to XML elements. To iterate over each of these XML elements returned by the formstring() function, we make use of the itertext()  function. It will basically iterate over every XML element and return the inner text within that element.
  3. We join the inner text with a null string using the join function and return the final output string.
  4. Finally, we call the remove_html function which removes the HTML tags from the input string.

Thus, we have reached the end of the tutorial on how to remove HTML tags from a string in Python You can use the following links to learn more about regex in python.
Regex In Python: Regular Expression in Python

Leave a Reply

Your email address will not be published.