Python HTML Module Tutorial


A Python module is a Python file that contains a set of in-built functions and variables that can be used in your program. A module can be of two types: in-built or user-defined. The HTML module in Python is an in-built module.

In this article, let’s look at what the HTML module is and the different methods it has to offer, with suitable example code for better understanding and clarity.

Let’s dive right in.

What Is Python HTML Module?

The HTML module in python is exclusively built to support coders who wish to work with HTML. This module defines HTML manipulation utilities. 

You do not need to install the Python HTML module on your system because it is built-in. To utilize the Python HTML module, use the import keyword to import the HTML module. 

We have a standard and clear HTML code utilized for encoding and decoding within the Python HTML module. [Reference]

The html module in Python contains two functions: escape() and unescape().

html.escape()

Using the html.escape() function, we can turn the HTML script into a string by replacing special characters in the string with ASCII characters. 

  • Syntax: html.escape(String)
  • It returns a string of ASCII characters from HTML.

For example, consider the following HTML code.

<!DOCTYPE html>
<html>
    <head>
        <title>Python HTML Module</title>
    </head>
    <body>
        <h2 style="text-align:center;">Hi , I am the HTML code</h2>
    </body>
</html>

Let’s look at how to use the html.escape() function to encode the HTML code.

import html

html_code = '<!DOCTYPE html> <html> <head> <title>Python HTML Module</title></head><body><h2 style="text-align:center;">Hi , I am the HTML code</h2></body></html>'

encoded_code = html.escape(html_code)
print(encoded_code)

Output:

HTML encoded code

The escape() function in the Python html module is used to encode HTML code. It changes the special characters (<, >,&, etc.) to HTML-safe sequences. 

If the optional flag quote is set to true (the default), the quotation mark characters, as well as double quote (“) And single quote (‘), are additionally translated.

You can also manually change the quote flag to False.

import html

html_code = '<&!DOCTYPE html> <html> <head> <title>Python HTML Module</title></head><body><h2 style="text-align:center;">Hi , I am the HTML code</h2></body></html>'

encoded_code = html.escape(html_code,quote=False)
print(encoded_code)

Output:

html.unescape()

In simple terms, by using the html.unescape() function, we can turn an ASCII string into an HTML script by substituting the ASCII characters with special characters. 

  • Syntax: html.unescape(String)
  • It returns an HTML script.

The html.unescape() function accepts only one parameter, an encoded string. It converts all named and numeric character references in strings to the corresponding Unicode characters.

For example, let’s consider the following encoded HTML code.

&lt;&amp;!DOCTYPE html&gt; &lt;html&gt; &lt;head&gt; &lt;title&gt;Python HTML Module&lt;/title&gt;&lt;/head&gt;&lt;body&gt;&lt;h2 style=&quot;text-align:center;&quot;&gt;Hi , I am the HTML code&lt;/h2&gt;&lt;/body&gt;&lt;/html&gt;

Let’s convert this into HTML using the unescape() method.

import html

encoded_code = '&lt;&amp;!DOCTYPE html&gt; &lt;html&gt; &lt;head&gt; &lt;title&gt;Python HTML Module&lt;/title&gt;&lt;/head&gt;&lt;body&gt;&lt;h2 style=&quot;text-align:center;&quot;&gt;Hi , I am the HTML code&lt;/h2&gt;&lt;/body&gt;&lt;/html&gt;'

html_code = html.unescape(encoded_code)
print(html_code)

Output:

HTML decoding using Python unescape()

The unescape() method applies the HTML5 standard guidelines for valid and invalid character references, as well as the list of HTML5 named character references specified in html.entities.html5.

HTML Module in Older Versions of Python

If you are using an older version of python, you may encounter this error when trying to import the HTML module. 

Error: Import HTML importerror: No module named HTML 

To resolve this error, you have to run the below codes.

  • To install the HTML module, run the following code in your terminal or command prompt.
pip install html 

To import the HTML module in Python, use the following code (for older versions).

from html import HTML
obj = HTML()
obj.p('Hello, world!')
print(obj)

Those who have newer versions of Python can skip this.

Submodules in Python HTML Module

Submodules in the HTML package are:

  • parser
  • entities

html.parser

The HTML parser is a tool for parsing structured markup. It is used to parse HTML files.

These are some of the parser methods available in this submodule.

MethodUse case
HTMLParser.handle_data(data)This method is used to handle the data contained between HTML tags.
HTMLParser.handle_comment(data)This method is used to handle HTML comments.
HTMLParser.handle_starttag(tag, attrs)This method is used to handle HTML start tags. The opening tag is included within the parameter tag, and the attribute of that tag is contained within the attrs parameter.
HTMLParser.handle_endtag(tag, attrs)This method is used to handle HTML end tags. The closing tag is contained within the parameter tag, and the attribute of that tag is contained within the attrs parameter.
HTMLParser.feed(data)This method can be used to supply data to the HTML parser.

Here is an example of an HTML parser application:

from html.parser import HTMLParser

class MyParser(HTMLParser):
    def handle_starttag(self, tag, attrs):
        print("Found a start tag:", tag)

    def handle_endtag(self, tag):
        print("Found an end tag :", tag)

    def handle_data(self, data):
        print("Found some data  :", data)

parserObject = MyParser()
parserObject.feed('<!DOCTYPE html><html><head><title>Python HTML Module</title></head><body><h2 style="text-align:center;">Hi , I am the HTML code</h2></body></html>')

Output:

python html parser example

html.entities

The html.entities submodule includes HTML generic entity definitions. There are 4 dictionaries defined in this module, which are html5name2codepointcodepoint2name, and entitydefs.

DictionaryDescription
html.entities.html5A dictionary that converts HTML5 named character references 1 to the corresponding Unicode character(s), for example, html5[‘lt;’] == ‘<‘.
html.entities.entitydefsA dictionary that maps XHTML 1.0 entity definitions to ISO Latin-1 replacement text.
html.entities.name2codepointA dictionary that converts HTML entity names to Unicode code points.
html.entities.codepoint2nameA dictionary that associates Unicode code points with HTML entity names.

[Reference]

Final Thoughts

Now you know the Python HTML module and how to use its functionalities like encoding and decoding HTML code. You can handle HTML with ease using Python.

I hope you find this tutorial useful. If you like this article, please leave a comment and share it with your friends who wish to learn Python topics.

Ashwin Joy

I'm the face behind Pythonista Planet. I learned my first programming language back in 2015. Ever since then, I've been learning programming and immersing myself in technology. On this site, I share everything that I've learned about computer programming.

Leave a Reply

Your email address will not be published.

Recent Posts