Share via:

Pygments Fails to Highlight UTF-8 Encoded Text

How to make Pygments highlight code containing non-ascii UTF-8 characters.

87 views

Edited: 2019-12-15 08:15

Pygments not working with non-ascii characters.

Rather than using an existing wrapper, I have made my own Pygments wrapper script to highlight code examples on Beamtic. However, I recently noticed a problem where it would fail to highlight text containing Danish (Scandinavian) characters (Æ Ø and Å), which are properly UTF-8 encoded in the database.

This was a problem because I also write tutorials in Danish occasionally.

Now, I have taken great care in making sure everything is UTF-8, so I was pretty sure the problem was not with my CMS. Both my Database and the CMS itself is setup to use unicode.

After some Googling, I realized the problem was with Pygments. But, luckily the solution was simple. Adding the encoding='utf-8' option to the HtmlFormatter function appears to solve the problem:

print(highlight(code, lexer, HtmlFormatter(encoding='utf-8')))

Doing this should make Pygments use UTF-8 when dealing with your code.

The Pygments wrapper script

My Pygments wrapper script is included below:

#!/usr/bin/python
# -*- coding: utf-8 -*-
# Author JacobSeated

# To generate a stylesheet:
# pygmentize -S default -f html -a .highlight > default.css

# using argparse to enable arguments I.e.:
# print(sys.argv[1])
from pygments.formatters import HtmlFormatter
from pygments.lexers import PythonLexer, guess_lexer, get_lexer_by_name
from pygments import highlight
import argparse

# Parse CLI arguments
parser = argparse.ArgumentParser()
parser.add_argument(
    "--file", help="Path for file to highlight. The file should only contain code.", type=str, required=True)
parser.add_argument(
    "--lang", help="Language to highlight I.e: php, html, css", type=str)
args = parser.parse_args()

# Check if file was provided
try:
    f = open(args.file, 'r')
except Exception as e:
    print(0)
    exit()
else:
    with f:
        file_contents = f.read()


# Check if lang was provided
if args.lang:
    # print('Using provided language')
    lexer = get_lexer_by_name(args.lang)
else:
    # print('Trying to guess the language')
    lexer = guess_lexer(file_contents)

code = file_contents
print(highlight(code, lexer, HtmlFormatter(encoding='utf-8')))

Comments