Pygments Fails to Highlight UTF-8 Encoded Text
How to make Pygments highlight code containing non-ascii UTF-8 characters.
By. Jacob
Edited: 2019-12-15 08:15
Rather than using an existing wrapper, I have made my own Pygments wrapper script to highlight code examples on Beamtic. However, I recently noticed a problem where it would fail to highlight text containing Danish (Scandinavian) characters (Æ Ø and Å), which are properly UTF-8 encoded in the database.
This was a problem because I also write tutorials in Danish occasionally.
Now, I have taken great care in making sure everything is UTF-8, so I was pretty sure the problem was not with my CMS. Both my Database and the CMS itself is setup to use unicode.
After some Googling, I realized the problem was with Pygments. But, luckily the solution was simple. Adding the encoding='utf-8' option to the HtmlFormatter function appears to solve the problem:
print(highlight(code, lexer, HtmlFormatter(encoding='utf-8')))
Doing this should make Pygments use UTF-8 when dealing with your code.
The Pygments wrapper script
My Pygments wrapper script is included below:
#!/usr/bin/python
# -*- coding: utf-8 -*-
# Author JacobSeated
# To generate a stylesheet:
# pygmentize -S default -f html -a .highlight > default.css
# using argparse to enable arguments I.e.:
# print(sys.argv[1])
from pygments.formatters import HtmlFormatter
from pygments.lexers import PythonLexer, guess_lexer, get_lexer_by_name
from pygments import highlight
import argparse
# Parse CLI arguments
parser = argparse.ArgumentParser()
parser.add_argument(
"--file", help="Path for file to highlight. The file should only contain code.", type=str, required=True)
parser.add_argument(
"--lang", help="Language to highlight I.e: php, html, css", type=str)
args = parser.parse_args()
# Check if file was provided
try:
f = open(args.file, 'r')
except Exception as e:
print(0)
exit()
else:
with f:
file_contents = f.read()
# Check if lang was provided
if args.lang:
# print('Using provided language')
lexer = get_lexer_by_name(args.lang)
else:
# print('Trying to guess the language')
lexer = guess_lexer(file_contents)
code = file_contents
print(highlight(code, lexer, HtmlFormatter(encoding='utf-8')))
Tell us what you think: