Edited: 2020-07-27 17:46

To count the number of characters in a string in PHP we can use a regular expression; we might also be able to use the strlen function, but this will not accurately count multi-byte characters.

Multi-byte characters can sometimes appear in strings in applications that supports UTF-8.

Just to give an example, the alphabetic characters [a-z] only take up one byte; but if we attempted to count the characters in a string that contained multi-byte characters, then we would end up with an inaccurate result.

This happens because some characters take up more bytes. For example, the shitty emoji character "💩" takes up 4 bytes rather than one; that is no fun at all. In fact. You might call it extremely crappy.

To also count multi-byte characters, we can use the preg_split function with the u modifier, and then count the resulting array:

function count_characters(string $string) {
  return count(preg_split('//u', $string, -1, PREG_SPLIT_NO_EMPTY));

Then, to count count characters, we would simply call this function:

echo count_characters('abcd'); // Should result in "4"
echo count_characters('abcd💩'); // Should result in "4"

When working with UTF-8, counting the number of characters in a string will not be as simple as calling strlen; this is because strlen only counts the bytes in a string, and not the characters themselves. It still works for single-byte characters, such as those in iso-8859-1, but not for UTF-8 aware applications.

Counting the characters in a string that contains a single 4-byte character will result in an highly inaccurate character-count:

echo strlen('abcd💩'); // Should result in "8"

Here, the first four characters are 1-byte characters, making up a total of 4 bytes; but the "crap" emoji at the end will, itself, also take up 4 bytes. The result is eight, which is an inaccurate count — the correct count would be five.

UTF-8 characters take up between 1 and 4 bytes in a string; the alphabetic characters [a-z] only take up one byte, so this is only applicable for strings that actually does contain multi-byte characters.


