Count the Number of Characters in a String in PHP
How to count the number of characters in a multi-byte string using PHP.
By. Jacob
Edited: 2021-02-10 15:42
To count the number of characters in a string in PHP we can use a regular expression; we might also be able to use the strlen function, but this will not accurately count multi-byte characters.
Multi-byte characters can sometimes appear in strings in applications that supports UTF-8.
Just to give an example, the alphabetic characters [a-z] only take up one byte; but if we attempted to count the characters in a string that contained multi-byte characters, then we would end up with an inaccurate result.
This happens because some characters take up more bytes. For example, the shitty emoji character "💩" takes up 4 bytes rather than one; that is no fun at all. In fact. You might call it extremely crappy.
To also count multi-byte characters, we can use the preg_split function with the u modifier, and then count the resulting array:
function count_characters(string $string) {
return count(preg_split('//u', $string, -1, PREG_SPLIT_NO_EMPTY));
}
Then, to count count characters, we would simply call this function:
echo count_characters('abcd'); // Should result in "4"
echo count_characters('abcd💩'); // Should result in "5" (strlen would output "8")
Counting multi-byte characters in PHP
When working with UTF-8, counting the number of characters in a string will not be as simple as calling strlen; this is because strlen only counts the bytes in a string, and not the characters themselves. It still works for single-byte characters, such as those in iso-8859-1, but not for UTF-8 aware applications.
Counting the characters in a string that contains a single 4-byte character will result in an highly inaccurate character-count:
echo strlen('abcd💩'); // Should result in "8"
Here, the first four characters are 1-byte characters, making up a total of 4 bytes; but the "crap" emoji at the end will, itself, also take up 4 bytes. The result is eight, which is an inaccurate count — the correct count would be five.
UTF-8 characters take up between 1 and 4 bytes in a string; the alphabetic characters [a-z] only take up one byte, so this is only applicable for strings that actually does contain multi-byte characters.
Tell us what you think: