How to replace parts of a string with str_replace and preg_replace
Tutorial on how to replace parts of a string with str_replace and regular expressions.
By. Jacob
Edited: 2020-11-08 12:17
To replace a substring within a string we can either use the string replacement functions, or we can create a regular expression for more complex replacements.
A simple way to replace a string is by using the str_replace (for a case sensitive replacement) or stri_replace (for a case insensitive replacement); but we can also use preg_replace to perform regular expression replacements.
The native str_replace function of PHP is used to replace all the occurrences of a given string with a replacement string; but using a regular expression will allow for more complex pattern-based replacements, which is useful for working with HTML, CSS and JavaScript content.
While the string functions are said to be faster than using regular expressions in preg_* functions, this rarely seem to matter in practice. Last time I tested this, I was able to do more than 1 million replacements in about a second on an i3 laptop; it will only matter for high-performance applications. Now, this does not mean that I recommend one over the other—if it is possible to use the string version, then I think you should aim to do so, since it is such an easy thing to do. When multiple calls to str_replace is needed, that is probably a good time to use regular expressions instead, since that is when preg_replace may actually be the faster option.
Replacing string with str_replace
The str_replace function can be used by feeding it with a target string, a replacement string, and a source string. It is possible to use bare strings, variables and arrays. The first example uses bare strings:
$source_str = "...target1...target2...target3...target4";
echo str_replace("target2", "replacement2", $source_str);
It is also possible to use arrays, containing different strings to be replaced. The replacement will be performed in the order of the array:
$source_str = "...target1...target2...target3...target4";
$targets = array("target1", "target2", "target3");
$replacements = array("replacement1", "replacement2", "replacement3");
$new_str = str_replace($targets, $replacements, $source_str);
echo $new_str;
Note. To perform a case insensitive match, the stri_replace function may be used.
Using regular expressions
Regular expressions may be used for more complicated replacements, such as when replacing HTML elements. In PHP we may use the preg_replace function to perform replacements with regex. Some may find it more complicated than performing replacements with str_replace, and it is in the sense that you need to learn how to write regular expressions, the effort is however worth it, as you can use it in many different cases.
Below is simple beginner regex that will replace multiple whitespace characters with just a single space.
preg_replace('/\s\s+/', ' ', $str);
In this case, the "s" is a short "code" for different whitespace characters, and the "+" signs can be translated to "one or more". The expression means something along the lines of: Where at least one whitespace character exists followed by one or more, replace this "matched pattern" with a single whitespace".
Sometimes you may want to "remember" certain parts of your source string while only replacing other parts. For example, you can remember the content of a HTML element, and only replace the element tags. In the below example we want to replace the paragraph tags with div tags, to do this we can "remember" the content, and replace the <p> tags around it. I.e.:
$str = '<!DOCTYPE html>
<html>
<head>
<title>My first Website</title>
</head>
<body>
<p>My first Website.</p>
</body>
</html>';
$new_str = preg_replace('|<p>([^<]*)</p>|su', "<div>$1</div>", $str);
echo $new_str;
The regex used to match the pattern in the $str variable above is relatively simple:
|<p>([^<]+)</p>|u
The part located inside the parentheses matches the content, basically it matches all characters except for the less than (<) sign. Hence the "[^<]" part. We already explained the plus sign earlier.
The part at the end, u are the modifiers. In this case I used u; the u modifier will cause the pattern and subject strings to be treated as UTF-8.
The Square brackets ([]) are used to match a series of unordered characters; in this case we used the caret/circumflex/hat sign to state which characters should not be matched.
The parentheses are used to remember the match as a "back reference", which allows us to insert it into the replacement string. Back references can be accessed by the numeric variables: $1, $2 ,$3 . Etc. For nested parentheses, the matches are stored in the order that they are matched, from the inside-out – like the layers of an onion. This can be shown by a visual representation:
( // $1 ( // $2 ( // $3 ) ) )
So, whenever you work with expressions and nested parentheses, keep this in mind.
It is often a good idea to use reverse-logic in your expressions. For instance, instead of listing all the characters that you allow, it is often much easier to simply list those that you do not allow using the caret inside square brackets. However, when working with HTML, this might prevent nested elements — more on this in the section on nested elements!
Finally, preg_replace also works with arrays in the same way that str_replace does.
$source_str = "...<b>target1</b>...<i>target2</i>...<u>target3</u>...";
$targets = array('#<b>([^<]*)</b>#', '#<i>([^<]*)</i>#', '#<u>([^<]*)</u>#');
$replacements = array("<strong>$1</strong>", "<em>$1</em>", "<span>$1</span>");
$new_str = preg_replace($targets, $replacements_arr, $source_str);
echo $new_str;
Allowing Nested HTML elements
Replacing HTML elements using the pattern [^<]+ is not good, since it will prevent nested elements. There is a very elegant solution to this that is covered in this section.
Normally the dot (.) character will not match all characters, such as linebreaks and new lines, so we can not use a pattern like (.*?) to match everything in an HTML element — but we can if we use the s with our code:
$str = '<p>My <b>first</b> Website.</p>';
$new_str = preg_replace('|<p>(.*?)</p>|su', "<div>$1</div>", $str);
echo $new_str;
Output:
<div>My <b>first</b> Website.</div>
The question mark in the pattern makes the expression non-greedy, meaning it will only match up until the closing </p>
Using regular expressions for HTML
You will often be told not to use regular expressions to work on HTML code, and instead use DOM tools — but let me be absolutely clear — there is nothing wrong with using regular expressions to alter HTML code!
What you can not expect, is to create a "parser" that will work on all thinkable HTML with RegExes. The problem with using RegExes for HTML is that HTML is irregular. There is just to many different, valid, ways to write HTML — and then you probably also need to support some degree of invalid HTML.
If you are crawling web pages, then you also need to account for pages with invalid character sets, since some pages do not match the character set that the server says is used.
If suddenly something changes, the HTML has an extra unexpected space or new line somewhere, your code might stop working.
It is important to note, that it is actually possible to account for all special cases. It is just very hard to do.
Tell us what you think: