PHP: Using Preg_match_all to Track Keywords and Keyphrases

How to track keywords and keyphrases with PHP's preg_match_all.

2115 views

Edited: 2017-09-07 06:42

The preg_match_all function can be used to perform a global regular expression match. It is perfect for matching and replacing HTML elements, and for tracking keywords in articles. In this tutorial, it will be shown how to use a regular expression in PHP, to track the occurrence of words and phrases in articles.

There can be many reasons, as to why you might want to gather information about words and phrases in articles. You could be creating a search engine, or simply be analyzing the data to learn something. Whatever your reason might be, preg_match_all is simply ideal for this type of task.

Finding words and phrases

First we will need to remove all the HTML tags from the source, so that we will only be counting actual words. The steps required to do this will depend on your source, but this tutorial assumes you will be dealing with articles stored in a database, which mainly contains HTML for the body of your pages. If your source contains script elements, or other special elements, you will need to first filter those out. The below PHP code will remove all HTNL elements, without removing the textural content of the elements. It will also collapse all stacked whitespace into just a single space.

$source = '<p>Here we are testing the <img src="example"> <i>preg_replace_all</i> function of PHP</p>';
// Remove HTML elements from body
$source = preg_replace("/<[^>]+>/", ' ', $source);
// Fix removed HTML (collapse whitespace into single space)
$source = preg_replace("/\s+/", ' ', $source);

Now, to find words and phrases, we may use preg_match_all with the PREG_SET_ORDER flag. This ensures that $matches[0] will contain an array of the first set of matches, and $matches[1] the second set of matches, which makes it easy to loop through the results using a counter variable with a while loop.

preg_match_all($pattern, $source, $matches, PREG_SET_ORDER);
$number_of_words = (count($matches) -1); // Number of matches
$i = 0; // Counter variable
$output = '<ul>'; // Just some HTML to format the output ;-)
while ($i <= $number_of_words) {
  $output .= '<li>' . $matches[$i][1] . "</li>";
  ++$i; // Increment the counter
}
$output .= '</ul>';
echo $output; // Shows a list of words found in $source

The above is useful to get single words, but what about phrases? It turns out phrases are equally easy to obtain. In this script, phrases consisting of up until 4 words will be accounted for, and you can easily add more if you need to. Just keep in mind, if you have a lot of articles of fair length (lets say 500-1000 words a piece), and you are inserting the words in a database, running this script could take a long time. We will deal with execution times later.

To get phrases, we simply need to fetch the next set of matches in the $matches array. This can be done easily by doing $i+1 when referencing the array key. In PHP, this looks like:

while ($i <= $number_of_words) {
  $first_word[] = $matches[$i][1];
  if (isset($matches[$i+1][1])) {$second_word[] = $matches[$i+1][1];} else {$second_word = '';}
  if (isset($matches[$i+2][1])) {$third_word[] = $matches[$i+2][1];} else {$third_word = '';}
  if (isset($matches[$i+3][1])) {$fourth_word[] = $matches[$i+3][1];} else {$fourth_word = '';}
  ++$i; // Increment the counter
}

Note. the isset() check. This is simply to avoid exceeding the end of the $source string, which would otherwise result in undefined errors.

In the above while loop, we are collecting all single-words, and phrases consisting of up until 4 words. Everything is placed inside arrays, which can later be handled with a simple loop. This may not be ideal, but this is just to show how the phrases may be obtained.

The complete script is included below:

<?php
$source = '<p>Here we are testing the <img src="example"> <i>preg_replace_all</i> function of PHP</p>';
// Remove HTML elements from body
$source = preg_replace("/<[^>]+>/", ' ', $source);
// Fix removed HTML (collapse whitespace into single space)
$source = preg_replace("/\s+/", ' ', $source);

$number_of_words = (count($matches) -1); // Number of matches
$i = 0; // Counter variable
while ($i <= $number_of_words) {
  $first_word[] = $matches[$i][1];
  if (isset($matches[$i+1][1])) {$second_word[] = $matches[$i+1][1];} else {$second_word = '';}
  if (isset($matches[$i+2][1])) {$third_word[] = $matches[$i+2][1];} else {$third_word = '';}
  if (isset($matches[$i+3][1])) {$fourth_word[] = $matches[$i+3][1];} else {$fourth_word = '';}
  ++$i; // Increment the counter
}
echo '<pre>';
print_r($first_word); // example output

Performance

The performance of your script will mostly depend on what you are doing with the data. If you are saving the words in a database, for statistical purposes, or making a search engine, then you may find that the script can take a long time to finish. It will ultimately depend on the number of inserts and updates you are performing, and there are ways to limit those.

The most obvious solution is to not save data you do not need. But, it can sometimes be hard to know in advance if you are going to need the data, and if you are doing research on it, you might want to save everything.

Another solution involves optimizing your database queries. This can be done by combining multiple inserts so that they will fit into a single query. But, by doing this, you may run into other road blocks, depending on the type of database you are using, and they can be hard to debug. MySQL seem to run into strange limits around this, even when the queries appear to fit within limits.

A very easy solution, which may be sufficient for the majority of use cases, is to use a SSD hard disk instead of a mechanical one. A script inserting words into a database, which runs for 2+ hours on a mechanical drive, can finish in just 5 minutes on a SSD drive!

Finding words and phrases

Performance

Tell us what you think: