Cleaning up a string to use as a URL

Photo of post

As part of a recent project, we had a requirement to take the headline of a news article and clean it up such that it could be used as part of a URL. 

The quickest and most obvious way to do this is simply to use PHP's native urlencode() method, and whilst this would have worked, the output of this method can give you a lot of 'noise' in the URL:

$string = "It's a news item. We're pleased with it";
echo urlencode($string);

This gives the following result:

It%27s+a+news+item.+We%27re+pleased+with+it

Obviously not very pretty. So, first things first: define what we want the URL slug to look like.

For this project we had a few simple rules:

  • The resulting string should be in lowercase
  • It shouldn't contain special characters, spaces or any encoded text
  • We should try to avoid introducing new meaning to the string by altering words
  • We should have the option of removing words we don't need (stopwords)

With a clear set of goals, the process becomes a simple step-by-step, but the order of operations also comes into play.  If we were to start by stripping out all the alpha-numeric characters from the original string, we run the risk of changing the meaning of the sentence:  "I'll" becomes "ill", "we're" becomes "were" and so-forth.   (If this was not a concern, we could have done this as step 1 and saved ourselves a step later on.)

Step 1

The first order of business was to remove words we didn't need (so-called stopwords).  At the same time, we force the input string to lower case to make the job a little simpler.   PHP has an excellent suite of functions for manipulating arrays and strings, so the job is fairly simple:

$stopwords=['in','a','on','and',"&","'",'the','to','it', 'with', 'at', 'be', 'up', 'of', 'one', 'for'];
$words = explode(' ', strtolower($in));
$out = array_diff($words, $stopwords);

So, we define an array of the words we want to remove.  Next we convert the input string into an array using explode();  This gives us all the words in the original string.  (Note the string is being forced to lower case with strtolower(); )

Finally we use array_diff() to give us a new array containing words that are not in our stopwords list.

 

Step 2

Now we have removed the words we don't want, we can set about removing unwanted characters.  Since we are currently working with an array, we have a couple of choices:  We could convert the array back to a string now and perform operations on it, or we can iterate the array and perform an operation on each element.

If we take the second option, we can use the PHP array_walk() function to iterate the array and modify it using a simple little anonymous function:

array_walk($out, function(&$value,&$key) {
 $value = preg_replace('/[^a-zA-Z0-9]/', '-', $value);
});

If you're unfamiliar with array_walk(), it iterates an array and passes the key / value pair into another function you define.  We're using an anonymous function here, and most importantly we're passing the key and value into the function by reference.  This means that whatever we do in the function will affect the original array.    Inside our function we have a simple regex operation going on which replaces any non-alphanumeric character with a dash.

For what we're doing here, the above method is actually a bit too complex for our needs but it's a handy technique to know and it does allow us to perform all kinds of operations on the elements in our array.  The other thing to consider is that performing actions inside loops may not always be very efficient.  Since we only need to remove a few characters, we can ignore the above and instead use the simpler method of converting back to a string.  Unlike the above which calls a regex function in every iteration of the loop, the alternative gives the same result with only a single call: 

$cleaned = implode('-', $out);

 So now we have a string with our stopwords removed, in lower case, we can use the regex from above to tidy it up:

$slug = preg_replace('/[^a-zA-Z0-9]/', '-', $cleaned);

This gives us the output we want:

it-s-news-item-we-re-pleased

 

All good and it fulfils our requirements. There's one final step that might be needed: In the case of a non-alphanumeric character followed by a space, the resulting string will have two slashes in it. So a string such as, "Orange: it's the new black" will return 'orange--it-s-the-new-black'.   If this is a concern you can perform a final operation on the string.  It's another regex which finds occurrences of multiple dashes and replaces them:

$slug2 = preg_replace('/-{2,}/', '-', $slug);

So, there it is:  we have converted our string and fulfilled all our criteria.   In the end the whole thing only takes a couple of lines of code.  There are doubtless other ways to do the same job, and the code snippet here obviously isn't doing any sanity-checking on the data.   Here's the code wrapped up in a nice little function:

function cleanString($string) {
 $stopwords = ['in', 'a', 'on', 'and', "&", "'", 'the', 'to', 'it', 'with', 'at', 'be', 'up', 'of', 'one', 'for'];
 $words = explode(' ', strtolower($string));
 $out = array_diff($words, $stopwords);
 $cleaned = implode('-', $out);
 $slug = preg_replace('/[^a-zA-Z0-9]/', '-', $cleaned);
 return preg_replace('/-{2,}/', '-', $slug);
}

And here's the more complex version, which would allow more advanced operations on the words:

function cleanStringAdvanced($string) {
 $stopwords = ['in', 'a', 'on', 'and', "&", "'", 'the', 'to', 'it', 'with', 'at', 'be', 'up', 'of', 'one', 'for'];
 $words = explode(' ', strtolower($string));
 $out = array_diff($words, $stopwords);
 array_walk($out, function(&$value, &$key) {
  $value = preg_replace('/[^a-zA-Z0-9]/', '-', $value);
  //Do anything else clever here
 });
 $cleaned = implode('-', $out);
 return preg_replace('/-{2,}/', '-', $cleaned);
}

Code snippets on this page are covered by the MIT licence