BLOG

“I still suggest you call
this page 42Lab.
Works for both of us!”
- Joey

How to be Good at Writing Regular Expressions in PHP?

Regular expressions in PHP refer to sequential characters forming a search pattern. However, they are hard to read and understand, and keeping up with a regular expression is not that simple.

PHP utilizes PCRE regular expressions that have various advanced features helping in writing regular expressions that are not only readable but easy to keep up with and are intelligible. Ctype functions and filters in PHP offer authentications like URL, alphanumeric values, and email. Given all these authentications, there is no need to use regular expressions primarily.

A better and reliable syntax can be provided by IDEs to make an already existing regular expression more comprehensible and easy to understand. It can further help in fixing the issues quickly and in a better way. Nevertheless, you can be benefited in the long run if you came up with a readable and self-explanatory regular expression primarily.

Following you will find some of the great tips on writing regular expressions in PHP effectively. However, keep in mind that these tips and tricks may or may not work in PHP versions before PHP 7.3. Speaking of which, the tips might make regular expressions less transferable to other languages.

Tips to Write Better Regular Expressions in PHP

Choice of Delimiter

Every regular expression consists of two parts namely, the expressions and flags. The two characters hold a regular expression within them followed by the flags. For example,

/(foo|bar)/i

(foo|bar)” is the expression, “i” is the flag, and “/” character is the delimiter. It is not necessary that the character “/” has to be the only delimiter. It could be either of them – ~, ! @, #, $, and so on. Braces can also be considered as delimiters along with regular expressions containing – {}, (), [], or <>. Using these characters might make regular expressions more readable. However, the following characters cannot be considered as delimiters –

  • Alphanumeric characters
  • Multi-byte characters
  • Backslashes

It is very important to make the right choice for the delimiter to make the occurrence of all the delimiter characters within the regular expression escape. The regular expression will turn out to be more readable and easy to understand, given there are very few escaped characters. Avoiding using meta-characters or those having unique meaning in the regular expression will decrease the number of escaped characters.

Although we know that forward slashes are known to be a common regular expression delimiter, they can turn out to be a bad fit for regular expressions with URIs. For example,

preg_match('/^https:\/\/example\.com\/path/i', $uri);

In the above example, the forward slash is an unfit choice of delimiter because there are already forward slashes in the expression that must be escaped and result in an unreadable extract. You can make a regular expression easily readable by just changing the delimiter from / to #, for example:

 

Reducing Escape Characters

Moving on to another useful tip, there is one more way by which you can decrease the number of escaped characters that are already used in a regular expression. How? Well, some specific meta-characters are unlikely to be considered as meta-characters if used within square braces. For instance, unlike in square braces, characters like ., *, +, and $ hold a unique functionality in regular expressions.

/Username: @[a-z\.0-9]/

In the aforesaid example, the dot character can be seen escaping by a backslash however, it is of no use because the dot character is not a meta-character when used within the square braces. Few characters do not require any escaping if they do not belong to any range.

Taking an example, if the dash character (-) is used within two characters then it represents a character range however, it holds no functionality of its own if used anywhere else. The dash character (-) is used in the regular expression /[A-Z]/ to build a range of matches from A to Z.

If for some reason, the dash character succeeds in escaping (/[A\-Z]/), the regular expression will only match characters A, Z, and (-). Therefore, move the dash character where the square brace ends to decrease the number of characters to escape rather than escaping the dash character (\-). Also, remember that the regular expression /[A\-Z]/ is equal to [AZ-] however the latter is better for reading and understanding. You will be mistaken if you think that using many escaped characters will fail the regular expression, it will not. But, the readability will get affected.

Non Capture Groups

Braces () in the regular expressions are likely to begin a capturing group. Look at the following example of the regular expression that extracts ‘price’ from the text – Cost: ¥24

$pattern = '/Cost: ($|¥)(\d+)/'; &amp;lt;br /&amp;gt;
$text = 'Cost: ¥24';&amp;lt;br /&amp;gt;
preg_match($pattern, $text, $matches);

You will see two capturing groups in the above extract in which the former one is for the currency and the latter is for the numeric values.

The variable in the following extract – $matches will reserve the results that are matched from both capture groups:

var_dump($matches);
array(3) {
          [0]=&amp;gt; string(12) “Cost: ¥24”
          [1]=&amp;gt; string(3) “¥”
         [2]=&amp;gt; string(2) “24”
}

A non-capturing group can be of great use where regular expressions do not require any capturing or restriction on the number of matched results passed to the $matches variable.
A brace starting with (?:, and ending with ) is a non-capturing group’s syntax.

If the expression mentioned above only likes to be written in numeric value then, the ($|¥) capturing group can be changed into a non-capturing group i.e, (?:$|¥).

$pattern = ‘/Cost: (?:$|¥)(\d+)/’;
$text = ‘Cost: ‘;
preg_match($pattern, $text, $matches);
var_dump($matches);

array(2) {
[0]=&amp;gt; string(12) “Cost: ¥24”
[1]=&amp;gt; string(2) “24”
}

If you change the unused capturing groups to non-capturing groups in regular expressions, it can restrict the data allotted to the $matches array.

Named Captures

Named captures are similar to non-capturing groups in terms of capturing a particular group and giving it a specific name. Not only they will give a name to the values that are returned but also to the parts of a regular expression.

Talking about the same price matching example, above, a named capture group will provide a specific name to all capture groups: allows to give a name to each capture group:

/Cost: (?&amp;lt;currency&amp;gt;$|¥)(?&amp;lt;cost&amp;gt;\d+)/

A named capture group syntax starts with (?<, followed by the name of the group, and ends with ).

In the above-mentioned example, (?<currency>$|¥) is known to be a named capture group with name currency, whereas (?<cost>\d+) is the named price. When reading the regular expression, these names offer some context along with names for the values in the matched values variable.

$pattern = ‘/Cost: (?&amp;lt;currency&amp;gt;$|¥)(?&amp;lt;cost&amp;gt;\d+)/’;
$text = ‘Cost: ¥24’;
preg_match($pattern, $text, $matches);
var_dump($matches);

array(5) {
[0]=&amp;gt; string(12) “: ¥24”
+ [“currency”]=&amp;gt; string(3) “¥”
[1]=&amp;gt; string(3) “¥”
+ [“cost”]=&amp;gt; string(2) “24”
[2]=&amp;gt; string(2) “24”
}

The names along with the positional values of the matched values can now be seen in the $matches array.

Named capture groups to make it certain to grasp the $matches values as well as change the regular expression after safeguarding the name of the capture group.

However, if duplicate names are used in capture groups then, by default it will result in an error PHP Warning: preg_match(): Compilation failed: two named subpatterns have the same name (PCRE2_DUPNAMES not set) at offset … in … on line... However, you can use the J modifier to permit the duplicate named capture-groups, for example:

/Cost: (?&amp;lt;currency&amp;gt;$|¥)?(?&amp;lt;cost&amp;gt;\d+)(?&amp;lt;currency&amp;gt;$|¥)?/J’

Two capturing groups can be seen in this regular expression along with the anime currency, explicitly permitted by the J flag. When it is compared against a string, it will simply go back to the last match for the named capture value, whereas the positional values (0, 1, 2, …) will contain all matches.

$pattern=’/Cost: (?&amp;lt;currency&amp;gt;$|¥)?(?&amp;lt;cost&amp;gt;\d+)(?&amp;lt;currency&amp;gt;$|¥)?/J’;
$text = ‘Cost: ¥24$’;
preg_match($pattern, $text, $matches);
var_dump($matches);

array(6) {
[0]=&amp;gt; string(14) “Cost: ¥24$”
[“currency”]=&amp;gt; string(2) “$”
[1]=&amp;gt; string(3) “¥”
[“cost”]=&amp;gt; string(2) “24”
[2]=&amp;gt; string(2) “24”
[3]=&amp;gt; string(2) “£”
}

Using Comments

Few regular expressions are very much longer and extended to multiple lines. Sequencing the regular expression at the time of commenting on sub-patterns or assertions individually can enhance readability as well as provide smaller diff outputs when going through comments:

– $pattern = ‘/Cost: (?&amp;lt;currency&amp;gt;$|¥)(?&amp;lt;cost&amp;gt;\d+)/i’;
+ $pattern = ‘/Cost: ‘;
+ $pattern .= ‘(?&amp;lt;currency&amp;gt;$|¥)’; // Capture currency symbols $ or ¥
+ $pattern .= ‘(?&amp;lt;cost&amp;gt;\d+)’; // Capture price without decimals.
+ $pattern .= ‘/i’; // Flags: Case-insensitive

In contrast, comments can also be included within the regular expression. There is a regular expression flag X, that enables the engine to overlook all characters with white spaces along with making the expression spread out, aligned, or split into multiple lines:

– /Cost: (?&amp;lt;currency&amp;gt;$|¥)(?&amp;lt;cost&amp;gt;\d+)/i
+ /Cost: \s (?&amp;lt;currency&amp;gt;$|¥) (?&amp;lt;cost&amp;gt;\d+) /ix

In /Cost: (?<currency>$|¥)(?<cost>\d+)/I, the engine is matched against the white space character, however, the white spaces are overlooked with the X flag. Therefore, to match a white space, you need to use a special character i.e., \s.

The pattern can be improved by making it better in reading and understanding if used more spacing around logical groups of sub-patterns. However, an effective way would be breaking the expression into multiple lines and adding comments:

– /Cost: (?&amp;lt;currency&amp;gt;$|¥)(?&amp;lt;cost&amp;gt;\d+)/i
+ /Cost: # Check for the label “Price:”
+ \s # Ensure a white-space after.
+ (?&amp;lt;currency&amp;gt;$|¥) # Capture currency symbols $ or ¥
+ (?&amp;lt;cost&amp;gt;\d+) # Capture price without decimals.
+ /ix

$pattern = &amp;lt;&amp;lt;&amp;lt;PATTERN
/Cost: # Check for the label “Cost:”
\s # Ensure a white-space after.
(?&amp;lt;currency&amp;gt;$|€) # Capture currency symbols $ or ¥
(?&amp;lt;cost&amp;gt;\d+) # Capture price without decimals.
/ix # Flags: Case-insensitive
PATTERN;
preg_match($pattern, ‘Cost: $42′, $matches);

Named Character Classes

Coming to the last tip, regular expressions are known to support character classes as they can help in avoiding scrutinizing a regular expression and making them more understandable at the same time.

\d is known to be one of the most often used character classes. \d denotes a single digit and is equal to [0-9]. Whereas, \D is the inverse of \d, which is equal to [^0-9].

Remember that a regular expression carefully looking for digits along with a non-digit, can easily be simplified without even touching its functionality:

– /Number: [0-9][^0-9]/
+ /Number: \d\D/

Some regular expressions are supporting various character classes, that can make the difference more highlighted:

\w is equivalent to [A-Za-z0-9_]:
– /[A-Za-z0-9_]/
+ /\w/

[:xdigit:] is equivalent to [a-fA-F0-9]:
– /[a-fA-F0-9]/
+ /[[:xdigit:]]/

\s is equivalent to [ \t\r\n\v\f]:
– / \t\r\n\v\f/
+ /\s/

Various other character classes are enabled when we use regular expressions with Unicode support (/u flag). Unicode named character classes also have a pattern – \p{EXAMPLE} (where EXAMPLE is the name of the character class). If used in the uppercase P, for example – \P{FOO}, it will be known as the inverse of that character class.

Character classes can easily capture classes without even giving a hint about the characters. In the coming future, new currency symbols will automatically begin to match, as soon as that information comes in the next Unicode database update.

There is a very useful list of script classes included in the Unicode character classes for all Unicode scripts. For example, \p{Tamil} denotes all characters from the Tamil language and is very much equal to \x{0B80}-\x{0BFF}.

– $pattern = ‘/[\x{0B80}-\x{0BFF}]/u’;
+ $pattern = ‘/\p{Tamil}/u’;
$text = ‘வணக்கம்`;
$contains_tamil = preg_match($pattern, $text);

Conclusion

Next time you sit to write regular expressions in PHP, consider the aforesaid useful tips. With the latest versions coming in along with new updates, it is very important to be good at writing readable and easy-to-understand regular expressions.

42 works

Exploring the Future of Apps with Apple

Read Blog

Dream Bigger. Work Greener.
Explore Our New & Sustainable Workspace

READ POLICY
42 works
Get 42Works Blog updates in your Inbox!

    Ohio United States 3.133.79.70 https://42works.net/how-to-be-good-at-writing-regular-expressions-in-php/