Tags

, , ,

There are many general skills a programmer should have to be effective and valuable. Some are very general — for instance, the ability to learn and to think abstractly — but some are more specific — various tools and tricks of skilled programming.

Among those tools are several non-programming languages all programmers should know. Those include HTML, XML, SQL, and an old one whose name doesn’t end with “L” — Regular Expressions (aka REs, aka RegEx or RegExp).

A regular expression (an RE from now on) is a pattern of letters and special characters that we use to match a piece of text. As a very simple pattern:

foobar matches “foobar”

A string of literal characters matches the same string of characters in the text. An RE with just a literal string works the same way a regular text search works. The search engine looks for bits of text that match exactly.

The power of REs comes from what the special characters can do. REs are powerful enough to define patterns that match a range of text strings. Below I’ll show you one that matches any phone number in the USA, but we’ll start simple. Let’s imagine we wanted to match both “foobar” and “foo bar” — a search string won’t match up with the space in the latter, but an RE can:

foo( )?bar matches foobar and foo bar

Parentheses enclose a group — of portion of the RE of special interest to us. In this case, a single space. The question mark after the group says the group is optional. Since the parentheses enclose just one character, they aren’t necessary in this case. The question mark could follow a space, making that space optional (foo ?bar). The question mark makes whatever single thing it follows optional, be it a single char or a group.

Some consider it bad form to use an actual space in an RE, though, because there are versions that ignore whitespace (to make them easier to read; it’s not the most readable language). For such as them, the preferred form is:

foo\s?bar matches foobar and foo bar

One could also use foo(\s)?bar, but it seems unnecessarily verbose.

There is a rich set of backslash characters. You may recognize the \t for the Tab character (aka [Tab] and Ctrl+I), \n for the New-Line character (aka Ctrl+J), and \r for the Carriage-Return character (aka [Enter] and Ctrl+M).

REs define others that are more generic: \a matches any alphabetic character, \d matches any digit. There can be many more depending on how many extensions an RE engine supports. One common extension is \l for any lowercase character and \u for any uppercase one. Another is \x for any hex digit (all the digits plus “A-F”). Another common extension is that capitalizing the letter means “not” — \A is any non-letter, \D is any non-digit, and \S is any non-space character.

Let’s get more sophisticated. Let’s suppose we want to match “foobar” and “foo bar” and “foo-bar” (for fun, let’s toss in “foo_bar” and “foo:bar”). It might seem a challenge to match five different (albeit similar) strings, but an RE does it with ease:

foo[-_:\s]?bar matches foobar, foo-bar, foo_bar, foo:barfoo bar

Square brackets enclose a list of characters that can match. The question mark makes that match optional. Note that square brackets can enclose a range, for instance [1-8] matches the digits 1 through 8, and [u-z] matches the lowercase letters u through z.

If we wanted to be extremely lax about what character, if any, comes between “foo” and “bar” we can be even more generic:

foo.?bar matches foo<any-character-or-none>bar

The period (.) matches any character.

So far there are individual literal characters matching single characters in the text and various special characters that represent a kind or range of single character. What if we want four digits? We could use \d\d\d\d, but a simpler RE is:

\d{4} matches any four digits

The curly braces can do more than specify a specific length. The full form, {n,m}, specifies any length from n to m (inclusive). You can also use {n,} to mean n or more, and {,n} to mean zero to n. The {n} form means exactly n.

We can use {} to mean zero-or-more, but it’s cleaner to use the asterisk (*):

\s*foobar\s* matches foobar and any whitespace around it

Sometimes we care about the beginning or ending of a line. The tophat (^) matches the beginning of a line:

^foobar matches foobar at the beginning of the line

The dollar sign ($) matches the end of the line:

foobar$ matches foobar at the end of the line

Which means that:

^foobar$ matches a line with only foobar in it

And:

^$ matches a blank line with nothing in it

A final example:

^foo.*bar$ matches a line beginning with foo and ending with bar and nothing or anything in between

As you can see, REs pack a fair amount of power under their hood (and we’ve only just scratched the surface).

Now the phone number pattern should make some sense:

(1-)?([0-9]{3}-)?[0-9]{3}-[0-9]{4} matches any USA telephone number

It’s a bit to unpack, but take a moment and try it. Can you see how it works?

§ §

Intermission

It has been said, “If you have a problem — and you solve it with a regular expression — now you have two problems.” As you may see by now, there is some justification in the saying. REs do look a little like line noise, and they are infamously hard to maintain (because first you have to figure out what a long complicated one even does— they are anything but self-documenting).

So, a double bottom line: Every programmer should know regular expressions; and every programmer should avoid using them as much as possible. But used sparingly and judiciously, they are a powerful tool. And many tools use them, so knowing them gives you access to powerful search (and replace) capabilities.

§ §

One thing that makes them so tempting for programmers is that, in most programming languages that implement regular expressions, text from groups (bits of the RE enclosed in parentheses) is available to the programmer.

We can use that phone number pattern, for instance, to extract the digits into specific variables. First, in case you didn’t bother trying to parse it, let me quickly explain what it does. Here it is again with some minor changes that will turn out to be helpful:

(1-)? ([0-9]{3}-)? ([0-9]{3}-) ([0-9]{4})

The first group matches a literal 1- sequence (a one followed by a hyphen); the question mark after the group makes it optional. The second group matches any three digits followed by a hyphen; the question mark makes it optional (a possible area code). These are the 1-### area code sequence that might prefix a phone number (and the 1- may or may not be there).

The change to this version adds parentheses to create two more explicit groups. For clarity I’ve added spaces between the groups. They should be ignored; they are not part of the pattern.

The third group, like the second, matches any three digits followed by a hyphen. There is no question mark, so this group is not optional. It must find a match for the RE to succeed. The fourth group, also required, matches any four digits. These last two groups comprise the standard USA seven-digit phone number.

§

In most programming languages with regular expressions, if a line of text matches the phone number RE, the text in the groups is available, usually in some kind of groups array. The code can check the first two slots to see if an area code was used (and the leading one). A successful match guarantees the third and fourth slots will have the two parts of the matched phone number.

Point it, regular expressions are great for extracting desired text from varying inputs (such as log files, for instance). But such search strings tend to be nightmares to maintain, so use them with caution and document them heavily.

§ §

As with most powerful tools, regular expressions are problematic. They can be serious work horses in the right hands, but utter disasters (and inspiration for that saying about two problems) in the wrong ones.

Ø