Go to the first, previous, next, last section, table of contents.


Regular Expressions

One of Perl's original applications was text processing (see section A Brief History of Perl). So far, we have seen easy manipulation of scalar and list data is in Perl, but we have yet to explore the core of Perl's text processing construct--regular expressions. To remedy that, this chapter is devoted completely to regular expressions.

The Theory Behind It All

Regular expressions are a concept borrowed from automata theory. Regular expressions provide a a way to describe a "language" of strings.

The term, language, when used in the sense borrowed from automata theory, can be a bit confusing. A language in automata theory is simply some (possibly infinite) set of strings. Each string (which can be possibly empty) is composed of a set of characters from a fixed, finite set. In our case, this set will be all the possible @acronym{ASCII} characters(10) characters.}.

When we write a regular expression, we are writing a description of some set of possible strings. For the regular expression to have meaning, this set of possible strings that we are defining should have some meaning to us.

Regular expressions give us extreme power to do pattern matching on text documents. We can use the regular expression syntax to write a succinct description of the entire, infinite class of strings that fit our specification. In addition, anyone else who understands the description language of regular expressions, can easily read out description and determine what set of strings we want to match. Regular expressions are a universal description for matching regular strings.

When we discuss regular expressions, we discuss "matching". If a regular expression "matches" a given string, then that string is in the class we described with the regular expression. If it does not match, then the string is not in the desired class.

The Simple

We can start our discussion of regular expression by considering the simplest of operators that can actually be used to create all possible regular expressions (11). All the other regular expression operators can actually be reduced into a set of these simple operators.

Simple Characters

In regular expressions, generally, a character matches itself. The only exceptions are regular expression special characters. To match one of these special characters, you must put a \ before the character.

For example, the regular expression abc matches a set of strings that contain abc somewhere in them. Since * happens to be a regular expression special character, the regular expression \* matches any string that contains the * character.

The * Special Character

As we mentioned * is a regular expression special character. The * is used to indicate that zero or more of the previous characters should be matched. Thus, the regular expression a* will match any string that contains zero or more a's.

Note that since a* will match any string with zero or more a's, a* will match all strings, since all strings (including the empty string) contain at least zero a's. So, a* is not a very useful regular expression.

A more useful regular expression might be baa*. This regular expression will match any string that has a b, followed by one or more a's. Thus, the set of strings we are matching are those that contain ba, baa, baaa, etc. In other words, we are looking to see if there is any "sheep speech" hidden in our text.

The . Character

The next special character we will consider is the . character. The . will match any valid character. As an example, consider the regular expression a.c. This regular expression will match any string that contains an a and a c, with any possible character in between. Thus, strings that contain abc, acc, amc, etc. are all in the class of strings that this regular expression matches.

The | Character

The | special character is equivalent to an "or" in regular expressions. This character is used to give a choice. So, the regular expression abc|def will match any string that contains either abc or def.

Grouping with ()s

Sometimes, within regular expressions, we want to group things together. Doing this allows building of larger regular expressions based on smaller components. The ()'s are used for grouping.

For example, if we want to match any string that contains abc or def, zero or more times, surrounded by a xx on either side, we could write the regular expression xx(abc|def)*xx. This applies the * character to everything that is in the parentheses. Thus we can match any strings such as xxabcxx, xxabcdefxx, etc.

The Anchor Characters

Sometimes, we want to apply the regular expression from a defined point. In other words, we want to anchor the regular expression so it is not permitted to match anywhere in the string, just from a certain point.

The anchor operators allow us to do this. When we start a regular expression with a ^, it anchors the regular expression to the beginning of the string. This means that whatever the regular expression starts with must be matched at the beginning of the string. For example, ^aa* will not match strings that contain one or more a's; rather it matches strings that start with one or more a's.

We can also use the $ at the end of the string to anchor the regular expression at the end of the string. If we applied this to our last regular expression, we have ^aa*$ which now matches only those strings that consist of one or more a's. This makes it clear that the regular expression cannot just look anywhere in the string, rather the regular expression must be able to match the entire string exactly, or it will not match at all.

In most cases, you will want to either anchor a regular expression to the start of the string, the end of the string, or both. Using a regular expression without some sort of anchor can also produce confusing and strange results. However, it is occasionally useful.

Pattern Matching

Now that you are familiar with some of the basics of regular expressions, you probably want to know how to use them in Perl. Doing so is very easy. There is an operator, =~, that you can use to match a regular expression against scalar variables. Regular expressions in Perl are placed between two forward slashes (i.e., //). The whole $scalar =~ // expression will evaluate to 1 if a match occurs, and undef if it does not.

Consider the following code sample:

use strict;
while ( defined($currentLine = <STDIN>) ) {
    if ($currentLine =~ /^(J|R)MS speaks:/) {
        print $currentLine;
    }
}

This code will go through each line of the input, and print only those lines that start with "JMS speaks:" or "RMS speaks:".

Regular Expression Shortcuts

Writing out regular expressions can be problematic. For example, if we want to have a regular expression that matches all digits, we have to write:

(0|1|2|3|4|5|6|7|8|9)

It would be terribly annoying to have to write such things out. So, Perl gives an incredible number of shortcuts for writing regular expressions. These are largely syntactic sugar, since we could write out regular expressions in the same way we did above. However, that is too cumbersome.

For example, for ranges of values, we can use the brackets, []'s. So, for our digit expression above, we can write [0-9]. In fact, it is even easier in perl, because \d will match that very same thing.

There are lots of these kinds of shortcuts. They are listed in the `perlre' online manual. They are listed in many places, so there is no need to list them again here.

However, as you learn about all the regular expression shortcuts, remember that they can all be reduced to the original operators we discussed above. They are simply short ways of saying things that can be built with regular characters, *, (), and |.


Go to the first, previous, next, last section, table of contents.