Using Regular Expressions


RText uses Java regular expressions, so the ultimate source of information on this topic would be Sun's Javadoc. Check out the Java home page at http://java.sun.com and do a search for the latest API Specification, then click on the Pattern class.

However, for the eager, what follows is a brief tutorial containing everything you probably want to know. This tutorial assumes you already understand the basic concepts of regular expressions.

I. The Basics
Java regular expressions contain every construct you'll need when searching for text, including character classes, greedy and reluctant qualifiers, and back references.

The basics are all here:


Regular Expression Matches
ab ab
a*b 0 or more a's, followed by a b
a+b 1 or more a's, followed by a b
a{n}b Exactly n a's, followed by a b
a{n,}b n or more a's, followed by a b
a{n,m}b Between n and m a's, followed by a b
a?b 0 or 1 a's, followed by a b
car[sd] cars and card
ca(rt|talog|nary) Any of cart, catalog, or canary
a.c A three-letter string beginning with a, followed by any character, followed by c

II. Characters and Character Classes
\\ Backslash
\t Tab
\n Newline
[abc] Any of a, b, or c
[^abc] Any character except a, b, or c
. Any character
\d Any digit (equivalent to [0-9])
\D Any non-digit (equivalent to [^0-9])
\s A whitespace character
\S A non-whitespace character.
\w A word character (equivalent to [A-Za-z0-9])
\W A non-word character (equivalent to [^\w])

III. Boundary Characters
^ The beginning of a line
$ The end of a line
\b A word boundary
\B A non-word boundary

IV. Back References
Back references allow you to capture a subsequence in a regex match, and use that subsequence later in the regex's matching. Back references are referred to as capturing groups, and are enclosed in parentheses. For example, in the regular expression:

     Fred([ \t])Joe\1Sue\1Tommy,

the first capturing group is ([ \t]). Anywhere following it in the regular expression, you can refer back to it with \1; this means that a match must contain the text matched with the capturing group, at that location. Thus, in the example, Fred, Joe, Sue, and Tommy are all separated by the same character, either a space or a tab.

Note that you can have multiple capturing groups per regular expression. In this case, the first group will be referred to as \1, the second as \2, etc. Note also that capturing groups can be embedded in one another; that is,

     ((A)B)\1\2

is valid.

See also: Fint, Find Next, Replace