Here is a real-life example of a complex regular expression that covers the operators we have seen so far: sentence-end, a variable Emacs uses to recognize the ends of sentences for sentence motion commands like forward-sentence (M-e). Its value is:
"[.?!][]\"')}]*\\($\\|\t\\| \\)[ \t\n]*"
Let's look at this piece by piece. The first character set, [.?!], matches a period, question mark, or exclamation mark (the first two of these are regular expression operators, but they have no special meaning within character sets). The next part, []\"')}]*, consists of a character set containing right bracket, double quote, single quote, right parenthesis, and right curly brace. A * follows the set, meaning that zero or more occurrences of any of the characters in the set matches. So far, then, this regexp matches a sentence-ending punctuation mark followed by zero or more ending quotes, parentheses, or curly braces. Next, there is the group \\($\\|\t\\| \\), which matches any of the three alternatives $ (end of line), Tab, or two spaces. Finally, [ \t\n]* matches zero or more spaces, tabs, or newlines. Thus the sentence-ending characters can be followed by end-of-line or a combination of spaces (at least two), tabs, and newlines.
There are other context operators besides ^ and $; two of them can be used to make regular expression search act like word search. The operators \\< and \\> match the beginning and end of a word, respectively. With these we can go part of the way toward solving Example 3. The regular expression \\<program\\> matches "program" but not "programmer" or "programming" (it also won't match "microprogram"). So far so good; however, it won't match "program's" or "programs." For this, we need a more complex regular expression:
\\<program\\('s\\|s\\)?\\>
This expression means, "a word beginning with program followed optionally by apostrophe s or just s." This does the trick as far as matching the right words goes.
11.3.2.4 Retrieving portions of matches
There is still one piece missing: the ability to replace "program" with "module" while leaving any s or 's untouched. This leads to the final regular expression feature we will cover here: the ability to retrieve portions of the matched string for later use. The preceding regular expression is indeed the correct one to give as the search string for replace-regexp. As for the replace string, the answer is module\\1; in other words, the required Lisp code is:
(replace-regexp "\\<program\\('s\\|s\\)?\\>" "module\\1")
The \\1 means, in effect, "substitute the portion of the matched string that matched the subexpression inside the \\( and \\)." It is the only regular-expression-related operator that can be used in replacements. In this case, it means to use 's in the replace string if the match was "program's," s if the match was "programs," or nothing if the match was just "program." The result is the correct substitution of "module" for "program," "modules" for "programs," and "module's" for "program's."
Another example of this feature solves Example 4. To match filenames <filename>.c and replace them with <filename>.java, use the Lisp code:
(replace-regexp "\\([a-zA-Z0-9_]+\\)\\.c" "\\1.java")
Remember that \\. means a literal dot (.). Note also that the filename pattern (which matches a series of one or more alphanumerics or underscores) was surrounded by \\( and \\) in the search string for the sole purpose of retrieving it later with \\1.
Actually, the \\1 operator is only a special case of a more powerful facility (as you may have guessed). In general, if you surround a portion of a regular expression with \\( and \\), the string matching the parenthesized subexpression is saved. When you specify the replace string, you can retrieve the saved substrings with \\n, where n is the number of the parenthesized subexpression from left to right, starting with 1. Parenthesized expressions can be nested; their corresponding \\n numbers are assigned in order of their \\( delimiter from left to right.
Lisp code that takes full advantage of this feature tends to contain complicated regular expressions. The best example of this in Emacs's own Lisp code is compilation-error-regexp-alist, the list of regular expressions the compile package (discussed in Chapter 9) uses to parse error messages from compilers. Here is an excerpt, adapted from the Emacs source code (it's become much too long to reproduce in its entirety; see below for some hints on how to find the actual file to study in its full glory):
(defvar compilation-error-regexp-alist
'(
;; NOTE! See also grep-regexp-alist, below.
;; 4.3BSD grep, cc, lint pass 1:
;; /usr/src/foo/foo.c(8): warning: w may be used before set
;; or GNU utilities:
;; foo.c:8: error message
;; or HP-UX 7.0 fc:
;; foo.f :16 some horrible error message
;; or GNU utilities with column (GNAT 1.82):
;; foo.adb:2:1: Unit name does not match file name
;; or with column and program name:
;; jade:dbcommon.dsclass="underline" 133:17:E: missing argument for function call
;;
;; We'll insist that the number be followed by a colon or closing
;; paren, because otherwise this matches just about anything
;; containing a number with spaces around it.
;; We insist on a non-digit in the file name
;; so that we don't mistake the file name for a command name
;; and take the line number as the file name.
("\\([a-zA-Z][-a-zA-Z._0-9]+: ?\\)?\
\\([a-zA-Z]?:?[^:( \t\n]*[^:( \t\n0-9][^:( \t\n]*\\)[:(][ \t]*\\([0-9]+\\)\
\\([) \t]\\|:\\(\\([0-9]+:\\)\\|[0-9]*[^:0-9]\\)\\)" 2 3 6)
;; Microsoft C/C++:
;; keyboard.c(537) : warning C4005: 'min' : macro redefinition
;; d:\tmp\test.c(23) : error C2143: syntax error : missing ';' before 'if'
;; This used to be less selective and allow characters other than
;; parens around the line number, but that caused confusion for