RegExp.com
Wide range of in-depth information
                Themes: Default |White |Blue |Green |Orange
                Font Size: Default |Small |Medium |Large

Submatches, Groups and Backreferences

Regular Expressions - User guide

Submatches (or groups or backreferences) Some language regular expression implementations provide the last results of each separate match enclosed in parenthesis (called a submatch, group or backreference because there may be more than one) in variables that may subsequently be used or substituted in an expression. These variables are usually numbered $1 to $9. Where $1 will contain the first submatch, $2 will contain the second submatch and so on.

Example:

# assume target string = "cat"
search expression = (c|a)(t|z)
$1 will contain "a"
# $1 contains "a" because it is the last
# character found using (c|a)
# if the target string was "act"
# $1 would contain "c"
$2 will contain "t"

PERL, Ruby and the LDAP access directive support submatches.

When used in regular expression utilities, such as grep, these submatches are typically called groups or backreferences and are placed in numeric variables (typically addressed as \1 to \9). Again these groups or backreferences (variables) may be used in the regular expression. The following demonstrates usage:

# the following expression finds double characters
(.)\1
# the paranthesis creates the grouping
# (or submatch or backreference) in this case the first or only (\1)
# the . (dot) finds any character and the \1 substitutes whatever
# character was found by the dot

Apache Browser Identification - an Example

All we ever wanted to do was find enough about our browsers in Apache to decide what code to supply or not for our pop-out menus. The Apache BrowserMatch directives will set a variable if the expression matches the USER_AGENT string.

We want to know: * If we have any browser that supports Javascript (isJS).
* If we have any browser that supports the MSIE DHTML Object Model (isIE).
* If we have any browser that supports the W3C DOM (isW3C).

Here in their glory are the Apache regular expression statements we used (maybe you can understand them now)

BrowserMatchNoCase [Mm]ozilla/[4-6] isJS
BrowserMatchNoCase MSIE isIE
BrowserMatchNoCase [Gg]ecko isW3C
BrowserMatchNoCase MSIE.((5\.[5-9])|([6-9])) isW3C
BrowserMatchNoCase W3C_ isW3C

Notes:

* Line 1 checks for any upper or lower case variant of Mozilla/4-6 (MSIE also sets this value). This test sets the variable isJS for all version 4-6 browsers (we assume that version 3 and lower do not support Javascript or at least not a sensible Javascript).
* Line 2 checks for MSIE only (line 1 will take out any MSIE 1-3 browsers even if this variable is set.
* Line 3 checks for any upper or lower case variant of the Gecko browser which includes Firefox, Netscape 6, 7 and now 8 and the Moz clones (all of which are Mozilla/5).
* Line 4 checks for MSIE 5.5 (or greater) OR MSIE 6+.

NOTE about binding:This expression does not work:

BrowserMatchNoCase MSIE.(5\.[5-9])|([6-9]) isW3C

* It incorrectly sets variable isW3C if the number 6 - 9 appears in the string. Our guess is the binding of the first parenthesis is directly to the MSIE expression and the OR and second parenthesis is treated as a separate expression. Adding the inner parenthesis fixed the problem.

* Line 5 checks for W3C_ in any part of the line. This allows us to identify the W3C validation services (either CSS or HTML/XHTML page validation).

Some of the above checks may be a bit excessive, for example, is Mozilla ever spelled mozilla, but it is also pretty silly to have code fail just because of this 'easy to prevent' condition. There is apparently no final consensus that all Gecko browsers will have to use Gecko in their 'user-agent' string but it would be extremely foolish not to since this would force guys like us to make huge numbers of tests for branded products and the more likely outcome would be that we would not.



First |  Next |  Previous |  Last