Simple Matching
We are going to try some simple matching against our example target strings:
Search for
m | STRING1 | match |
Finds the m in compatible
|
|
STRING2 | no match |
There is no lower case m in this string. Searches are case sensitive unless you take special action.
|
||
a/4 | STRING1 | match |
Found in Mozilla/4.0 - any combination of characters can be used for the match
|
|
STRING2 | match |
Found in same place as in STRING1
|
||
5 [ | STRING1 | no match |
The search is looking for a pattern of '5 [' and this does NOT exist in STRING1. Spaces are valid in searches.
| |
STRING2 | match |
Found in Mozilla/4.75 [en]
|
||
in | STRING1 | match |
found in Windows
|
|
STRING2 | match |
Found in Linux
|
||
le | STRING1 | match |
found in compatible
|
|
STRING2 | no match |
There is an l and an e in this string but they are not adjacent (or contiguous).
|
Brackets, Ranges and Negation
Bracket expressions introduce our first metacharacters, in this case the square brackets which allow us to define list of things to test for rather than the single characters we have been checking up until now. These lists can be grouped into what are known as Character Classes typically comprising well know groups such as all numbers etc.
Metacharacter Meaning
[ ] Match anything inside the square brackets for one character position once and only once, for example, [12] means match the target to either 1 or 2 while [0123456789] means match to any character in the range 0 to 9.
- The - (dash) inside square brackets is the 'range separator' and allows us to define a range, in our example above of [0123456789] we could rewrite it as [0-9].
You can define more than one range inside a list e.g. [0-9A-C] means check for 0 to 9 and A to C (but not a to c).
NOTE: To test for - inside brackets (as a literal) it must come first or last, that is, [-0-9] will test for - and 0 to 9.
^ The ^ (circumflex or caret) inside square brackets negates the expression (we will see an alternate use for the circumflex/caret outside square brackets later), for example, [^Ff] means anything except upper or lower case F and [^a-z] means everything except lower case a to z.
NOTE: There are some special range values (Character Classes) that are built-in to most regular expression software and have to be if it claims POSIX 1003.2 compliance for either BRE or ERE.
So lets try this new stuff with our target strings.
Search for
in[du] | STRING1 | match |
finds ind in Windows
|
STRING2 | match |
finds inu in Linux
|
|
x[0-9A-Z] | STRING1 | no match |
Again the tests are case sensitive to find the xt in DigExt we would need to use [0-9a-z] or [0-9A-Zt]. We can also use this format for testing upper and lower case e.g. [Ff] will check for lower and upper case F.
|
STRING2 | match |
FFinds x2 in Linux2
|
|
[^A-M]in | STRING1 | match |
Finds Win in Windows
|
STRING2 | no match |
We have excluded the range A to M in our search so Linux is not found but linux (if it were present) would be found.
|
Positioning(or Anchors)
We can control where in our target strings the matches are valid. The following is a list of metacharacters that affect the position of the search:
Metacharacter Meaning
^ The ^ (circumflex or caret) outside square brackets means look only at the beginning of the target string, for example, ^Win will not find Windows in STRING1 but ^Moz will find Mozilla.
$The $ (dollar) means look only at the end of the target string, for example, fox$ will find a match in 'silver fox' since it appears at the end of the string but not in 'the fox jumped over the moon'.
.The . (period) means any character(s) in this position, for example, ton. will find tons and tonneau but not wanton because it has no following character.
NOTE: Many systems and utilities, but not all, support special positioning macros, for example \< match at beginning of word, \> match at end of word, \b match at the begining OR end of word , \B except at the beginning or end of a word.
So lets try this new stuff with our target strings.
>Search for
[a-z]\)$ | STRING1 | match |
finds t) in DigiExt) Note: The \ is an escape characher and is required to treat the ) as a literal
|
STRING2 | no match |
We have a numeric value at the end of this string but we would need [0-9a-z]) to find it.
|
|
.in | STRING1 | match |
Finds Win in Windows.
|
STRING2 | match |
Finds Lin in Linux.
|