This question is categorized "General Linux" because it has been tested on Linux with "grep". Which of the following strings is NOT matched by the regular expression 'ca*t'
- cart
- cat
- caat
- ct
EXPLANATION
My Terminal output.[root@sanbox ~]# cat file | grep 'ca*t'
cat
caat
ctx`
Being able to match varying sets of characters is the first thing regular expressions can do that isn’t already possible with the methods available on strings. However, if that was the only additional capability of regexes, they wouldn’t be much of an advance. Another capability is that you can specify that portions of the RE must be repeated a certain number of times.
The first metacharacter for repeating things that we’ll look at is
*
. *
doesn’t match the literal character *
; instead, it specifies that the
previous character can be matched zero or more times, instead of exactly once.For example,
ca*t
will match ct
(0 a
characters), cat
(1 a
),
caaat
(3 a
characters), and so forth. The RE engine has various
internal limitations stemming from the size of C’s int
type that will
prevent it from matching over 2 billion a
characters; you probably don’t
have enough memory to construct a string that large, so you shouldn’t run into
that limit.Repetitions such as
*
are greedy; when repeating a RE, the matching
engine will try to repeat it as many times as possible. If later portions of the
pattern don’t match, the matching engine will then back up and try again with
fewer repetitions.A step-by-step example will make this more obvious. Let’s consider the expression
a[bcd]*b
. This matches the letter 'a'
, zero or more letters
from the class [bcd]
, and finally ends with a 'b'
. Now imagine matching
this RE against the string abcbd
.Step | Matched | Explanation |
---|---|---|
1 | a |
The a in the RE matches. |
2 | abcbd |
The engine matches [bcd]* ,
going as far as it can, which
is to the end of the string. |
3 | Failure | The engine tries to match
b , but the current position
is at the end of the string, so
it fails. |
4 | abcb |
Back up, so that [bcd]*
matches one less character. |
5 | Failure | Try b again, but the
current position is at the last
character, which is a 'd' . |
6 | abc |
Back up again, so that
[bcd]* is only matching
bc . |
6 | abcb |
Try b again. This time
the character at the
current position is 'b' , so
it succeeds. |
abcb
. This
demonstrates how the matching engine goes as far as it can at first, and if no
match is found it will then progressively back up and retry the rest of the RE
again and again. It will back up until it has tried zero matches for
[bcd]*
, and if that subsequently fails, the engine will conclude that the
string doesn’t match the RE at all.Another repeating metacharacter is
+
, which matches one or more times. Pay
careful attention to the difference between *
and +
; *
matches
zero or more times, so whatever’s being repeated may not be present at all,
while +
requires at least one occurrence. To use a similar example,
ca+t
will match cat
(1 a
), caaat
(3 a
’s), but won’t match
ct
.There are two more repeating qualifiers. The question mark character,
?
,
matches either once or zero times; you can think of it as marking something as
being optional. For example, home-?brew
matches either homebrew
or
home-brew
.The most complicated repeated qualifier is
{m,n}
, where m and n are
decimal integers. This qualifier means there must be at least m repetitions,
and at most n. For example, a/{1,3}b
will match a/b
, a//b
, and
a///b
. It won’t match ab
, which has no slashes, or a////b
, which
has four.You can omit either m or n; in that case, a reasonable value is assumed for the missing value. Omitting m is interpreted as a lower limit of 0, while omitting n results in an upper bound of infinity — actually, the upper bound is the 2-billion limit mentioned earlier, but that might as well be infinity.
Readers of a reductionist bent may notice that the three other qualifiers can all be expressed using this notation.
{0,}
is the same as *
, {1,}
is equivalent to +
, and {0,1}
is the same as ?
. It’s better to use
*
, +
, or ?
when you can, simply because they’re shorter and easier
to read.
0 comments:
Post a Comment