Regular Expressions: How Greedy is Greedy?
According to Python's how-to documentation on regular expressions, "Repetitions such as
* are greedy; when repeating a RE, the matching engine will try to repeat it as many times as possible. If later portions of the pattern don’t match, the matching engine will then back up and try again with fewer repetitions."
One way to think of this is that they are greedy, but not so greedy that it will throw away the whole match just to grab a few more characters for themselves.
To illustrate, consider this regular expression:
p = re.compile(r"([ab]*)bb") p.search('aabbaabbaacc')
This matches 'aabbaabb'. The group matched by
([ab]*) is "aabbaa". How's it work?
- As a first check, the
.*does go all the way through the string matching "a"s or "b"s until it gets to the last "a" or "b". It finds "aabbaabbaa", but since that is not followed with "bb", this check fails to match: aabbaabbaacc
- It then tries again, but this time drops off the last character. That also fails because the match isn't followed by a "bb". aabbaabbaacc
- And so it continues. From the beginning, it looks like this:
- That last one matches: a series of "a"s or "b"s followed by "bb".