2019-08-14
Problem with positive lookbehind and repeating pattern
stackoverflow
Question

Consider the following string:

ab(cd.xz) e(ab(fg).xz)) ab(hi.xz)

I want to match every substring that starts after ab( and ends with z. So I've written the following Regular Expression:

(?<=a.*?\().*?z

This should attempt to do the following according to RegexBuddy:

Assert that the regex below can be matched, with the match ending at this position (positive lookbehind) «(?<=a.*?\()»
   Match the character “a” literally «a»
   Match any single character that is not a line break character «.*?»
      Between zero and unlimited times, as few times as possible, expanding as needed (lazy) «*?»
   Match the character “(” literally «\(»
Match any single character that is not a line break character «.*?»
   Between zero and unlimited times, as few times as possible, expanding as needed (lazy) «*?»
Match the character “z” literally «z»

The result I get in RegexBuddy are the following matches (notice the middle one is not working right, as it should match fg).xz). What am I doing wrong?

Answer
1

The regex is working as designed :)

In the second example, the lookbehind expression matches ab(cd.xz) e(. The lookbehind match is always attempted from the start of the string onward (moving ahead if necessary), so the .*? matches more than you think. It is not (as one might expect) actually performed backwards from the current position.

So in the third example, the lookbehind even matches ab(cd.xz) e(ab(fg).xz)) ab(. It just happens to appear to work correctly because the actual match starts after another ab(...

Solution: Be more specific about what you allow to match. I suggest taking parentheses out of the allowed characters:

(?<=a[^()]*\().*?z
Problem with positive lookbehind and repeating pattern
See more ...