Simplifying Lookarounds in Regex | Better Programming

comic.  someone swoops in screaming “I know regular expressions!”  Taps on the computer, some Perl, then leaves.  Other people cheer.
Image by https://xkcd.com/

One of the concepts in Regular Expressions (Regex) that I’ve always found difficult to wrap my head around is look-arounds — which comprise look-aheads and look-behinds.

While there are plenty of articles and tutorials online explaining this concept, few do it in a way that is easy to understand, at least not to my satisfaction. Many use jargons such as “consuming groups,” “zero-width assertions,” etc., which doesn’t help those who are learning this advanced topic.

Furthermore, there is a lack of clarity over how to interpret the names of the look-arounds. For instance, for look-behind, what is “behind” relative to? What are we “looking” for? The same goes for look-ahead. As if they are not confusing enough, there are two sub-types — positive and negative — for each type of look-around.

Comic: two people talking.  one has sunglasses on.  sunglasses: if you're havin' Perl problems, I feel bad for you son.  I got 99 problems, so I used regular expressions.  Now, I have 100 problems.
Image by https://xkcd.com/

In this article, I attempt to demystify the concepts of look-ahead and look-behind once and for all. I will avoid technical jargon and instead explain in simple terms, supported by animated GIFs.

My explanation will be programming language-agnostic, although my code snippets will be in Python. I hope this article will be useful for you. Let’s begin!

The code snippets and animated GIFs shown in this article can be found at this GitHub repo.

Before we dive deeper, let’s first get some high-level intuition of what look-arounds are trying to achieve and how they work. Let’s do this with a simple analogy.

Suppose you’re a tourist in another country and you wish to visit a local museum. You’re on foot and you’re lost. You ask a passerby for directions to the museum. They tell you, “Go straight ahead, and once you see the French café on your left, you’ll see the museum.” You follow their instructions and voilà, you find the museum!

Photo by Ehud Neuhaus on Unsplash

Here, you successfully find the museum because you’re given a landmark (“French café”), the direction to walk (“straight ahead”), and where to find the landmark relative to the direction of motion (“on your left” ). Look-arounds work in a similar way — instead of walking past buildings, you are “walking through” with text strings.

In look-arounds, you are looking for some part(s) of a text string. To find it/them, you need to know the “landmark” (also known as the pattern), and where to find the landmark (either before or after the pattern).

The direction to “walk” is fixed because text strings read from left to right, at least in the English language. Once that pattern exists in the text string, you find a match.

To keep things interesting, I will be illustrating the concepts of look-arounds using the following quote from the Spiderman movies:

“With great power comes great responsibility.”

Photo by Road Trip with Raj on Unsplash

Look-ahead is the type of look-around where the pattern appears ahead of the desired match. We are “looking ahead” to see if a certain string of text has a specific pattern ahead of it. If it does, then that string of text is a match.

3.1. Positive look-ahead

In a positive look-ahead, you want to find an expression A that has an expression B (ie, the pattern) after it. Its syntax is A(?=B) .

Figure 1: Definition of Positive Look-ahead (GIF by Author)

Let’s contextualise this with our example text. Suppose you want to find any complete word which has the pattern " great" after it. Since this is our first example in this article, let’s break it down and walk through it step by step… quite literally.

Using Spiderman's saying “with great power comes great responsibility,” find a complete word that has “great” after it.  Words found are “with” and “comes.”  This is a positive look-ahead.
Figure 2: Positive Look-ahead Example (GIF by Author)

Imagine that you are the animated walking man in Figure 2. You are first standing at the beginning of the example text. Then, you start walking character by character sequentially towards the end of the text.

As you walk, you are always looking ahead to find the “landmark,” in this case, that being the pattern " great" .

Each time you find " great" just after a complete word, that word is a match.

In this case, the successful matches are "With" and "comes" . The corresponding code snippet in Python is as follows:

>>> import re>>> text = "With great power comes great responsibility."
>>> pattern = r'bw+b(?= great)'
>>> matches = re.finditer(pattern, text)
>>> for match in matches:
... print(f'Match: "{match.group()}" => Span: {match.span()}')

Match: "With" => Span: (0, 4)
Match: "comes" => Span: (17, 22)

3.2. Negative look-ahead

A negative look-ahead, on the other hand, is when you want to find an expression A that does not have an expression B (ie, the pattern) after it. Its syntax is: A(?!B) . In a way, it is the opposite of a positive look-ahead.

Figure 3: Definition of Negative Look-ahead (GIF by Author)

Now, let’s say you want to find any complete word which does not have the pattern " great" after it.

Example of a negative look-ahead.
Figure 4: Negative Look-ahead Example (GIF by Author)

This time round, you are looking ahead to find any word that does not have the pattern " great" after it.

  • the first word, "With"has " great" after it, so it is not a match.
  • The next word, "great"does not have " great" after it, so it is a match.
  • the third word, "power"also does not have " great" after it, so it is a match.
  • This goes on until you reach the end of the string. The successful matches are therefore "great", "power", "great" and "responsibility".

Let’s see this in code:

>>> text = "With great power comes great responsibility."
>>> pattern = r'bw+b(?! great)'
>>> matches = re.finditer(pattern, text)
>>> for match in matches:
... print(f'Match: "{match.group()}" => Span: {match.span()}')

Match: "great" => Span: (5, 10)
Match: "power" => Span: (11, 16)
Match: "great" => Span: (23, 28)
Match: "responsibility" => Span: (29, 43)

Let’s turn our attention now to look-behind. Unlike look-ahead, look-behind is used when the pattern appears before a desired match. You’re “looking behind” to see if a certain string of text has the desired pattern behind it. If it does, then that string of text is a match.

4.1. Positive look-behind

In a positive look-behind, you want to find an expression A that has the expression B (ie, the pattern) before it. Its syntax is (?<=B)A .

Figure 5: Definition of Positive Look-behind (Animated GIF by Author)

Let’s understand this better with our example text. Suppose you now want to find any complete word that has the pattern "great " before it.

Figure 6: Positive Look-behind Example (Animated GIF by Author)

Once again, you walk from the start of the text string to the end. The difference now is that as you walk, you “turn around” to “look behind” instead of just looking ahead. Notice that the animated man in Figure 6 always turns his head around!

You “look behind” to find any word that has the pattern "great " before it.

  • The first word "With" has no characters before it, thus it is not a match.
  • The second word "great" has "With " before it and is not a match.
  • The third word "power" has "great " before it and it is a match.
  • At the end, the successful matches are "power" and "responsibility". Here’s the code snippet:
>>> text = "With great power comes great responsibility."
>>> pattern = r'(?<=great )bw+b'
>>> matches = re.finditer(pattern, text)
>>> for match in matches:
... print(f'Match: "{match.group()}" => Span: {match.span()}')

Match: "power" => Span: (11, 16)
Match: "responsibility" => Span: (29, 43)

4.2. Negative Look-behind

Finally, in negative look-behind, you are interested in finding an expression A that does not have the expression B (ie, the pattern) before it. Its syntax is: (?<!B)A . It is the opposite of a positive look-behind.

Figure 7: Definition of Negative Look-behind (Animated GIF by Author)

Now, let’s say you want to find any complete word which does not have the pattern "great " before it in our example text string. This time, as you walk from the start to the end of the string, you are “looking behind” for words that do not have "great " before them.

By a similar “walking through” process, you find that the successful matches are "With" , "great" , "comes" and "great" .

Figure 8: Negative Look-behind Example (Animated GIF by Author)

The code is as follows:

>>> text = "With great power comes great responsibility."
>>> pattern = r'(?<!great )bw+b'
>>> matches = re.finditer(pattern, text)
>>> for match in matches:
... print(f'Match: "{match.group()}" => Span: {match.span()}')

Match: "With" => Span: (0, 4)
Match: "great" => Span: (5, 10)
Match: "comes" => Span: (17, 22)
Match: "great" => Span: (23, 28)

You may encounter situations where you want to find matches in a text string that starts after one pattern and ends before another. In such cases, you can combine look-ahead and look-behind.

For example, if you want to find any characters between the two “great” words in the example text, you can combine a positive look-behind (?<=great).* and a positive look-ahead .*(?=great)in the following way:

>>> text = "With great power comes great responsibility."
>>> pattern = r'(?<=great).*(?=great)'
>>> matches = re.finditer(pattern, text)
>>> for match in matches:
... print(f'Match: "{match.group()}" => Span: {match.span()}')

Match: " power comes " => Span: (10, 23)

Let’s zoom out a little and wrap things up before you go. We have covered four types of look-arounds in Regex. Here’s a cheat sheet that summarises their definitions and syntaxes. Feel free to save a copy for your future reference.

3x3 chart.  positive and negative columns, look-ahead and look-behind rows.
Figure 9: Cheatsheet for Regex Look-arounds (Image by Author)

Here are a few more observations to note to further cement your understanding:

  • The syntaxes for the two types of positive look-arounds are associated with an equal sign, =
  • The syntaxes for the two types of negative look-arounds are associated with an exclamation sign, !
  • Look-aheads are associated with the preposition “after” — finding a match that has a specific pattern after it
  • Look-behinds are associated with the preposition “before” — finding a match that has a specific pattern before it.

Congratulations! I hope this article has helped you gain a better understanding of look-arounds in Regex. Do not worry if you still struggle to make sense of these concepts — they are confusing to begin with. Feel free to bookmark this article, and come back here if you need a refresher.

I have also kept my explanation simple and used layman’s language, but if you need to take your understanding to the next level, here are some resources you should check out:

That’s it for now. Have a great day!

Let's connect!Reach out to me via LinkedIn or Twitter.

Leave a Comment