We'd like to use cookies on your device. You can accept our recommended cookies or customize your settings for better functionality.
We'd like to use cookies on your device. You can accept our recommended cookies or customize your settings for better functionality.
×

Regex Guide for SEO – Regular Expression Fundamentals and Use Cases

SEO Professional’s Guide to Regular Expressions  

Regular expressions, or regex, are used to find text matching a pre-defined pattern. Regular expressions can be a handy tool for any SEO professional whose typical day at work involves crawling large websites, pulling reports from GA/GSC, or analyzing data in Google spreadsheets. This guide will help you understand the basic building blocks of regular expression, how to write a regular expression, and walk you through various examples.

 

What are Regular Expressions (Regex)?  

Regex is used to find text matching a pre-defined pattern. But what does that exactly mean? Think of a situation where you want to find only email addresses or phone numbers in a block of text. As you know, a typical email id has two text strings on either side of the character “@” and a typical phone number in India has 10 digits with or without preceding country code. As you can see there is a pattern here that, if defined correctly, can help in achieving the objectives. This pattern used to describe text is known as regular expression.

 

Why Learn About Regular Expressions?  

As a SEO professional you might be wondering why I need to learn about regex or how is this skill going to benefit me in my work. The short answer is being hands on with regex can help speed up all the different tasks that a SEO professional needs to perform on a regular basis. To elaborate more on that, a regex can help you create advanced filters in analytics or apply custom filters in Google Search Console to analyze data more accurately, run custom crawls with tools such Screaming Frog, manually extract information from any website, rewrite URLs or filenames, or simply supercharge your search and replace operations in word. Once you understand the concept and get comfortable with writing regex, you will find many more areas where your newfound skill will be useful. 

 

Building Blocks of a Regex  

A regex consists of characters (literal characters) and metacharacters (special characters). As the names suggest, a character or a literal character is simply any letter, symbol, number or even space or punctuation while metacharacter refers to a set of characters with special meanings. Let’s explore both character types in detail. 

 

Literal Characters  

As mentioned above, a literal character simply matches itself i.e., letter “a” in a regex will match letter “a” in the given string. A search pattern with all literal character is simplest form of regex and same as searching for the specific set of characters through a given string. For example, a regex /search/ will march the exact word “search” or first six letters of the word “searching”. What you need to consider here is a regex is by inherently eager, meaning it will return the first match instead of the most accurate match, and case sensitive. Therefore, in most cases a simple regex will not be enough for advanced operations and you will need to employ metacharacters, word boundaries, character sets etc., to achieve perfect match. 

 

Metacharacters  

The characters with special meaning inside a regex are called as metacharacters. As you might have guessed, these are not used to match their literal versions. Instead, they are used to give literal characters different meanings and to transform your regex from a simple text search into a powerful tool. The table below lists some of the metacharacters with their meanings. 

Metacharacter 

Name 

Meaning 

Example 

Dot 

It matches any single character except for a newline 

T.O will match TWO and TOO both 

Caret 

Matches the beginning of the input string 

^play will match with “play” but not with “display”  

Dollar 

Matches the end of the input string 

play$ will match with “play” or “display” but not with “plays” 

Plus 

Matches the previous character at least one or more times 

110+ will match with “110”, “1100” and so on but not with “11” 

Asterisk 

Matches the previous character zero or more times 

110* will match with “11”, “110”, “1100” and so on 

Question Mark 

Matches the previous character zero or one time 

110? Will match with “11” or “110” only 

OR (Pipe) 

Matches either of the strings on two sides of the pipe 

a|b will match with “a” or “b” 

[] 

Set (Square Brackets) 

Used to define character sets and any one of the characters within the brackets can be matched 

go[og]gle can match with google or goggle 

{} 

Repetition (Braces) 

Used to define number of repetitions by specifying minimum and maximum within the braces* 

[0-9]{10} will match with any 10-digit number 

Escape (Backslash) 

Used to turn a metacharacter into a literal character 

product\.html with match with “product.html”  

() 

Group (parenthesis) 

Used to group characters to perform matching operation 

Sun(ny)? will match “sun” and “sunny” 

The characters *, ?, +, and {} are also known as quantifiers.  

*The braces are used to specify minimum and maximum number of repetitions with following combinations 

  • {n} – Matches previous character exactly n times 

  • {n,m} – Matches previous character at least n and at most m times. 

  • {n,} – Matches previous character at least n times with maximum assumed to be infinity. Writing x{0,} is same as writing x* and writing x{1,} is same as writing x+ 

While discussing metacharacters we also introduced concepts like character sets and grouping.  

 

Character Set  

A character sets defined within square brackets […] allows us any of the character inside the set irrespective of the order in which they are put into the set. For example, the regex /h[oi]t/ will match both “hit” and “hot”. Since character sets match only one character, if we were to try and match any number from 0 to 9 or any lower/upper case letter then that would make the set unnecessarily long and to bothersome to manage. This is where character ranges come into picture. 

 

Character ranges  

Character ranges make it easier to define character sets where there is a known order between two character. Hyphen is used as a metacharacter to define range within a character set. For example 

  • [0-9] – All numbers between o to 9 

  • [a-z] – All lower-case letters from a to z 

  • [A-Z] – All upper-case letters from A to Z 

At this point you might be wondering if there’s an even better way to write the character sets. Shorthand character sets make the task ever easier for cleaner, simpler regular expressions. 

Shorthand character sets 

Notation 

Meaning 

Negative 

\d 

Any digit from 0 to 9. Same as writing [0-9] 

\D – Not a digit [^0-9] 

\w 

A word character such as all upper/lower case letters, all number from 0 to 9, and the underscore. Same as writing [a-zA-Z0-9_] 

\W – Not a word character [^a-zA-Z0-9_] 

\s 

Whitespace, line return or line feed. Same as writing [\t\r\n] 

\S – Not a whitespace [^\t\r\n] 

 

Grouping Characters  

Open and closed parenthesis metacharacters are used for grouping parts of the regex, performing repetition operations, and for performing match and replace operations. Unlike character sets where we could only match one character at a time grouping, as the name suggests, allows matching of more than one characters. Combining with quantifiers this results in very powerful regular expressions. For example 

  • rank(ing)? – Can match with rank and ranking both. 

  • stra(ight|it) – Can match with straight or strait 

This covers the basics of regex that you, an SEO, needs to understand. Let’s dive into some use cases and put our new skills to use. 

 

Regex Use Cases for SEO  

  1. Screaming Frog 

While working with large sites, effectively managing the conflict between time spent in crawling and the need for regular crawls is extremely important. Screaming Frog offers two very useful configurations to customize crawls based on whether you want to include or exclude certain section(s) of the site.  

  • Include Site Section(s) 
    A regex can help in limiting crawl to specific section(s) of the site thereby allowing you to prioritize most important sections/pages for regular audits. For example, if we wanted to crawl only the blog section on https://www.merkleinc.com/in/, we can easily do that using a simple regex as shown below.

Screaming Frog Include

  • Exclude Site Section(s) 
    The exclude option under configuration tab allows us to, you guessed it, exclude certain site section(s) or URLs with parameters. Let’s assume for this example we wanted to exclude the blog section on Merkle’s website and crawl the remaining pages. Here’s how we can do that. 

Screaming Frog Exclude

  1. Google Analytics  

Regex can be very useful in Google Analytics right from setting up custom filters to creating advanced segments or setting up precise filters on the fly while analyzing reports.  

  • Advanced Segments Using Regex 
    The OR (|) operator can be used to look for multiple matching strings at once. This is very useful when we need to setup advanced segments to analyze traffic coming via multiple sources or any other dimensions. The example below shows a segment created to track traffic coming from Facebook, Twitter, LinkedIn, and YouTube. It is a simple OR operation that looks for the strings in sources list and pulls data for the matches found. 

Google Analytics Segment Example

  1. Google Search Console 

Google recently added regex support to GSC, and it has already become one of the popular features of the platform. The feature supports both matching and not matching filter option when using regex match.  

  • Filtering for Search Intent Using Regex 
    If you want to look for questions that people are asking about your products/services, you can filter performance data at query level using regex as shown below. 

Regex Filter

You can setup similar filters to analyze other types of search intents such as transactional or navigational. 

  • Grouping Pages by Type or Characteristics 
    Let’s say you wanted to analyze how your top navigation is doing or how your most important product pages across multiple categories are performing. You can easily do this by setting up a regex. Below is an example of a regex filter to track how Merkle’s services pages are performing.

Grouping Pages by Types

^https://www.merkleinc.com/in/.*/(amazon-and-eretail|search-engine-marketing-sem-ppc|programmatic-media-display-advertising|social-media-marketing-advertising|shopping-feed-management-google-bing-plas)$ 

 

Conclusion  

The guide’s aim is to help you understand what regular expressions are and how you can start defining patterns to suit your requirements. Kudos if you have reached till here and now is the time to put the knowledge you have gained to practice and start experimenting with different forms of expressions and types of data that you might want to match. 

 

In Our Company