SEO Professional’s Guide to Regular Expressions
Regular expressions, or regex, are used to find text matching a pre-defined pattern. Regular expressions can be a handy tool for any SEO professional whose typical day at work involves crawling large websites, pulling reports from GA/GSC, or analyzing data in Google spreadsheets. This guide will help you understand the basic building blocks of regular expression, how to write a regular expression, and walk you through various examples.
What are Regular Expressions (Regex)?
Regex is used to find text matching a pre-defined pattern. But what does that exactly mean? Think of a situation where you want to find only email addresses or phone numbers in a block of text. As you know, a typical email id has two text strings on either side of the character “@” and a typical phone number in India has 10 digits with or without preceding country code. As you can see there is a pattern here that, if defined correctly, can help in achieving the objectives. This pattern used to describe text is known as regular expression.
Why Learn About Regular Expressions?
As a SEO professional you might be wondering why I need to learn about regex or how is this skill going to benefit me in my work. The short answer is being hands on with regex can help speed up all the different tasks that a SEO professional needs to perform on a regular basis. To elaborate more on that, a regex can help you create advanced filters in analytics or apply custom filters in Google Search Console to analyze data more accurately, run custom crawls with tools such Screaming Frog, manually extract information from any website, rewrite URLs or filenames, or simply supercharge your search and replace operations in word. Once you understand the concept and get comfortable with writing regex, you will find many more areas where your newfound skill will be useful.
Building Blocks of a Regex
A regex consists of characters (literal characters) and metacharacters (special characters). As the names suggest, a character or a literal character is simply any letter, symbol, number or even space or punctuation while metacharacter refers to a set of characters with special meanings. Let’s explore both character types in detail.
Literal Characters
As mentioned above, a literal character simply matches itself i.e., letter “a” in a regex will match letter “a” in the given string. A search pattern with all literal character is simplest form of regex and same as searching for the specific set of characters through a given string. For example, a regex /search/ will march the exact word “search” or first six letters of the word “searching”. What you need to consider here is a regex is by inherently eager, meaning it will return the first match instead of the most accurate match, and case sensitive. Therefore, in most cases a simple regex will not be enough for advanced operations and you will need to employ metacharacters, word boundaries, character sets etc., to achieve perfect match.
Metacharacters
The characters with special meaning inside a regex are called as metacharacters. As you might have guessed, these are not used to match their literal versions. Instead, they are used to give literal characters different meanings and to transform your regex from a simple text search into a powerful tool. The table below lists some of the metacharacters with their meanings.
Metacharacter |
Name |
Meaning |
Example |
. |
Dot |
It matches any single character except for a newline |
T.O will match TWO and TOO both |
^ |
Caret |
Matches the beginning of the input string |
^play will match with “play” but not with “display” |
$ |
Dollar |
Matches the end of the input string |
play$ will match with “play” or “display” but not with “plays” |
+ |
Plus |
Matches the previous character at least one or more times |
110+ will match with “110”, “1100” and so on but not with “11” |
* |
Asterisk |
Matches the previous character zero or more times |
110* will match with “11”, “110”, “1100” and so on |
? |
Question Mark |
Matches the previous character zero or one time |
110? Will match with “11” or “110” only |
| |
OR (Pipe) |
Matches either of the strings on two sides of the pipe |
a|b will match with “a” or “b” |
[] |
Set (Square Brackets) |
Used to define character sets and any one of the characters within the brackets can be matched |
go[og]gle can match with google or goggle |
{} |
Repetition (Braces) |
Used to define number of repetitions by specifying minimum and maximum within the braces* |
[0-9]{10} will match with any 10-digit number |
\ |
Escape (Backslash) |
Used to turn a metacharacter into a literal character |
product\.html with match with “product.html” |
() |
Group (parenthesis) |
Used to group characters to perform matching operation |
Sun(ny)? will match “sun” and “sunny” |
The characters *, ?, +, and {} are also known as quantifiers.
*The braces are used to specify minimum and maximum number of repetitions with following combinations
-
{n} – Matches previous character exactly n times
-
{n,m} – Matches previous character at least n and at most m times.
-
{n,} – Matches previous character at least n times with maximum assumed to be infinity. Writing x{0,} is same as writing x* and writing x{1,} is same as writing x+
While discussing metacharacters we also introduced concepts like character sets and grouping.
Character Set
A character sets defined within square brackets […] allows us any of the character inside the set irrespective of the order in which they are put into the set. For example, the regex /h[oi]t/ will match both “hit” and “hot”. Since character sets match only one character, if we were to try and match any number from 0 to 9 or any lower/upper case letter then that would make the set unnecessarily long and to bothersome to manage. This is where character ranges come into picture.
Character ranges
Character ranges make it easier to define character sets where there is a known order between two character. Hyphen is used as a metacharacter to define range within a character set. For example
-
[0-9] – All numbers between o to 9
-
[a-z] – All lower-case letters from a to z
-
[A-Z] – All upper-case letters from A to Z
At this point you might be wondering if there’s an even better way to write the character sets. Shorthand character sets make the task ever easier for cleaner, simpler regular expressions.
Shorthand character sets
Notation |
Meaning |
Negative |
\d |
Any digit from 0 to 9. Same as writing [0-9] |
\D – Not a digit [^0-9] |
\w |
A word character such as all upper/lower case letters, all number from 0 to 9, and the underscore. Same as writing [a-zA-Z0-9_] |
\W – Not a word character [^a-zA-Z0-9_] |
\s |
Whitespace, line return or line feed. Same as writing [\t\r\n] |
\S – Not a whitespace [^\t\r\n] |
Grouping Characters
Open and closed parenthesis metacharacters are used for grouping parts of the regex, performing repetition operations, and for performing match and replace operations. Unlike character sets where we could only match one character at a time grouping, as the name suggests, allows matching of more than one characters. Combining with quantifiers this results in very powerful regular expressions. For example
-
rank(ing)? – Can match with rank and ranking both.
-
stra(ight|it) – Can match with straight or strait
This covers the basics of regex that you, an SEO, needs to understand. Let’s dive into some use cases and put our new skills to use.
Regex Use Cases for SEO
-
Screaming Frog
While working with large sites, effectively managing the conflict between time spent in crawling and the need for regular crawls is extremely important. Screaming Frog offers two very useful configurations to customize crawls based on whether you want to include or exclude certain section(s) of the site.
- Include Site Section(s)
A regex can help in limiting crawl to specific section(s) of the site thereby allowing you to prioritize most important sections/pages for regular audits. For example, if we wanted to crawl only the blog section on https://www.merkleinc.com/in/, we can easily do that using a simple regex as shown below.
- Exclude Site Section(s)
The exclude option under configuration tab allows us to, you guessed it, exclude certain site section(s) or URLs with parameters. Let’s assume for this example we wanted to exclude the blog section on Merkle’s website and crawl the remaining pages. Here’s how we can do that.
-
Google Analytics
Regex can be very useful in Google Analytics right from setting up custom filters to creating advanced segments or setting up precise filters on the fly while analyzing reports.
-
Advanced Segments Using Regex
The OR (|) operator can be used to look for multiple matching strings at once. This is very useful when we need to setup advanced segments to analyze traffic coming via multiple sources or any other dimensions. The example below shows a segment created to track traffic coming from Facebook, Twitter, LinkedIn, and YouTube. It is a simple OR operation that looks for the strings in sources list and pulls data for the matches found.
-
Google Search Console
Google recently added regex support to GSC, and it has already become one of the popular features of the platform. The feature supports both matching and not matching filter option when using regex match.
-
Filtering for Search Intent Using Regex
If you want to look for questions that people are asking about your products/services, you can filter performance data at query level using regex as shown below.
You can setup similar filters to analyze other types of search intents such as transactional or navigational.
-
Grouping Pages by Type or Characteristics
Let’s say you wanted to analyze how your top navigation is doing or how your most important product pages across multiple categories are performing. You can easily do this by setting up a regex. Below is an example of a regex filter to track how Merkle’s services pages are performing.
^https://www.merkleinc.com/in/.*/(amazon-and-eretail|search-engine-marketing-sem-ppc|programmatic-media-display-advertising|social-media-marketing-advertising|shopping-feed-management-google-bing-plas)$
Conclusion
The guide’s aim is to help you understand what regular expressions are and how you can start defining patterns to suit your requirements. Kudos if you have reached till here and now is the time to put the knowledge you have gained to practice and start experimenting with different forms of expressions and types of data that you might want to match.