We'd like to use cookies on your device. You can accept our recommended cookies or customize your settings for better functionality.
We'd like to use cookies on your device. You can accept our recommended cookies or customize your settings for better functionality.
×

Crawl Budget Optimization

There have been many myths surrounding Crawl Budget, including – not having any control over crawl limit to crawl frequency being a ranking signal. Google has been very clear in its guidelines to webmasters apropos of optimizing the crawl budget.

 

What is Crawl Budget? 

Crawl Budget can be defined as the number of pages Google will crawl at any given time. Crawl capacity & crawl demand are the two primary factors driving the crawl budget.  

Google wants to crawl any site without overloading its servers and to do that Google calculates the crawl capacity of a site by assessing the crawl health and the crawl limit set by the site owners.  

Needless to mention, Google’s limited crawling resources also play an important role in managing the crawl budget.

 

Who should care for Crawl Budget? 

Google has made it clear that the large sites with millions of pages or medium-sized websites with frequent content changes/updates like – 

  • Large e-commerce sites 

  • News publishers 

  • News aggregators 

  • Any other sites with huge & frequent content changes 

On the contrary, any site which houses a few hundred pages shouldn’t worry too much about Crawl Budget. A simple check on the site health i.e. internal redirects, usage of canonical URLs for internal linking, sitemap health, etc. to ensure you are not wasting the allocated crawl budget.

 

Can you increase/decrease the crawl budget? 

There isn’t any way to ask Google to increase or decrease the budget for your site. Although, you can certainly control or choose the way Google spends its resources crawling the web pages across your site.  

As Google does not want to overwhelm your web servers, it becomes vital to make index-worthy pages available to Google for crawling.  

Additionally, webmasters can also limit the crawl rate within Google Search Console if they find Googlebot slowing down their servers.

GSC Crawl Limit
The Default within the search console allows Google to optimize the crawl rate.

 

Hacks to optimize the crawl budget

  1. Effective use of Robots.txt – Remember, robots.txt is an effective crawl management tool. Disallowing the unwanted section of the website from crawling can prevent wasting of the crawl budget. Identify if any part of the website is not meant to be indexed, which includes – 

  • Duplicate content 

  • Parameterized URLs etc. 

  • Low quality/thin content 

  • Block unimportant, non-critical resources from loading for Googlebot

Robots.txt example

More importantly, avoid using the no-index tag as Google will end up requesting these pages however drop it when it encounters a no-index tag.  

It is pertinent to note that the Crawl-Delay directive is ignored by Google and hence it’s best to be avoided 

  

  1. Non-indexable URLs –  

    • 301/302 Redirects – Let’s not forget that one of the factors affecting crawl budget is “Crawl Health”. If your site has loads of internal redirects, it will simply end up hurting the crawl limit. Run an in-depth audit using the tools like DeepCrawl, Screaming Frog or OnCrawl, etc. to identify any redirects loops or chains on the site.  

    • Permanently Removed Pages – Ensure that the pages which are deleted or removed are returning 404/410 status codes. Search engines use 404 status as an indication of not crawling the particular in the future.

    • Soft 404s – Despite having no real significance, search engines will keep on spending their resources on crawling these pages and not having enough budget left to crawl and index other important pages.

  2. Page Loading Speed  – Reducing the page loading speed will have a positive impact on the Googlebot’s crawling as the quicker the server responds, the more the pages Google could crawl. Here are quick fixes to consider for faster loading times – 

    • Robots.txt – as explained earlier, blocking off the non-critical resources from loading. 

    • Leverage Browser Caching – Consider caching of JS/CSS/images, wherever possible, as Google won’t have to crawl them over and over again and thereby saving its resources to crawl more important content of your site.  

    • Increase the server capacity – Google crawls without wanting to overload your site servers. It is thus, important to have strong servers which do not get easily overloaded 

  

  1. Sitemap Errors: Minisming the sitemap errors will go a long way in ensuring optimal use of the allocated crawl budget.  Some of the common Sitemap error includes 

    • Submitted URLs are no-index (available in GSC coverage report)

    • Non-200 status codes URLs present in the sitemap (Sitemap Crawl) 

    • Non-canonical URLs in the sitemap (Sitemap crawl) 

    • URLs with low or no content (Sitemap Crawl)

  2. Content – The popularity or staleness of the content dictates the crawl demand. Googlebot wants to keep its index up-to-date and to do, it frequently crawls the content that attracts maximum traffic.  If we analyze the server log files, we will see Googlebot's activity on high traffic pages. 

Google uses hints from some on-page elements to identify the content change/update on the page. These elements chiefly include – 

  •  The <lastmod> attribute within sitemaps 

  • Structured data dates  

Some of the other technical things that you can consider updating include – Etags, and the last modified network header to pass on the content change signal to Google.  

Note that, doing so, unless there is a substantial change on the website, Google may overlook or ignore these signals fully. It is thus, important to use the above signals wisely.

 

Common Misconceptions

  • Crawl Frequency is a ranking signal – Crawl ensures that pages appear in the SERP, it is not the ranking factor. The search engines use a multitude of signals to determine the ranking.  

  • Javascript files have no impact on the crawl budget – Unless cached, the JS requires an additional step of “rendering” demanding more of Google’s resources for crawling every time affecting the crawling budget.  

  • Only Google has a crawl budget – All the search engines have limited resources to crawl the humongous world wide web and, so does their crawl budgets too. However, steps to optimize the crawl budget largely remain the same.

  • Small edits to the content keep its freshness alive – Absolutely unnecessary! If Google identifies no major changes, it will stop crawling those pages.  

  

Thus, 

  • Crawl Budget affects the huge sites with millions of pages OR medium-size sites with frequent content changes 

  • Unless the substantial content change,  <lastmod> attributes or other ., will have no impact on the crawl frequency 

  • To make the maximum out of the allocated crawl budget, a few things to check are –

    • Strong servers 

    • Improving the page loading time 

    • The crawl health etc 

 

Watch this SEO myth-busting video of Martin Splitt & Alexis Sanders https://www.youtube.com/watch?v=am4g0hXAA8Q  covering some of the unanswered questions on “Crawl Budget Optimization”

In Our Company