Avoiding Inaccurate URL Counts in Google Webmaster Tools Sitemaps

The Google Webmaster Tools (GWT) Sitemaps page can be an extremely useful feature for webmasters trying to get better insight into how Google is crawling their XML sitemaps. However, this tool may be deceiving you if you’ve mistakenly magnified the number of URLs submitted and indexed, which is an easy mistake to make because of how Google deals with sitemaps loaded by themselves and those loaded within a Sitemap Index. Let’s explore the problem and how advertisers can avoid it.

Double-Counting of URLs Submitted and Indexed

The problem arises when a site submits a particular XML Sitemap by itself, as well as within a sitemap index, through this tool. Instead of GWT realizing this one sitemap is duplicated, it will count all URLs submitted and indexed. This means that if a 1,500 URL XML sitemap that has 1,000 of those URLs indexed is submitted by itself and within an index, Google will report this as 3,000 URLs submitted and 2,000 indexed. Unfortunately, the issue above carries over to individual URLs within a sitemap. If you submit an XML sitemap with 1,500 URLs, and one URL is repeated 20 times in this sitemap, GWT will still report this as 1,500 URLs submitted. Furthermore, if this URL is indexed it will also be counted 20 times in the indexed figure. Lastly, if a particular URL is repeated in different XML sitemaps, and those sitemaps have been submitted to GWT, it will still be counted each time towards the submitted and indexed figures.

Preventing Inaccurate URL Counts in GWT

As you can see, this issue can cause not only confusion but also inaccurate data for webmasters. To ensure this duplication issue does not affect your sitemaps data, we recommend the following:
  • If a particular XML sitemap is contained within your sitemap index, do NOT submit it by itself outside of this sitemap index
  • If you have multiple XML sitemaps, ensure that no URLs are duplicated between sitemaps
  • URLs submitted in an XML sitemap should be:
    • 200 status URLs
    • Have self-referencing canonical tags, or at the least not have canonical tags pointing to other pages
    • Not be excluded in your robots.txt file
Cleaning up your sitemap submission can have an immediate impact on the number of URLs shown as submitted and indexed, as shown by this screenshot from GWT for a site that corrected its practices.

gwt_sitemap_correction

Importance of GWT Sitemaps

XML sitemaps are an important aspect of digital strategy as they give search engines a better understanding of the content on websites. As discussed above, a sitemap is a file that lists out the various important pages on your site, and then search engines read this file so they can more efficiently and intelligently crawl your site. Additionally, a sitemap can include metadata about the pages in your sitemap such as when the page was last updated, the importance of the page, etc. Remember, a sitemap is not required, but having one can help improve the crawling of your site. Google Webmaster Tools and Bing Webmaster Tools include information about these sitemaps, which helps us better understand how they are crawling a site. By keeping these sitemaps as accurate as possible, void of duplication, you will ensure you are getting the most accurate statistics on these sitemaps. While the search engine index will not reflect this duplication, GWT will, so making certain that each URL submitted is unique will give you the best insight into how Google is crawling your site. Check out your sitemaps in Google Webmaster Tools today, and make sure you are making the most out of this great tool!
Join the Discussion