By Bilal--Updated -10 min read

Sitemap Audit: 12 Things to Check Before You Submit to Google

This is the checklist I run on every sitemap before recommending it gets submitted to Google. It catches the issues that turn into "Submitted URL not selected as canonical" or "Couldn't fetch" weeks later, when fixing them means undoing a deploy. None of these checks require special tools. Most are a curl command and a careful read.

Run through this list before your next sitemap submission. Items 1 through 6 are mandatory; 7 through 12 are quality-of-life improvements that prevent slow-burn issues.

1. Every URL in the sitemap returns HTTP 200

The single biggest cause of crawl waste is sitemaps full of URLs that 404, redirect, or return 5xx. Pick a random sample of 20 URLs and run:

curl -s -o /dev/null -w "%{http_code}\n" https://example.com/some-page

If anything other than 200 comes back without a redirect, that URL does not belong in the sitemap. Common offenders:

  • Pages deleted from the CMS but still in the sitemap because the export job runs on stale data
  • URLs that redirect to a canonical version (you should be listing the canonical, not the source)
  • URLs that 200 but render a "this page no longer exists" template (these are soft 404s, see item 5)

For larger sites, do not check by hand. Run a tool like Screaming Frog in list mode on your sitemap, or write a 30-line script that reads the sitemap and HEAD-checks every URL.

2. Every URL is the canonical version

For each sample URL, fetch the page and check what the rel canonical declares. The URL in your sitemap and the URL declared as canonical should be byte-for-byte identical.

curl -s https://example.com/page | grep -i 'rel="canonical"'

If the page lists a different canonical, your sitemap is wrong. Common mismatches:

  • Sitemap has trailing slash, canonical does not (or vice versa)
  • Sitemap uses https://www., canonical uses https://
  • Sitemap includes tracking parameters, canonical does not
  • Sitemap uses uppercase characters, canonical uses lowercase

These mismatches produce "Submitted URL not selected as canonical" in Search Console and they do nothing useful. Fix the generator to emit canonical URLs.

3. No URL has a noindex tag

Sample 10 URLs from the sitemap and check for noindex:

curl -sI https://example.com/page | grep -i x-robots-tag
curl -s  https://example.com/page | grep -i 'name="robots"'

If any of them have a meta robots noindex tag or an X-Robots-Tag noindex header, the URL should not be in the sitemap. The two are contradictory: the sitemap says "please index this," the page says "do not index this." Google logs this as a warning and ignores the URL.

The fix is either to remove the noindex (if you do want the page indexed) or remove the URL from the sitemap (if you do not).

4. robots.txt does not block any URL in the sitemap

Open https://example.com/robots.txt and read every Disallow rule. For each rule, mentally check whether any URL in your sitemap matches. If a sitemap URL is blocked by robots.txt, both signals are wasted: the URL gets logged as "Blocked by robots.txt" in Search Console, and search engines cannot crawl it to evaluate the sitemap signal.

The most common version of this bug: a Disallow: /search/ meant to block internal search results, but the sitemap includes /search-engine-optimization-guide because the prefix accidentally matches. Test specific paths in Google's robots.txt tester.

5. No URL in the sitemap is a soft 404

A soft 404 is a page that returns 200 but contains content like "this product is no longer available" or "no results found." Google detects these heuristically and reports them in Search Console.

To check before submission: scan your sitemap for known empty or placeholder URLs. Common sources:

  • E-commerce: products with zero inventory rendering an "out of stock" template
  • Listings: filtered URLs returning "no results match your filters"
  • Blogs: tag pages with zero posts under the tag
  • News sites: archived content rewritten to "this article has been moved"

If you find any, either return a real 404, redirect to a relevant page, or remove the URL from the sitemap.

6. The sitemap file itself is reachable and well-formed

Open the sitemap URL in a browser and view source. Three checks:

  • The Content-Type header is application/xml (or application/gzip for compressed sitemaps), not text/html
  • The XML namespace declaration is present (xmlns="http://www.sitemaps.org/schemas/sitemap/0.9")
  • The file parses as valid XML (paste it into an online XML validator if you are not sure)
curl -I https://example.com/sitemap.xml

If the Content-Type comes back as text/html, your server is misconfigured or your routing is sending the sitemap path through your HTML template engine. Both are common with frameworks that have catch-all routing rules.

7. URLs are absolute, not relative

The sitemap protocol requires absolute URLs. Each URL must include the protocol and host. Relative URLs (/about, /blog/post-1) are silently dropped by most search engines without an error in Search Console.

If you find relative URLs in your sitemap, your generator is broken. Fix it before submitting.

8. Lastmod values reflect actual content changes

Sample 10 URLs from your sitemap and compare each lastmod to the actual last meaningful change of the content:

  • Did the content actually change on the date shown?
  • If the date is "now" for every URL, your generator is timestamping everything with the current time and Google will start ignoring your lastmods entirely.
  • If the date is the original publication date and the content has been updated, you are leaving useful signal on the table.

The best practice: lastmod equals the timestamp of the last meaningful content change. Not the timestamp of a CSS tweak, not the current time, not a fixed value like "2020-01-01" because someone forgot to wire it up.

If you cannot reliably track meaningful changes, omit lastmod entirely. No lastmod is better than a misleading lastmod.

9. URL count is reasonable for the site

Open the sitemap and count the URLs. Compare to your mental model of the site:

  • Very high count vs reality: the generator is including URL parameters, paginated archives, internal search results, or duplicate content. Filter these out.
  • Very low count vs reality: the crawler is missing pages, possibly because the site is JavaScript-rendered and the crawler does not execute JS, or because internal linking is broken.

For a content site, the number should roughly match unique articles plus a small number of static pages plus categories. A 100-post blog should have a sitemap with around 110 to 120 URLs. If it has 5,000, something is wrong.

10. The sitemap is referenced in robots.txt

Every site should have this line in its robots.txt:

Sitemap: https://example.com/sitemap.xml

This is how non-Google search engines (Bing, Yandex, DuckDuckGo's crawler) discover your sitemap. Google does not strictly need this once you have submitted via Search Console, but it is harmless and signals consistency to all crawlers.

The Sitemap directive can appear anywhere in robots.txt and applies globally regardless of any User-agent block.

11. URLs are HTTPS (and consistent)

Pick five URLs from the sitemap. Are they all https? Mixing http and https URLs in a sitemap is a clear signal of a misconfigured site. If your site has migrated to HTTPS, the sitemap should contain only HTTPS URLs. If it serves both, pick one as canonical and only include those.

Same check for www vs non-www. Pick one canonical version and use it everywhere.

12. The sitemap is gzipped if it is large

The 50MB uncompressed limit applies to the file as parsed, but you can ship the file gzipped to reduce bandwidth. Search engines support gzip natively for sitemaps with the .xml.gz extension or with Content-Encoding: gzip.

For sitemaps under a few megabytes, do not bother. The overhead of compression on small files is not worth the complexity. For sitemaps over 5MB, ship them gzipped. For sitemaps approaching 50MB, gzip is essentially required to keep the wire transfer reasonable.

What to do after submitting

The audit is one half of the job. The other half is what you do after submission:

  • Check Search Console daily for the first week, then weekly
  • Watch the gap between "Submitted" and "Submitted and indexed"
  • Look at the Page indexing report breakdown for any new error categories
  • If errors appear, fix them at the source, then click Validate Fix in Search Console

A clean sitemap submission with a good audit beforehand usually shows "Success" within 24 hours and starts indexing URLs within a week. If you are still seeing issues after a week, it is rarely the sitemap. It is one of the underlying issues this audit was designed to surface, and it is worth running through the diagnostic walkthrough article.

Ready to Generate Your Sitemap?

Use SiteMapr to create a complete XML sitemap for your website in seconds. It's free, fast, and requires no technical knowledge.

Generate Your Sitemap Now