The single most common email I get goes something like this: "I submitted 1,000 URLs to Google. The sitemap shows Success. But when I look at the indexing report, only 200 are indexed. What is wrong?"
Sometimes the answer is "nothing, this is normal." Sometimes it is a five-minute fix. Sometimes the site has a deep content problem that no amount of sitemap tweaking will solve. The trick is diagnosing which of those three you are in. This article is the diagnostic order I run, top to bottom, and the questions I ask at each step.
Step zero: set realistic expectations
Google never indexes 100% of any site. Not yours, not Wikipedia's, not anyone's. The expected hit rate depends on site size, authority, and content quality:
- A new site with low domain authority: 30 to 60% of submitted URLs indexed within the first three months is normal.
- An established site with good content: 80 to 95% indexed is typical.
- A large e-commerce or programmatic site: 50 to 70% is common because Google deliberately deprioritizes faceted-nav variants and low-uniqueness category pages.
If your numbers are in those ranges, you do not have a problem. You have a website. The diagnostic below is for cases where the gap is dramatically worse than these baselines or where the indexed count is dropping over time.
Step one: confirm the sitemap is doing its job
Before blaming content, blame the pipe. Open Search Console, go to Sitemaps, and verify:
- The sitemap status is Success, not Has errors or Couldn't fetch.
- The Discovered URLs count is close to the actual URL count in your file. A 5% discrepancy is normal (validation drops). A 30% discrepancy is broken.
- The Last read date is recent (within the last week or two for active sites).
If any of those are off, fix the file before you do anything else. There is a separate article on decoding sitemap status messages if you need the full breakdown.
Step two: the five-minute checks
These are the cheap, high-yield checks. Run them in order and stop when one fires.
Pick five URLs from your sitemap at random and open each
For each one, check three things:
- Does it return HTTP 200? (Use curl -I or browser dev tools, not just "the page loads.")
- Does the rel canonical in the page source match the URL itself?
- Is there a meta robots noindex or similar X-Robots-Tag header?
I have lost count of how many indexing problems turned out to be the CMS auto-noindexing pages older than 90 days because someone's plugin had that setting on by default.
Read your robots.txt
Do not skim it. Read it. Look for:
- Any Disallow rule that matches paths in your sitemap
- A Disallow: / accidentally left from a staging environment
- User-agent: Googlebot rules that disagree with User-agent: *
The most expensive bug I have seen in this category was a site with Disallow: /products/*? meant to block faceted nav, written as Disallow: /products/, blocking every product page. Took the team three weeks to notice.
Verify URLs are reachable from Googlebot specifically
Use the URL Inspection tool in Search Console on three or four of your sitemap URLs. Click Test live URL. Look at the rendered HTML and the page resources. If Google cannot fetch CSS or JS, the page cannot be properly evaluated. If Page availability shows anything other than Available, you have a server-level access problem (Cloudflare bot fight mode is the usual culprit; geo-restrictions are second).
Step three: read the page-indexing report
In Search Console, go to Pages. Look at the breakdown of "Why pages aren't indexed". The categories matter. They each suggest a different fix.
If the dominant category is Crawled – currently not indexed with a large absolute number, the problem is content quality. Google fetched the pages, looked at them, and chose not to include them. No sitemap change fixes this. You need to either improve those pages or remove them from your sitemap so they stop dragging down the property's average.
If it is Discovered – currently not indexed, the problem is crawl budget. Google knows about the pages but has not gotten around to them. For new and low-authority sites this is the default state, and the fix is patience combined with internal linking and external backlinks. Submitting the sitemap repeatedly does nothing.
If it is Submitted URL not selected as canonical or Duplicate, Google chose different canonical than user, the sitemap and your canonical tags are misaligned. This is almost always a sitemap-generation bug: the generator is emitting URLs that do not match what is declared as canonical on the page. Check protocol (HTTP vs HTTPS), trailing slashes, www vs non-www, and tracking parameters.
If it is Excluded by 'noindex' tag or Blocked by robots.txt, take those URLs out of the sitemap. You are sending mixed signals.
Step four: the content quality test
This is the hard one because nobody likes hearing "your content is not good enough." But the test is mechanical:
- Take ten URLs that are stuck in Crawled – currently not indexed.
- For each one, open the page and ask honestly: if I landed here from a Google search, would I be glad I clicked?
- Count how many you would answer yes to.
If it is three or fewer, you have a content problem. The pages are probably thin, templated, near-duplicates of each other, or AI-generated boilerplate. The fix is content work: rewrite, consolidate, or remove.
The harder version of this test: pick the URL that ranks best on Google for the topic your page is about. Compare yours to it. If yours does not add anything that the top result does not already cover, Google is making a reasonable call by leaving you out of the index.
Step five: structural checks for large sites
If you are running thousands of URLs and hitting an indexing wall, the structural issues matter more.
Internal linking depth
Pages more than four clicks from your homepage tend to get crawled and indexed less reliably. Run a crawl of your own site (a tool like Screaming Frog, or SiteMapr's depth report) and look at the distribution of click-depth. If half your URLs are six or more clicks deep, the navigation is the problem.
Pagination handling
Paginated archives (/blog/page/2/, /category/foo/page/3/) are treated as duplicate by Google more often than not. They should not be in your sitemap. If they are, remove them.
Faceted navigation
Filtered URLs like /shoes?color=blue&size=10 are crawl-budget poison. They should not be in your sitemap, ever. If they are, your sitemap generator is misconfigured. The URLs to include are the un-filtered category pages.
Sitemap age and freshness
If your lastmod dates have not changed in months but your content has, Google starts ignoring the dates. Either keep lastmod accurate or omit it entirely.
Step six: when to walk away
Sometimes the answer is that Google has decided your site does not deserve much indexing right now, and there is no immediate technical fix. Common cases:
- The site is brand new and has zero external backlinks.
- The site has a manual action (check the Manual Actions report. Most do not, but check anyway).
- The domain has a poor history (recently expired and rebuilt, previous owner spammed it, etc.).
- The content is competing in a topic where Google has many existing trusted sources.
For the first case: time and links. For the second: address the manual action. For the third: a change of address request will not fix reputation issues, but a long timeline and consistent quality will. For the fourth: you need to be substantially differentiated, not just present.
Quick reference: the diagnostic in one screen
- Sitemap status is Success and discovered count matches.
- Five random URLs return 200, have correct canonical, no noindex.
- robots.txt is not blocking anything in the sitemap.
- URL Inspection shows pages as Available to Google.
- Page indexing report's largest "not indexed" category points to the right kind of fix.
- Content quality test: at least 70% of stuck pages would be useful to a searcher.
- Internal linking puts no critical page more than three clicks from the homepage.
If all seven check out and you are still stuck, you are either in the patience bucket or the "this site has a fundamental authority problem" bucket. Neither is solved by another sitemap submission.