By Bilal--Updated -7 min read

Splitting a Large Sitemap: When and How to Use a Sitemap Index File

There is exactly one rule for when you need a sitemap index file: you have crossed 50,000 URLs in a single sitemap, or you are about to cross 50MB uncompressed. Past either limit, the sitemap protocol requires you to split. This article covers how to split well. The strategies that lead to cleaner Search Console reporting and easier debugging, versus the lazy splits that work but hide problems.

The numbers

The sitemap.org spec sets two hard limits per individual sitemap file:

  • 50,000 URLs maximum
  • 50 megabytes uncompressed (the limit applies to the parsed file size, not the size on disk if you ship gzipped)

Hit either limit and the file is considered invalid. Most search engines reject it outright. Some will silently truncate it to the first 50,000 URLs, which is worse than rejection because you do not know it is happening.

The 50MB ceiling is rarely hit before the URL count ceiling. Most URLs encode to under 200 bytes including their lastmod and other metadata. 50,000 URLs at 200 bytes each is about 10MB. To hit 50MB uncompressed you would need either very long URLs (image sitemaps with full image metadata can do this) or a lot of additional xhtml:link annotations for hreflang.

The sitemap index file

When you split, you do not just publish multiple sitemap URLs and hope search engines find them. You create a sitemap index file that points to all the individual sitemap files. The index is what you submit to Google Search Console; the engines walk the index and fetch each child sitemap.

Format of an index file:

<?xml version="1.0" encoding="UTF-8"?>
<sitemapindex xmlns="http://www.sitemaps.org/schemas/sitemap/0.9">
  <sitemap>
    <loc>https://example.com/sitemap-pages.xml</loc>
    <lastmod>2026-04-08T10:30:00+00:00</lastmod>
  </sitemap>
  <sitemap>
    <loc>https://example.com/sitemap-blog.xml</loc>
    <lastmod>2026-04-08T10:30:00+00:00</lastmod>
  </sitemap>
</sitemapindex>

The structure is almost identical to a regular sitemap, but uses sitemapindex and sitemap elements instead of urlset and url. You can include lastmod for each child sitemap; this tells search engines whether each child has changed since the last fetch and whether they need to re-process it.

A sitemap index file itself is also capped at 50,000 entries and 50MB. In practice this means you can address up to 2.5 billion URLs through a single index, far more than any real site needs.

How to split: the three strategies

There are three reasonable ways to split, and they have meaningfully different debug-ability characteristics.

Strategy 1: split by content type (recommended)

Group URLs by what they are. For an e-commerce site:

  • /sitemap-products.xml
  • /sitemap-categories.xml
  • /sitemap-pages.xml (static pages: about, contact, etc.)
  • /sitemap-blog.xml

For a publication:

  • /sitemap-articles.xml
  • /sitemap-authors.xml
  • /sitemap-tags.xml
  • /sitemap-pages.xml

Why this is best: when you open Search Console and look at the Sitemaps report, you see indexing performance per content type. If your products are at 90% indexed and your categories are at 30%, you immediately know where the problem is. With other splits, this signal is lost.

It also makes operational sense. When your blog post about a new product spikes, you can regenerate just /sitemap-blog.xml without rebuilding the entire index.

Strategy 2: split by index/chunk

The default if you let a tool split for you: chunks of 5,000 or 10,000 URLs each, named /sitemap-1.xml, /sitemap-2.xml, and so on. Mechanically valid, technically fine, but loses all semantic information. If you see "60% indexed" across the property, you cannot tell whether one section is dragging the rest down.

I only recommend this approach when content is genuinely uniform, for example a single-purpose directory site where every URL is the same type of page. In that case, semantic grouping does not exist, and chunking is fine.

Strategy 3: split by date

For news and publication sites: /sitemap-2024.xml, /sitemap-2025.xml, /sitemap-2026.xml. Or by month for very high-volume publishers.

The argument for: search engines can re-fetch only the recently-changed sitemap (current month or year) instead of re-processing your entire archive. This was more important when search engines were less efficient about lastmod handling; it matters less now but still helps for very large news archives.

The argument against: indexing performance is mixed across time, so you can mask a quality problem in older content with the boost from fresh content. Date-based splits work, but I would still combine them with content-type splits for any site with multiple content types.

Implementing the index

How you generate the index depends on your stack.

Static sites: If your sitemap is generated at build time, write the index as a plain file alongside the others. Most static-site generators have plugins or built-in support; if not, a 20-line build script that reads your output directory and writes the index works fine.

Next.js, Nuxt, similar frameworks: The framework's metadata route system can produce both the index and the children. In Next.js App Router, the generateSitemaps function in app/sitemap.ts returns multiple sitemap IDs and produces an index automatically.

Manual or CMS-driven: If you maintain the sitemap by hand or through a server-rendered endpoint, you write the index file directly. The structure shown above is all you need; there is no clever logic.

Submitting to Search Console

You submit only the index file URL. Google will fetch the index, then fetch each child sitemap, and report on each one separately. You will see something like:

  • /sitemap.xml — Success — 250,000 URLs discovered
  • /sitemap-products.xml — Success — 200,000 URLs
  • /sitemap-blog.xml — Success — 5,000 URLs
  • /sitemap-pages.xml — Success — 50 URLs

If a child sitemap has errors, the index status stays "Success" but the child shows the error. Watch for this. It is easy to miss when you are looking at the top-level status.

A note on robots.txt

Reference the index file in your robots.txt, not the children:

Sitemap: https://example.com/sitemap.xml

Crawlers that read robots.txt will discover the children automatically by following the index. Listing every child sitemap in robots.txt is harmless but redundant.

Common mistakes

A few patterns I see go wrong on real sites.

Listing children in the sitemap index AND the children listing each other

Some implementations have child sitemaps cross-referencing each other or referencing the index. The protocol does not support this; sitemaps cannot nest. Each URL appears in exactly one sitemap, and the index references each child exactly once.

Including retired children

If you decide to retire /sitemap-old-products.xml, remove it from the index. Leaving it referenced means search engines keep trying to fetch it and report 404 errors against your property.

Mismatched lastmod between index and children

The lastmod in the index entry should reflect when the child sitemap last meaningfully changed. If you regenerate the index every hour but rebuild the actual child sitemap daily, the index lastmod should be the daily timestamp, not the hourly. Misaligned lastmods make Google ignore the values entirely.

Going over 50MB on the compressed file but under it on the file system

Gzip compression hides the real size. The 50MB limit is on the uncompressed content. If you are right at the edge, decompress and weigh it.

Ready to Generate Your Sitemap?

Use SiteMapr to create a complete XML sitemap for your website in seconds. It's free, fast, and requires no technical knowledge.

Generate Your Sitemap Now