When you have a large e-commerce store with thousands of products (or any large site with thousands of pages), things are never easy when it comes to SEO.
This includes some of the basics, such as XML sitemaps.
Usually, most e-commerce and CMS platforms allow automatic sitemap generation which you can plug straight into Google Search Console. That’s smashing, until you realise that standard sitemaps allow no more than 50,000 URLs per sitemap.
The solution to this is to use a sitemap index. This is a type of sitemap which simply lists more sitemaps, and it’s these sitemaps which contain your normal page URLs.
The basic step-by-step is as follows:
1). Create XML sitemaps with a maximum of 50,000 URLs each.
2). Create a sitemap index pointing to your other sitemaps.
3). Add this sitemap index to Google Search Console.
4). Google will read this sitemap index, find the other sitemaps, and then crawl the page URLs.
Let’s go through each step.
1). Create XML sitemaps with a maximum of 50,000 URLs each
There are two ways you can do this. If your CMS or e-commerce platform has a system for creating sitemaps (most do), then you can take one large sitemap and split it in half.
Without that, you can take a list of your URLs (no more than 50,000 at a time), and then use any online sitemap generator to create two or more separate sitemaps below the 50,000 URL limit.
If you’re splitting one long sitemap in half, it can be as easy as just opening it in something like Notepad++, and then using the line numbering down the left hand side to select and copy the first set of sub-50k URLs.
If you do open your Sitemap to edit it in this way, you’ll probably see something like this:
Once you’ve done this, you can paste the first set into one text file, and the other set into another text file, and then save them as XML sitemaps 1 and 2. Notepad++ also has a handy way of clicking “Save as” and then selecting “eXtensible Markup Language File” from the “Save as type” drop-down.
However, you must make sure that the header content you have at the top of your original sitemap file is also at the top of the two new sitemaps you create. It will look something like this (yours may be different), but it’s essentially everything above the first <url><loc> line:
Once you have two XML sitemaps, both with no more than 50,000 URLs each, simply upload them to the root directory of your webspace hosting (where the index file of your website is).
2). Create a sitemap index pointing to your other sitemaps
Time to create that sitemap index. To make it easier for you, here’s one you can copy and modify to suit your needs:
<?xml version="1.0" encoding="UTF-8"?>
All you need to do is replace those URLs in between the <loc> tags with the URLs pointing the sitemaps you uploaded to your hosting earlier. Again, Notepad++, or any other basic text editor, is suitable for this. Simply copy and paste it in, replace the URLs, and then save it as a .xml file like you did with the two sitemaps in step one.
If you have more than 100,000 products and therefore require three or more sitemaps (big site you’ve got there), just add another section of this before the final </sitemapindex> tag and modify accordingly:
Now you have all your sitemaps online and ready to go, it’s time to point Google in their direction.
3). Add this sitemap index to Google Search Console
Head into Google Search Console (if you don’t have that yet, sign up for it), and head over to Crawl > Sitemaps. You should see this in the top right:
Click this, and enter the file name of your sitemap index only. The URL will already be pre-filled with your root domain, so it’s just a case of entering the sitemap index file name. If this is “sitemapindex.xml”, for example, pop that in and then click “Test”.
Quite quickly, you should receive test results to view:
Click the button, and Google should list the number of children it has found. If the number of children are the number of sitemaps in your sitemap index, and there are no errors, you’re good to go.
But wait: It’s also important that you go back to the “add/test sitemap” button, enter the URLs of all your individual sitemaps, and then click to test those as well. Do not submit them. Just test them all to ensure that the number of listed URLs are correct and that there are no errors.
If everything checks out, you can then click the “add/test sitemap” button again, enter in the file name of your sitemap index file (like the first test you did), but this time, click submit.
4). Google will read this sitemap index, find the other sitemaps, and then crawl the page URLs
Job done! Your sitemap index has been submitted to Google, and it will start processing immediately. This may take a few days for very large websites, so you’ll have to be a little patient.
Eventually, you should see something like this in Search Console:
How do I update these sitemaps with new pages?
This is where you may run into some issues, as this manual method is better suited to static sites. If there are the occasional change of pages, or the addition of new pages, then it’s not too difficult. All you have to do is open up the sitemap in an editor, add the new URL, re-upload the sitemap and then submit it to Google again.
However, if you have a site with lots of pages regularly being deleted or changed, any manual method is likely to be too cumbersome. In this case, you would be better off using a plugin, module or custom modification on your website/platform to automatically update sitemaps depending on website content changes.
But don’t worry too much. Sitemaps help Google find and crawl pages, but they are not essential to this process providing that there are internal links on your website which allow the search engine spiders to find these pages naturally (i.e. you have publicly visible pages on your website which link to these new pages). Google will find them eventually on its own. The only situation where sitemaps are essential is when there isn’t this network of internal or external website links (for example, if some pages are behind a search wall).
To find out if you’re running into crawl issues which would be fixed by a sitemap, just search on Google for “site:www.yoursite.com”.
This will display every page Google has indexed from your URL. If the number of results is close to the number of pages you have on your site, then Google has managed to crawl everything on its own by following links. However, if this figure is far less, then search engine bots may be struggling to find many of your pages, in which a sitemap could help you (either that, or they have crawled all your pages and just don’t consider some of them worthy of indexing).
“But why isn’t Google indexing all of my submitted webpages?”
Eventually, you may notice that the number of pages submitted, and the number of pages listed as indexed, are very different. It could even be the case that only a minority of your submitted pages are indexed in Google.
Unfortunately, it’s entirely up to Google whether they decide to rank certain pages, but there are techniques you can use to help increase this number, AND increase the rankings for those pages too. Just head over to my top tips on e-commerce SEO for thousands of products.
If you don’t have an e-commerce site (i.e. you have an information-based site), and you have tens of thousands of pages with only a small number being ranked despite using a sitemap, then I have the answer for you right here: these pages do not have content which is long, unique or valuable enough.
If certain pages are lacking in content, or have too much duplication of other pages, then Google will avoid indexing them no matter how much effort you put into splitting sitemaps and making these pages easier to crawl.