robots.txt & Sitemap: The Essential SEO Guide
Get these wrong and Google might not index your site correctly — or at all
“A misconfigured robots.txt is one of the most common causes of pages mysteriously disappearing from Google.”
robots.txt: the access control file
Located at yoursite.com/robots.txt — Google fetches this first, before crawling anything.
The most dangerous mistake: accidental Disallow: / which blocks Google from crawling your entire site. This gets deployed more often than you'd think — usually when someone copies a development robots.txt to production without changing it.
Check yours right now: visit yoursite.com/robots.txt in your browser. If you see Disallow: / under User-agent: * — you have a critical problem that must be fixed before anything else.
A safe baseline robots.txt:
``
User-agent: *
Disallow: /admin/
Disallow: /api/
Sitemap: https://yoursite.com/sitemap.xml
``
Block private pages (admin, API, staging). Allow everything else.
What NOT to block: A common mistake is blocking /wp-admin/ but accidentally also blocking theme assets or plugins in subdirectories. Always test changes with Google Search Console's robots.txt tester before deploying.
Find these issues on your site right now
RankyPulse checks canonicals, redirects, meta tags, and 50+ more signals in 30 seconds.
Run your technical audit →XML sitemaps: your site's table of contents
Your sitemap lists every URL you want Google to index. Without one, Google discovers your pages by following links — which means any page not linked from somewhere might never get indexed.
What your sitemap should include: - Every public page (homepage, product pages, blog posts, landing pages) - Canonical URLs only (not paginated versions, not filter variants) - Last modified dates (helps Google prioritize re-crawling updated content)
What to exclude: - Redirected URLs (301s should not appear in your sitemap) - Noindex pages (if it's noindex, it shouldn't be in your sitemap — contradictory signals confuse Google) - Duplicate content variants (/product?color=red and /product?sort=asc)
Sitemap size limits: Each sitemap file can contain a maximum of 50,000 URLs. Large sites use a sitemap index file that references multiple individual sitemaps — one for blog posts, one for products, one for landing pages.
After submitting: Submit your sitemap in Google Search Console under Sitemaps. Check the "Coverage" report weekly for the first month to see which pages Google has indexed and which have issues. Pages that are "Discovered but not indexed" usually need more internal links pointing to them.
Find these issues on your site right now
RankyPulse checks canonicals, redirects, meta tags, and 50+ more signals in 30 seconds.
Run your technical audit →