ALL POSTS
TECHNICAL SEO
5 MIN READ
January 20, 2026

robots.txt and sitemaps: the two files Google reads before anything else

Get these wrong and Google might not index your site correctly — or at all

A misconfigured robots.txt is one of the most common causes of pages mysteriously disappearing from Google.

robots.txt: the access control file

Located at yoursite.com/robots.txt — Google fetches this first, before crawling anything.

The most dangerous mistake: accidental "Disallow: /" which blocks Google from crawling your entire site. This gets deployed more often than you'd think — usually when someone copies a development robots.txt to production.

Check yours right now: visit yoursite.com/robots.txt in your browser. If you see "Disallow: /" under "User-agent: *" — you have a critical problem.

A safe baseline robots.txt: User-agent: * Disallow: /admin/ Disallow: /api/ Sitemap: https://yoursite.com/sitemap.xml

Block private pages (admin, API, staging). Allow everything else.

XML sitemaps: your site's table of contents

Your sitemap lists every URL you want Google to index. Without one, Google discovers your pages by following links — which means any page not linked from somewhere might never get indexed.

What your sitemap should include: - Every public page (homepage, product pages, blog posts, landing pages) - Canonical URLs only (not paginated versions, not filter variants) - Last modified dates (helps Google prioritize re-crawling updated content)

What to exclude: - Redirected URLs - Noindex pages - Duplicate content

Submit your sitemap in Google Search Console under Sitemaps. Then check the "Coverage" report to see which pages Google has indexed.

See this in action on your site

Free audit. No signup. 30 seconds.

Run free audit →