robots.txt and sitemaps: the two files Google reads before anything else
Get these wrong and Google might not index your site correctly — or at all
“A misconfigured robots.txt is one of the most common causes of pages mysteriously disappearing from Google.”
robots.txt: the access control file
Located at yoursite.com/robots.txt — Google fetches this first, before crawling anything.
The most dangerous mistake: accidental "Disallow: /" which blocks Google from crawling your entire site. This gets deployed more often than you'd think — usually when someone copies a development robots.txt to production.
Check yours right now: visit yoursite.com/robots.txt in your browser. If you see "Disallow: /" under "User-agent: *" — you have a critical problem.
A safe baseline robots.txt: User-agent: * Disallow: /admin/ Disallow: /api/ Sitemap: https://yoursite.com/sitemap.xml
Block private pages (admin, API, staging). Allow everything else.
XML sitemaps: your site's table of contents
Your sitemap lists every URL you want Google to index. Without one, Google discovers your pages by following links — which means any page not linked from somewhere might never get indexed.
What your sitemap should include: - Every public page (homepage, product pages, blog posts, landing pages) - Canonical URLs only (not paginated versions, not filter variants) - Last modified dates (helps Google prioritize re-crawling updated content)
What to exclude: - Redirected URLs - Noindex pages - Duplicate content
Submit your sitemap in Google Search Console under Sitemaps. Then check the "Coverage" report to see which pages Google has indexed.