Hi everyone,
I came across this robots.txt file and I was wondering if the Allow is necessary and whether is affecting the disallowed pages.
User-agent: *
Sitemap: https://www.this-is-an-example.co.uk/sitemap.xml
Allow: /
User-agent: *
Disallow: /*sortBy*
Disallow: /*v_attributes_*
Disallow: /checkout*
Disallow: /search$
Disallow: /search?*
Disallow: /my-pages*
Disallow: /epiui*
Disallow: /punchout-order*
Disallow: /resolvedynamicdata*
Thoughts?
Hey Lorenzo,
Google official documentation says:
User-agent: Googlebot Disallow: /nogooglebot/ User-agent: * Allow: / Sitemap: https://www.example.com/sitemap.xml
https://developers.google.com/search/docs/crawling-indexing/robots/create-robots-txt
Actually allow is not required as everything is allowed by default unless blocked by disallow
Hey Lorenzo,
For Google, no, both groups of the same user agent will be combined, and the Disallow: rules will override the Allow: / (longest match wins), all your Disallow rules are longer.
Historically, before Nov '21 there was a genuine bug in Google's parser where this was affecting closure, so the two groups wouldn't have been combined, but that's lon past.
All that being said how other services might interpret this isn't as clear cut, some may trip up, so personally I'd remove:
User-agent: * Sitemap: https://www.this-is-an-example.co.uk/sitemap.xml Allow: /
From the top, the allow does nothing anyway in effect.
Add the Sitemap entry at the end of the file if you want to keep it, or the top I suppose, keep in mind that you can't scope that to user agent, so you can't say, hey Google here's your sitemap, everyone else, this is yours, so it doesn't need to be ina user-agent group.
I would argue Sitemap in robots.txt is probably not a thing you need to do anyway, and just adding through search engines consoles is enough.
For testing, have a search for online checkers that use the official parser, or even Cook up your own, Google open source it here: https://github.com/google/robotstxt