robots.txt将多个User-agent写到一起

发布于 2017-04-27 作者 [重庆SEO]

维基百科是这样解释robots.txt的

robots.txt(统一小写)是一种存放于网站根目录下的ASCII编码的文本文件,它通常告诉网络搜索引擎的漫游器(又称网络蜘蛛),此网站中的哪些内容是不应被搜索引擎的漫游器获取的,哪些是可以被漫游器获取的。因为一些系统中的URL是大小写敏感的,所以robots.txt的文件名应统一为小写。robots.txt应放置于网站的根目录下。如果想单独定义搜索引擎的漫游器访问子目录时的行为,那么可以将自定的设置合并到根目录下的robots.txt,或者使用robots元数据(Metadata,又称元数据)。

robots.txt协议并不是一个规范,而只是约定俗成的,有些搜索引擎会遵守这一规范,而其他则不然。所以并不能保证网站的隐私。

根据大量的实践经验 robots.txt对绝大多数的蜘蛛有效,比如google,bing,yahoo,百度等搜索引擎,如果本身就带有采集等流氓气质的蜘蛛来说,这玩意不管用。

常规的robots.txt的写法都是1个User-agent对应1个或多个规则。

将多个User-agent写到一起的写法不常见,这种不常见的写法目前为止我不光没在国内的那几个大网站看到实际应用的例子,甚至连相关的介绍的也没有。国外好像这么使用的也少见,难道是种错误的写法?为了避免打脸,我用google抓取测试工具robots.txt Tester实测是等效的。

关于这种写法,其实GOOGLE也是有支持的示例 https://developers.google.com/webmasters/control-crawl-index/docs/robots_txt#grouping-of-records:

示例组:

user-agent: a
disallow: /c

user-agent: b
disallow: /d

user-agent: e
user-agent: f
disallow: /g

该示例中指定了三个不同的组,一个针对"a",一个针对"b",还有一个同时针对"e"和"f"。每个组都有各自的组成员记录。注意:您可以选择使用空格(空白行),以提高可读性。

好了,说这么多,无非就是证明这种写法是正确的

这种非常见的robots.txt写法的应用场景:

比如百度当前的robots.txt是长这样的 https://www.baidu.com/robots.txt

简写前

User-agent: Baiduspider
Disallow: /baidu
Disallow: /s?
Disallow: /ulink?
Disallow: /link?

User-agent: Googlebot
Disallow: /baidu
Disallow: /s?
Disallow: /shifen/
Disallow: /homepage/
Disallow: /cpro
Disallow: /ulink?
Disallow: /link?

User-agent: MSNBot
Disallow: /baidu
Disallow: /s?
Disallow: /shifen/
Disallow: /homepage/
Disallow: /cpro
Disallow: /ulink?
Disallow: /link?

User-agent: Baiduspider-image
Disallow: /baidu
Disallow: /s?
Disallow: /shifen/
Disallow: /homepage/
Disallow: /cpro
Disallow: /ulink?
Disallow: /link?

User-agent: YoudaoBot
Disallow: /baidu
Disallow: /s?
Disallow: /shifen/
Disallow: /homepage/
Disallow: /cpro
Disallow: /ulink?
Disallow: /link?

User-agent: Sogou web spider
Disallow: /baidu
Disallow: /s?
Disallow: /shifen/
Disallow: /homepage/
Disallow: /cpro
Disallow: /ulink?
Disallow: /link?

User-agent: Sogou inst spider
Disallow: /baidu
Disallow: /s?
Disallow: /shifen/
Disallow: /homepage/
Disallow: /cpro
Disallow: /ulink?
Disallow: /link?

User-agent: Sogou spider2
Disallow: /baidu
Disallow: /s?
Disallow: /shifen/
Disallow: /homepage/
Disallow: /cpro
Disallow: /ulink?
Disallow: /link?

User-agent: Sogou blog
Disallow: /baidu
Disallow: /s?
Disallow: /shifen/
Disallow: /homepage/
Disallow: /cpro
Disallow: /ulink?
Disallow: /link?

User-agent: Sogou News Spider
Disallow: /baidu
Disallow: /s?
Disallow: /shifen/
Disallow: /homepage/
Disallow: /cpro
Disallow: /ulink?
Disallow: /link?

User-agent: Sogou Orion spider
Disallow: /baidu
Disallow: /s?
Disallow: /shifen/
Disallow: /homepage/
Disallow: /cpro
Disallow: /ulink?
Disallow: /link?

User-agent: ChinasoSpider
Disallow: /baidu
Disallow: /s?
Disallow: /shifen/
Disallow: /homepage/
Disallow: /cpro
Disallow: /ulink?
Disallow: /link?

User-agent: Sosospider
Disallow: /baidu
Disallow: /s?
Disallow: /shifen/
Disallow: /homepage/
Disallow: /cpro
Disallow: /ulink?
Disallow: /link?


User-agent: yisouspider
Disallow: /baidu
Disallow: /s?
Disallow: /shifen/
Disallow: /homepage/
Disallow: /cpro
Disallow: /ulink?
Disallow: /link?

User-agent: EasouSpider
Disallow: /baidu
Disallow: /s?
Disallow: /shifen/
Disallow: /homepage/
Disallow: /cpro
Disallow: /ulink?
Disallow: /link?

User-agent: *
Disallow: / 

简写后

User-agent: Baiduspider
Disallow: /baidu
Disallow: /s?
Disallow: /ulink?
Disallow: /link?

User-agent: Googlebot
User-agent: MSNBot
User-agent: Baiduspider-image
User-agent: YoudaoBot
User-agent: Sogou web spider
User-agent: Sogou inst spider
User-agent: Sogou spider2
User-agent: Sogou blog
User-agent: Sogou News Spider
User-agent: Sogou Orion spider
User-agent: ChinasoSpider
User-agent: Sosospider
User-agent: yisouspider
User-agent: EasouSpider
Disallow: /baidu
Disallow: /s?
Disallow: /shifen/
Disallow: /homepage/
Disallow: /cpro
Disallow: /ulink?
Disallow: /link?

User-agent: *
Disallow: /