Apache/Nginx通过UserAgent屏蔽蜘蛛和采集

发布于 2015-10-22 作者 [重庆SEO]

更新于 2016-07-17

正规的搜索引擎的蜘蛛爬行我们的网站对于网站来说是有利的,但垃圾爬虫我们就需要屏蔽掉它们的访问,因为他们有的是人为来采集我们网站内容的,有的是SEO以及其他工具索引我们的网站数据建库进行分析的。它们不仅对网站内容不利,而且对于网站服务器也是一种负担。即便bot支持,但实际情况是robots.txt 根本无法阻止那些垃圾蜘蛛的,好在垃圾爬虫基本上还是有一定特征的,比如可以根据UA分析。即可使用少量代码屏蔽掉。不过,如果UA伪造或UA变化等其他情况,可使用crontab对日志里面IP频率进行分析和屏蔽了。

以下是示范,请举一反三,根据实际情况修改。

Apache
#------------------------------------------------------------
# Apache 根据UA,REFERER禁止爬虫
# [G]返回410页面 [F]返回403页面
# 来源:seonoco.com
#------------------------------------------------------------
<IfModule mod_rewrite.c>
 RewriteCond %{HTTP_USER_AGENT} (wget|curl|AhrefsBot|DotBot|MJ12bot|httrack|Findxbot|BLEXBot|WinHttpRequest|Go\s1.1\spackage\shttp|megaindex|BIDUBrowser|FunWebProducts|MSIE\s5|Add\sCatalog|SeznamBot|KomodiaBot|aiHitBot|MojeekBot|PhantomJS|SiteSucker|HTTrack|MegaIndex|BLEXBot|LinkpadBot|Findxbot|SEOkicks|OpenLinkProfiler|PhantomJS|Xenu|007ac9|sistrix|spbot|SiteExplorer|wotbox|ZumBot|ltx71|memoryBot|WBSearchBot|DomainAppender|Python|Aboundex|-crawler|WinHttpRequest|NerdyBot|ZmEu|xovibot) [NC,OR]
 RewriteCond %{HTTP_USER_AGENT} ^$ [NC,OR] 
 RewriteCond %{HTTP_USER_AGENT} ^-$ [NC,OR]
 RewriteCond %{HTTP_REFERER} .ru/$ [NC,OR]
 RewriteCond %{HTTP_REFERER} (example.com) [NC]
 RewriteRule .* - [G]
</IfModule>

Nginx
#------------------------------------------------------------
# Nginx 根据UA禁止爬虫
# 来源:seonoco.com
#------------------------------------------------------------
if ($http_user_agent ~* (wget|curl|AhrefsBot|DotBot|MJ12bot|httrack|Findxbot|BLEXBot|WinHttpRequest|Go\s1.1\spackage\shttp|megaindex|BIDUBrowser|FunWebProducts|MSIE\s5|Add\sCatalog|SeznamBot|KomodiaBot|aiHitBot|MojeekBot|PhantomJS|SiteSucker|HTTrack|MegaIndex|BLEXBot|LinkpadBot|Findxbot|SEOkicks|OpenLinkProfiler|PhantomJS|Xenu|007ac9|sistrix|spbot|SiteExplorer|wotbox|ZumBot|ltx71|memoryBot|WBSearchBot|DomainAppender|Python|Aboundex|-crawler|WinHttpRequest|NerdyBot|ZmEu|xovibot|^$)) {
 return 403;
} 

参考与扩展阅读

[1] 2013 User Agent Blacklist