当前位置 : 主页 > 基础优化 > 正文

robots.txt屏蔽垃圾蜘蛛抓取网站信息

原创

重庆seo 2022-07-08 阅读() robots

做网站当然希望蜘蛛过来爬取内容，这样有利于增加收录量，提升曝光率。白帽SEO优化中，要保证网站的主要信息能被蜘蛛顺利抓取。Hei帽SEO希望把蜘蛛“困在”站群中，让蜘蛛误认为这个网站的垃圾内容看起来“丰富”。

蜘蛛一般抓取网站首页，然后根据页面上的锚文本链接是否允许继续抓取，参考文章《什么是nofollow标签对SEO优化作用》。

理论上来说蜘蛛抓不到其他没有URL路径的页面，除非你或他人手动提交给搜索引擎，这样就造成网站的不安全。在没有robots协议的情况下蜘蛛“畅行无阻”，有一些网站内容比较敏感，比如网站后台、数据库、模板、会员信息等，这些信息如果在页面上有入口，非法蜘蛛、Python写的爬虫就会“悄无声息”进去爬，被非法利用。

同时蜘蛛在爬取信息的时候会占用服务器资源，卡顿的几率还是存在的。另外很多人喜欢用Python写爬虫，抓取网站信息，也让不少站长苦不堪言。

大多数的搜索引擎蜘蛛遵循robots协议，但不遵守的蜘蛛也不少（主要是境外）。用百度站长看到很多莫名其妙的IP进入，我们只能够对遵循robots协议的蜘蛛进行阻拦，减轻服务器压力，提高网站访问速度等。下面搜集了一些垃圾的蜘蛛，写到根目录下的 robots.txt 即可。

robots.txt禁止垃圾蜘蛛

User-agent: AhrefsBot
Disallow: /
User-agent: DotBot
Disallow: /
User-agent: SemrushBot
Disallow: /
User-agent: Uptimebot
Disallow: /
User-agent: MJ12bot
Disallow: /
User-agent: MegaIndex.ru
Disallow: /
User-agent: ZoominfoBot
Disallow: /
User-agent: Mail.Ru
Disallow: /
User-agent: SeznamBot
Disallow: /
User-agent: BLEXBot
Disallow: /
User-agent: ExtLinksBot
Disallow: /
User-agent: aiHitBot
Disallow: /
User-agent: Researchscan
Disallow: /
User-agent: DnyzBot
Disallow: /
User-agent: spbot
Disallow: /
User-agent: YandexBot
Disallow: /
User-agent: CCBot
Disallow: /
User-agent: Applebot
Disallow: /
User-agent: CensysInspect
Disallow: /
User-agent: MauiBot
Disallow: /

屏蔽蜘蛛IP（不推荐）

这个比较难，连百度蜘蛛都没有完全对外公开蜘蛛的IP，而且经常变。只能根据这类蜘蛛是否频率过多，是否造成服务器卡滞。

屏蔽相应蜘蛛字段（关键字）

比如apache环境下，在 .htaccess 文件下添加

<IfModule mod_rewrite.c>
RewriteEngine On
#Block spider
RewriteCond %{HTTP_USER_AGENT} "SemrushBot|Webdup|AcoonBot|AhrefsBot|Ezooms|EdisterBot|EC2LinkFinder|jikespider|Purebot|MJ12bot|WangIDSpider|WBSearchBot|Wotbox|xbfMozilla|Yottaa|YandexBot|Jorgee|SWEBot|spbot|TurnitinBot-Agent|mail.RU|curl|perl|Python|Wget|Xenu|ZmEu" [NC]
RewriteRule !(^robots\.txt$) - [F]
</IfModule>

IIS环境下，在 web.config 文件添加

<?xml version="1.0" encoding="UTF-8"?>
<configuration>
  <system.webServer>
   <rewrite>
    <rules>
     <rule name="Block spider">
      <match url="(^robots.txt$)" ignoreCase="false" negate="true" />
      <conditions>
      <add input="{HTTP_USER_AGENT}" pattern="SemrushBot|Webdup|AcoonBot|AhrefsBot|Ezooms|EdisterBot|EC2LinkFinder|jikespider|Purebot|MJ12bot|WangIDSpider|WBSearchBot|Wotbox|xbfMozilla|Yottaa|YandexBot|Jorgee|SWEBot|spbot|TurnitinBot-Agent|mail.RU|curl|perl|Python|Wget|Xenu|ZmEu" ignoreCase="true" />
      </conditions>
       <action type="AbortRequest"/>
     </rule>
    </rules>
   </rewrite>
  </system.webServer>
</configuration>

扩展阅读

本文地址：https://www.vi586.com/seo/779.html
版权声明：原创文章，版权归重庆SEO吖七所有，欢迎分享本文，支持原创，转载请保留出处

上一篇：智能小程序检查页面死链返回码及处理办法
下一篇：SEO原创软文与伪原创软文哪个更容易被收录

栏目最新文章