Author |
Message |
Serafim
Worker
![Worker Worker](modules/Forums/images/ranks/3stars.gif)
![](modules/Forums/images/avatars/4421e1f3442a0ba9abf71.jpg)
Joined: Mar 25, 2006
Posts: 109
Location: Delaware Usa
|
Posted:
Tue May 09, 2006 12:08 pm |
|
Hey all looking for a few suggestions from the vetran nukers. I have been bombarded by the slurp bot. I am becoming very annoyed. I have blocked them at the sentinel level and many of their ips are hitting the .htaccess..
But I have blocked at lease 10 in the past two days and I think they should abide by the robots.txt.. Besides banning them what can I do to stop this..
I have currecntly disallowed all yahoo email accounts all yahoo search and anything yahoo related on my site.. |
_________________
|
|
|
![](themes/RavenIce/forums/images/spacer.gif) |
Guardian2003
Site Admin
![](modules/Forums/images/avatars/125904890252d880f79f312.png)
Joined: Aug 28, 2003
Posts: 6799
Location: Ha Noi, Viet Nam
|
Posted:
Tue May 09, 2006 12:44 pm |
|
Yahoo has its own bots which I believe uses the useragent 'Slurp', the problem is that almost anyone can use a Yahoo bot for their own purposes - which is why there are so many of the darn things and they will either show with the Slurp useragent or 'inktomisearch.com' as the useragent depending on the IP and what script you are using to monitor them.
There are several things you can do here.
You can slow the crawling process by adding this to your robots.txt file
Code:User-agent: Slurp
Crawl-delay: 30
| This will create a delay between accesses, the higher the number, the greater the 'delay'.
A genuine Slurp bot should obey the directives in htaccess - but don't forget, it may take some time for them to re-crawl that and thus obey any changes in robots.txt that you have made.
You could ban them completely if you feel it is getting out of control with something like this in the top of your htaccess
Code: RewriteCond %{HTTP_USER_AGENT} Slurp
RewriteRule ^.*$ X.html [L]
|
|
|
|
|
![](themes/RavenIce/forums/images/spacer.gif) |
Serafim
![](modules/Forums/images/avatars/gallery/blank.gif)
|
Posted:
Tue May 09, 2006 12:49 pm |
|
The number 30 does that equal days minutes or what.. Its not that its out of control Sentinel gets them and bans them its just annoying.. I can expect at least three perday to be banned I guess beyond adding those lines of code i will just need to set my email to not beforwarded on a harvest.. Thanks for the help.. |
|
|
|
![](themes/RavenIce/forums/images/spacer.gif) |
Guardian2003
![](modules/Forums/images/avatars/gallery/blank.gif)
|
Posted:
Tue May 09, 2006 3:13 pm |
|
As far as I can tell, the delay amount refers to minutes.
3 Slurps per day is not actually that many, I have had dozens of them at any given time before I finally got fed up with them and banned them.
If you google for 'Crawl-delay' you will probably find Yahoo's help pages which can give you more information, including an online form requesting them not to crawl your site. |
|
|
|
![](themes/RavenIce/forums/images/spacer.gif) |
Serafim
![](modules/Forums/images/avatars/gallery/blank.gif)
|
Posted:
Tue May 09, 2006 4:05 pm |
|
Ok I added the delay to my robot.txt I know there is not may but thats just an average before I added slurp to my harvest list they would hit the site 20 or so times I figure if even three hit it per day then thats a waste of bandwidth. I don't yahoo anything and I had to email them and request that certain content be removed from their search engine. Its just the principal of the matter. If they cannot play by the rules I do not want them near my site. If they just visited my index page it would be no issue. Im seeing my modules attempts at admin log in and even posts from these forums in their search engine.. Its just creepy lol.. Thanks again |
|
|
|
![](themes/RavenIce/forums/images/spacer.gif) |
guidyy
Worker
![Worker Worker](modules/Forums/images/ranks/3stars.gif)
![](modules/Forums/images/avatars/Charlie_Brown/Charlie_Brown_-_The_Model.gif)
Joined: Nov 22, 2004
Posts: 208
Location: Italy
|
Posted:
Tue May 09, 2006 11:07 pm |
|
that's a problem with all search engines
they follow links and do not know what's for admin, what's for posting and so on
the only thing you can do is via the robots.txt. |
|
|
|
![](themes/RavenIce/forums/images/spacer.gif) |
Guardian2003
![](modules/Forums/images/avatars/gallery/blank.gif)
|
Posted:
Wed May 10, 2006 12:41 am |
|
Serafim - please post the contents of your robots.txt file. |
|
|
|
![](themes/RavenIce/forums/images/spacer.gif) |
guidyy
![](modules/Forums/images/avatars/gallery/blank.gif)
|
Posted:
Wed May 10, 2006 5:44 am |
|
This is mine for a googleTapped site.
User-agent: *
Disallow: /admin/
Disallow: /images/
Disallow: /includes/
Disallow: /themes/
Disallow: /blocks/
Disallow: /modules/
Disallow: /language/
Disallow: /admin.php
Disallow: /config.php
Disallow: /cgi-bin/
Disallow: /feedback.html
Disallow: /reccomend.html
Disallow: /members.html
Disallow: /messages.html
Disallow: /account.html
Disallow: /submit.html
Disallow: /top.html
Disallow: /stats.html
Disallow: /fsearch-newposts.html
Disallow: /fsearch-egosearch.html
Disallow: /fsearch-unanswered.html
Disallow: /forums-group6.html
Disallow: /forums-group7.html
Disallow: /forums-groupcp.html
Disallow: /forums-search.html
Disallow: /forum-editprofile.html
Disallow: /message-post
Disallow: /ftopic-new
Disallow: /ftopic-reply
Disallow: /ftopic-quote
Disallow: /forum-userprofile
Disallow: /ratelink-
Disallow: /linkop-AddLink.html
Disallow: /messages-inbox.html
Disallow: /messages-post
Disallow: /modules.php
The last directive is to tell spiders they do not have to care about whatever is not tapped. If you dont use googleTap you need to drop it.
So far, it's working pretty good.
Guido |
|
|
|
![](themes/RavenIce/forums/images/spacer.gif) |
Serafim
![](modules/Forums/images/avatars/gallery/blank.gif)
|
Posted:
Wed May 10, 2006 2:33 pm |
|
User-agent: *
Crawl-delay: 30
Disallow:
/modules.php?name=edited****************************
Disallow: /abuse/
Disallow: /admin/
Disallow: /blocks/
Disallow: /cgi-bin/
Disallow: /db/
Disallow: /images/
Disallow: /includes/
Disallow: /language/
Disallow: /modules/
Disallow: /themes/
Disallow: /admin.php
Disallow: /config.php
Disallow: /conf/
Disallow: /chat/
Disallow: /other/
Disallow: /scripts/
Disallow: /ebot/
Disallow: /botsv/
Disallow: /botsi/
I recieved a letter from Yahoo this afternoon that stated that their robots are incomplience with robots standards... Perhaps it is me who messed up I assumed that if I placed /modules/ that would cover all portions of the modules including modules.php... They said in order to keep the robots from crawling that specific page I would have to place it as modules.php.. Ill place the letter that they sent
I have investigated the issue but was not able to find any exclusions
for the following URLs in your robots.txt:
www.disciplesofcain.com/modules.php?name=Forums
www.disciplesofcain.com/modules.php?name=Forums
www.disciplesofcain.com/modules.php
It seems that Slurp is acting in compliance with the robots.txt
exclusion standard of 1994. Your current robots.txt reads:
see above |
|
|
|
![](themes/RavenIce/forums/images/spacer.gif) |
Serafim
![](modules/Forums/images/avatars/gallery/blank.gif)
|
Posted:
Sun May 14, 2006 12:07 am |
|
Just an update about them spiders.. I added the whole /modules.php to my robots txt.. Funny thing tho after Yahoo contacted me about my loathing of their spiders all the spiders have stopped. Not one trace of them has hit me in days.. Just thought I would share that thanks for all the help again |
|
|
|
![](themes/RavenIce/forums/images/spacer.gif) |
Guardian2003
![](modules/Forums/images/avatars/gallery/blank.gif)
|
Posted:
Sun May 14, 2006 1:27 am |
|
No problem, glad they are leaving you in peace now ![Wink](modules/Forums/images/smiles/icon_wink.gif) |
|
|
|
![](themes/RavenIce/forums/images/spacer.gif) |
|