Search Bots

Posted: Wed Sep 29, 2004 12:57 pm

Ok ... this is a stupid questions ... but I have to ask Very Happy

If I want to stop bots from visitng my site for a week, and I change my meta tag to META NAME=\"REVISIT-AFTER\" CONTENT=\"7 DAYS\", and change my robot.txt to disallow: *.*

Will that stop them?

Site Admin/Owner Joined: Aug 27, 2002 Posts: 17088

My understanding (copied from the Internet somewhere/sometime):

The robots.txt method is the best, you can also, stack these in the file to create different rules for different bots e.g.
User-agent: *
Disallow: /
User-agent: Googlebot
Disallow:
User-agent: Googlebot-Image
Disallow: /images/

The above broke down, means that all robots are barred from the site,
except google, which can spider the lot, and googles image indexer, that is allowed access to anything outside of the images directory... google is very well behaved, and you can even use support for wildcard extensions with google, e.g.

Disallow: *.pl
will stop google from indexing any pl files but allow it to index anything else.

Also, there is an older method that some bots still use, and thats a meta tag version, though its now falling into obscurity as almost every bot that recognises this, recongises robots.txt as well: [meta name="robots" content="noindex, nofollow"] will stop older bots from indexing the page upon which it is found, or following any links from that page. Similarly [meta name="robots" content="index, nofollow"] will allow the bot to index that page, but not follow to spider the rest of the site.

As to the meta tag, most of what I have read is that the search engins operate on their own schedule, but I am no expert in this area Wink

Posted: Thu Sep 30, 2004 9:31 am

I have in my robot.txt

User-agent: Mediapartners-Google*
Disallow:
User-agent: *
Disallow: admin.php
Disallow: /admin/
Disallow: /images/
Disallow: /includes/
Disallow: /themes/
Disallow: /blocks/
Disallow: /modules/
Disallow: /language/
Disallow: *.*

The disallow *.* is a new thing, the rest is there ALL the time. The one that seems to be hitting a lot is jeteye.com. I've never even heard of these guys Sad

But my traffic has slowed a little. It wouldn't bother me but my traffic is almost 2GB a day. But the reports show that it's not the bots that are doing all the traffic, and I can't see any one file being downloaded a lot to figure out where all the bandwidth is being used.

Posted: Thu Sep 30, 2004 10:20 am

Have you reviewed awstats from cPanel? if you want to disallow everything, then just use

Disallow: /

I don't know that *.* works or not.

Posted: Thu Sep 30, 2004 10:59 am

I have reviewed the awstats, it's weird. Everything looks normal. I can't see any big useages on anything.

I've changed my setting to see if that works. Smile

Client Joined: Apr 10, 2004 Posts: 649 Location: UK

Could be someone is linking to images on your site

Posted: Thu Sep 30, 2004 5:06 pm

Nope that's not it. I watch that very close Smile

After talking in with Raven we figured out what it was. I am actually using that much bandwidth. Now that I know what it is I won't be so worried I have someone stealing bandwidth Smile

Posted: Fri Oct 01, 2004 2:08 pm

Glad you got it sussed

Posted: Sat Oct 02, 2004 5:08 pm

Just for clarification, I recently found this, which substatntiates my suspicion that wildcards *.* are not allowed/honored Wink

Quote:

Where do I find out how robots.txt files work?

For a complete overview of how robots.txt exclusion files work, visit: http://www.robotstxt.org/wc/norobots.html

The basic concept is simple. By writing a structured text file you can indicate to robots that certain parts of your server are off-limits to some or all robots. It is best explained with an example:

# /robots.txt file for controlling indexing
User-agent: webcrawler
Disallow:

User-agent: googlebot
Disallow: /

User-agent: *
Disallow: /tmp
Disallow: /logs

The first line, starting with '#', specifies a comment

The first paragraph specifies that the robot called 'webcrawler' has nothing disallowed: it may go anywhere.

The second paragraph indicates that the robot called ‘googlebot’ has all relative URLs starting with '/' disallowed. Because all relative URL's on a server start with '/', this means the entire site is closed off.

The third paragraph indicates that all other robots should not visit URLs starting with /tmp or /log. Note the '*' is a special token, meaning "any other User-agent"; you cannot use wildcard patterns or regular expressions in either User-agent or Disallow lines.

Two common errors:

Wildcards are not supported: instead of 'Disallow: /tmp/*' just say 'Disallow: /tmp/'.
You shouldn't put more than one path on a Disallow line.
Where do I find out more about robots?

There is a Web robots home page on: http://www.robotstxt.org/wc/robots.html