Author |
Message |
HauntedWebby
Involved
data:image/s3,"s3://crabby-images/19054/19054ce8282204d9ea618292eea9886d76874318" alt="Involved Involved"
data:image/s3,"s3://crabby-images/9ad21/9ad21287db940949b5b04c324558b3fba0b32d08" alt=""
Joined: May 19, 2004
Posts: 363
Location: Ogden, UT
|
Posted:
Wed Sep 29, 2004 12:57 pm |
|
Ok ... this is a stupid questions ... but I have to ask
If I want to stop bots from visitng my site for a week, and I change my meta tag to META NAME=\"REVISIT-AFTER\" CONTENT=\"7 DAYS\", and change my robot.txt to disallow: *.*
Will that stop them? |
_________________ --Webby-- |
|
|
data:image/s3,"s3://crabby-images/74676/7467655c43f84619d5d7cf725b1d668453dba0fe" alt="" |
Raven
Site Admin/Owner
data:image/s3,"s3://crabby-images/6c868/6c86859170a3596c942592f58366e4a982a03ad0" alt=""
Joined: Aug 27, 2002
Posts: 17088
|
Posted:
Wed Sep 29, 2004 9:20 pm |
|
My understanding (copied from the Internet somewhere/sometime):
The robots.txt method is the best, you can also, stack these in the file to create different rules for different bots e.g.
User-agent: *
Disallow: /
User-agent: Googlebot
Disallow:
User-agent: Googlebot-Image
Disallow: /images/
The above broke down, means that all robots are barred from the site,
except google, which can spider the lot, and googles image indexer, that is allowed access to anything outside of the images directory... google is very well behaved, and you can even use support for wildcard extensions with google, e.g.
Disallow: *.pl
will stop google from indexing any pl files but allow it to index anything else.
Also, there is an older method that some bots still use, and thats a meta tag version, though its now falling into obscurity as almost every bot that recognises this, recongises robots.txt as well: [meta name="robots" content="noindex, nofollow"] will stop older bots from indexing the page upon which it is found, or following any links from that page. Similarly [meta name="robots" content="index, nofollow"] will allow the bot to index that page, but not follow to spider the rest of the site.
As to the meta tag, most of what I have read is that the search engins operate on their own schedule, but I am no expert in this area data:image/s3,"s3://crabby-images/65647/65647f0db57cf641cbdf8d726317ee9f636d8ec1" alt="Wink" |
|
|
|
data:image/s3,"s3://crabby-images/74676/7467655c43f84619d5d7cf725b1d668453dba0fe" alt="" |
HauntedWebby
data:image/s3,"s3://crabby-images/6ea31/6ea3138e9a23822aea960115951a6c1ae34639ea" alt=""
|
Posted:
Thu Sep 30, 2004 9:31 am |
|
I have in my robot.txt
User-agent: Mediapartners-Google*
Disallow:
User-agent: *
Disallow: admin.php
Disallow: /admin/
Disallow: /images/
Disallow: /includes/
Disallow: /themes/
Disallow: /blocks/
Disallow: /modules/
Disallow: /language/
Disallow: *.*
The disallow *.* is a new thing, the rest is there ALL the time. The one that seems to be hitting a lot is jeteye.com. I've never even heard of these guys
But my traffic has slowed a little. It wouldn't bother me but my traffic is almost 2GB a day. But the reports show that it's not the bots that are doing all the traffic, and I can't see any one file being downloaded a lot to figure out where all the bandwidth is being used. |
|
|
|
data:image/s3,"s3://crabby-images/74676/7467655c43f84619d5d7cf725b1d668453dba0fe" alt="" |
Raven
data:image/s3,"s3://crabby-images/6ea31/6ea3138e9a23822aea960115951a6c1ae34639ea" alt=""
|
Posted:
Thu Sep 30, 2004 10:20 am |
|
Have you reviewed awstats from cPanel? if you want to disallow everything, then just use
Disallow: /
I don't know that *.* works or not. |
|
|
|
data:image/s3,"s3://crabby-images/74676/7467655c43f84619d5d7cf725b1d668453dba0fe" alt="" |
HauntedWebby
data:image/s3,"s3://crabby-images/6ea31/6ea3138e9a23822aea960115951a6c1ae34639ea" alt=""
|
Posted:
Thu Sep 30, 2004 10:59 am |
|
I have reviewed the awstats, it's weird. Everything looks normal. I can't see any big useages on anything.
I've changed my setting to see if that works. data:image/s3,"s3://crabby-images/fabed/fabed724a04168d23d67c0f0722ee8a640f1adb3" alt="Smile" |
|
|
|
data:image/s3,"s3://crabby-images/74676/7467655c43f84619d5d7cf725b1d668453dba0fe" alt="" |
Muffin
Client
data:image/s3,"s3://crabby-images/b4011/b4011f8faae6ab724746c5f944714f383f4ec33f" alt=""
Joined: Apr 10, 2004
Posts: 649
Location: UK
|
Posted:
Thu Sep 30, 2004 4:59 pm |
|
Could be someone is linking to images on your site |
_________________ Classic Mini rules the bends & bends the rules!
[img] |
|
|
data:image/s3,"s3://crabby-images/74676/7467655c43f84619d5d7cf725b1d668453dba0fe" alt="" |
HauntedWebby
data:image/s3,"s3://crabby-images/6ea31/6ea3138e9a23822aea960115951a6c1ae34639ea" alt=""
|
Posted:
Thu Sep 30, 2004 5:06 pm |
|
Nope that's not it. I watch that very close
After talking in with Raven we figured out what it was. I am actually using that much bandwidth. Now that I know what it is I won't be so worried I have someone stealing bandwidth data:image/s3,"s3://crabby-images/fabed/fabed724a04168d23d67c0f0722ee8a640f1adb3" alt="Smile" |
|
|
|
data:image/s3,"s3://crabby-images/74676/7467655c43f84619d5d7cf725b1d668453dba0fe" alt="" |
Muffin
data:image/s3,"s3://crabby-images/6ea31/6ea3138e9a23822aea960115951a6c1ae34639ea" alt=""
|
Posted:
Fri Oct 01, 2004 2:08 pm |
|
|
|
data:image/s3,"s3://crabby-images/74676/7467655c43f84619d5d7cf725b1d668453dba0fe" alt="" |
Raven
data:image/s3,"s3://crabby-images/6ea31/6ea3138e9a23822aea960115951a6c1ae34639ea" alt=""
|
Posted:
Sat Oct 02, 2004 5:08 pm |
|
Just for clarification, I recently found this, which substatntiates my suspicion that wildcards *.* are not allowed/honored data:image/s3,"s3://crabby-images/65647/65647f0db57cf641cbdf8d726317ee9f636d8ec1" alt="Wink" Quote: | Where do I find out how robots.txt files work?
For a complete overview of how robots.txt exclusion files work, visit: http://www.robotstxt.org/wc/norobots.html
The basic concept is simple. By writing a structured text file you can indicate to robots that certain parts of your server are off-limits to some or all robots. It is best explained with an example:
# /robots.txt file for controlling indexing
User-agent: webcrawler
Disallow:
User-agent: googlebot
Disallow: /
User-agent: *
Disallow: /tmp
Disallow: /logs
The first line, starting with '#', specifies a comment
The first paragraph specifies that the robot called 'webcrawler' has nothing disallowed: it may go anywhere.
The second paragraph indicates that the robot called ‘googlebot’ has all relative URLs starting with '/' disallowed. Because all relative URL's on a server start with '/', this means the entire site is closed off.
The third paragraph indicates that all other robots should not visit URLs starting with /tmp or /log. Note the '*' is a special token, meaning "any other User-agent"; you cannot use wildcard patterns or regular expressions in either User-agent or Disallow lines.
Two common errors:
Wildcards are not supported: instead of 'Disallow: /tmp/*' just say 'Disallow: /tmp/'.
You shouldn't put more than one path on a Disallow line.
Where do I find out more about robots?
There is a Web robots home page on: http://www.robotstxt.org/wc/robots.html |
|
|
|
|
data:image/s3,"s3://crabby-images/74676/7467655c43f84619d5d7cf725b1d668453dba0fe" alt="" |
|