Author |
Message |
New Member

Joined: Mar 07, 2005
Posts: 10
Mon Mar 07, 2005 8:38 pm |
A few days ago I tested downloading my site with httrack to see if everything worked. But the sentinel didnt block it.
Now, I have just upgraded to sentinel 2.0, nothing has changed. You can download the whole site with these harvest softwares.
Here is my configuration for harvesters...
Only registered users can see links on this board! Get registered or login!
Does anyone know what it may be.. |
Spouse Contemplates Divorce

Joined: Jan 02, 2003
Posts: 2496
Mon Mar 07, 2005 9:12 pm |
Since its based on the user agent there is really only so much that can be done. If the user agent is spoofed or changed it will get around the filter list. |
_________________ [b][size=5]openSUSE 11.4-x86 | Linux i686 | KDE: 4.6.41>=4.7 | XFCE 4.8 | AMD Athlon(tm) XP 3000+ | MSI K7N2 Delta-L | 3GB Black Diamond DDR
| GeForce 6200@433Mhz 512MB | Xorg 1.9.3 | NVIDIA 270.30[/size:2b8 |

Joined: Jan 29, 2004
Posts: 624
Mon Mar 07, 2005 11:47 pm |
Makes sense. Besides I like httrack lol |
_________________ Computer Science is no more about computers than astronomy is about telescopes.
- E. W. Dijkstra |

Tue Mar 08, 2005 5:34 am |
No, I am sure that previous versions of sentinel, dont remember which one, could block harvest softwares. I had also tested with different softwares and all they got blocked.
Is there a way to block it then?
Iam tired of users who wants to download my site. I get like 3000 httrack hits per day. |

Tue Mar 08, 2005 7:29 am |
I guess what I would do is take the exact user agent from your Apache log and compare it to the entries in your block list.
You can test different user agents with, sam spades text browser or an online test to find what is working and not working. |

Tue Mar 08, 2005 12:13 pm |
OK a bit of research finds
RewriteCond User-Agent: .*(Tele|WebZIP|Crawl|Control|Offline|Fetch|Miner|HTTrack|Ninja|Online|Fresh|NetAnts|Reaper|Wget|archiver|GetRight|Copier|DA|Stripper|Pockey|Flash).*
RewriteRule (.+) ?1$2:/noleech.html\?$2 [L]
What you need to do is put a rewrite rule in your htaccess so any harvester not covered by Sentinel gets stopped before it reaches your site. You can redirect it then to any page you want. Try a google on Httrack user agent if you crave more info... |

Tue Mar 08, 2005 1:21 pm |
Luckily for you, bergman, I'm interested in this issue myself lol
Okeydoke a bit more funtime on google reveals
//filename: websiteguard.php
// Purpose: To deny access for spambots, spybots and other bad agents.
// When the useragent is a goodone it allows, otherwise your
// php page will stop working and
// protects your website from badbots.
// Inputs: UserAgent string
// Author: Vivek [ webmaster AT allthewebsites DOT org ]
// Version: 1.0.0
//--- Call the function
function WebsiteGuard()
global $thisAgent;
$isDenied = false;
if (preg_match("/webzip|httrack|wget|FlickBot|downloader|production
bot|superbot|PersonaPilot|NPBot|WebCopier|vayala|imagefetch|Microsoft URL
Control|mac finder|emailreaper|emailsiphon|emailwolf|emailmagnet|emailsweeper|Indy
Library|FrontPage|cherry picker|WebCopier|netzip|Share Program|TurnitinBot|full web
$isDenied = true;
// Customize this message :-)
print("Do not disturb...Zzz...\n");
How to use this script?
just "include" the script at the top of every php page.*
It will allow good bots like googlebot, webcrawler, ia_archiver etc., to crawl your website and list your website in Search engines.
It prevents human users from downloading your entire website by using Offline browsers.
It does not utilize .htaccess, so this code is portable in almost all the platforms which can run PHP scripts (Windows, Unix, Linux etc).
Can be used in Apache (Linux, *nix ) and as well as IIS (Windows).
Simple and Light-weight.
Few limitations and disadvantages.
If the useragent is unknown or empty or not provided, it would allow the browser to view the webpage. So protect your emails by using Javascript method.
List of bad bots is not yet complete. It is just a partial list.
If all the available badbots are added, it would slightly degrade the performance of your script.
Will not prevent HTML (.html, .htm) pages from the badbots. Use .htaccess (Apache) or some other scripts.
Even IPs can be blocked. But that part of the code is not shown, since they are specific to each website.
*Just include it at the top of mainfile.php- AFTER the Sentinel code!- as that is called by every other php page.
Hope this helps, it did me  |

Tue Mar 08, 2005 4:21 pm |
The above code works fine. I included websiteguard.php in my mainfile.php, d/led and installed Httrack and ran it against my own website and it didn't get anything. Of course I have all my 'critical' files plus all directories in my robots.txt and Httrack respects that unless turned off in Options... anyway I got a Get out of here!, which I put in the websiteguard code so I'm sure it blocks Httrack as well as other harvesters. Voila bergman your prob is solved, and mine lol |

Tue Mar 08, 2005 5:16 pm |
I shall put it in my mainfile and test it now... |

Tue Mar 08, 2005 5:26 pm |
Yes, it really works. This is really a must. Thank you southern. |
Site Admin

Joined: Aug 29, 2004
Posts: 9457
Location: Arizona
Tue Mar 08, 2005 5:46 pm |
Sounds like a good addition to a future Sentinel release???? |
_________________ Only registered users can see links on this board! Get registered or login!
Only registered users can see links on this board! Get registered or login! |
Sells PC To Pay For Divorce

Posts: 5661
Tue Mar 08, 2005 6:31 pm |
so what does this mean....?
nice try bob but your "i block harvesters" doesnt work ?
anyone ever came up with the idea of going to the website of the authors that create the harvester scripts?
Personaly i think that the attacks are less now then months ago...
I still get friendly bots and they behave but 3000 hits by harvesters a day ?
And they cant even grab something usefull... |

Tue Mar 08, 2005 9:06 pm |
You're a cynic, hitwalker lol Sentinel does block a lot of harvesters but there are some it doesn't catch. I've gone to the site of Httrack, interesting and btw in it's EULA Httrack asks that it NOT be used for harvesting so if it is its a misuse... anyway it is open source gpl licensing same as nuke and Filezilla so it can't be all bad. I dunno where other harvesters come from but if I find out I'll use Httrack to download their site.
Glad to help bergman  |

Wed Mar 09, 2005 3:58 am |
synic ?...ha...ha...its a dutch thing i
im not to concerned about website harvesters.
maybe 2 a month get banned thats all.
And by visiting the website of those who create the progs that copy websites i mean...
it would be easier getting to know the prog and finding ways to reject it on visit. |

Wed Mar 09, 2005 9:27 am |
Code:anyone ever came up with the idea of going to the website of the authors that create the harvester scripts?
Code:And by visiting the website of those who create the progs that copy websites i mean...
What is it about? |

Wed Mar 09, 2005 9:40 am |

Wed Mar 09, 2005 2:31 pm |
I think it's a good idea to visit harvester sites and I'll do it if I can find where to go... ah a new project lol I don't get too many harvesters either like 2 or 3 a month so I got into this as a kind of knowledge for knowledge sake.  |

Wed Mar 09, 2005 2:37 pm |
well i mean sites who create the software...
the grabbers...
the site copieers...
if you can understand the technique they use it should be simple enough to write something to reject them...ban them or whatever... |

Wed Mar 09, 2005 5:33 pm |
Southern, Sentinel 2.2 has been updated, if you read the news on This one really blocks harvesters, without webguard.php. |