Ravens PHP Scripts: Forums
 

 

View next topic
View previous topic
Post new topic   Reply to topic    Ravens PHP Scripts And Web Hosting Forum Index -> General/Other Stuff
Author Message
Susann
Moderator



Joined: Dec 19, 2004
Posts: 3191
Location: Germany:Moderator German NukeSentinel Support

PostPosted: Wed Sep 21, 2005 4:55 pm Reply with quote

We all know that some search engines have a monopol. Therefore I found this is a great and very interesting project:
[ Only registered users can see links on this board! Get registered or login! ]



But I wasn't very responsive when I checked my logfiles. The crawler hits several times the same file. After all (weeks later) I think the bot cannot handle the session ids in the forums.

Code:
67.53.54.213 - - [20/Sep/2005:17:47:15 +0200] "GET /forum-11.html?sid=1b8043b9d6b72afe85a40aa8e8aedcce HTTP/1.1" 200 30264 "-" "MJ12bot/v1.0.1 (/bot.php?+)"

67.53.54.213 - - [20/Sep/2005:17:47:20 +0200] "GET /forum-11.html?sid=1b88f66ad6a2a53de5cc20cf96cdf730 HTTP/1.1" 200 30264 "-" "MJ12bot/v1.0.1 (http://majestic12.co.uk/bot.php?+)"
67.53.54.213 - - [20/Sep/2005:17:47:25 +0200] "GET /forum-11.html?sid=1b8a8ec7e7ee5117b23746551a59f242 HTTP/1.1" 200 30264 "-" "MJ12bot/v1.0.1 (http://majestic12.co.uk/bot.php?+)"
67.53.54.213 - - [20/Sep/2005:17:47:30 +0200] "GET /forum-11.html?sid=1b90c226566b115aba82160c06199c48 HTTP/1.1" 200 30264 "-" "MJ12bot/v1.0.1 (http://majestic12.co.uk/bot.php?+)"
67.53.54.213 - - [20/Sep/2005:17:47:35 +0200] "GET /forum-11.html?sid=1bb945a7ab5218f27bda9952a13befcf HTTP/1.1" 200 30264 "-" "MJ12bot/v1.0.1 (http://majestic12.co.uk/bot.php?+)"
67.53.54.213 - - [20/Sep/2005:17:47:40 +0200] "GET /forum-11.html?sid=1bc73a5c5e05535bf3e6f672ed040cbb HTTP/1.1" 200 30264 "-" "MJ12bot/v1.0.1 (http://majestic12.co.uk/bot.php?+)"
67.53.54.213 - - [20/Sep/2005:17:47:45 +0200] "GET /forum-11.html?sid=1bd1a7eab553b5fd522231ef679ad112 HTTP/1.1" 200 30264 "-" "MJ12bot/v1.0.1 (http://majestic12.co.uk/bot.php?+)"
67.53.54.213 - - [20/Sep/2005:17:47:50 +0200] "GET /forum-11.html?sid=1be250dffaa8f3fe420cc513705d1553 HTTP/1.1" 200 30264 "-" "MJ12bot/v1.0.1 (http://majestic12.co.uk/bot.php?+)"
67.53.54.213 - - [20/Sep/2005:17:47:55 +0200] "GET /forum-11.html?sid=1c084facd4dda1e9b6139147f85f7985 HTTP/1.1" 200 30264 "-" "MJ12bot/v1.0.1 (http://majestic12.co.uk/bot.php?+)"

67.53.54.213 - - [20/Sep/2005:17:48:04 +0200] "GET /forum-11.html?sid=1c44737502f47007f9ed91d176da7f27 HTTP/1.1" 200 30264 "-" "MJ12bot/v1.0.1 (http://majestic12.co.uk/bot.php?+)"
67.53.54.213 - - [20/Sep/2005:17:48:08 +0200] "GET /forum-11.html?sid=1c4ae8a9b8cd3e2eb3fb1a47f418eef9 HTTP/1.1" 200 30264 "-" "MJ12bot/v1.0.1 (http://majestic12.co.uk/bot.php?+)"
67.53.54.213 - - [20/Sep/2005:17:48:13 +0200] "GET /forum-11.html?sid=1c55d390f897683c30b7363d916713eb HTTP/1.1" 200 30264 "-" "MJ12bot/v1.0.1 (http://majestic12.co.uk/bot.php?+)"
67.53.54.213 - - [20/Sep/2005:17:48:17 +0200] "GET /forum-11.html?sid=1c61bf7c1bafb8229f24b61f8ef4222f HTTP/1.1" 200 30264 "-" "MJ12bot/v1.0.1 (http://majestic12.co.uk/bot.php?+)"
67.53.54.213 - - [20/Sep/2005:17:48:24 +0200] "GET /forum-11.html?sid=1ca8f8c274562ce712c37fbffcb0eccf HTTP/1.1" 200 30264 "-" "MJ12bot/v1.0.1 (http://majestic12.co.uk/bot.php?+)"
67.53.54.213 - - [20/Sep/2005:17:48:28 +0200] "GET /forum-11.html?sid=1cb9dde375e1be3033c0646b7d850617 HTTP/1.1" 200 30264 "-" "MJ12bot/v1.0.1 (http://majestic12.co.uk/bot.php?+)"
 
View user's profile Send private message
djmaze
Subject Matter Expert



Joined: May 15, 2004
Posts: 727
Location: http://tinyurl.com/5z8dmv

PostPosted: Thu Sep 22, 2005 1:14 am Reply with quote

There are 0 to none bots that handle sessions that's why it sometimes seems that you have 500 visitors Smile
Secondly sid's suck because a searchengine relies on a list of urls and when the sid is listed in the searchengine your pages will be hit more and more abusively.

Get rid of sid and everything should become normal
 
View user's profile Send private message Visit poster's website
Susann







PostPosted: Thu Sep 22, 2005 4:56 am Reply with quote

I know about our sid problem. Nowbody of the experts I asked was able to fix this.
But the bot above is an extreme example. I haven´t seen other bots in our logs with such a behavior.
 
djmaze







PostPosted: Thu Sep 22, 2005 5:47 am Reply with quote

It's d*** easy to fix, phpBB has a function that uses 'sid=' remove that part and it works.
 
Susann







PostPosted: Thu Sep 22, 2005 6:15 am Reply with quote

If you think it´s so easy you can take a look.I give you the adress for our session.php.
 
djmaze







PostPosted: Thu Sep 22, 2005 7:59 am Reply with quote

Look inside the function append_sid()
In the end it uses $SID so now you only have to figure out where $SID gets assigned a value.

At the end of the function session_begin() we see a line
Code:
$SID = 'sid=' . $session_id;

change it into
Code:
$SID = '';

and watch the miracle happen Wink

An expert is no expert if he can't debug and backtrace things that happen.
I just opened a copy of php-nuke did a file search on 'sid=' and found the files within an second and i'm no expert either, i'm just a code guru.
 
hitwalker
Sells PC To Pay For Divorce



Joined:
Posts: 5661

PostPosted: Thu Sep 22, 2005 8:26 am Reply with quote

lol...
 
View user's profile Send private message
majestic-12
New Member
New Member



Joined: Sep 22, 2005
Posts: 10

PostPosted: Thu Sep 22, 2005 5:55 pm Reply with quote

Hi there,

I am the creator of the bot -- found this forum just like you found my bot - from the log file Smile

I am suprised session ID was present in the URL because a few months ago I implemented SID= filtering that was removing these SIDs from URLs before deduplication, its possible that the URLs that were crawled were loaded before that change. I could swear this code works, but I am going to double-check it, this is not to guarantee that you won't have a few of those URLs crawled, but hopefully a new batch of URLs won't have these SIDs.

Anyhow, its a good idea to get rid of those SIDs because even big search engines like Google get confused.

regards

alexc
 
View user's profile Send private message
Susann







PostPosted: Thu Sep 22, 2005 6:17 pm Reply with quote

Hi,

great project.
I know also the similar german project YACY.
But if you really implented a SID filtering why doesn´t this not work. Your bot caused 180 MB until 20.9. on our website. I know, I have wait for long time, because I thought that something would change. I added the bot today to my robots.txt. I hope he doesn´t ignore this.
 
majestic-12







PostPosted: Thu Sep 22, 2005 6:20 pm Reply with quote

The bot does NOT ignore robots.txt and it support Crawl-Delay parameter to have bigger than normal (1 sec) delay between requests.

I do have SID filtering implemented, however I am going to recheck it and find out where the bug is -- it could be bug in filtering code or it could be that URL was found before SID filtering was implemented.

I know about YACY, but I think the model they are trying to achieve - P2P search engine - is not possible with current level of consumer grade hardware. I do not believe it will scale to WWW levels of billions of web pages.
 
Susann







PostPosted: Thu Sep 22, 2005 6:32 pm Reply with quote

Some days ago I have read in YACYS forum about your discussion. Was very interesting.
However, I wish you the best for your project and that I found a way to get rid of the sid s.

Smile
 
majestic-12







PostPosted: Thu Sep 22, 2005 6:36 pm Reply with quote

Thanks - best wishes to whatever you do in cyber life too! Smile

I did have friendly discussion with Yacy people but they did not agree with me, which is fine -- perhaps I am wrong and P2P is possible, but I choose to focus on something that can definately happen Smile

I will review SID removal before loading next batch of data.
 
majestic-12







PostPosted: Mon Sep 26, 2005 8:49 am Reply with quote

I am back!

As promised I tested my code to see if there was a bug. Now my code was NOT removing your session ID, but there is a good reason for it -- your URL is actually not correct because you use query parameter without using query identified ? first!

Ie, you have URL:

/forum-11.html&sid=1b8043b9d6b72afe85a40aa8e8aedcce

Note & -- you added parameter right after filename (which I am sure it used as internal rewrite, but URL parsing does not know that).

A correct way to do it would have been this:

/forum-11.html?sid=1b8043b9d6b72afe85a40aa8e8aedcce

Use ? to start query string, and then have your SID. This is the proper way to have URLs.

Just tested my code with it and if you had proper query delimiter then session IDs will have been removed Smile

Now, what's the way forward here, technically speaking its your "fault" as you use non-standard compliant URLs, however I will write a special parsing for current load only to correct your URLs, but you really need to fix them here as you would confuse other search engines -- Google might just check URL and not crawl for it in the first place, so you losing out in terms of traffic anyway.
 
hitwalker







PostPosted: Mon Sep 26, 2005 9:04 am Reply with quote

that would be bad news for susann cause her number 1 priority is a good rank..
well susannn ,you have some work to do..
 
Susann







PostPosted: Mon Sep 26, 2005 1:29 pm Reply with quote

Thanks majestic-12 for your informative reply.
Well, I´m really not the expert, but I thought our rewrite urls are correct.Im using GT Next Gen. I had enough stress with this forum in the past and I´m sure it´s not a typical nuke forum. But its working and our members like it. I´ don´t touch the forums files anymore but actual I´m looking for someone who can fix some issues.
Thanks again.

Btw: I know Bull from my other forum. Was he a Beta tester ?

@ Hitwalker
don´t worry.

Our ranking is good enough. We get daily between 5 - 15 new members.You know which game we support. Check us. Wink


Last edited by Susann on Mon Sep 26, 2005 1:33 pm; edited 1 time in total 
majestic-12







PostPosted: Mon Sep 26, 2005 1:32 pm Reply with quote

Your rewrite is fine, its just the SID bit that's the problem, if I were you I'd disable it completely because even though my bot understands it (provided URL is properly formatted), but others won't.
 
Susann







PostPosted: Mon Sep 26, 2005 1:44 pm Reply with quote

Quote:
just the SID disable it completely
That isn t so easy.
I know how other webmaster her phpBB forums search friendly optimized. They don´ t disable the sids completly.
 
majestic-12







PostPosted: Mon Sep 26, 2005 2:40 pm Reply with quote

Susann wrote:
Quote:
just the SID disable it completely
That isn t so easy.


You probably right about this -- but you definately need to fix your URLs by changing & to ?, because without it a URL parsing routine will think its a fancy filename rather than query parameter.
 
Susann







PostPosted: Mon Sep 26, 2005 3:46 pm Reply with quote

Quote:
but you definately need to fix your URLs by changing & to ?

Yes, I understand, but how do I do this exactly ?

-----------------------------------------------------------------------------
Really I ask me daily what problem and fault should I fix first.


Last edited by Susann on Mon Sep 26, 2005 4:02 pm; edited 1 time in total 
majestic-12







PostPosted: Mon Sep 26, 2005 3:53 pm Reply with quote

Susann wrote:
Yes, I understand, but how do I do this exactly ?


Well, you will need to edit the code that appends those SIDs, from what I can see you must be using some internal re-writing to have nice ".html"'s, and that code incorrectly adds SID without starting query itself first using ? char.

I changed my url loader to fix this for you, but I doubt other search engines will do the same.
 
Susann







PostPosted: Thu Oct 27, 2005 5:04 pm Reply with quote

I did my homework. There aren´t any sessions for bots.
But I´m wondering that your bot still gets sid. Rolling Eyes
I swear I couldn´t find any session ids when I checked my logfiles and also when I used some helpful tools to check my site. Really strange. I´m also wondering that we are the only one with this majestic bot problem on earth, because a lot of php-nuke sites have the same sid problem. Any other ideas ?
Code:
62.20.222.103 - - [26/Oct/2005:02:07:12 +0200] "GET /forum-29.html?sid=91d94aba163e7238014ed962c45d7b8d HTTP/1.1" 200 24821 "-" "MJ12bot/v1.0.4 (http://majestic12.co.uk/bot.php?+)"

62.20.222.103 - - [26/Oct/2005:02:07:14 +0200] "GET /forum-29.html?sid=920f3fde9d8138a242ef4b537fae9a4a HTTP/1.1" 200 24821 "-" "MJ12bot/v1.0.4 (http://majestic12.co.uk/bot.php?+)"
62.20.222.103 - - [26/Oct/2005:02:07:16 +0200] "GET /forum-29.html?sid=97668c6ea394a2ebf3674c2c82bc34f6 HTTP/1.1" 200 24821 "-" "MJ12bot/v1.0.4 (http://majestic12.co.uk/bot.php?+)"
62.20.222.103 - - [26/Oct/2005:02:07:24 +0200] "GET /forum-29.html?sid=97cc19a6cb5bffedac687f2c5525fbe0 HTTP/1.1" 200 24821 "-" "MJ12bot/v1.0.4 (http://majestic12.co.uk/bot.php?+)"
62.20.222.103 - - [26/Oct/2005:02:07:25 +0200] "GET /forum-29.html?sid=9956ef31690f42e7e14078ad52ea3cb7 HTTP/1.1" 200 24821 "-" "MJ12bot/v1.0.4 (http://majestic12.co.uk/bot.php?+)"
62.20.222.103 - - [26/Oct/2005:02:07:27 +0200] "GET
------------cut--------------------


64.242.88.50 - - [26/Oct/2005:12:57:14 +0200] "GET /forum-29.html HTTP/1.1" 200 11200 "-" "Mozilla/4.0 compatible ZyBorg/1.0 Dead Link Checker (wn.dlc@looksmart.net; http://www.WISEnutbot.com)"


66.249.64.66 - - [26/Oct/2005:14:51:16 +0200] "GET /forum-29.html HTTP/1.0" 200 24772 "-" "Googlebot/2.1 (+http://www.google.com/bot.html)"


Code:
 

------------
12.207.8.100 - - [24/Oct/2005:00:25:59 +0200] "GET /forums-faq.html?sid=cfbe5819c57d46870048770cca259052 HTTP/1.1" 200 53029 "-" "MJ12bot/v1.0.4 (http://majestic12.co.uk/bot.php?+)"


195.27.215.91 - - [24/Oct/2005:03:12:07 +0200] "GET /forums-faq.html HTTP/1.0" 200 52980 "-" "Seekbot/1.0 (http://www.seekbot.net/bot.html) HTTPFetcher/0.3"


66.249.71.10 - - [24/Oct/2005:09:24:53 +0200] "GET /forums-faq.html HTTP/1.0" 200 52980 "-" "Googlebot/2.1 (+http://www.google.com/bot.html)"
Code:
 


212.239.212.249 - - [24/Oct/2005:10:35:42 +0200] "GET /forum-19.html?sid=07298bf4713e2e3e08cf383f57b19e43 HTTP/1.1" 200 32890 "-" "MJ12bot/v1.0.4 (http://majestic12.co.uk/bot.php?+)"

68.142.251.26 - - [24/Oct/2005:22:51:20 +0200] "GET /forum-19.html HTTP/1.0" 200 6239 "-" "Mozilla/5.0 (compatible; Yahoo! Slurp; http://help.yahoo.com/help/us/ysearch/slurp)"
 
majestic-12







PostPosted: Thu Oct 27, 2005 5:17 pm Reply with quote

Susann -- these are likely to be old urls - the change you make on site do not have immediate effect on old crawled data -- note that these urls you referenced exibit same error we discussed above - no query string indicator - ?, this means they were parsed from older pages when you did not have this problem fixed.
 
Susann







PostPosted: Thu Oct 27, 2005 5:24 pm Reply with quote

I supposed so. But what should I do ? Would it be a good idea to ban all your IP´s for the next 1-3 month or how long ?
 
majestic-12







PostPosted: Thu Oct 27, 2005 7:27 pm Reply with quote

Susann wrote:
Would it be a good idea to ban all your IP´s for the next 1-3 month or how long ?


No, it would not be a good idea because we use distributed model and number of IPs is very high with new ones constantly being added -- this simply won't work. If you want the bot to stop crawling then you can use robots.txt or I can add your domain to list of domains that should not be crawled.
 
Susann







PostPosted: Fri Oct 28, 2005 8:17 am Reply with quote

Thanks majestic -12 for your replies.
Gave me the information which I needed.
 
Display posts from previous:       
Post new topic   Reply to topic    Ravens PHP Scripts And Web Hosting Forum Index -> General/Other Stuff

View next topic
View previous topic
You cannot post new topics in this forum
You cannot reply to topics in this forum
You cannot edit your posts in this forum
You cannot delete your posts in this forum
You cannot vote in polls in this forum
You can attach files in this forum
You can download files in this forum


Powered by phpBB © 2001-2007 phpBB Group
All times are GMT - 6 Hours
 
Forums ©