Author |
Message |

Joined: Dec 19, 2004
Posts: 3191
Location: Germany:Moderator German NukeSentinel Support
Wed Sep 21, 2005 4:55 pm |
We all know that some search engines have a monopol. Therefore I found this is a great and very interesting project:
But I wasn't very responsive when I checked my logfiles. The crawler hits several times the same file. After all (weeks later) I think the bot cannot handle the session ids in the forums.
Code: - - [20/Sep/2005:17:47:15 +0200] "GET /forum-11.html?sid=1b8043b9d6b72afe85a40aa8e8aedcce HTTP/1.1" 200 30264 "-" "MJ12bot/v1.0.1 (/bot.php?+)" - - [20/Sep/2005:17:47:20 +0200] "GET /forum-11.html?sid=1b88f66ad6a2a53de5cc20cf96cdf730 HTTP/1.1" 200 30264 "-" "MJ12bot/v1.0.1 (" - - [20/Sep/2005:17:47:25 +0200] "GET /forum-11.html?sid=1b8a8ec7e7ee5117b23746551a59f242 HTTP/1.1" 200 30264 "-" "MJ12bot/v1.0.1 (" - - [20/Sep/2005:17:47:30 +0200] "GET /forum-11.html?sid=1b90c226566b115aba82160c06199c48 HTTP/1.1" 200 30264 "-" "MJ12bot/v1.0.1 (" - - [20/Sep/2005:17:47:35 +0200] "GET /forum-11.html?sid=1bb945a7ab5218f27bda9952a13befcf HTTP/1.1" 200 30264 "-" "MJ12bot/v1.0.1 (" - - [20/Sep/2005:17:47:40 +0200] "GET /forum-11.html?sid=1bc73a5c5e05535bf3e6f672ed040cbb HTTP/1.1" 200 30264 "-" "MJ12bot/v1.0.1 (" - - [20/Sep/2005:17:47:45 +0200] "GET /forum-11.html?sid=1bd1a7eab553b5fd522231ef679ad112 HTTP/1.1" 200 30264 "-" "MJ12bot/v1.0.1 (" - - [20/Sep/2005:17:47:50 +0200] "GET /forum-11.html?sid=1be250dffaa8f3fe420cc513705d1553 HTTP/1.1" 200 30264 "-" "MJ12bot/v1.0.1 (" - - [20/Sep/2005:17:47:55 +0200] "GET /forum-11.html?sid=1c084facd4dda1e9b6139147f85f7985 HTTP/1.1" 200 30264 "-" "MJ12bot/v1.0.1 (" - - [20/Sep/2005:17:48:04 +0200] "GET /forum-11.html?sid=1c44737502f47007f9ed91d176da7f27 HTTP/1.1" 200 30264 "-" "MJ12bot/v1.0.1 (" - - [20/Sep/2005:17:48:08 +0200] "GET /forum-11.html?sid=1c4ae8a9b8cd3e2eb3fb1a47f418eef9 HTTP/1.1" 200 30264 "-" "MJ12bot/v1.0.1 (" - - [20/Sep/2005:17:48:13 +0200] "GET /forum-11.html?sid=1c55d390f897683c30b7363d916713eb HTTP/1.1" 200 30264 "-" "MJ12bot/v1.0.1 (" - - [20/Sep/2005:17:48:17 +0200] "GET /forum-11.html?sid=1c61bf7c1bafb8229f24b61f8ef4222f HTTP/1.1" 200 30264 "-" "MJ12bot/v1.0.1 (" - - [20/Sep/2005:17:48:24 +0200] "GET /forum-11.html?sid=1ca8f8c274562ce712c37fbffcb0eccf HTTP/1.1" 200 30264 "-" "MJ12bot/v1.0.1 (" - - [20/Sep/2005:17:48:28 +0200] "GET /forum-11.html?sid=1cb9dde375e1be3033c0646b7d850617 HTTP/1.1" 200 30264 "-" "MJ12bot/v1.0.1 ("
Subject Matter Expert

Joined: May 15, 2004
Posts: 727
Thu Sep 22, 2005 1:14 am |
There are 0 to none bots that handle sessions that's why it sometimes seems that you have 500 visitors
Secondly sid's suck because a searchengine relies on a list of urls and when the sid is listed in the searchengine your pages will be hit more and more abusively.
Get rid of sid and everything should become normal |

Thu Sep 22, 2005 4:56 am |
I know about our sid problem. Nowbody of the experts I asked was able to fix this.
But the bot above is an extreme example. I haven´t seen other bots in our logs with such a behavior. |

Thu Sep 22, 2005 5:47 am |
It's d*** easy to fix, phpBB has a function that uses 'sid=' remove that part and it works. |

Thu Sep 22, 2005 6:15 am |
If you think it´s so easy you can take a look.I give you the adress for our session.php. |

Thu Sep 22, 2005 7:59 am |
Look inside the function append_sid()
In the end it uses $SID so now you only have to figure out where $SID gets assigned a value.
At the end of the function session_begin() we see a line
Code:$SID = 'sid=' . $session_id;
change it into
and watch the miracle happen
An expert is no expert if he can't debug and backtrace things that happen.
I just opened a copy of php-nuke did a file search on 'sid=' and found the files within an second and i'm no expert either, i'm just a code guru. |
Sells PC To Pay For Divorce

Posts: 5661
Thu Sep 22, 2005 8:26 am |
New Member

Joined: Sep 22, 2005
Posts: 10
Thu Sep 22, 2005 5:55 pm |
Hi there,
I am the creator of the bot -- found this forum just like you found my bot - from the log file
I am suprised session ID was present in the URL because a few months ago I implemented SID= filtering that was removing these SIDs from URLs before deduplication, its possible that the URLs that were crawled were loaded before that change. I could swear this code works, but I am going to double-check it, this is not to guarantee that you won't have a few of those URLs crawled, but hopefully a new batch of URLs won't have these SIDs.
Anyhow, its a good idea to get rid of those SIDs because even big search engines like Google get confused.
alexc |

Thu Sep 22, 2005 6:17 pm |
great project.
I know also the similar german project YACY.
But if you really implented a SID filtering why doesn´t this not work. Your bot caused 180 MB until 20.9. on our website. I know, I have wait for long time, because I thought that something would change. I added the bot today to my robots.txt. I hope he doesn´t ignore this. |

Thu Sep 22, 2005 6:20 pm |
The bot does NOT ignore robots.txt and it support Crawl-Delay parameter to have bigger than normal (1 sec) delay between requests.
I do have SID filtering implemented, however I am going to recheck it and find out where the bug is -- it could be bug in filtering code or it could be that URL was found before SID filtering was implemented.
I know about YACY, but I think the model they are trying to achieve - P2P search engine - is not possible with current level of consumer grade hardware. I do not believe it will scale to WWW levels of billions of web pages. |

Thu Sep 22, 2005 6:32 pm |
Some days ago I have read in YACYS forum about your discussion. Was very interesting.
However, I wish you the best for your project and that I found a way to get rid of the sid s.

Thu Sep 22, 2005 6:36 pm |
Thanks - best wishes to whatever you do in cyber life too!
I did have friendly discussion with Yacy people but they did not agree with me, which is fine -- perhaps I am wrong and P2P is possible, but I choose to focus on something that can definately happen
I will review SID removal before loading next batch of data. |

Mon Sep 26, 2005 8:49 am |
I am back!
As promised I tested my code to see if there was a bug. Now my code was NOT removing your session ID, but there is a good reason for it -- your URL is actually not correct because you use query parameter without using query identified ? first!
Ie, you have URL:
Note & -- you added parameter right after filename (which I am sure it used as internal rewrite, but URL parsing does not know that).
A correct way to do it would have been this:
Use ? to start query string, and then have your SID. This is the proper way to have URLs.
Just tested my code with it and if you had proper query delimiter then session IDs will have been removed
Now, what's the way forward here, technically speaking its your "fault" as you use non-standard compliant URLs, however I will write a special parsing for current load only to correct your URLs, but you really need to fix them here as you would confuse other search engines -- Google might just check URL and not crawl for it in the first place, so you losing out in terms of traffic anyway. |

Mon Sep 26, 2005 9:04 am |
that would be bad news for susann cause her number 1 priority is a good rank..
well susannn ,you have some work to do.. |

Mon Sep 26, 2005 1:29 pm |
Thanks majestic-12 for your informative reply.
Well, I´m really not the expert, but I thought our rewrite urls are correct.Im using GT Next Gen. I had enough stress with this forum in the past and I´m sure it´s not a typical nuke forum. But its working and our members like it. I´ don´t touch the forums files anymore but actual I´m looking for someone who can fix some issues.
Thanks again.
Btw: I know Bull from my other forum. Was he a Beta tester ?
@ Hitwalker
don´t worry.
Our ranking is good enough. We get daily between 5 - 15 new members.You know which game we support. Check us.  |
Last edited by Susann on Mon Sep 26, 2005 1:33 pm; edited 1 time in total |

Mon Sep 26, 2005 1:32 pm |
Your rewrite is fine, its just the SID bit that's the problem, if I were you I'd disable it completely because even though my bot understands it (provided URL is properly formatted), but others won't. |

Mon Sep 26, 2005 1:44 pm |
Quote: | just the SID disable it completely | That isn t so easy.
I know how other webmaster her phpBB forums search friendly optimized. They don´ t disable the sids completly. |

Mon Sep 26, 2005 2:40 pm |
Susann wrote: | Quote: | just the SID disable it completely | That isn t so easy. |
You probably right about this -- but you definately need to fix your URLs by changing & to ?, because without it a URL parsing routine will think its a fancy filename rather than query parameter. |

Mon Sep 26, 2005 3:46 pm |
Quote: | but you definately need to fix your URLs by changing & to ? |
Yes, I understand, but how do I do this exactly ?
Really I ask me daily what problem and fault should I fix first. |
Last edited by Susann on Mon Sep 26, 2005 4:02 pm; edited 1 time in total |

Mon Sep 26, 2005 3:53 pm |
Susann wrote: | Yes, I understand, but how do I do this exactly ? |
Well, you will need to edit the code that appends those SIDs, from what I can see you must be using some internal re-writing to have nice ".html"'s, and that code incorrectly adds SID without starting query itself first using ? char.
I changed my url loader to fix this for you, but I doubt other search engines will do the same. |

Thu Oct 27, 2005 5:04 pm |
I did my homework. There aren´t any sessions for bots.
But I´m wondering that your bot still gets sid.
I swear I couldn´t find any session ids when I checked my logfiles and also when I used some helpful tools to check my site. Really strange. I´m also wondering that we are the only one with this majestic bot problem on earth, because a lot of php-nuke sites have the same sid problem. Any other ideas ?
Code: - - [26/Oct/2005:02:07:12 +0200] "GET /forum-29.html?sid=91d94aba163e7238014ed962c45d7b8d HTTP/1.1" 200 24821 "-" "MJ12bot/v1.0.4 (" - - [26/Oct/2005:02:07:14 +0200] "GET /forum-29.html?sid=920f3fde9d8138a242ef4b537fae9a4a HTTP/1.1" 200 24821 "-" "MJ12bot/v1.0.4 (" - - [26/Oct/2005:02:07:16 +0200] "GET /forum-29.html?sid=97668c6ea394a2ebf3674c2c82bc34f6 HTTP/1.1" 200 24821 "-" "MJ12bot/v1.0.4 (" - - [26/Oct/2005:02:07:24 +0200] "GET /forum-29.html?sid=97cc19a6cb5bffedac687f2c5525fbe0 HTTP/1.1" 200 24821 "-" "MJ12bot/v1.0.4 (" - - [26/Oct/2005:02:07:25 +0200] "GET /forum-29.html?sid=9956ef31690f42e7e14078ad52ea3cb7 HTTP/1.1" 200 24821 "-" "MJ12bot/v1.0.4 (" - - [26/Oct/2005:02:07:27 +0200] "GET
------------cut-------------------- - - [26/Oct/2005:12:57:14 +0200] "GET /forum-29.html HTTP/1.1" 200 11200 "-" "Mozilla/4.0 compatible ZyBorg/1.0 Dead Link Checker (;" - - [26/Oct/2005:14:51:16 +0200] "GET /forum-29.html HTTP/1.0" 200 24772 "-" "Googlebot/2.1 (+"
------------ - - [24/Oct/2005:00:25:59 +0200] "GET /forums-faq.html?sid=cfbe5819c57d46870048770cca259052 HTTP/1.1" 200 53029 "-" "MJ12bot/v1.0.4 (" - - [24/Oct/2005:03:12:07 +0200] "GET /forums-faq.html HTTP/1.0" 200 52980 "-" "Seekbot/1.0 ( HTTPFetcher/0.3" - - [24/Oct/2005:09:24:53 +0200] "GET /forums-faq.html HTTP/1.0" 200 52980 "-" "Googlebot/2.1 (+"
| Code: - - [24/Oct/2005:10:35:42 +0200] "GET /forum-19.html?sid=07298bf4713e2e3e08cf383f57b19e43 HTTP/1.1" 200 32890 "-" "MJ12bot/v1.0.4 (" - - [24/Oct/2005:22:51:20 +0200] "GET /forum-19.html HTTP/1.0" 200 6239 "-" "Mozilla/5.0 (compatible; Yahoo! Slurp;"

Thu Oct 27, 2005 5:17 pm |
Susann -- these are likely to be old urls - the change you make on site do not have immediate effect on old crawled data -- note that these urls you referenced exibit same error we discussed above - no query string indicator - ?, this means they were parsed from older pages when you did not have this problem fixed. |

Thu Oct 27, 2005 5:24 pm |
I supposed so. But what should I do ? Would it be a good idea to ban all your IP´s for the next 1-3 month or how long ? |

Thu Oct 27, 2005 7:27 pm |
Susann wrote: | Would it be a good idea to ban all your IP´s for the next 1-3 month or how long ? |
No, it would not be a good idea because we use distributed model and number of IPs is very high with new ones constantly being added -- this simply won't work. If you want the bot to stop crawling then you can use robots.txt or I can add your domain to list of domains that should not be crawled. |

Fri Oct 28, 2005 8:17 am |
Thanks majestic -12 for your replies.
Gave me the information which I needed. |