Author |
Message |
hitwalker
Sells PC To Pay For Divorce

Joined:
Posts: 5661
|
Posted:
Thu Aug 11, 2005 4:17 pm |
|
Ok here's the story..
With CoffeeCup SiteMapper i created the google sitemap.
It works perfect when you have a html site or when you use url-rewrite (like with googletap).
So in this case it would work great with ravens forums and site (enjoy the program raven )
The program can create the google map locally or it can index your site.
But i have urls like :
Code:http://www.hitwalker.nl/phpx/html/modules.php?name=cpaneluserguide-addingAnonymousFTPmessage.html
|
But when i let the program index my site locally it created links like :
modules/CPanel_User_Guide/addingAnonymousFTPmessage.htm
And thats not what i wanted..cause it would address the page directly,and did not prefer that.
So i did a mass replace of urls in the index.html and xml file when it was created ...
You can see here :
http://www.hitwalker.nl/phpx/html/modules.php?name=Phpnukedatabase_sitemap
The xml version is here :http://www.hitwalker.nl/sitemap.xml
As you can see error...
Using the validator http://www.feedvalidator.org
Use directly :http://www.feedvalidator.org/check.cgi?url=http%3A%2F%2Fwww.hitwalker.nl%2Fsitemap.xml
What i dont understand is that the error points to the >"=" as it shows at the sitemap.xml
Can anyone give a hand as to solve this...? |
|
|
|
 |
Guardian2003
Site Admin

Joined: Aug 28, 2003
Posts: 6799
Location: Ha Noi, Viet Nam
|
Posted:
Thu Aug 11, 2005 9:57 pm |
|
hitwalker, please PM me a link to a text file of your sitemap and I'll take a look for you.
The link you provided indicates an error on line 56, also, I see the line
Code:<loc>http://www.hitwalker.nl/phpx/html/modules.php?name=CPanel_User_Guide&page=addingAnonymousFTPmessage.htm&l...
|
All references to the entity '&' need to be changed to '&' (without the single quotes) for google to accept the sitemap and be compliant in that respect.
CoffeeCup should have automaitcally changed those references if it was going to produce compliant xml which makes me wonder if it is not simply 'indexing' files with no real thought to converting entities to be xml compliant.
I couldnt check that much of your xml file as the error halted it and prevents me from getting the full source as text but what I did check, I found 24 errors.
You might also want to consider trying the SoftPlus GSiteCrawler software (free) as discussed in another forum thread, I have found it the best so far. If you use the GSiteCrawler though, make sure you have your IP protected in Sentinel if you have Ddos prevention turned on  |
|
|
|
 |
hitwalker

|
Posted:
Fri Aug 12, 2005 3:50 am |
|
|
|
 |
Guardian2003

|
Posted:
Fri Aug 12, 2005 6:30 am |
|
|
|
 |
hitwalker

|
Posted:
Fri Aug 12, 2005 6:33 am |
|
Pm ?huh?
Didnt get anything.. |
|
|
|
 |
Guardian2003

|
Posted:
Fri Aug 12, 2005 6:43 am |
|
|
|
 |
hitwalker

|
Posted:
Fri Aug 12, 2005 6:55 am |
|
got it thanks,almost didnt.
seems to be a problem with the internet getting parts of the uk.
But it works indeed...
How simple things can messup a xml file.... |
|
|
|
 |
softplus
New Member


Joined: Jul 31, 2005
Posts: 8
Location: Switzerland
|
Posted:
Wed Aug 24, 2005 1:42 am |
|
Hi hitwalker
Did the file come out of CoffeeCup like that? Ouch. I see you replaced the & with &s; BUT you still have another (typical american) problem -- your last-mod dates are incorrect:
<lastmod>2005-08-09T14:32:38+01,00</lastmod>
is not a correctly coded date/time in ISO 8601.
It seems that CoffeeCup only works when you set your regional settings to US - this is something that very many us-based companies mess up, "let's think about the rest of the world later".
The tag should be:
<lastmod>2005-08-09T14:32:38+01:00</lastmod>
Note the "," in the timezone offset to UTC should be a ":". To be on the safe side, my GSiteCrawler gives all times in UTC (easier to check). Perhaps you can adjust CoffeeCup to only store dates? Usually you don't really need the exact time of a URL...
Good luck with search+replace ... every time you make a sitemap
If you want, I can send you a GSiteCrawler-generated sitemap-file to compare. Or just download it and try it for yourself
Cheers
John |
|
|
|
 |
hitwalker

|
Posted:
Wed Aug 24, 2005 3:14 am |
|
hi thanks for the reply,
well the huge search and replace was my own decission cause coffeycup works ok but generated the links in a way i dont like,specialy the ones of the phpnuke howto mod..
For example..
An url like :
modules.php?name=PHP-Nuke_HOWTO&page=xoops-vs-post-nuke.html
would come out like :
modules/PHP-Nuke_HOWTO/xoops-vs-post-nuke.html
And thats creating links directly to the html file.
But i will try yours and let you know... |
|
|
|
 |
softplus

|
Posted:
Wed Aug 24, 2005 3:55 am |
|
One thing you need to be aware of: If a crawler finds a link like that (modules/PHP-Nuke_HOWTO/xoops-vs-post-nuke.html) then it's linked somewhere in your website like that. So if you submit a sitemap with nukemanual-xoops-vs-post-nuke.html (which is usually better, thats why you use url-rewrite ) then a search engine might find both (the first by crawling, the second from your sitemap).
The problem with that is that Google (I'm not sure about the others) will then have 2 URLs to the same content: "duplicate content". That CAN and usually WILL cause problems, especially when you submit sitemap files. Google will notice that and either decide on one or the other (and throw the other one) or it will discard both (not good!) and maybe it will even mark your site as "bad" (I doubt it, not because of a few duplicate content URLs).
You should really check your site to make sure all internal links point to the rewritten URL. Otherwise you WILL run into this problem, I've seen many sites who have had to fight with something like that.
I would check the site with a crawler (CoffeeCup should do this ok, whatever you prefer) and fix all the pages that link to the "incorrect" URL. Make sure that when you crawl the site, it can only find the correct URLs! An alternative, if you can't get them all fixed - add a "rel=nofollow" tag to the link. However, if you do that, then the Search engines won't follow to your links and PR (from Google) will then not be passed on to the subpages (not a good idea, but a temporary solution until you get it fixed properly).
Good luck,
John |
|
|
|
 |
|