Google App Engine - SiteMap Creation for a social network
- by spidee
Hi all.
I am creating a social tool - I want to allow search engines to pick up "public" user profiles - like twitter and face-book.
I have seen all the protocol info at http://www.sitemaps.org and i understand this and how to build such a file - along with an index if i exceed the 50K limit.
Where i am struggling is the concept of how i make this run.
The site map for my general site pages is simple i can use a tool to create the file - or a script - host the file - submit the file and done.
What i then need is a script that will create the site-maps of user profiles. I assume this would be something like:
<?xml version="1.0" encoding="UTF-8"?>
<urlset xmlns="http://www.sitemaps.org/schemas/sitemap/0.9">
<url>
<loc>http://www.socialsite.com/profile/spidee</loc>
<lastmod>2010-5-12</lastmod>
<changefreq>???</changefreq>
<priority>???</priority>
</url>
<url>
<loc>http://www.socialsite.com/profile/webbsterisback</loc>
<lastmod>2010-5-12</lastmod>
<changefreq>???</changefreq>
<priority>???</priority>
</url>
</urlset>
Ive added some ??? as i don't know how i should set these settings for my profiles based on the following:-
When a new profile is created it must be added to a site-map. If the profile is changed or if "certain" properties are changed - then i don't know if i update the entry in the map - or do something else? (updating would be a nightmare!)
Some users may change their profile. In terms of relevance to the search engine the only way a google or yahoo search will find the users (for my requirement) profile would be for example by means of [user name] and [location] so once the entry for the profile has been added to the map file the only reason to have the search-bot re-index the profile would be if the user changed their user-name - which they cant. or their location - and or set their settings so that their profile would be "hidden" from search engines.
I assume my map creation will need to be dynamic. From what i have said above i would imagine that creating a new profile and possible editing certain properties could mark it as needing adding/updating in the sitemap.
Assuming i will have millions of profiles added/being edited how can i manage this in a sensible manner.
i know i need a script that can append urls as each profile is created
i know the script will prob be a TASK - running at a set freq - perhaps the profiles have a property like "indexed" and the TASK sets them to "true" when the profiles are added to the map.
I dont see the best way to store the map - do i store it in the datastore i.e;
model=sitemaps
properties
key_name=sitemap_xml_1 (and for my map sitemap_index_xml)
mapxml=blobstore (the raw xml map or ror map)
full=boolean (set true when url count is 50) # might need this as a shard will tell us
To make this work my thoughts are
m cache the current site map structure as "sitemap_xml"
keep a shard of url count
when my task executes
1. build the xml structure for say the first 100 urls marked "index==false" (how many could u run at a time?)
2. test if the current mcache sitemap is full (shardcounter+10050K)
3.a if the map is near full create a new map entry in models "sitemap_xml_2" - update the map_index file (also stored in my model as "sitemap_index" start a new shard - or reset.2
3.b if the map is not full grab it from mcache
4.append the 100 url xml structure
5.save / m cache the map
I can now add a handler using a url map/route like /sitemaps/*
Get my * as map name and serve the maps from the blobstore/mache on the fly.
Now my question is does this work - is this the right way or a good way to start? Will this handle the situation of making sure the search bots update when a user changes their profile - possibly by setting the change freq correctly? - Do i need a more advance system :( ? or have i re-invented the wheel!
I hope this is all clear and make some form of sense :-)