Get your content off Blogger.com
Posted
by Daniel Moth
on Daniel Moth
See other posts from Daniel Moth
or by Daniel Moth
Published on Fri, 09 Apr 2010 13:58:21 GMT
Indexed on
2010/04/09
18:43 UTC
Read the original article
Hit count: 543
blogging
Due to blogger.com deprecating FTP users I've decided to move my blog.
When I think of the content of a blog, 4 items come to mind: blog posts, comments, binary files that the blog posts linked to (e.g. images, ZIP files) and the CSS+structure of the blog.
1. Binaries
The binary files you used in your blog posts are sitting on your own web space, so really blogger.com is not involved with that. Nothing for you to do at this stage, I'll come back to these in another post.
2. CSS and structure
In the best case this exists as a separate CSS file on your web space (so no action for now) or in a worst case, like me, your CSS is embedded with the HTML. In the latter case, simply navigate from you dashboard to "Template" then "Edit HTML" and copy paste the contents of the box. Save that locally in a txt file and we'll come back to that in another post.
3. Blog posts and Comments
The blog posts and comments exist in all the HTML files on your own web space. Parsing HTML files to extract that can be painful, so it is easier to download the XML files from blogger's servers that contain all your blog posts and comments.
3.1 Single XML file, but incomplete
The obvious thing to do is go into your dashboard "Settings" and under the "Basic" tab look at the top next to "Blog Tools". There is a link there to "Export blog" which downloads an XML file with both comments and posts. The problem with that is that it only contains 200 comments - if you have more than that, you will lose the surplus. Also, this XML file has a lot of noise, compared to the better solution described next. (note that a tool I will refer to in a future post deals with either kind of XML file)
3.2 Multiple XML files
First you need to find your blog ID. In case you don't know what that is, navigate to the "Template" as described in section 2 above. You will find references to the blog id in the HTML there, but you can also see it as part of the URL in your browser: blogger.com/template-edit.g?blogID=YOUR_NUMERIC_ID. Mine is 7 digits.
You can now navigate to these URLs to download the XML for your posts and comments respectively:
blogger.com/feeds/YOUR_NUMERIC_ID/posts/default?max-results=500&start-index=1
blogger.com/feeds/YOUR_NUMERIC_ID/comments/default?max-results=200&start-index=1
Note that you can only get 500 posts at a time and only 200 comments at a time. To get more than that you have to change the URL and download the next batch. To get you started, to get the XML for the next 500 posts and next 200 comments respectively you’d have to use these URLs:
blogger.com/feeds/YOUR_NUMERIC_ID/posts/default?max-results=500&start-index=501
blogger.com/feeds/YOUR_NUMERIC_ID/comments/default?max-results=200&start-index=201
...and so on and so forth. Keep all the XML files in the same folder on your local machine (with nothing else in there).
4. Validating the XML aka editing older blog posts
The XML files you just downloaded really contain HTML fragments inside for all your blog posts. If you are like me, your blog posts did not conform to XHTML so passing them to an XML parser (which is what we will want to do) will result in the XML parser choking. So the next step is to fix that. This can be no work at all for you, or a huge time sink or just a couple hours of pain (which was my case).
The process I followed was to attempt to load the XML files using XmlDocument.Load and wait for the exception to be thrown from my code. The exception would point to the exact offending line and column which would help me fix the issue. Rather than fix it in the XML itself, I would go back and edit the offending blog post and fix it there - recommended! Then I'd repeat the cycle until the XML could be loaded in the XmlDocument.
To give you an idea, some of the issues I encountered are: extra or missing quotes in img and href elements, direct usage of chevrons instead of encoding them as <, missing closing tags, mismatched nested pairs of elements and capitalization of html elements. For a full list of things that may go wrong see this.
5. Opportunity for other changes
I also found a few posts that did not have a category assigned so I fixed those too. I took the further opportunity to create new categories and tag some of my blog posts with that. Note that I did not remove/change categories of existing posts, but only added.
In an another post we'll see how to use the XML files you stored in the local folder…
Comments about this post welcome at the original blog.
© Daniel Moth or respective owner