Currently I am in the process of developing an importer of any existing, arbitrary (static) HTML website into the upcoming release of our CMS.
While the downloading the files is solved successfully, I'm pulling my hair off when it comes to detect a site structure (pages and subpages) purely from the HTML files, without the user specifying additional hints.
Basically I want to get a tree like:
+ Root page 1
+ Child page 1
+ Child page 2
+ Child child page1
+ Child page 3
+ Root page 2
+ Child page 4
+ Root page 3
+ ...
I.e. I want to be able to detect the menu structure from the links inside the pages. This has not to be 100% accurate, but at least I want to achieve more than just a flat list.
I thought of looking at multiple pages to see similar areas and identify these as menu areas and parse the links there, but after all I'm not that satisfied with this idea.
My question:
Can you imagine any algorithm when it comes to detecting such a structure?
Update 1:
What I'm looking for is not a web spider, but an algorithm do create a logical tree of the relationship of the pages to be able to create pages and subpages inside my CMS when importing them.
Update 2:
As of Robert's suggestion I'll solve this by starting at the root page, and then simply parse links as you go and treat every link inside a page simply as a child page. Probably I'll recurse not in a deep-first manner but rather in a breadth-first manner to get a more balanced navigation structure.