Idea of an algorithm to detect a website's navigation structure?

Posted by Uwe Keim on Programmers See other posts from Programmers or by Uwe Keim
Published on 2011-11-21T19:56:57Z Indexed on 2011/11/22 2:09 UTC
Read the original article Hit count: 466

Filed under:

trees

Currently I am in the process of developing an importer of any existing, arbitrary (static) HTML website into the upcoming release of our CMS.

While the downloading the files is solved successfully, I'm pulling my hair off when it comes to detect a site structure (pages and subpages) purely from the HTML files, without the user specifying additional hints.

Basically I want to get a tree like:

+ Root page 1
     + Child page 1
     + Child page 2
         + Child child page1
     + Child page 3
+ Root page 2
     + Child page 4
+ Root page 3
+ ...

I.e. I want to be able to detect the menu structure from the links inside the pages. This has not to be 100% accurate, but at least I want to achieve more than just a flat list.

I thought of looking at multiple pages to see similar areas and identify these as menu areas and parse the links there, but after all I'm not that satisfied with this idea.

My question:

Can you imagine any algorithm when it comes to detecting such a structure?

Update 1:

What I'm looking for is not a web spider, but an algorithm do create a logical tree of the relationship of the pages to be able to create pages and subpages inside my CMS when importing them.

Update 2:

As of Robert's suggestion I'll solve this by starting at the root page, and then simply parse links as you go and treat every link inside a page simply as a child page. Probably I'll recurse not in a deep-first manner but rather in a breadth-first manner to get a more balanced navigation structure.

Developer IT

Idea of an algorithm to detect a website's navigation structure? - Developer IT

Idea of an algorithm to detect a website's navigation structure?

html

parsing

trees

Related posts about html

Install usblib package - Ubuntu

Prevent malicious vulnerability scan increasing load on a server

can't install psycopg2 in my env on mac os x lion

Bitnami redmine error SVN

Can the .htaccess file slow down a website to a crawl? If so, are there better ways to solve these problems with different rewrite rules and such?

Related posts about parsing

Hot to fix nautilus desktop on linux mint

Is parsing JSON faster than parsing XML

Looking for a tutorial on Recursive Descent Parsing.

Parsing XML with Hpricot, a Gem of a Ruby Gem

Parsing scripts that use curly braces

Categories cloud