Web Crawler for Learnign Topics on Wikipedia
Posted
by
Chris Okyen
on Programmers
See other posts from Programmers
or by Chris Okyen
Published on 2012-10-08T16:23:39Z
Indexed on
2012/10/08
21:50 UTC
Read the original article
Hit count: 251
When I want to learn a vast topic on wikipedia, I don't know where to start. For instance say I want to learn about Binary Stars, I then have to know other things linked on that pages and linked pages on all the linked pages and so on for the specified number of levels. I want to write a web crawler like HTTracker or something similiar, that will display a heiarchy of the links on a certain page and the links on those linked pages.I wish to use as much prewritten code as possible. Here is an example:
Pretending we are bending the rules by grabing links from only the first sentence of each pages
The example archives and "processes" two levels deep
The page is Ternary operation
The First Level
In mathematics a ternary operation is an N-ary operation
The Second Level
Under Mathmatics:
Mathematics (from Greek µ???µa máthema, “knowledge, study, learning”) is the abstract study of topics encompassing quantity, structure, space, change and others; it has no generally accepted definition.
Under N-ary
In logic,mathematics, and computer science, the arity i/'ær?ti/ of a function or operation is the number of arguments or operands that the function takes
Under Operation
In its simplest meaning in mathematics and logic, an operation is an action or procedure which produces a new value from one or more input values
-------------------------------------------------------------------------
I need some way to determine what oder to approach all these wiki pages to learn the concept ( in this case ternary operations )... Following along with this exmpakle, one way to show the path to read would a printout flowout like so:
This shows that the first sentence of the Mathematics page doesn't link to the first sentence of pages linked on ternary page two levels deep. (Please tell me how I should explain this ) ---> In otherwords, the child node of the top pages first sentence, ternary_operation, does not have any child nodes that reference the children of the top pages other children nodes- N-ary and operation. Thus it is safe to read this first. Since N-ary has a link to operations we shoudl read the operation page second and finally read the N-ary page last.
Again, I wish to use as much prewritten code as possible, and was wondering what language to use and what would be the simpliest way to go about doing this if there isn't already somethign out there?
Thank You!
© Programmers or respective owner