Webcrawler, feedback?

Posted by Jan Kuboschek on Stack Overflow See other posts from Stack Overflow or by Jan Kuboschek
Published on 2010-05-29T18:03:03Z Indexed on 2010/05/29 18:12 UTC
Read the original article Hit count: 323

Filed under:
|
|
|
|

Hey folks, every once in a while I have the need to automate data collection tasks from websites. Sometimes I need a bunch of URLs from a directory, sometimes I need an XML sitemap (yes, I know there is lots of software for that and online services).

Anyways, as follow up to my previous question I've written a little webcrawler that can visit websites.

  • Basic crawler class to easily and quickly interact with one website.

  • Override "doAction(String URL, String content)" to process the content further (e.g. store it, parse it).

  • Concept allows for multi-threading of crawlers. All class instances share processed and queued lists of links.

  • Instead of keeping track of processed links and queued links within the object, a JDBC connection could be established to store links in a database.

  • Currently limited to one website at a time, however, could be expanded upon by adding an externalLinks stack and adding to it as appropriate.

  • JCrawler is intended to be used to quickly generate XML sitemaps or parse websites for your desired information. It's lightweight.

Is this a good/decent way to write the crawler, provided the limitations above?

http://pastebin.com/VtgC4qVE - Main.java
http://pastebin.com/gF4sLHEW - JCrawler.java
http://pastebin.com/VJ1grArt - HTMLUtils.java

Thanks for your feedback in advance! :)

© Stack Overflow or respective owner

Related posts about java

Related posts about optimization