Webcrawler, feedback?

Posted by Jan Kuboschek on Stack Overflow See other posts from Stack Overflow or by Jan Kuboschek
Published on 2010-05-29T18:03:03Z Indexed on 2010/05/29 18:12 UTC
Read the original article Hit count: 373

Filed under:

java

|

optimization

|

web-crawler

|

code

|

feedback

Hey folks, every once in a while I have the need to automate data collection tasks from websites. Sometimes I need a bunch of URLs from a directory, sometimes I need an XML sitemap (yes, I know there is lots of software for that and online services).

Anyways, as follow up to my previous question I've written a little webcrawler that can visit websites.

Basic crawler class to easily and quickly interact with one website.

Override "doAction(String URL, String content)" to process the content further (e.g. store it, parse it).

Concept allows for multi-threading of crawlers. All class instances share processed and queued lists of links.

Instead of keeping track of processed links and queued links within the object, a JDBC connection could be established to store links in a database.

Currently limited to one website at a time, however, could be expanded upon by adding an externalLinks stack and adding to it as appropriate.

JCrawler is intended to be used to quickly generate XML sitemaps or parse websites for your desired information. It's lightweight.

Is this a good/decent way to write the crawler, provided the limitations above?

http://pastebin.com/VtgC4qVE - Main.java
http://pastebin.com/gF4sLHEW - JCrawler.java
http://pastebin.com/VJ1grArt - HTMLUtils.java

Thanks for your feedback in advance! :)

© Stack Overflow or respective owner

Related posts about java

Tomcat 6: Access Control Exception?

as seen on Server Fault - Search for 'Server Fault'
I'm trying to setup a tomcat6 server, and I'm trying to match another setup someone else established. However, my deployment (default Ubuntu install) uses a policy.d/ directory structure, and the established server just uses a catalina.policy file. I've tried setting every entry in policy.d to match… >>> More
Problem in creation MDB Queue connection at Jboss StartUp

as seen on Stack Overflow - Search for 'Stack Overflow'
I am not able to create a Queue connection in JBOSS4.2.3GA Version & Java1.5, as I am using MDB as per the below details. I am putting this MDB in a jar file(named utsJar.jar) and copied it in deploy folder of JBOSS, In the test env. this MDB works well but in another env. [ env settings and… >>> More
failing to establish connection between Postgres db and gwt

as seen on Stack Overflow - Search for 'Stack Overflow'
Hi, I am using Postgres and gwt 2.0 for one of my applications. I am facing problem connecting to the database. When I try to connect it gives "ClassNotFoundException". Here is what I get when I try to connect to database: java.lang.ClassNotFoundException: org.postgresql.Driver at java.net… >>> More
failing to establish connection between postgre db and gwt

as seen on Stack Overflow - Search for 'Stack Overflow'
Hi, For i am using postgre and gwt 2.0 for one of my applications. I am facing problem connecting to the database. When i try to connect it gives "ClassNotFoundException". Here is what i get when i try to connect to database: java.lang.ClassNotFoundException: org.postgresql.Driver at java.net… >>> More
Migration and deployement problems JBoss 4.2.2.GA to JBoss 6.0.0.M2

as seen on Stack Overflow - Search for 'Stack Overflow'
Hi, I'm trying to migrate an application running on JBoss 4.2.2.GA to JBoss 6.0.0.M2 I give you some log to explain my problem : boot.log : 2010-03-16 09:59:29,406 ERROR [org.jboss.system.server.profileservice.ProfileServiceBootstrap] (Thread-2) Failed to load profile: Summary of incomplete deployments… >>> More

Related posts about optimization

Search Engine Optimization - The Importance of Page Optimization in Search Engine Optimization

as seen on Ezine Articles - Search for 'Ezine Articles'
In order for your website to rank well, your internal linking structure is critical to your success. This is covered some of the theory for this in various articles and blogs about Page Structure of a website, which said how you should map out the physical linking structure, but in this guide I will… >>> More
SEO Optimization - How to Master the SEO Optimization Process in Four Easy Steps

as seen on Ezine Articles - Search for 'Ezine Articles'
Search engine optimization is a critical part of any internet marketing campaign but can often be intimidating to new marketers. In this article you will learn the 4 basic components of SEO. >>> More
Keywords Optimization For Website Optimization

as seen on Ezine Articles - Search for 'Ezine Articles'
Saying that you need to do website optimization sounds like saying you need to get healthy. To get healthy we do 2 things: diet management and exercise. Lets start with diet management. Keywords are like food for your WebPages. This article explains the role of keywords in website optimization. >>> More
The Expert Secret to Search Engine Optimization - Effective Website Optimization

as seen on Ezine Articles - Search for 'Ezine Articles'
Throwing keywords into a program that shows you how popular they are and then using those keywords without doing a little bit of preliminary research and answering some very important questions can just spell disaster. There are three questions that are extremely important to ask yourself before just… >>> More
Importance of On-Page Optimization in Search Engine Optimization (SEO)

as seen on Ezine Articles - Search for 'Ezine Articles'
On-page optimization is crucial for any website. Read the importance of on-page optimization and how it would be helpful for getting high ranking in search engines. >>> More