What is a good Java web crawler library?

Posted by DrDee on Stack Overflow See other posts from Stack Overflow or by DrDee
Published on 2010-03-22T19:58:49Z Indexed on 2010/05/16 17:50 UTC
Read the original article Hit count: 334

Filed under:

java

|

web-crawler

|

webcrawling

Hi,

I am about to develop a crawler in Java but don't feel like reinventing the wheel. A quick Google search gives a whole bunch of Java libraries to build a web crawler. Besides that Nutch is of course a very robust package but seems a bit too advanced for my needs. I only need to crawl a handful websites a week containing a couple of 1000 pages each.

Which open source Java library would you recommend considering:

speed
multithreading (or even distributed)
extending it with new functionality
active maintained
and documentation?

© Stack Overflow or respective owner

Related posts about java

Tomcat 6: Access Control Exception?

as seen on Server Fault - Search for 'Server Fault'
I'm trying to setup a tomcat6 server, and I'm trying to match another setup someone else established. However, my deployment (default Ubuntu install) uses a policy.d/ directory structure, and the established server just uses a catalina.policy file. I've tried setting every entry in policy.d to match… >>> More
Problem in creation MDB Queue connection at Jboss StartUp

as seen on Stack Overflow - Search for 'Stack Overflow'
I am not able to create a Queue connection in JBOSS4.2.3GA Version & Java1.5, as I am using MDB as per the below details. I am putting this MDB in a jar file(named utsJar.jar) and copied it in deploy folder of JBOSS, In the test env. this MDB works well but in another env. [ env settings and… >>> More
failing to establish connection between Postgres db and gwt

as seen on Stack Overflow - Search for 'Stack Overflow'
Hi, I am using Postgres and gwt 2.0 for one of my applications. I am facing problem connecting to the database. When I try to connect it gives "ClassNotFoundException". Here is what I get when I try to connect to database: java.lang.ClassNotFoundException: org.postgresql.Driver at java.net… >>> More
failing to establish connection between postgre db and gwt

as seen on Stack Overflow - Search for 'Stack Overflow'
Hi, For i am using postgre and gwt 2.0 for one of my applications. I am facing problem connecting to the database. When i try to connect it gives "ClassNotFoundException". Here is what i get when i try to connect to database: java.lang.ClassNotFoundException: org.postgresql.Driver at java.net… >>> More
Migration and deployement problems JBoss 4.2.2.GA to JBoss 6.0.0.M2

as seen on Stack Overflow - Search for 'Stack Overflow'
Hi, I'm trying to migrate an application running on JBoss 4.2.2.GA to JBoss 6.0.0.M2 I give you some log to explain my problem : boot.log : 2010-03-16 09:59:29,406 ERROR [org.jboss.system.server.profileservice.ProfileServiceBootstrap] (Thread-2) Failed to load profile: Summary of incomplete deployments… >>> More

Related posts about web-crawler

web crawler needed

as seen on Stack Overflow - Search for 'Stack Overflow'
does anybody know where i can get a free web crawler that actually works with minimal coding by me. ive googled it and can only find really old ones that dont work or openwebspider which doesnt seem to work. ideally id like to store just the web addresses and which links that page contains any suggestions… >>> More
Building an automatic web crawler

as seen on Stack Overflow - Search for 'Stack Overflow'
I am building a web application crawler that's meant not only to find all the links or pages in a web application, but also perform all the allowed actions in the app (such as pushing buttons, filling forms, notice changes in the DOM even if they did not trigger a request etc.) Basically, this is… >>> More
Appengine Apps Vs Google bot web crawler

as seen on Stack Overflow - Search for 'Stack Overflow'
i built an appengine web app cricket.hover.in. The web app consists of about 15k url's linked in it, But even after a long time of my launch, no pages are indexed on google. Any base link place on my root site hover.in are being indexed with in minutes. but i placed the same link home page of root… >>> More
Extracting data from internet

as seen on Programmers - Search for 'Programmers'
I would like to extract data from internet like www.mozenda.com does but I want to write my own program to do that. Specific data I'm looking for is various event data. Based on my research, I think custom web crawler is my answer but I Would like to confirm the answer and see if there are any suggestion… >>> More
Web crawler update strategy

as seen on Stack Overflow - Search for 'Stack Overflow'
I want to crawl useful resource (like background picture .. ) from certain websites. It is not a hard job, especially with the help of some wonderful projects like scrapy. The problem here is I not only just want crawl this site ONE TIME. I also want to keep my crawl long running and crawl the updated… >>> More