I have an application which downloads more than 4500 html pages from 62 target hosts using HttpClient (4.1.3 or 4.2-beta). It runs on Windows 7 64-bit. Processor - Core i7 2600K. Network bandwidth - 54 Mb/s.
At this moment it uses such parameters:
DefaultHttpClient and PoolingClientConnectionManager;
Also it hasIdleConnectionMonitorThread from
http://hc.apache.org/httpcomponents-client-ga/tutorial/html/connmgmt.html;
Maximum total connections = 80;
Default maximum connections per route = 5;
For thread management it uses ForkJoinPool with the parallelism
level = 5 (Do I understand correctly that it is a number of working
threads?)
In this case my network usage (in Windows task manager) does not rise above 2.5%. To download 4500 pages it takes 70 minutes. And in HttpClient logs I have such things:
DEBUG ForkJoinPool-2-worker-1
[org.apache.http.impl.conn.PoolingClientConnectionManager]: Connection
released: [id: 209][route: {}-http://stackoverflow.com][total kept
alive: 6; route allocated: 1 of 5; total allocated: 10 of 80]
Total allocated connections do not raise above 10-12, in spite of that I've set it up to 80 connections.
If I'll try to rise parallelism level to 20 or 80, network usage remains the same but a lot connection time-outs will be generated.
I've read tutorials on hc.apache.org (HttpClient Performance Optimization Guide and HttpClient Threading Guide) but they does not help.
Task's code looks like this:
public class ContentDownloader extends RecursiveAction {
private final HttpClient httpClient;
private final HttpContext context;
private List<Entry> entries;
public ContentDownloader(HttpClient httpClient, List<Entry> entries){
this.httpClient = httpClient;
context = new BasicHttpContext();
this.entries = entries;
}
private void computeDirectly(Entry entry){
final HttpGet get = new HttpGet(entry.getLink());
try {
HttpResponse response = httpClient.execute(get, context);
int statusCode = response.getStatusLine().getStatusCode();
if ( (statusCode >= 400) && (statusCode <= 600) ) {
logger.error("Couldn't get content from " + get.getURI().toString() + "\n" + response.toString());
} else {
HttpEntity entity = response.getEntity();
if (entity != null) {
String htmlContent = EntityUtils.toString(entity).trim();
entry.setHtml(htmlContent);
EntityUtils.consumeQuietly(entity);
}
}
} catch (Exception e) {
} finally {
get.releaseConnection();
}
}
@Override
protected void compute() {
if (entries.size() <= 1){
computeDirectly(entries.get(0));
return;
}
int split = entries.size() / 2;
invokeAll(new ContentDownloader(httpClient, entries.subList(0, split)),
new ContentDownloader(httpClient, entries.subList(split, entries.size())));
}
}
And the question is - what is the best practice to use multi threaded HttpClient, may be there is a some rules for setting up ConnectionManager and HttpClient? How can I use all of 80 connections and raise network usage?
If necessary, I will provide more code.