HttpClient multithread performance

Posted by pepper on Stack Overflow See other posts from Stack Overflow or by pepper
Published on 2012-05-21T09:24:57Z Indexed on 2012/10/16 11:02 UTC
Read the original article Hit count: 306

I have an application which downloads more than 4500 html pages from 62 target hosts using HttpClient (4.1.3 or 4.2-beta). It runs on Windows 7 64-bit. Processor - Core i7 2600K. Network bandwidth - 54 Mb/s.

At this moment it uses such parameters:

  • DefaultHttpClient and PoolingClientConnectionManager;
  • Also it hasIdleConnectionMonitorThread from
    http://hc.apache.org/httpcomponents-client-ga/tutorial/html/connmgmt.html;
  • Maximum total connections = 80;
  • Default maximum connections per route = 5;
  • For thread management it uses ForkJoinPool with the parallelism
    level = 5 (Do I understand correctly that it is a number of working
    threads?)

In this case my network usage (in Windows task manager) does not rise above 2.5%. To download 4500 pages it takes 70 minutes. And in HttpClient logs I have such things:

DEBUG ForkJoinPool-2-worker-1 [org.apache.http.impl.conn.PoolingClientConnectionManager]: Connection released: [id: 209][route: {}->http://stackoverflow.com][total kept alive: 6; route allocated: 1 of 5; total allocated: 10 of 80]

Total allocated connections do not raise above 10-12, in spite of that I've set it up to 80 connections. If I'll try to rise parallelism level to 20 or 80, network usage remains the same but a lot connection time-outs will be generated.

I've read tutorials on hc.apache.org (HttpClient Performance Optimization Guide and HttpClient Threading Guide) but they does not help.

Task's code looks like this:

public class ContentDownloader extends RecursiveAction {
    private final HttpClient httpClient;
    private final HttpContext context;
    private List<Entry> entries;

    public ContentDownloader(HttpClient httpClient, List<Entry> entries){
        this.httpClient = httpClient;
        context = new BasicHttpContext();
        this.entries = entries;
    }

    private void computeDirectly(Entry entry){      
        final HttpGet get = new HttpGet(entry.getLink());
        try {
            HttpResponse response = httpClient.execute(get, context);
            int statusCode = response.getStatusLine().getStatusCode();

            if ( (statusCode >= 400) && (statusCode <= 600) ) {
                logger.error("Couldn't get content from " + get.getURI().toString() + "\n"  + response.toString());
            } else {        
                HttpEntity entity = response.getEntity();
                if (entity != null) {
                    String htmlContent = EntityUtils.toString(entity).trim();
                    entry.setHtml(htmlContent);
                    EntityUtils.consumeQuietly(entity);                             
                }
            }                           
        } catch (Exception e) {
        } finally {
            get.releaseConnection();
        }
    }

    @Override
    protected void compute() {
        if (entries.size() <= 1){           
            computeDirectly(entries.get(0));
            return;         
        }       
        int split = entries.size() / 2;     
        invokeAll(new ContentDownloader(httpClient, entries.subList(0, split)), 
                new ContentDownloader(httpClient, entries.subList(split, entries.size())));
    }
}

And the question is - what is the best practice to use multi threaded HttpClient, may be there is a some rules for setting up ConnectionManager and HttpClient? How can I use all of 80 connections and raise network usage?

If necessary, I will provide more code.

© Stack Overflow or respective owner

Related posts about java

Related posts about apache