Problem when getting pageContent of an unavailable URL in Java

Posted by tiendv on Stack Overflow See other posts from Stack Overflow or by tiendv
Published on 2010-05-11T01:46:29Z Indexed on 2010/05/11 2:04 UTC
Read the original article Hit count: 322

Filed under:
|

I have a code for get pagecontent from a URL:

import java.io.BufferedReader;
import java.io.IOException;
import java.io.InputStreamReader;
import java.net.URL;
import java.net.URLConnection;

public class GetPageFromURLAction extends Thread {

    public String stringPageContent;
    public String targerURL;

    public  String getPageContent(String targetURL) throws IOException {
            String returnString="";
            URL urlString = new URL(targetURL);
            URLConnection openConnection = urlString.openConnection();
            String temp;
             BufferedReader in = new BufferedReader( newInputStreamReader(openConnection.getInputStream()));
                while ((temp = in.readLine()) != null) 
                {
                    returnString += temp + "\n";        
                }       
                in.close();
              //  String nohtml = sb.toString().replaceAll("\\<.*?>","");
                return returnString;

     }

    public String getStringPageContent() {
        return stringPageContent;
    }

    public void setStringPageContent(String stringPageContent) {
        this.stringPageContent = stringPageContent;
    }

    public String getTargerURL() {
        return targerURL;
    }

    public void setTargerURL(String targerURL) {
        this.targerURL = targerURL;
    }

    @Override
    public void run() {
        try {
            this.stringPageContent=this.getPageContent(targerURL);
        } catch (IOException e) {
            e.printStackTrace();
        }
    }

}

Sometimes I receive an HTTP error of 405 or 403 and result string is null. I have tried checking permission to connect to the URL with:

    URLConnection openConnection = urlString.openConnection();
    openConnection.getPermission()

but it usualy returns null. Does mean that i don't have permission to access the link?

I have tried stripping off the query portion of the URL with:

String nohtml = sb.toString().replaceAll("\\<.*?>","");

where sb is a Stringbulder, but it doesn't seem to strip off the whole query substring.

In an unrelated question, I'd like to use threads here because I must retrieve many URLs; how can I create a multi-thread client to improve the speed?

© Stack Overflow or respective owner

Related posts about java

Related posts about http