Problem when getting pageContent of an unavailable URL in Java
- by tiendv
I have a code for get pagecontent from a URL:
import java.io.BufferedReader;
import java.io.IOException;
import java.io.InputStreamReader;
import java.net.URL;
import java.net.URLConnection;
public class GetPageFromURLAction extends Thread {
public String stringPageContent;
public String targerURL;
public String getPageContent(String targetURL) throws IOException {
String returnString="";
URL urlString = new URL(targetURL);
URLConnection openConnection = urlString.openConnection();
String temp;
BufferedReader in = new BufferedReader( newInputStreamReader(openConnection.getInputStream()));
while ((temp = in.readLine()) != null)
{
returnString += temp + "\n";
}
in.close();
// String nohtml = sb.toString().replaceAll("\\<.*?>","");
return returnString;
}
public String getStringPageContent() {
return stringPageContent;
}
public void setStringPageContent(String stringPageContent) {
this.stringPageContent = stringPageContent;
}
public String getTargerURL() {
return targerURL;
}
public void setTargerURL(String targerURL) {
this.targerURL = targerURL;
}
@Override
public void run() {
try {
this.stringPageContent=this.getPageContent(targerURL);
} catch (IOException e) {
e.printStackTrace();
}
}
}
Sometimes I receive an HTTP error of 405 or 403 and result string is null.
I have tried checking permission to connect to the URL with:
URLConnection openConnection = urlString.openConnection();
openConnection.getPermission()
but it usualy returns null. Does mean that i don't have permission to access the link?
I have tried stripping off the query portion of the URL with:
String nohtml = sb.toString().replaceAll("\\<.*?>","");
where sb is a Stringbulder, but it doesn't seem to strip off the whole query substring.
In an unrelated question, I'd like to use threads here because I must retrieve many URLs;
how can I create a multi-thread client to improve the speed?