Problem when get pageContent of URL in java ?
- by tiendv
Hi all !
i have a code for get pagecontent from a URL
here is code !
import java.io.BufferedReader;
import java.io.IOException;
import java.io.InputStreamReader;
import java.net.URL;
import java.net.URLConnection;
public class GetPageFromURLAction extends Thread {
public String stringPageContent;
public String targerURL;
public String getPageContent(String targetURL) throws IOException {
String returnString="";
URL urlString = new URL(targetURL);
URLConnection openConnection = urlString.openConnection();
String temp;
BufferedReader in = new BufferedReader(new InputStreamReader(openConnection.getInputStream()));
while ((temp = in.readLine()) != null)
{
returnString += temp + "\n";
}
in.close();
// String nohtml = sb.toString().replaceAll("\\<.*?>","");
return returnString;
}
public String getStringPageContent() {
return stringPageContent;
}
public void setStringPageContent(String stringPageContent) {
this.stringPageContent = stringPageContent;
}
public String getTargerURL() {
return targerURL;
}
public void setTargerURL(String targerURL) {
this.targerURL = targerURL;
}
@Override
public void run() {
try {
this.stringPageContent=this.getPageContent(targerURL);
} catch (IOException e) {
e.printStackTrace();
}
}
}
The problem is :
1 Some time i receive a error lik 405 ,or 403 HTTP error ... and result string is null .
To repair i check permission to connect URL but it usualy return null
URLConnection openConnection = urlString.openConnection();
openConnection.getPermission(
)
is mean that i don't have permission to acess link ?
To get resultString without HTML Tag ? i do like that
String nohtml = sb.toString().replaceAll("\<.*?","");
Para sb is Stringbulder , but it can't remove all HTML Tab in string return ?
I use thread here because i must get page alot of url ,
so how can i cread a multi thread to impro speed of program !
Thanks