Search Results

Search found 1 results on 1 pages for 'kooldave98'.

Page 1/1 | 1 

  • Retrieve KEYWORDS from META tag in a HTML WebPage using JAVA.

    - by kooldave98
    Hello all, I want to retrieve all the content words from a HTML WebPage and all the keywords contained in the META TAG of the same HTML webpage using Java. For example, consider this html source code: <html> <head> <meta name = "keywords" content = "deception, intricacy, treachery"> </head> <body> My very short html document. <br> It has just 2 'lines'. </body> </html> The CONTENT WORDS here are: my, very, short, html, document, it, has, just, lines Note: The punctuation and the number '2' are ruled out. The KEYWORDS here are: deception, intricacy, treachery I have created a class for this purpose called WebDoc, this is as far as I have been able to get. import java.io.BufferedReader; import java.io.IOException; import java.io.InputStreamReader; import java.net.URL; import java.util.Set; import java.util.TreeSet; public class WebDoc { protected URL _url; protected Set<String> _contentWords; protected Set<String> _keyWords public WebDoc(URL paramURL) { _url = paramURL; } public Set<String> getContents() throws IOException { //URL url = new URL(url); Set<String> contentWords = new TreeSet<String>(); BufferedReader in = new BufferedReader(new InputStreamReader(_url.openStream())); String inputLine; while ((inputLine = in.readLine()) != null) { // Process each line. contentWords.add(RemoveTag(inputLine)); //System.out.println(RemoveTag(inputLine)); } in.close(); System.out.println(contentWords); _contentWords = contentWords; return contentWords; } public String RemoveTag(String html) { html = html.replaceAll("\\<.*?>",""); html = html.replaceAll("&",""); return html; } public Set<String> getKeywords() { //NO IDEA ! return null; } public URL getURL() { return _url; } @Override public String toString() { return null; } }

    Read the article

1