Retrieve KEYWORDS from META tag in a HTML WebPage using JAVA.

Posted by kooldave98 on Stack Overflow See other posts from Stack Overflow or by kooldave98
Published on 2011-02-23T22:33:02Z Indexed on 2011/02/23 23:24 UTC
Read the original article Hit count: 180

Filed under:
|
|
|
|

Hello all, I want to retrieve all the content words from a HTML WebPage and all the keywords contained in the META TAG of the same HTML webpage using Java.
For example, consider this html source code:

<html>
<head>
<meta name = "keywords" content = "deception, intricacy, treachery">
</head>
<body>
My very short html document. 
<br>
It has just 2 'lines'.
</body>
</html>

The CONTENT WORDS here are: my, very, short, html, document, it, has, just, lines

Note: The punctuation and the number '2' are ruled out.

The KEYWORDS here are: deception, intricacy, treachery

I have created a class for this purpose called WebDoc, this is as far as I have been able to get.

import java.io.BufferedReader;
import java.io.IOException;
import java.io.InputStreamReader;
import java.net.URL;
import java.util.Set;
import java.util.TreeSet;

public class WebDoc {

    protected URL _url;
    protected Set<String> _contentWords;
    protected Set<String> _keyWords

    public WebDoc(URL paramURL) {
        _url = paramURL;
    }

    public Set<String> getContents() throws IOException {
        //URL url = new URL(url);
        Set<String> contentWords = new TreeSet<String>();
        BufferedReader in = new BufferedReader(new InputStreamReader(_url.openStream()));
        String inputLine;

        while ((inputLine = in.readLine()) != null) {
            // Process each line.
            contentWords.add(RemoveTag(inputLine));
            //System.out.println(RemoveTag(inputLine));
        }
        in.close();
        System.out.println(contentWords);
        _contentWords = contentWords;
        return contentWords;
    }    

    public String RemoveTag(String html) {
        html = html.replaceAll("\\<.*?>","");
        html = html.replaceAll("&","");
        return html;
    }



    public Set<String> getKeywords() {
        //NO IDEA !
        return null;
    }

    public URL getURL() {
        return _url;
    }

    @Override
    public String toString() {
        return null;
    }
}

© Stack Overflow or respective owner

Related posts about java

Related posts about html