Retrieve KEYWORDS from META tag in a HTML WebPage using JAVA.

Posted by kooldave98 on Stack Overflow See other posts from Stack Overflow or by kooldave98
Published on 2011-02-23T22:33:02Z Indexed on 2011/02/23 23:24 UTC
Read the original article Hit count: 185

Filed under:

Hello all, I want to retrieve all the content words from a HTML WebPage and all the keywords contained in the META TAG of the same HTML webpage using Java.
For example, consider this html source code:

<html>
<head>
<meta name = "keywords" content = "deception, intricacy, treachery">
</head>
<body>
My very short html document. 
<br>
It has just 2 'lines'.
</body>
</html>

The CONTENT WORDS here are: my, very, short, html, document, it, has, just, lines

Note: The punctuation and the number '2' are ruled out.

The KEYWORDS here are: deception, intricacy, treachery

I have created a class for this purpose called WebDoc, this is as far as I have been able to get.

import java.io.BufferedReader;
import java.io.IOException;
import java.io.InputStreamReader;
import java.net.URL;
import java.util.Set;
import java.util.TreeSet;

public class WebDoc {

    protected URL _url;
    protected Set<String> _contentWords;
    protected Set<String> _keyWords

    public WebDoc(URL paramURL) {
        _url = paramURL;
    }

    public Set<String> getContents() throws IOException {
        //URL url = new URL(url);
        Set<String> contentWords = new TreeSet<String>();
        BufferedReader in = new BufferedReader(new InputStreamReader(_url.openStream()));
        String inputLine;

        while ((inputLine = in.readLine()) != null) {
            // Process each line.
            contentWords.add(RemoveTag(inputLine));
            //System.out.println(RemoveTag(inputLine));
        }
        in.close();
        System.out.println(contentWords);
        _contentWords = contentWords;
        return contentWords;
    }    

    public String RemoveTag(String html) {
        html = html.replaceAll("\\<.*?>","");
        html = html.replaceAll("&","");
        return html;
    }



    public Set<String> getKeywords() {
        //NO IDEA !
        return null;
    }

    public URL getURL() {
        return _url;
    }

    @Override
    public String toString() {
        return null;
    }
}

Developer IT

Retrieve KEYWORDS from META tag in a HTML WebPage using JAVA. - Developer IT

Retrieve KEYWORDS from META tag in a HTML WebPage using JAVA.

java

html

collections

source-code

meta-tags

Related posts about java

Tomcat 6: Access Control Exception?

Problem in creation MDB Queue connection at Jboss StartUp

failing to establish connection between Postgres db and gwt

failing to establish connection between postgre db and gwt

Migration and deployement problems JBoss 4.2.2.GA to JBoss 6.0.0.M2

Related posts about html

Install usblib package - Ubuntu

Prevent malicious vulnerability scan increasing load on a server

can't install psycopg2 in my env on mac os x lion

Bitnami redmine error SVN

Can the .htaccess file slow down a website to a crawl? If so, are there better ways to solve these problems with different rewrite rules and such?

Categories cloud