Using a "white list" for extracting terms for Text Mining, Part 2

Posted by [email protected] on Oracle Blogs See other posts from Oracle Blogs or by [email protected]
Published on Wed, 31 Mar 2010 12:13:40 +0000 Indexed on 2010/03/31 16:23 UTC
Read the original article Hit count: 760

In my last post, we set the groundwork for extracting specific tokens from a white list using a CTXRULE index. In this post, we will populate a table with the extracted tokens and produce a case table suitable for clustering with Oracle Data Mining.

Our corpus of documents will be stored in a database table that is defined as

create table documents(id NUMBER, text VARCHAR2(4000));

However, any suitable Oracle Text-accepted data type can be used for the text.

We then create a table to contain the extracted tokens. The id column contains the unique identifier (or case id) of the document. The token column contains the extracted token. Note that a given document many have many tokens, so there will be one row per token for a given document.

create table extracted_tokens (id NUMBER, token VARCHAR2(4000));

The next step is to iterate over the documents and extract the matching tokens using the index and insert them into our token table. We use the MATCHES function for matching the query_string from my_thesaurus_rules with the text.

DECLARE
    cursor c2 is
      select id, text
      from documents;
BEGIN
    for r_c2 in c2 loop
       insert into
extracted_tokens
         select r_c2.id id, main_term token
         from my_thesaurus_rules
         where matches(query_string,
                       r_c2.text)>0;
    end loop;
END;


Now that we have the tokens, we can compute the term frequency - inverse document frequency (TF-IDF) for each token of each document.

create table extracted_tokens_tfidf as
  with num_docs as (select count(distinct id) doc_cnt
                    from extracted_tokens),
       tf       as (select a.id, a.token,
                           a.token_cnt/b.num_tokens token_freq
                    from 
                      (select id, token, count(*) token_cnt
                       from extracted_tokens
                       group by id, token) a,
                      (select id, count(*) num_tokens
                       from extracted_tokens
                       group by id) b
                    where a.id=b.id),
       doc_freq as (select token, count(*) overall_token_cnt
                    from extracted_tokens
                    group by token)
  select tf.id, tf.token,
         token_freq * ln(doc_cnt/df.overall_token_cnt) tf_idf
  from num_docs,
       tf,
       doc_freq df
  where df.token=tf.token;


From the WITH clause, the num_docs query simply counts the number of documents in the corpus. The tf query computes the term (token) frequency by computing the number of times each token appears in a document and divides that by the number of tokens found in the document. The doc_req query counts the number of times the token appears overall in the corpus. In the SELECT clause, we compute the tf_idf.

Next, we create the nested table required to produce one record per case, where a case corresponds to an individual document. Here, we COLLECT all the tokens for a given document into the nested column extracted_tokens_tfidf_1.

CREATE TABLE extracted_tokens_tfidf_nt
             NESTED TABLE
extracted_tokens_tfidf_1
                 STORE AS
extracted_tokens_tfidf_tab AS
             select id,
                    cast(collect(DM_NESTED_NUMERICAL(token,tf_idf)) as DM_NESTED_NUMERICALS)
extracted_tokens_tfidf_1
             from
extracted_tokens_tfidf
             group by id;

 
To build the clustering model, we create a settings table and then insert the various settings. Most notable are the number of clusters (20), using cosine distance which is better for text, turning off auto data preparation since the values are ready for mining, the number of iterations (20) to get a better model, and the split criterion of size for clusters that are roughly balanced in number of cases assigned.

CREATE TABLE km_settings (setting_name  VARCHAR2(30), setting_value VARCHAR2(30));

BEGIN
  INSERT INTO km_settings (setting_name, setting_value) VALUES
   
VALUES (dbms_data_mining.clus_num_clusters, 20);
 
INSERT INTO km_settings (setting_name, setting_value)
   
VALUES (dbms_data_mining.kmns_distance, dbms_data_mining.kmns_cosine);
  INSERT INTO km_settings (setting_name, setting_value) VALUES
   
VALUES (dbms_data_mining.prep_auto,dbms_data_mining.prep_auto_off);
  INSERT INTO km_settings (setting_name, setting_value) VALUES
   
VALUES (dbms_data_mining.kmns_iterations,20);
  INSERT INTO km_settings (setting_name, setting_value) VALUES
   
VALUES (dbms_data_mining.kmns_split_criterion,dbms_data_mining.kmns_size);
  COMMIT;
END;

With this in place, we can now build the clustering model.

BEGIN
    DBMS_DATA_MINING.CREATE_MODEL(
    model_name          => 'TEXT_CLUSTERING_MODEL',
    mining_function     => dbms_data_mining.clustering,
    data_table_name     => '
extracted_tokens_tfidf_nt',
    case_id_column_name => 'id',
    settings_table_name => 'km_settings');
END;


To generate cluster names from this model, check out my earlier post on that topic.

© Oracle Blogs or respective owner

Related posts about Data Mining

Related posts about Oracle Data Mining