Using a "white list" for extracting terms for Text Mining, Part 2
Posted
by [email protected]
on Oracle Blogs
See other posts from Oracle Blogs
or by [email protected]
Published on Wed, 31 Mar 2010 12:13:40 +0000
Indexed on
2010/03/31
16:23 UTC
Read the original article
Hit count: 765
Our corpus of documents will be stored in a database table that is defined as
create table documents(id NUMBER, text VARCHAR2(4000));
However, any suitable Oracle Text-accepted data type can be used for the text.
We then create a table to contain the extracted tokens. The id column contains the unique identifier (or case id) of the document. The token column contains the extracted token. Note that a given document many have many tokens, so there will be one row per token for a given document.
create table extracted_tokens (id NUMBER, token VARCHAR2(4000));
The next step is to iterate over the documents and extract the matching tokens using the index and insert them into our token table. We use the MATCHES function for matching the query_string from my_thesaurus_rules with the text.
DECLARE
cursor c2 is
select id, text
from documents;
BEGIN
for r_c2 in c2 loop
insert into extracted_tokens
select r_c2.id id, main_term token
from my_thesaurus_rules
where matches(query_string,
r_c2.text)>0;
end loop;
END;
Now that we have the tokens, we can compute the term frequency - inverse document frequency (TF-IDF) for each token of each document.
create table extracted_tokens_tfidf as
with num_docs as (select count(distinct id) doc_cnt
from extracted_tokens),
tf as (select a.id, a.token,
a.token_cnt/b.num_tokens token_freq
from
(select id, token, count(*) token_cnt
from extracted_tokens
group by id, token) a,
(select id, count(*) num_tokens
from extracted_tokens
group by id) b
where a.id=b.id),
doc_freq as (select token, count(*) overall_token_cnt
from extracted_tokens
group by token)
select tf.id, tf.token,
token_freq * ln(doc_cnt/df.overall_token_cnt) tf_idf
from num_docs,
tf,
doc_freq df
where df.token=tf.token;
From the WITH clause, the num_docs query simply counts the number of documents in the corpus. The tf query computes the term (token) frequency by computing the number of times each token appears in a document and divides that by the number of tokens found in the document. The doc_req query counts the number of times the token appears overall in the corpus. In the SELECT clause, we compute the tf_idf.
Next, we create the nested table required to produce one record per case, where a case corresponds to an individual document. Here, we COLLECT all the tokens for a given document into the nested column extracted_tokens_tfidf_1.
CREATE TABLE extracted_tokens_tfidf_nt
NESTED TABLE extracted_tokens_tfidf_1
STORE AS extracted_tokens_tfidf_tab AS
select id,
cast(collect(DM_NESTED_NUMERICAL(token,tf_idf)) as DM_NESTED_NUMERICALS) extracted_tokens_tfidf_1
from extracted_tokens_tfidf
group by id;
To build the clustering model, we create a settings table and then insert the various settings. Most notable are the number of clusters (20), using cosine distance which is better for text, turning off auto data preparation since the values are ready for mining, the number of iterations (20) to get a better model, and the split criterion of size for clusters that are roughly balanced in number of cases assigned.
CREATE TABLE km_settings (setting_name VARCHAR2(30), setting_value VARCHAR2(30));
BEGIN
INSERT INTO km_settings (setting_name, setting_value) VALUES
VALUES (dbms_data_mining.clus_num_clusters, 20);
INSERT INTO km_settings (setting_name, setting_value)
VALUES (dbms_data_mining.kmns_distance, dbms_data_mining.kmns_cosine);
INSERT INTO km_settings (setting_name, setting_value) VALUES
VALUES (dbms_data_mining.prep_auto,dbms_data_mining.prep_auto_off);
INSERT INTO km_settings (setting_name, setting_value) VALUES
VALUES (dbms_data_mining.kmns_iterations,20);
INSERT INTO km_settings (setting_name, setting_value) VALUES
VALUES (dbms_data_mining.kmns_split_criterion,dbms_data_mining.kmns_size);
COMMIT;
END;
With this in place, we can now build the clustering model.
BEGIN
DBMS_DATA_MINING.CREATE_MODEL(
model_name => 'TEXT_CLUSTERING_MODEL',
mining_function => dbms_data_mining.clustering,
data_table_name => 'extracted_tokens_tfidf_nt',
case_id_column_name => 'id',
settings_table_name => 'km_settings');
END;
To generate cluster names from this model, check out my earlier post on that topic.
© Oracle Blogs or respective owner