stoplist - Developer IT

Using a "white list" for extracting terms for Text Mining

- by [email protected]

In Part 1 of my post on "Generating cluster names from a document clustering model" (part 1, part 2, part 3), I showed how to build a clustering model from text documents using Oracle Data Miner, which automates preparing data for text mining. In this process we specified a custom stoplist and lexer and relied on Oracle Text to identify important terms. However, there is an alternative approach, the white list, which uses a thesaurus object with the Oracle Text CTXRULE index to allow you to specify the important terms. INTRODUCTIONA stoplist is used to exclude, i.e., black list, specific words in your documents from being indexed. For example, words like a, if, and, or, and but normally add no value when text mining. Other words can also be excluded if they do not help to differentiate documents, e.g., the word Oracle is ubiquitous in the Oracle product literature. One problem with stoplists is determining which words to specify. This usually requires inspecting the terms that are extracted, manually identifying which ones you don't want, and then re-indexing the documents to determine if you missed any. Since a corpus of documents could contain thousands of words, this could be a tedious exercise. Moreover, since every word is considered as an individual token, a term excluded in one context may be needed to help identify a term in another context. For example, in our Oracle product literature example, the words "Oracle Data Mining" taken individually are not particular helpful. The term "Oracle" may be found in nearly all documents, as with the term "Data." The term "Mining" is more unique, but could also refer to the Mining industry. If we exclude "Oracle" and "Data" by specifying them in the stoplist, we lose valuable information. But it we include them, they may introduce too much noise. Still, when you have a broad vocabulary or don't have a list of specific terms of interest, you rely on the text engine to identify important terms, often by computing the term frequency - inverse document frequency metric. (This is effectively a weight associated with each term indicating its relative importance in a document within a collection of documents. We'll revisit this later.) The results using this technique is often quite valuable. As noted above, an alternative to the subtractive nature of the stoplist is to specify a white list, or a list of terms--perhaps multi-word--that we want to extract and use for data mining. The obvious downside to this approach is the need to specify the set of terms of interest. However, this may not be as daunting a task as it seems. For example, in a given domain (Oracle product literature), there is often a recognized glossary, or a list of keywords and phrases (Oracle product names, industry names, product categories, etc.). Being able to identify multi-word terms, e.g., "Oracle Data Mining" or "Customer Relationship Management" as a single token can greatly increase the quality of the data mining results. The remainder of this post and subsequent posts will focus on how to produce a dataset that contains white list terms, suitable for mining. CREATING A WHITE LIST We'll leverage the thesaurus capability of Oracle Text. Using a thesaurus, we create a set of rules that are in effect our mapping from single and multi-word terms to the tokens used to represent those terms. For example, "Oracle Data Mining" becomes "ORACLEDATAMINING." First, we'll create and populate a mapping table called my_term_token_map. All text has been converted to upper case and values in the TERM column are intended to be mapped to the token in the TOKEN column. TERM TOKEN DATA MINING DATAMINING ORACLE DATA MINING ORACLEDATAMINING 11G ORACLE11G JAVA JAVA CRM CRM CUSTOMER RELATIONSHIP MANAGEMENT CRM ... Next, we'll create a thesaurus object my_thesaurus and a rules table my_thesaurus_rules: CTX_THES.CREATE_THESAURUS('my_thesaurus', FALSE); CREATE TABLE my_thesaurus_rules (main_term VARCHAR2(100), query_string VARCHAR2(400)); We next populate the thesaurus object and rules table using the term token map. A cursor is defined over my_term_token_map. As we iterate over the rows, we insert a synonym relationship 'SYN' into the thesaurus. We also insert into the table my_thesaurus_rules the main term, and the corresponding query string, which specifies synonyms for the token in the thesaurus. DECLARE cursor c2 is select token, term from my_term_token_map; BEGIN for r_c2 in c2 loop CTX_THES.CREATE_RELATION('my_thesaurus',r_c2.token,'SYN',r_c2.term); EXECUTE IMMEDIATE 'insert into my_thesaurus_rules values (:1,''SYN(' || r_c2.token || ', my_thesaurus)'')' using r_c2.token; end loop; END; We are effectively inserting the token to return and the corresponding query that will look up synonyms in our thesaurus into the my_thesaurus_rules table, for example: 'ORACLEDATAMINING' SYN ('ORACLEDATAMINING', my_thesaurus)At this point, we create a CTXRULE index on the my_thesaurus_rules table: create index my_thesaurus_rules_idx on my_thesaurus_rules(query_string) indextype is ctxsys.ctxrule; In my next post, this index will be used to extract the tokens that match each of the rules specified. We'll then compute the tf-idf weights for each of the terms and create a nested table suitable for mining.

Read the article

Using a "white list" for extracting terms for Text Mining, Part 2

- by [email protected]

In my last post, we set the groundwork for extracting specific tokens from a white list using a CTXRULE index. In this post, we will populate a table with the extracted tokens and produce a case table suitable for clustering with Oracle Data Mining. Our corpus of documents will be stored in a database table that is defined as create table documents(id NUMBER, text VARCHAR2(4000)); However, any suitable Oracle Text-accepted data type can be used for the text. We then create a table to contain the extracted tokens. The id column contains the unique identifier (or case id) of the document. The token column contains the extracted token. Note that a given document many have many tokens, so there will be one row per token for a given document. create table extracted_tokens (id NUMBER, token VARCHAR2(4000)); The next step is to iterate over the documents and extract the matching tokens using the index and insert them into our token table. We use the MATCHES function for matching the query_string from my_thesaurus_rules with the text. DECLARE cursor c2 is select id, text from documents; BEGIN for r_c2 in c2 loop insert into extracted_tokens select r_c2.id id, main_term token from my_thesaurus_rules where matches(query_string, r_c2.text)>0; end loop; END; Now that we have the tokens, we can compute the term frequency - inverse document frequency (TF-IDF) for each token of each document. create table extracted_tokens_tfidf as with num_docs as (select count(distinct id) doc_cnt from extracted_tokens), tf as (select a.id, a.token, a.token_cnt/b.num_tokens token_freq from (select id, token, count(*) token_cnt from extracted_tokens group by id, token) a, (select id, count(*) num_tokens from extracted_tokens group by id) b where a.id=b.id), doc_freq as (select token, count(*) overall_token_cnt from extracted_tokens group by token) select tf.id, tf.token, token_freq * ln(doc_cnt/df.overall_token_cnt) tf_idf from num_docs, tf, doc_freq df where df.token=tf.token; From the WITH clause, the num_docs query simply counts the number of documents in the corpus. The tf query computes the term (token) frequency by computing the number of times each token appears in a document and divides that by the number of tokens found in the document. The doc_req query counts the number of times the token appears overall in the corpus. In the SELECT clause, we compute the tf_idf. Next, we create the nested table required to produce one record per case, where a case corresponds to an individual document. Here, we COLLECT all the tokens for a given document into the nested column extracted_tokens_tfidf_1. CREATE TABLE extracted_tokens_tfidf_nt NESTED TABLE extracted_tokens_tfidf_1 STORE AS extracted_tokens_tfidf_tab AS select id, cast(collect(DM_NESTED_NUMERICAL(token,tf_idf)) as DM_NESTED_NUMERICALS) extracted_tokens_tfidf_1 from extracted_tokens_tfidf group by id; To build the clustering model, we create a settings table and then insert the various settings. Most notable are the number of clusters (20), using cosine distance which is better for text, turning off auto data preparation since the values are ready for mining, the number of iterations (20) to get a better model, and the split criterion of size for clusters that are roughly balanced in number of cases assigned. CREATE TABLE km_settings (setting_name VARCHAR2(30), setting_value VARCHAR2(30)); BEGIN INSERT INTO km_settings (setting_name, setting_value) VALUES VALUES (dbms_data_mining.clus_num_clusters, 20); INSERT INTO km_settings (setting_name, setting_value) VALUES (dbms_data_mining.kmns_distance, dbms_data_mining.kmns_cosine); INSERT INTO km_settings (setting_name, setting_value) VALUES VALUES (dbms_data_mining.prep_auto,dbms_data_mining.prep_auto_off); INSERT INTO km_settings (setting_name, setting_value) VALUES VALUES (dbms_data_mining.kmns_iterations,20); INSERT INTO km_settings (setting_name, setting_value) VALUES VALUES (dbms_data_mining.kmns_split_criterion,dbms_data_mining.kmns_size); COMMIT; END; With this in place, we can now build the clustering model. BEGIN DBMS_DATA_MINING.CREATE_MODEL( model_name => 'TEXT_CLUSTERING_MODEL', mining_function => dbms_data_mining.clustering, data_table_name => 'extracted_tokens_tfidf_nt', case_id_column_name => 'id', settings_table_name => 'km_settings'); END;To generate cluster names from this model, check out my earlier post on that topic.

Read the article

How to ignore noiseXXX.txt files for a specific column in SQL Server 2005?

- by John MacIntyre

I have a product table where the description column is fulltext indexed. The problem is, users frequently search a single word, which happens to be in the noiseXXX.txt files. We'd like to keep the noise word functionality enabled, but is there anyway to turn it off just for this one column? I think you can do this in 2008 with the SET STOPLIST=OFF, but I can't seem to find similar functionality in SQL Server 2005.

Read the article

How to get the top keys from a hash by value

- by Kirs Kringle

I have a hash that I sorted by values greatest to least. How would I go about getting the top 5? There was a post on here that talked about getting only one value. What is the easiest way to get a key with the highest value from a hash in Perl? I understand that so would lets say getting those values add them to an array and delete the element in the hash and then do the process again? Seems like there should be an easier way to do this then that though. My hash is called %words. use strict; use warnings; use Tk; #Learn to install here: http://factscruncher.blogspot.com/2012/01/easy-way-to-install-tk- on-strawberry.html #Reading in the text file my $file0 = Tk::MainWindow->new->Tk::getOpenFile; open( my $filehandle0, '<', $file0 ) || die "Could not open $file0\n"; my @words; while ( my $line = <$filehandle0> ) { chomp $line; my @word = split( /\s+/, lc($line)); push( @words, @word ); } for (@words) { s/[\,|\.|\!|\?|\:|\;|\"]//g; } #Counting words that repeat; put in hash my %words_count; $words_count{$_}++ for @words; #Reading in the stopwords file my $file1 = "stoplist.txt"; open( my $filehandle1, '<', $file1 ) or die "Could not open $file1\n"; my @stopwords; while ( my $line = <$filehandle1> ) { chomp $line; my @linearray = split( " ", $line ); push( @stopwords, @linearray ); } for my $w ( my @stopwords ) { s/\b\Q$w\E\B//ig; } #Comparing the array to Hash and deleteing stopwords my %words = %words_count; for my $stopwords ( @stopwords ) { delete $words{ $stopwords }; } #Sorting Hash Table my @keys = sort { $words{$b} <=> $words{$a} or "\L$a" cmp "\L$b" } keys %words; #Starting Statistical Work my $value_count = 0; my $key_count = 0; #Printing Hash Table $key_count = keys %words; foreach my $key (@keys) { $value_count = $words{$key} + $value_count; printf "%-20s %6d\n", $key, $words{$key}; } my $value_average = $value_count / $key_count; #my @topwords; #foreach my $key (@keys){ #if($words{$key} > $value_average){ # @topwords = keys %words; # } #} print "\n", "The number of values: ", $value_count, "\n"; print "The number of elements: ", $key_count, "\n"; print "The Average: ", $value_average, "\n\n";

Developer IT