n-grams from text in PostgreSQL
Posted
by harshsinghal
on Stack Overflow
See other posts from Stack Overflow
or by harshsinghal
Published on 2010-06-15T12:59:55Z
Indexed on
2010/06/15
13:02 UTC
Read the original article
Hit count: 264
I am looking to create n-grams from text column in PostgreSQL. I currently split(on white-space) data(sentences) in a text column to an array.
select regexp_split_to_array(sentenceData,E'\s+') from tableName
Once I have this array, how do I go about:
- Creating a loop to find n-grams, and write each to a row in another table
Using unnest I can obtain all the elements of all the arrays on separate rows, and maybe I can then think of a way to get n-grams from a single column, but I'd loose the sentence boundaries which I wise to preserve.
Sample SQL code for PostgreSQL to emulate the above scenario
create table tableName(sentenceData text);
INSERT INTO tableName(sentenceData) VALUES('This is a long sentence');
INSERT INTO tableName(sentenceData) VALUES('I am currently doing grammar, hitting this monster book btw!');
INSERT INTO tableName(sentenceData) VALUES('Just tonnes of grammar, problem is I bought it in TAIWAN, and so there aint any englihs, just chinese and japanese');
select regexp_split_to_array(sentenceData,E'\s+') from tableName;
select unnest(regexp_split_to_array(sentenceData,E'\s+')) from tableName;
© Stack Overflow or respective owner