Building dictionary of words from large text

Posted by LiorH on Stack Overflow See other posts from Stack Overflow or by LiorH
Published on 2010-04-06T19:43:25Z Indexed on 2010/04/06 19:53 UTC
Read the original article Hit count: 428

Filed under:

nlp

|

natural-language-process

|

lucene

I have a text file containing posts in English/Italian. I would like to read the posts into a data matrix so that each row represents a post and each column a word. The cells in the matrix are the counts of how many times each word appears in the post. The dictionary should consist of all the words in the whole file or a non exhaustive English/Italian dictionary.

I know this is a common essential preprocessing step for NLP.

Does anyone know of a tool\project that can perform this task?

Someone mentioned apache lucene, do you know if lucene index can be serialized to a data-structure similar to my needs?

© Stack Overflow or respective owner

Related posts about nlp

stanford pos tagger runs out of memory?

as seen on Stack Overflow - Search for 'Stack Overflow'
my stanford tagger ran out of memory. Is it because the text has to be properly formatted? This is because i use it to tag html contents, with the tags stripped, but there may have quite a excessive amount of newlines. here is the error: BlockquoWARNING: Untokenizable: ? (char in decimal: 9829) … >>> More
NLP with greatly contrained input and abilities

as seen on Stack Overflow - Search for 'Stack Overflow'
Hat in hand here. I'm a seasoned developer and I would be grateful for a bit of help. I don't have time to read or digest long intricate discussions on theoretical concepts around NLP (or go get my PHD). That said, I have read a few and it's a damn interesting field. The problem is I need real world… >>> More
NLP - Word Alignment

as seen on Stack Overflow - Search for 'Stack Overflow'
Hi..:) I am looking for word alignment tools and algorithms, I am dealing with bilingual English - Hindi text, Currently I am working on DTW(Dynamic Time Warping) algorithm, CLA(Competitive Linking Algorithm) , NATool, Giza++. Could you please suggest me any other alogrithm/tool which is language… >>> More
AGFL npx grammar nlp techniques dependency parsing

as seen on Stack Overflow - Search for 'Stack Overflow'
Hi I am trying to obtain a dependency parse tree using AGFL. Unfortunately I cannot understand how to derive this. I am trying to generate the npx grammar but I am still lost can someone help me please? Thanks :) L >>> More
Starting out NLP - Python + large data set

as seen on Stack Overflow - Search for 'Stack Overflow'
Hi, I've been wanting to learn python and do some NLP, so have finally gotten round to starting. Downloaded the english wikipedia mirror for a nice chunky dataset to start on, and have been playing around a bit, at this stage just getting some of it into a sqlite db (havent worked with dbs in the… >>> More

Related posts about natural-language-process

details on the following Natural Language Processing terms ?

as seen on Stack Overflow - Search for 'Stack Overflow'
Named Entity Extraction (extract ppl, cities, organizations) Content Tagging (extract topic tags by scanning doc) Structured Data Extraction Topic Categorization (taxonomy classification by scanning doc....bayesian ) Text extraction (HTML page cleaning) are there libraries that i can use to do any… >>> More
How to determine the (natural) language of a document?

as seen on Stack Overflow - Search for 'Stack Overflow'
I have a set of documents in two languages: English and German. There is no usable meta information about these documents, a program can look at the content only. Based on that, the program has to decide which of the two languages the document is written in. Is there any "standard" algorithm for… >>> More
Building dictionary of words from large text

as seen on Stack Overflow - Search for 'Stack Overflow'
I have a text file containing posts in English/Italian. I would like to read the posts into a data matrix so that each row represents a post and each column a word. The cells in the matrix are the counts of how many times each word appears in the post. The dictionary should consist of all the words… >>> More
Constructing human readable sentences based on a survey

as seen on Stack Overflow - Search for 'Stack Overflow'
The following is a survey given to course attendees to assess an instructor at the end of the course. Communication Skills 1. The instructor communicated course material clearly and accurately. Yes No 2. The instructor explained course objectives and learning outcomes. Yes No 3. In the event of not… >>> More
How to perform FST (Finite State Transducer) composition

as seen on Stack Overflow - Search for 'Stack Overflow'
Consider the following FSTs : T1 0 1 a : b 0 2 b : b 2 3 b : b 0 0 a : a 1 3 b : a T2 0 1 b : a 1 2 b : a 1 1 a : d 1 2 a : c How do I perform the composition operation on these two FSTs (i.e. T1 o T2) I saw some algorithms but couldn't understand much. If anyone could explain it in a easy… >>> More