Hashing words to numbers with respect to definition

Posted by thornate on Stack Overflow See other posts from Stack Overflow or by thornate
Published on 2010-03-22T01:48:01Z Indexed on 2010/03/22 1:51 UTC
Read the original article Hit count: 632

As part of a larger project, I need to read in text and represent each word as a number. For example, if the program reads in "Every good boy deserves fruit", then I would get a table that converts 'every' to '1742', 'good' to '977513', etc.

Now, obviously I can just use a hashing algorithm to get these numbers. However, it would be more useful if words with similar meanings had numerical values close to each other, so that 'good' becomes '6827' and 'great' becomes '6835', etc.

As another option, instead of a simple integer representing each number, it would be even better to have a vector made up of multiple numbers, eg (lexical_category, tense, classification, specific_word) where lexical_category is noun/verb/adjective/etc, tense is future/past/present, classification defines a wide set of general topics and specific_word is much the same as described in the previous paragraph.

Does any such an algorithm exist? If not, can you give me any tips on how to get started on developing one myself? I code in C++.

© Stack Overflow or respective owner

Related posts about hash

Related posts about natural-language