Tokenizing Twitter Posts in Lucene

Posted by Amaç Herdagdelen on Stack Overflow See other posts from Stack Overflow or by Amaç Herdagdelen
Published on 2010-03-31T17:26:09Z Indexed on 2010/04/01 6:23 UTC
Read the original article Hit count: 514

Filed under:
|
|
|

Hello,

My question in a nutshell: Does anyone know of a TwitterAnalyzer or TwitterTokenizer for Lucene?

More detailed version:

I want to index a number of tweets in Lucene and keep the terms like @user or #hashtag intact. StandardTokenizer does not work because it discards the punctuation (but it does other useful stuff like keeping domain names, email addresses or recognizing acronyms). How can I have an analyzer which does everything StandardTokenizer does but does not touch terms like @user and #hashtag?

My current solution is to preprocess the tweet text before feeding it into the analyzer and replace the characters by other alphanumeric strings. For example,

String newText = newText.replaceAll("#", "hashtag");
newText = newText.replaceAll("@", "addresstag");

Unfortunately this method breaks legitimate email addresses but I can live with that. Does that approach make sense?

Thanks in advance!

Amaç

© Stack Overflow or respective owner

Related posts about lucene

Related posts about twitter