Tokenizing Twitter Posts in Lucene
Posted
by Amaç Herdagdelen
on Stack Overflow
See other posts from Stack Overflow
or by Amaç Herdagdelen
Published on 2010-03-31T17:26:09Z
Indexed on
2010/04/01
6:23 UTC
Read the original article
Hit count: 512
Hello,
My question in a nutshell: Does anyone know of a TwitterAnalyzer or TwitterTokenizer for Lucene?
More detailed version:
I want to index a number of tweets in Lucene and keep the terms like @user or #hashtag intact. StandardTokenizer does not work because it discards the punctuation (but it does other useful stuff like keeping domain names, email addresses or recognizing acronyms). How can I have an analyzer which does everything StandardTokenizer does but does not touch terms like @user and #hashtag?
My current solution is to preprocess the tweet text before feeding it into the analyzer and replace the characters by other alphanumeric strings. For example,
String newText = newText.replaceAll("#", "hashtag");
newText = newText.replaceAll("@", "addresstag");
Unfortunately this method breaks legitimate email addresses but I can live with that. Does that approach make sense?
Thanks in advance!
Amaç
© Stack Overflow or respective owner