How to index and search .doc files

Posted by Jared on Stack Overflow See other posts from Stack Overflow or by Jared
Published on 2009-07-18T22:28:45Z Indexed on 2012/09/20 3:38 UTC
Read the original article Hit count: 236

Filed under:
|
|
|

I have an application that needs to have .doc files uploaded to it. These documents should then be index and the whole collection of documents should be searchable. This will run on a Windows Server, without Word installed, using IIS and SqlServer, but I'd rather not be tied to SqlServer's full text indexing.

I was thinking of using Lucene.Net for the indexing part and was wondering what the best way to get the text out of the .doc files would be. I could probably extract the text by reading in the whole stream and then using a regEx to pull out any regular characters, but that seems hefty and prone to error.

I saw an article on using iFilters that sounds promising, but I thought I'd put this out there since it's not something I'm familiar with.

P.S. If it matters, these .doc files will have mail-merge fields in them and there's no other current alternative for the .doc format.

© Stack Overflow or respective owner

Related posts about search

Related posts about word