.Net search engine architecture and technology choice
- by shrivb
I am in the process of designing a search engine for an asp.net site. The site currently uses Microsoft Indexing Server to index and search content which range from simple text files to MS documents to PDFs. MIS is also used to crawl File servers. MIS in tandem with Index Server Companion crawls for content from external sites. I intend to replace MIS with the indexer/crawler I am trying to build.
Since my platform is completely on the Microsoft stack, I cant afford to have a Java application server. Thus, Solr, and effectively, SolrNet is ruled out.
With this being the context, I have couple of questions.
1.Technology choice
I had done my initial investigation and looked at Lucene.Net. There seemed to be 2 issues in using Lucene.Net. First being, it cant crawl external content. There doesn't seem to be a direct port of Nutch in .Net. Second, since it is just an indexer, it cant parse various document types. The parsing is left to the developer.
So, what would be best technology choice on the .Net platform to achieve indexing & crawling? Are there any .Net open source libraries available for document parsing?
2.Architectural pattern
Is there any general architectural pattern or best practice that needs to be followed in designing such a search engine?
Thanks in advance.