Need a tool to search large structure text documents for words, phrases and related phrases

Posted by pitosalas on Stack Overflow See other posts from Stack Overflow or by pitosalas
Published on 2010-06-05T13:52:35Z Indexed on 2010/06/05 14:12 UTC
Read the original article Hit count: 302

Filed under:
|

I have to keep up with structured documents containing things such as requests for proposals, government program reports, threat models and all kinds of things like that. They are in techno-legalese as I would call them: highly structured, with section numbering and 3, 4 and 5 levels of nesting. All in English

I need a more efficient way to locate those paragraphs of nuggets that matter to me. So what I’d like is kind of a local document index/repository, that would allow me to have some standing queries and easily locate sections in documents that talk about my queries. Here’s an example:

  • I’d like to load in 10 large PDF files, each of say 100 pages. Each PDF contains English text, formatted very nicely into paragraphs and sections.

  • I’d like to specify that I am interested in “blogging platforms”, “weaknesses in Ruby”, “localization and internationalization”

  • Ideally then look at a list that showed the section of text, the name of the document, and other information that seemed to be related to and/or include the words and phrases I specified.

I am sure something like this exists. I would call it something like document indexing, document comprehension or structured searching.

© Stack Overflow or respective owner

Related posts about full-text-search

Related posts about concepts