Read huge free text docs in one file for lucene indexing

Posted by Jun on Stack Overflow See other posts from Stack Overflow or by Jun
Published on 2012-05-31T04:36:51Z Indexed on 2012/05/31 4:40 UTC
Read the original article Hit count: 183

Filed under:
|
|
|

I have heaps of free text news docs in one big file. The structure of each news doc is like:

(Header line) Category, Doc1, Date (day, month, year)

(body text)

...

...

...

(Header line) Category, Doc2, Date (day, month, year)

(body text)

...

...

...

If I extract each doc from the big file, it costs too much time and not efficient. Therefore, I decide to read the file line by line and feed information to lucene the same time. I write c# code to index each doc to lucene like:

Streamreader sr = new Streamreader(file);
string line = "";
while((line = sr.ReadLine()) != null)
{
   How can I tell this line is a doc header line from text line
   and get the metadata and all the text lines of a doc for lucene to index.

   Also, the text is read by OCR which can not give correct line-separating.
   Captions are mixed with content text

   iterate the process till the end of the file
}

with thanks

© Stack Overflow or respective owner

Related posts about text

Related posts about lucene