Read huge free text docs in one file for lucene indexing
Posted
by
Jun
on Stack Overflow
See other posts from Stack Overflow
or by Jun
Published on 2012-05-31T04:36:51Z
Indexed on
2012/05/31
4:40 UTC
Read the original article
Hit count: 183
I have heaps of free text news docs in one big file. The structure of each news doc is like:
(Header line) Category, Doc1, Date (day, month, year)
(body text)
...
...
...
(Header line) Category, Doc2, Date (day, month, year)
(body text)
...
...
...
If I extract each doc from the big file, it costs too much time and not efficient. Therefore, I decide to read the file line by line and feed information to lucene the same time. I write c# code to index each doc to lucene like:
Streamreader sr = new Streamreader(file);
string line = "";
while((line = sr.ReadLine()) != null)
{
How can I tell this line is a doc header line from text line
and get the metadata and all the text lines of a doc for lucene to index.
Also, the text is read by OCR which can not give correct line-separating.
Captions are mixed with content text
iterate the process till the end of the file
}
with thanks
© Stack Overflow or respective owner