I have heaps of free text news docs in one big file. The structure of each news doc is like:
(Header line) Category, Doc1, Date (day, month, year)
(body text)
...
...
...
(Header line) Category, Doc2, Date (day, month, year)
(body text)
...
...
...
If I extract each doc from the big file, it costs too much time and not efficient. Therefore, I decide to read the file line by line and feed information to lucene the same time.
I write c# code to index each doc to lucene like:
Streamreader sr = new Streamreader(file);
string line = "";
while((line = sr.ReadLine()) != null)
{
How can I tell this line is a doc header line from text line
and get the metadata and all the text lines of a doc for lucene to index.
Also, the text is read by OCR which can not give correct line-separating.
Captions are mixed with content text
iterate the process till the end of the file
}
with thanks