Need suggestions on how to extract data from .docx/.doc file then into mssql

Posted by DarkPP on Stack Overflow See other posts from Stack Overflow or by DarkPP
Published on 2011-06-24T07:34:32Z Indexed on 2011/06/24 8:22 UTC
Read the original article Hit count: 377

Filed under:
|
|
|

I'm suppose to develop an automated application for my project, it will load past-year examination/exercises paper (word file), detect the sections accordingly, extract the questions and images in that section, and then store the questions and images into the database. (Preview of the question paper is at the bottom of this post)

So I need some suggestions on how to extract data from a word file, then inserting them into a database. Currently I have a few methods to do so, however I have no idea how I could implement them when the file contains textboxes with background image. The question has to link with the image.

Method One (Make use of ms office interop)

  • Load the word file -> Extract image, save into a folder -> Extract text, save as .txt -> Extract text from .txt then store in db

Question: How i detect the section and question. How I link the image to the question.

Extract text from word file (Working):

    private object missing = Type.Missing;
    private object sFilename = @"C:\temp\questionpaper.docx";
    private object sFilename2 = @"C:\temp\temp.txt";
    private object readOnly = true;
    object fileFormat = Word.WdSaveFormat.wdFormatText;

    private void button1_Click(object sender, EventArgs e)
    {
        Word.Application wWordApp = new Word.Application();
        wWordApp.DisplayAlerts = Word.WdAlertLevel.wdAlertsNone;
        Word.Document dFile = wWordApp.Documents.Open(ref sFilename,
                                ref missing, ref readOnly, ref missing, ref missing,
                                ref missing, ref missing, ref missing, ref missing,
                                ref missing, ref missing, ref missing, ref missing, 
                                ref missing, ref missing, ref missing);

        dFile.SaveAs(ref sFilename2, ref fileFormat, ref missing, ref missing, 
            ref missing, ref missing, ref missing, ref missing,ref missing,
            ref missing,ref missing,ref missing,ref missing,ref missing,
            ref missing,ref missing);
        dFile.Close(ref missing, ref missing, ref missing);
    }

Extract image from word file (Doesn't work on image inside textbox):

    private Word.Application wWordApp;
    private int m_i;
    private object missing = Type.Missing;
    private object filename = @"C:\temp\questionpaper.docx";
    private object readOnly = true;

    private void CopyFromClipbordInlineShape(String imageIndex)
    {
        Word.InlineShape inlineShape = wWordApp.ActiveDocument.InlineShapes[m_i];
        inlineShape.Select();
        wWordApp.Selection.Copy();
        Computer computer = new Computer();
        if (computer.Clipboard.GetDataObject() != null)
        {
            System.Windows.Forms.IDataObject data = computer.Clipboard.GetDataObject();
            if (data.GetDataPresent(System.Windows.Forms.DataFormats.Bitmap))
            {
                Image image = (Image)data.GetData(System.Windows.Forms.DataFormats.Bitmap, true);
                image.Save("C:\\temp\\DoCremoveImage" + imageIndex + ".png", System.Drawing.Imaging.ImageFormat.Png);
            }
        }
    }

    private void button1_Click(object sender, EventArgs e)
    {
        wWordApp = new Word.Application();
        wWordApp.Documents.Open(ref filename,
                                ref missing, ref readOnly, ref missing, ref missing,
                                ref missing, ref missing, ref missing, ref missing,
                                ref missing, ref missing, ref missing, ref missing, 
                                ref missing, ref missing, ref missing);
        try
        {
            for (int i = 1; i <= wWordApp.ActiveDocument.InlineShapes.Count; i++)
            {
                m_i = i;
                CopyFromClipbordInlineShape(Convert.ToString(i));
            }
        }
        finally
        {
            object save = false;
            wWordApp.Quit(ref save, ref missing, ref missing);
            wWordApp = null;
        }
    }

Method Two

  • Unzip the word file (.docx) -> Copy the media(image) folder, store somewhere -> Parse the XML file -> Store the text in db

Any suggestion/help would be greatly appreciated :D

Preview of the word file: Preview of the word file (backup link: http://i.stack.imgur.com/YF1Ap.png)

© Stack Overflow or respective owner

Related posts about c#

Related posts about interop