Parsing two-dimensional text

Posted by alexbw on Stack Overflow See other posts from Stack Overflow or by alexbw
Published on 2010-05-02T16:52:32Z Indexed on 2010/05/02 16:58 UTC
Read the original article Hit count: 298

Filed under:
|
|

I need to parse text files where relevant information is often spread across multiple lines in a nonlinear way. An example:

1234
 1         IN THE SUPERIOR COURT OF THE STATE OF SOME STATE           
 2              IN AND FOR THE COUNTY OF SOME COUNTY                
 3                      UNLIMITED JURISDICTION                        
 4                            --o0o--                                 
 5                                                                    
 6   JOHN SMITH and JILL SMITH,         )                             
                                        )                             
 7                  Plaintiffs,         )                             
                                        )                             
 8        vs.                           )     No. 12345
                                        )                             
 9   ACME CO, et al.,                   )                             
                                        )                             
10                  Defendants.         )                             
     ___________________________________)                             

I need to pull out Plaintiff and Defendant identities.

These transcripts have a very wide variety of formattings, so I can't always count on those nice parentheses being there, or the plaintiff and defendant information being neatly boxed off, e.g.:

 1        SUPREME COURT OF THE STATE OF SOME OTHER STATE
                      COUNTY OF COUNTYVILLE
 2                  First Judicial District
                     Important Litigation
 3  --------------------------------------------------X
    THIS DOCUMENT APPLIES TO:
 4
    JOHN SMITH,
 5                            Plaintiff,          Index No.
                                                  2000-123
 6
                                            DEPOSITION
 7                  - against -             UNDER ORAL
                                            EXAMINATION
 8                                              OF
                                            JOHN SMITH,
 9                                           Volume I

10  ACME CO,
    et al,
11                            Defendants.

12  --------------------------------------------------X

The two constants are:

  1. "Plaintiff" will occur after the name of the plaintiff(s), but not necessarily on the same line.
  2. Plaintiffs and defendants' names will be in upper case.

Any ideas?

© Stack Overflow or respective owner

Related posts about text

Related posts about text-parsing