Extracting dates from html meta data in FAST-ESP

Posted by Neil on Stack Overflow See other posts from Stack Overflow or by Neil
Published on 2010-04-21T19:55:48Z Indexed on 2010/05/09 8:48 UTC
Read the original article Hit count: 449

Filed under:

During document processing I want to extract all dates from html meta data and then identify the latest date which will be used to populate a date field (dtgeneric1).

<meta name="OriginalPublicationDate" content="2010/04/21 12:06:36" />
<meta name="LastModificationDate" content="2010/04/22 14:10:16" />
+ other non-date meta data

Inspection using spy stages shows that our pipeline already adds meta_* attributes but the meta data names will be different across documents from different sources.

#### ATTRIBUTE meta_originalpublicationdate <class 'docproc.DocumentAttributes.TextChunks'>: 2010/04/21 12:06:36
#### ATTRIBUTE meta_lastmodificationdate <class 'docproc.DocumentAttributes.TextChunks'>: 2010/04/22 14:10:16
+ other non-date meta attributes

Ideally we would like to pass all the meta_* attributes to a Python stage and use that to work out which are dates and which is the largest but there seems to be no way of specifying "all meta attributes" as input.

Has anyone done something similar and can offer any advice on the best way to do this.

Thanks

Neil

© Stack Overflow or respective owner

Related posts about fast-esp