Extracting dates from html meta data in FAST-ESP
- by Neil
During document processing I want to extract all dates from html meta data and then identify the latest date which will be used to populate a date field (dtgeneric1).
<meta name="OriginalPublicationDate" content="2010/04/21 12:06:36" />
<meta name="LastModificationDate" content="2010/04/22 14:10:16" />
+ other non-date meta data
Inspection using spy stages shows that our pipeline already adds meta_* attributes but the meta data names will be different across documents from different sources.
#### ATTRIBUTE meta_originalpublicationdate <class 'docproc.DocumentAttributes.TextChunks'>: 2010/04/21 12:06:36
#### ATTRIBUTE meta_lastmodificationdate <class 'docproc.DocumentAttributes.TextChunks'>: 2010/04/22 14:10:16
+ other non-date meta attributes
Ideally we would like to pass all the meta_* attributes to a Python stage and use that to work out which are dates and which is the largest but there seems to be no way of specifying "all meta attributes" as input.
Has anyone done something similar and can offer any advice on the best way to do this.
Thanks
Neil