Extracting dates from html meta data in FAST-ESP
Posted
by Neil
on Stack Overflow
See other posts from Stack Overflow
or by Neil
Published on 2010-04-21T19:55:48Z
Indexed on
2010/05/09
8:48 UTC
Read the original article
Hit count: 449
fast-esp
During document processing I want to extract all dates from html meta data and then identify the latest date which will be used to populate a date field (dtgeneric1).
<meta name="OriginalPublicationDate" content="2010/04/21 12:06:36" />
<meta name="LastModificationDate" content="2010/04/22 14:10:16" />
+ other non-date meta data
Inspection using spy stages shows that our pipeline already adds meta_* attributes but the meta data names will be different across documents from different sources.
#### ATTRIBUTE meta_originalpublicationdate <class 'docproc.DocumentAttributes.TextChunks'>: 2010/04/21 12:06:36
#### ATTRIBUTE meta_lastmodificationdate <class 'docproc.DocumentAttributes.TextChunks'>: 2010/04/22 14:10:16
+ other non-date meta attributes
Ideally we would like to pass all the meta_* attributes to a Python stage and use that to work out which are dates and which is the largest but there seems to be no way of specifying "all meta attributes" as input.
Has anyone done something similar and can offer any advice on the best way to do this.
Thanks
Neil
© Stack Overflow or respective owner