Which metadata I should save when downloading web-pages?
Posted
by Vojtech R.
on Stack Overflow
See other posts from Stack Overflow
or by Vojtech R.
Published on 2010-04-12T17:07:36Z
Indexed on
2010/04/12
17:13 UTC
Read the original article
Hit count: 383
Hi,
I'm going to download (for future purposes of language processing) some thousands webpages. Now I'm thinking, which metadata I should save. I explore this, but I do not wont to neglect something important.
<title>
<link>
<publish_date>
<date_downloaded>
<source> // to this page
<keyword> // for Solr indexing
<text> // cleaned body of page
Is there something important what I could miss in future?
© Stack Overflow or respective owner