Which metadata I should save when downloading web-pages?

Posted by Vojtech R. on Stack Overflow See other posts from Stack Overflow or by Vojtech R.
Published on 2010-04-12T17:07:36Z Indexed on 2010/04/12 17:13 UTC
Read the original article Hit count: 509

Filed under:

solr

|

web-crawler

|

download

Hi,

I'm going to download (for future purposes of language processing) some thousands webpages. Now I'm thinking, which metadata I should save. I explore this, but I do not wont to neglect something important.

<title>
<link>
<publish_date>
<date_downloaded>
<source>  // to this page
<keyword> // for Solr indexing
<text>    // cleaned body of page

Is there something important what I could miss in future?

© Stack Overflow or respective owner

Related posts about solr

Faceted search with Solr on Windows

as seen on ASP.net Weblogs - Search for 'ASP.net Weblogs'
With over 10 million hits a day, funda.nl is probably the largest ASP.NET website which uses Solr on a Windows platform. While all our data (i.e. real estate properties) is stored in SQL Server, we're using Solr 1.4.1 to return the faceted search results as fast as we can.And yes, Solr is very… >>> More
Severe errors in solr configuration: Error loading class 'solr.TrieDateField'

as seen on Server Fault - Search for 'Server Fault'
Installed Solr on my Windows XP PC. Tomcat seems to be working fine. Cannot get Solr to work. I noticed the TrieDateField is declared in a file called schema.xml in the SolrHome directory. Any thoughts? The Url http://localhost:8080/solr/ returns: HTTP Status 500 - Severe errors in solr configuration… >>> More
Solr error; Anybody know what this means?

as seen on Stack Overflow - Search for 'Stack Overflow'
I am installing solr on my VPS (Ubuntu 9.10) via PuTTY. First, I thought about installing Solr with Tomcat, but then after installing tomcat, I changed my mind and went for the Jetty which comes with Solr. Now that I have setup everything on my Server, and try to start the "start.jar" file, I get… >>> More
Solr DataImportHandler configuration

as seen on Stack Overflow - Search for 'Stack Overflow'
I want to get data from mysql database with the help of DataImportHandler so i can create indexes. Now I've configured my Solr instance so that it works on Tomcat (the example admin page), but if I try to change the sorlconfig.xml file i'll get the error message. I'm working with Solr 3.6 So my configuration… >>> More
Java exception when the traffic grow up

as seen on Server Fault - Search for 'Server Fault'
I have an error with java/solr when the traffic grows up. It seems Solr tries to cast a java.lang.Object to a org.apache.solr.common.util.ConcurrentLRUCache$CacheEntry SEVERE: java.lang.ClassCastException: [Ljava.lang.Object; cannot be cast to [Lorg.apache.solr.common.util.ConcurrentLRUCache$CacheEntry; … >>> More

Related posts about web-crawler

web crawler needed

as seen on Stack Overflow - Search for 'Stack Overflow'
does anybody know where i can get a free web crawler that actually works with minimal coding by me. ive googled it and can only find really old ones that dont work or openwebspider which doesnt seem to work. ideally id like to store just the web addresses and which links that page contains any suggestions… >>> More
Building an automatic web crawler

as seen on Stack Overflow - Search for 'Stack Overflow'
I am building a web application crawler that's meant not only to find all the links or pages in a web application, but also perform all the allowed actions in the app (such as pushing buttons, filling forms, notice changes in the DOM even if they did not trigger a request etc.) Basically, this is… >>> More
Appengine Apps Vs Google bot web crawler

as seen on Stack Overflow - Search for 'Stack Overflow'
i built an appengine web app cricket.hover.in. The web app consists of about 15k url's linked in it, But even after a long time of my launch, no pages are indexed on google. Any base link place on my root site hover.in are being indexed with in minutes. but i placed the same link home page of root… >>> More
Extracting data from internet

as seen on Programmers - Search for 'Programmers'
I would like to extract data from internet like www.mozenda.com does but I want to write my own program to do that. Specific data I'm looking for is various event data. Based on my research, I think custom web crawler is my answer but I Would like to confirm the answer and see if there are any suggestion… >>> More
Web crawler update strategy

as seen on Stack Overflow - Search for 'Stack Overflow'
I want to crawl useful resource (like background picture .. ) from certain websites. It is not a hard job, especially with the help of some wonderful projects like scrapy. The problem here is I not only just want crawl this site ONE TIME. I also want to keep my crawl long running and crawl the updated… >>> More