Do not filter outlinks in Nutch?
Posted
by
sigpwned
on Stack Overflow
See other posts from Stack Overflow
or by sigpwned
Published on 2013-10-28T03:51:46Z
Indexed on
2013/10/28
3:53 UTC
Read the original article
Hit count: 156
nutch
I'm currently trying to perform a deep crawl within a small list of sites. To accomplish this, I updated conf/domain-urlfilter.txt
with the domains of the sites I wish to scrape, which worked nicely. However, I found that not only were the links crawled at every step filtered, but the outlinks captured from each page crawled were filtered as well.
Is there a way to avoid filtering captured outlinks while still filtering crawled URLs?
© Stack Overflow or respective owner