Search Results

Search found 446 results on 18 pages for 'crawl'.

Page 10/18 | < Previous Page | 6 7 8 9 10 11 12 13 14 15 16 17 | Next Page >

How to make a jar file run on startup & and when you log out?

- by RanZilber

I have no idea where to start looking. I've been reading about daemons and didn't understand the concept. More details : I've been writing a crawler which never stops and crawlers over RSS in the internet. The crawler has been written in java - therefore its a jar right now. I'm an administrator on a machine that has Ubuntu 11.04 . There is some…

Read the article
Ubuntu slows down even after cpu-intensive process is ended

- by Matt2

After a Skype video call, or the use of virtualbox, Ubuntu slows down to a crawl, even after the process is ended. Running htop reveals that processes that used little CPU before are now all using about 30% cpu (namely Compiz, Firefox, Python, and Skype, but I'm sure there are others), to the point where all my cores are at 99%. All I can do from…

Read the article
Restricting A Directory Through .htaccess

- by Whitechapel

I'm trying to put all of my FTP accounts into a folder on /public_html/ftp and password protect it so search bots can't crawl their private files. I'm also trying to redirect all site traffic from the non-www to www. I keep getting 500 errors when accessing the site, and I need to point it to www.vivalanation.com/ftp to www.vivalanation.com/ftp/,…

Read the article
Site returning 404 header to google, not sure why

- by Damon

A Drupal site that works fine for regular users returns a 404 not found error when I try to use the W3C validator on it; it is also not being indexed by google at all (which is the main issue but I suspect there is a connection). It is a https:// site with .htaccess rule to redirect any http:// request to the https://. I had had it running in…

Read the article
If C-Panel Indexing Manager sets a folder to "No Indexing" can it be crawled by a webcrawler?

- by Graham

People are able to view directories / folders on my site right now. So, they could go to mysite.com/images and see the full index. To prevent this, C-Panel offers an option to set a directory / folder to "No Indexing" under the "Index Manager." Will this option allow webcrawlers to crawl / index the images? Or, is there a simpler…

Read the article
Hide from google while developing

- by user210757

I will be building a (wordpress) web site. While I am developing, other team members will be pushing content. I'd like to have it hidden from google while under development. It will be hosted on godaddy. I have thought of not pointing the domain name to it until live and using "preview dns", or buying a static IP during…

Read the article
SEO Influenc search result per device class (mobile/desktop)

- by user32224

We're currently building a new responsive website and while working on the site map figured that we don't want to show certain sections on mobile devices. This can be easily done by hiding the navigation parts using css/media queries. However, trouble is that the hidden sites would still show up in search engines' search…

Read the article
Exclude pages from search results based on device class (mobile/desktop)

- by user32224

We're currently building a new responsive website. While working on the site map, we figured that we don't want to show certain sections on mobile devices. This can be easily done by hiding the navigation parts using CSS/media queries. However, the trouble is that the hidden sites would still show up in search engine…

Read the article
How to write a crawler?

- by Jason

Hi All, I have had thoughts of trying to write a simple crawler that might crawl and produce a list of its findings for our NPO's websites and content. Does anybody have any thoughts on how to do this? Where do you point the crawler to get started? How does it send back its findings and still keep crawling? How does…

Read the article
need help in site classification

- by goh

hi guys, I have to crawl the contents of several blogs. The problem is that I need to classify whether the blogs the authors are from a specific school and is talking about the school's stuff. May i know what's the best approach in doing the crawling or how should i go about the classification?

Read the article
Oracle Secure Enterprise Search(SES) Intranet crawling problem.

- by vipin k.

I am using oracle Oracle Secure Enterprise Search(SES), and using the crawler to crawl the Intranet site. but i am getting the error. EQG-30008: http://site-name/: Not found I have added the Log on password and user name and also added the proxy settings. Any body who worked on SES crawling,please look in.

Read the article
C# Parsing html for general use?

- by Wardy

What is the best way to take a string of html and turn it in to something useful? Essentially if i take a url and go get the html from that url in .net i get a response but this would come in the form of either a file or stream or string. What if i want an actual document or something I can crawl like an xmldocument…

Read the article
Backlink-reporting website crawler?

- by Stewart

What tools are there out there to crawl a website and report, for each page, a list of pages within the website that link to it?

Read the article
Multiple SiteMap: entries in robots.txt?

- by user306942

I have been searching around using Google but I can't find an answer to this question. A robots.txt file can contain the following line: Sitemap: http://www.mysite.com/sitemapindex.xml but is it possible to specify MULTIPLE sitemap index files in the robots.txt and have the search engines recognize that and crawl…

Read the article
Website content crawling

- by klork

We have a Business Listings directory hosted on IIS 6 Windows 2003. Our competitors crawl and steal our content and customers. We have tried IP blocking using honeypot URLs and log parsing without much success. Is anyone aware of a network device or a proxy server that I can run in front of my web server to…

Read the article
Nutch crawling with seeds urls are in range

- by user365345

Some site have url pattern as www..com/id=1 to www..com/id=1000. How can I crawl the site using nutch. Is there any why to provide seed for fetching in range??

Read the article
file_get_contents VS CURL, what has better performance?

- by ahmed

I am using PHP to build a web crawler, to crawl millions of URLs, what is better for me in terms of performance?file_get_contents or CURL? Thanks

Read the article
how to prevent all crawlers except good ones (google, bing, yahoo) access website content?

- by tranhuyhung

I just want to let Google, Bing, Yahoo crawl my website to build indexes. But I do not want my opposite website use crawling service to steal my website content. What should I do?

Read the article
Configure HTTP Post data input to Nutch before crawling a site

- by user365345

I have to crawl a site which list item based on user input through http post submission. How to configure post http submission details in Nutch. I got help on how to do HttpPostAuthentication, but I got no help on "how to do post data submit other than username and password".

Read the article
is it possible to extract all PDFs from a site

- by deming

given a URL like www.mysampleurl.com is it possible to crawl through the site and extract links for all PDFs that might exist? I've gotten the impression that Python is good for this kind of stuff. but is this feasible to do? how would one go about implementing something like this? also, assume that the…

Read the article
Mod_rewrite - How to tell Google to dynamically delete pages from their index after 7 days

- by Sattvic

Search engines like to crawl and index webpages or URLs, but what if your webpages/URLs have expired content and you do not want them to be indexed after so many days? Can you put an expiration in the URL and have mod_rewrite 301 redirect pages after a given expiration date? Or maybe a cron job to add a…

Read the article
A good open source web crawler for indexing Specific website for specific contents?

- by Peeyush

Hello Please suggest me a good open source web crawler written in C++,JAVA or PHP. i just need to crawl/index some specific websites for specific contents(images,text,videos). i know that their are already a lot of question & answers about this topic on this website but i am a little confused…

Read the article
Investment advice data dump analysis

- by portoalet

For my year-end pet project, I'd like to analyze investment advices and their correlation to the stock market performance. The problem is, where do I get the dump of investment advice data (free) ? something like stackoverflow.com data dump will be nice. Or maybe it's easier to do distributed…

Read the article
How to do 404 link testing through selenium rc for complete website?

- by user1726460

How can i verify a complete website's link(mostly links that are redirecting to 404 page) by using selenium Rc. Previously i tried to do this thong by using xenu and web link validator but in there results most of the links are showing 500 internal serevr error.and for the pages they are showing…

Read the article
how can i find unused css in ajax app?

- by Haroldo

I've been searching and i can't find any ff addons or javascript for finding unused css in ajax apps. dust-me selectors can do a site-crawl, but i'm looking for something that examines loaded-in content... I'd like something where i can press 'record' and then make a load of clicks which…

Read the article

< Previous Page | 6 7 8 9 10 11 12 13 14 15 16 17 | Next Page >