Search Results

Search found 261 results on 11 pages for 'crawler'.

Page 3/11 | < Previous Page | 1 2 3 4 5 6 7 8 9 10 11 | Next Page >

Ways of Gathering Event Information From the Internet

- by Ciwan

What are the best ways of gathering information on events (any type) from the internet ? Keeping in mind that different websites will present information in different ways. I was thinking 'smart' web crawlers, but that can turn out to be extremely challenging, simply because of the hugely varied ways that different sites present their information. Then I was thinking of sifting through the official twitter feeds of organisations, people with knowledge of events .. etc and look for the event hash tag, grab the tweet and dissect it to grab the relevant information about the event. Information I am interested in gathering is: Date and Time of Event, Address where Event is being held, and any Celebrities (or any famous people) attending the event (if any). The reason to ask here is my hope that experienced folk will open my eyes to things I've missed, which I am sure I have.

Read the article
Is there any software to clip a website, make changes to the code and republish it?

- by user1445919

I am working in the front end of an application and we provide the interface between the customers and several backend services. We have been using Kapow software to clip the html/jsp code we receive from the backend, make the necessary changes and publish them on the main website. I wanted to know if there is any other alternative to this software which suffices our requirement. Also, are any of those open source?

Read the article
Does the google crawler really guess URL patterns and index pages that were never linked against?

- by Dominik

I'm experiencing problems with indexed pages which were (probably) never linked to. Here's the setup: Data-Server: Application with RESTful interface which provides the data Website A: Provides the data of (1) at http://website-a.example.com/?id=RESOURCE_ID Website B: Provides the data of (1) at http://website-b.example.com/?id=OTHER_RESOURCE_ID So the whole, non-private data is stored on (1) and the websites (2) and (3) can fetch and display this data, which is a representation of the data with additional cross-linking between those. In fact, the URL /?id=1 of website-a points to the same resource as /?id=1 of website-b. However, the resource id:1 is useless at website-b. Unfortunately, the google index for website-b now contains several links of resources belonging to website-a and vice versa. I "heard" that the google crawler tries to determine the URL-pattern (which makes sense for deciding which page should go into the index and which not) and furthermore guesses other URLs by trying different values (like "I know that id 1 exists, let's try 2, 3, 4, ..."). Is there any evidence that the google crawler really behaves that way (which I doubt). My guess is that the google crawler submitted a HTML-Form and somehow got links to those unwanted resources. I found some similar posted questions about that, including "Google webmaster central: indexing and posting false pages" [link removed] however, none of those pages give an evidence.

Read the article
How to create a web crawler/spider/robot?

- by Chris

Is there a way to make a web robot like websiteoutlook.com does? I need something that searches the internet for URLs only...I don't need links, descriptions, etc. What is the best way to do this without getting too technical? I guess it could even be a cronjob that runs a PHP script grabbing URLs from Google, or is there a better way? A simple example or a link to more information would be much appreciated.

Read the article
Creating a spam list with a web crawler in python

- by user313623

Hey guys, I'm not trying to do anything malicious here, I just need to do some homework. I'm a fairly new programmer, I'm using python 3.0, and I having difficulty using recursion for problem-solving. I've been stuck on this question for quite a while. Here's the assignment: Write a recursive method spam(url, n) that takes a url of a web page as input and a non-negative integer n, collects all the email address contained in the web page and adds them to a global dictionary variable spam_dict, and then recursively calls itself on every http hyperlink contained in the web page. You will use a dictionary so only one copy of every email address is save; your dictionary will store (key,value) pairs (email, email). The recursive call should use the parameter n-1 instead of n. If n = 0, you should collect the email addresses but no recursive calls should be made. The parameter n is used to limit the recursion to at most depth n. You will need to use the solutions of the two above problems; you method spam() will call the methods links2() and emails() and possibly other functions as well. Notes: 1. running spam() directly will produce no output on the screen; to find your spam_dict, you will need to read the value of spam_dict, and you will also need to reset it to the empty dictionary before every run of spam. 2. Recall how global variables are used. Usage: spam_dict = {} spam('http://reed.cs.depaul.edu/lperkovic/csc242/test1.html',0) spam_dict.keys() dict_keys([]) spam_dict = {} spam('http://reed.cs.depaul.edu/lperkovic/csc242/test1.html',1) spam_dict.keys() dict_keys(['[email protected]', '[email protected]']) So far, I've written a function that traverses web pages and puts all the links in a nice little list, and what I wanted to do was call that functions. And why would I use recursion on a dictionary? And how? I don't understand how n ties into all of this. def links2(url): content = str(urlopen(url).read()) myparser = MyHTMLParser() myparser.feed(content) lst = myparser.get() mergelst = [] for link in lst: mergelst.append(urljoin(lst[0],link)) print(mergelst) Any input (except why spam is bad) would be greatly appreciated. Also, I realize that the above function could probably look better, if you have a way to do it, I'm all ears. However, all I need is the point is for the program to produce the proper output.

Read the article
What's a good Web Crawler tool

- by Glenn Slaven

I need to index a whole lot of webpages, what good webcrawler utilities are there? I'm preferably after something that .NET can talk to, but that's not a showstopper. What I really need is something that I can give a site url to & it will follow every link and store the content for indexing.

Read the article
Backlink-reporting website crawler?

- by Stewart

What tools are there out there to crawl a website and report, for each page, a list of pages within the website that link to it?

Read the article
how to get IP adress of google search/crawler bot to add to our white list of ip address

- by Jayapal Chandran

Hi, Google webmaster tools says network unreachable. When i contacted my hosting provider they said that they have installed firewall which could block frequent incoming ip addresses and they dont know the google's ip adress to unblock. so they requested me to find google search/crawler bot's ip adress so that they can add it to their whitelist. How to find the ip address of google search bot or crawler bot? My site stopped appearing in google search. My hits had gone too low. What should i do? any kind of reply would he helpful.

Read the article
SharePoint Search Problem: The crawler could not communicate with the server.

- by Clara Oscura

This one was not easy to solve ... Error: The crawler could not communicate with the server. Check that the server is available and that the firewall access is configured. Context: Some pages were not crawled (giving the above error) and, what is worse, all the sub content of that site was not crawled either! (the pages were the homepages of the site) Solution:The pages that could not be crawled due to this error contained a custom web part. This web part used default credentials for a given action. During crawling, the SP_Search account tried to perform this action but did not have the appropriate rights. This gave an error that stopped the crawling for the whole site. This blog helped me: http://patricklamber.blogspot.com/2010/04/why-might-moss-crawler-not-working.html

Read the article
Is there any descent open-source search engine solutions?

- by Nazariy

Few weeks ago my friend asked me how hard is it to launch your own search engine service with list of websites that suppose to be crawled time to time. First what come at my mind was Google Custom Search however pricing policy is quite tricky and would drain your budget if you reach 500K queries per year. Another solution I found here was SearchBlox, which can be compared to Google Mini service. It's quite good solution if you planing to cover search over small amount of websites but for larger projects it is not very handy. I also found few other search platforms like Lucene, Hadoop and Xapian which seems to be quite powerful solutions to reach Google search quality, and Nutch as a web crawler. As most of open-source projects they share same problem, luck of comprehensive guidance of usage, examples and it's expected that you are expert in this subject. I'm wondering if any of you using this solutions, which of them would you recommend, and what should I be aware of?

Read the article
How much HDD space would I need to cache the web while respecting robot.txts?

- by Koning Baard XIV

I want to experiment with creating a web crawler. I'll start with indexing a few medium sized website like Stack Overflow or Smashing Magazine. If it works, I'd like to start crawling the entire web. I'll respect robot.txts. I save all html, pdf, word, excel, powerpoint, keynote, etc... documents (not exes, dmgs etc, just documents) in a MySQL DB. Next to that, I'll have a second table containing all restults and descriptions, and a table with words and on what page to find those words (aka an index). How much HDD space do you think I need to save all the pages? Is it as low as 1 TB or is it about 10 TB, 20? Maybe 30? 1000? Thanks

Read the article
Detecting 'stealth' web-crawlers

- by Jacco

What options are there to detect web-crawlers that do not want to be detected? (I know that listing detection techniques will allow the smart stealth-crawler programmer to make a better spider, but I do not think that we will ever be able to block smart stealth-crawlers anyway, only the ones that make mistakes.) I'm not talking about the nice crawlers such as googlebot and Yahoo! Slurp. I consider a bot nice if it: identifies itself as a bot in the user agent string reads robots.txt (and obeys it) I'm talking about the bad crawlers, hiding behind common user agents, using my bandwidth and never giving me anything in return. There are some trapdoors that can be constructed updated list (thanks Chris, gs): Adding a directory only listed (marked as disallow) in the robots.txt, Adding invisible links (possibly marked as rel="nofollow"?), style="display: none;" on link or parent container placed underneath another element with higher z-index detect who doesn't understand CaPiTaLiSaTioN, detect who tries to post replies but always fail the Captcha. detect GET requests to POST-only resources detect interval between requests detect order of pages requested detect who (consistently) requests https resources over http detect who does not request image file (this in combination with a list of user-agents of known image capable browsers works surprisingly nice) Some traps would be triggered by both 'good' and 'bad' bots. you could combine those with a whitelist: It trigger a trap It request robots.txt? It doest not trigger another trap because it obeyed robots.txt One other important thing here is: Please consider blind people using a screen readers: give people a way to contact you, or solve a (non-image) Captcha to continue browsing. What methods are there to automatically detect the web crawlers trying to mask themselves as normal human visitors. Update The question is not: How do I catch every crawler. The question is: How can I maximize the chance of detecting a crawler. Some spiders are really good, and actually parse and understand html, xhtml, css javascript, VB script etc... I have no illusions: I won't be able to beat them. You would however be surprised how stupid some crawlers are. With the best example of stupidity (in my opinion) being: cast all URLs to lower case before requesting them. And then there is a whole bunch of crawlers that are just 'not good enough' to avoid the various trapdoors.

Read the article
Webcrawler, feedback?

- by Jan Kuboschek

Hey folks, every once in a while I have the need to automate data collection tasks from websites. Sometimes I need a bunch of URLs from a directory, sometimes I need an XML sitemap (yes, I know there is lots of software for that and online services). Anyways, as follow up to my previous question I've written a little webcrawler that can visit websites. Basic crawler class to easily and quickly interact with one website. Override "doAction(String URL, String content)" to process the content further (e.g. store it, parse it). Concept allows for multi-threading of crawlers. All class instances share processed and queued lists of links. Instead of keeping track of processed links and queued links within the object, a JDBC connection could be established to store links in a database. Currently limited to one website at a time, however, could be expanded upon by adding an externalLinks stack and adding to it as appropriate. JCrawler is intended to be used to quickly generate XML sitemaps or parse websites for your desired information. It's lightweight. Is this a good/decent way to write the crawler, provided the limitations above? http://pastebin.com/VtgC4qVE - Main.java http://pastebin.com/gF4sLHEW - JCrawler.java http://pastebin.com/VJ1grArt - HTMLUtils.java Thanks for your feedback in advance! :)

Read the article
Mining Groups of people from Wikipedia

- by AlgoMan

I am trying to get the list of people from the http://en.wikipedia.org/wiki/Category:People_by_occupation . I have to go through all the sections and get people from each section. How should i go about it ? Should i use a crawler and get the pages and search through those using BeautifulSoup ? Or is there any other alternative to get the same from Wikipedia ?

Read the article
How to rebuild Safari Web Clip functionality in PHP

- by Mayko

Hi there, is there a way to rebuild Mac OSX Snow Leopard's Dashboard Widget 'Web Clip' on a PHP website? Something like a crawler or scraper. I thought about using file_get_contents to getting the page content into the page, but how do I select a section on the external page? And does this work with session/login content as well? I'm happy for any kind of suggestions! Cheers

Read the article
How to extract the headline and content from a crawled web page / article?

- by gAMBOOKa

I need some guidelines on how to detect the headline and content of crawled pages. I've been seeing some very weird front-end codework since i started working on this crawler.

Read the article
Finding all IP ranges blelonging to a specific ISP

- by Jim Jim

I'm having an issue with a certain individual who keeps scraping my site in an aggressive manner; wasting bandwidth and CPU resources. I've already implemented a system which tails my web server access logs, adds each new IP to a database, keeps track of the number of requests made from that IP, and then, if the same IP goes over a certain threshold of requests within a certain time period, it's blocked via iptables. It may sound elaborate, but as far as I know, there exists no pre-made solution designed to limit a certain IP to a certain amount of bandwidth/requests. This works fine for most crawlers, but an extremely persistent individual is getting a new IP from his/her ISP pool each time they're blocked. I would like to block the ISP entirely, but don't know how to go about it. Doing a whois on a few sample IPs, I can see that they all share the same "netname", "mnt-by", and "origin/AS". Is there a way I can query the ARIN/RIPE database for all subnets using the same mnt-by/AS/netname? If not, how else could I go about getting every IP belonging to this ISP? Thanks.

Read the article
Googlebot repeatedly looks for files that aren't on my server

- by John at CashCommons

I'm hosting a site for a volunteer organization. I've moved the site to WordPress, but it wasn't always that way. I suspect at one point it was hacked badly. My Apache error log file has grown to 122 kB in just the past 18 hours. The large majority of the errors logged are of this form -- it's repeated hundreds of times today alone in my log files: [Mon Nov 12 18:29:27 2012] [error] [client xx.xxx.xx.xxx] File does not exist: /home/*******/public_html/*******.org/calendar.php [Mon Nov 12 18:29:27 2012] [error] [client xx.xxx.xx.xxx] File does not exist: /home/*******/public_html/*******.org/404.shtml (I verified that xx.xxx.xx.xxx was a Google server.) I suspect there was a security hole somewhere before, likely in calendar.php, that was exploited. The files don't exist anymore, but there may be many backlinks that exist that reference here, hence why googlebot is so interested in crawling them. How do I fix this gracefully? I still would like Google to index the site. I just want to tell it somehow not to look for these files anymore.

Read the article
Blocking 'good' bots in nginx with multiple conditions for certain off-limits URL's where humans can go

- by Glenn Plas

After 2 days of searching/trying/failing I decided to post this here, I haven't found any example of someone doing the same nor what I tried seems to be working OK. I'm trying to send a 403 to bots not respecting the robots.txt file (even after downloading it several times). Specifically Googlebot. It will support the following robots.txt definition. User-agent: * Disallow: /*/*/page/ The intent is to allow Google to browse whatever they can find on the site but return a 403 for the following type of request. Googlebot seems to keep on nesting these links eternally adding paging block after block: my_domain.com:80 - 66.x.67.x - - [25/Apr/2012:11:13:54 +0200] "GET /2011/06/ page/3/?/page/2//page/3//page/2//page/3//page/2//page/2//page/4//page/4//pag e/1/&wpmp_switcher=desktop HTTP/1.1" 403 135 "-" "Mozilla/5.0 (compatible; G ooglebot/2.1; +http://www.google.com/bot.html)" It's a wordpress site btw. I don't want those pages to show up, even though after the robots.txt info got through, they stopped for a while only to begin crawling again later. It just never stops .... I do want real people to see this. As you can see, google get a 403 but when I try this myself in a browser I get a 404 back. I want browsers to pass. root@my_domain:# nginx -V nginx version: nginx/1.2.0 I tried different approaches, using a map and plain old nono if's and they both act the same: (under http section) map $http_user_agent $is_bot { default 0; ~crawl|Googlebot|Slurp|spider|bingbot|tracker|click|parser|spider 1; } (under the server section) location ~ /(\d+)/(\d+)/page/ { if ($is_bot) { return 403; # Please respect the robots.txt file ! } } I recently had to polish up my Apache skills for a client where I did about the same thing like this : # Block real Engines , not respecting robots.txt but allowing correct calls to pass # Google RewriteCond %{HTTP_USER_AGENT} ^Mozilla/5\.0\ \(compatible;\ Googlebot/2\.[01];\ \+http://www\.google\.com/bot\.html\)$ [NC,OR] # Bing RewriteCond %{HTTP_USER_AGENT} ^Mozilla/5\.0\ \(compatible;\ bingbot/2\.[01];\ \+http://www\.bing\.com/bingbot\.htm\)$ [NC,OR] # msnbot RewriteCond %{HTTP_USER_AGENT} ^msnbot-media/1\.[01]\ \(\+http://search\.msn\.com/msnbot\.htm\)$ [NC,OR] # Slurp RewriteCond %{HTTP_USER_AGENT} ^Mozilla/5\.0\ \(compatible;\ Yahoo!\ Slurp;\ http://help\.yahoo\.com/help/us/ysearch/slurp\)$ [NC] # block all page searches, the rest may pass RewriteCond %{REQUEST_URI} ^(/[0-9]{4}/[0-9]{2}/page/) [OR] # or with the wpmp_switcher=mobile parameter set RewriteCond %{QUERY_STRING} wpmp_switcher=mobile # ISSUE 403 / SERVE ERRORDOCUMENT RewriteRule .* - [F,L] # End if match This does a bit more than I asked nginx to do but it's about the same principle, I'm having a hard time figuring this out for nginx. So my question would be, why would nginx serve my browser a 404 ? Why isn't it passing, The regex isn't matching for my UA: "Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/536.5 (KHTML, like Gecko) Chrome/19.0.1084.30 Safari/536.5" There are tons of example to block based on UA alone, and that's easy. It also looks like the matchin location is final, e.g. it's not 'falling' through for regular user, I'm pretty certain that this has some correlation with the 404 I get in the browser. As a cherry on top of things, I also want google to disregard the parameter wpmp_switcher=mobile , wpmp_switcher=desktop is fine but I just don't want the same content being crawled multiple times. Even though I ended up adding wpmp_switcher=mobile via the google webmaster tools pages (requiring me to sign up ....). that also stopped for a while but today they are back spidering the mobile sections. So in short, I need to find a way for nginx to enforce the robots.txt definitions. Can someone shell out a few minutes of their lives and push me in the right direction please ? I really appreciate ANY response that makes me think harder ;-)

Read the article
Google Indexed an Unlinked Page

- by Yar

Google indexed a page on a site of mine that was not linked from any other page, ever. No one has ever put a link to it, and the directory contents were not browsable. How could this happen? I thought crawlers have no way to include a page that is not linked.

Read the article
How can I scrape specific data from a website

- by Stoney

I'm trying to scrape data from a website for research. The urls are nicely organized in an example.com/x format, with x as an ascending number and all of the pages are structured in the same way. I just need to grab certain headings and a few numbers which are always in the same locations. I'll then need to get this data into structured form for analysis in Excel. I have used wget before to download pages, but I can't figure out how to grab specific lines of text. Excel has a feature to grab data from the web (Data-From Web) but from what I can see it only allows me to download tables. Unfortunately, the data I need is not in tables.

Read the article
Can I use a Google Appliance/Mini to crawl and index sites I don't own?

- by SkippyFire

Maybe this is a stupid question, but... I am working with this company and they said they needed to get "permission" to crawl other people's sites. They have a Google Search Appliance And some Google Minis and want to point them at other sites to aggregate content. The end result will be something like a targeted search engine. (All the indexed sites relate to a specific topic) The only thing they will be doing is: Indexing Content from the other sites/domains Providing search functionality on their own site that searches the indexed content (like Google, displaying summaries and not the full content) The search results will provide links back to the original content Their intent is not malicious in nature, and is to provide a single site/resource for people to reference on their given topic. Is there anything illegal or fishy about this process?

Read the article
Google "not selecting" many of the links on my site

- by Loki

Since a few days I noticed that Google wasn't indexing any of the pages on my site anymore. When I checked the indexingstatus-page I noticed the following: As you can see the "not selected"-line is going thru the roof! The information on google's help pages is very limited: https://support.google.com/webmasters/bin/answer.py?hl=nl&answer=139066 I run a free downloads-site which is automatically updated. I don't have duplicate content (except that sometimes descriptions of software is similar over an older version of the software) and my URL's are all forwarded to the www.-variant from within Wordpress. So the canonical-part that Google mentions in their Help-file, isn't the problem. Any ideas what could be causing this and how to solve it?

Read the article
Firewall - Preventing Content Theft & Rogue Crawlers

- by drodecker

Our websites are being crawled by content thieves on a regular basis. We obviously want to let through the nice bots and legitimate user activity, but block questionable activity. We have tried IP blocking at our firewall, but this becomes to manage the block lists. Also, we have used IIS-handlers, however that complicates our web applications. Is anyone familiar with network appliances, firewalls or application services (say for IIS) that can reduce or eliminate the content scrapers?

Read the article
How should I interpret site analytics with 11 pageviews in an 3 second visit?

- by Juank

I'm using google analytics and recently i've noticed some weird trends going on. I have a lot of visits that last mere seconds but mark several page views... more than a normal human can see in that range of time. A specific case is that the only visitor from Ireland i've had until now recorded 11 pageviews in a 3 second visit. Are these crawlers? Shouldn't google analytics filter those out?

Read the article

< Previous Page | 1 2 3 4 5 6 7 8 9 10 11 | Next Page >