google crawlers - Page 101

Why deny access to website for msnbot/bingbot?

- by Quandary

I've seen quite a lot of tutorials that recommend you to ban user agents containing the strings libwww-perl and msnbot. I understand why one would ban libwww-perl, it's mainly if not only used for hacking and spamming. But why are there so many sites recommending to ban msnbot/bingbot? Since it's a search engine, even if only with a marginal market share, I would except one would want this bot to crawl one's sites. What is it that msnbot does that makes people ban it?

Read the article

What bots are really worth letting onto a site?

- by blunders

Having written a number of bots, and seen the massive amounts of random bots that happen to crawl a site, I am wondering if the goal of the site allowing bots is for the potential for the bot to send real traffic back to the site if there is any reason to allow bots that are not known to be sending real traffic back, and how to spot these "good" bots; based on how they ID themselves, IPs they come from, behaviors, etc.

Read the article

How to reload google dfp in ajax content? [migrated]

- by cj333

google dfp support ads in ajax content, but if I parse all the code in main page. it always show the same ads even turn page reload the ajax content. I read some article from http://productforums.google.com/forum/#!msg/dfp/7MxNjJk46DQ/4SAhMkh2RU4J. But my code not work. Any work code for suggestion? Thanks. Main page code: <script type='text/javascript'> $(document).ready(function(){ $('#next').live('click',function(){ var num = $(this).html(); $.ajax({ url: "album-slider.php", dataType: "html", type: 'POST', data: 'photo=' + num, success: function(data){ $("#slider").center(); googletag.cmd.push(function() { googletag.defineSlot('/1*******/ads-728-90', [728, 90], 'div-gpt-ad-1**********-'+ num).addService(googletag.pubads()); googletag.pubads().enableSingleRequest(); googletag.enableServices(); }); } }); }); }); </script> album-slider.php  <div id='div-gpt-ad-1**********-<?php echo $_GET['photo']; ?>' style='width:728px; height:90px;'> <script type='text/javascript'> googletag.cmd.push(function() { googletag.display('div-gpt-ad-1**********-<?php echo $_GET['photo']; ?>); }); </script>

Read the article

Sync Google Contacts with QuickBooks

- by dataintegration

The RSSBus ADO.NET Providers offer an easy way to integrate with different data sources. In this article, we include a fully functional application that can be used to synchronize contacts between Google and QuickBooks. Like our QuickBooks ADO.NET Provider, the included application supports both the desktop versions of QuickBooks and QuickBooks Online Edition. Getting the Contacts Step 1: Google accounts include a number of contacts. To obtain a list of a user's Google Contacts, issue a query to the Contacts table. For example: SELECT * FROM Contacts. Step 2: QuickBooks stores contact information in multiple tables. Depending on your use case, you may want to synchronize your Google Contacts with QuickBooks Customers, Employees, Vendors, or a combination of the three. To get data from a specific table, issue a SELECT query to that table. For example: SELECT * FROM Customers Step 3: Retrieving all results from QuickBooks may take some time, depending on the size of your company file. To narrow your results, you may want to use a filter by including a WHERE clause in your query. For example: SELECT * FROM Customers WHERE (Name LIKE '%James%') AND IncludeJobs = 'FALSE' Synchronizing the Contacts Synchronizing the contacts is a simple process. Once the contacts from Google and the customers from QuickBooks are available, they can be compared and synchronized based on user preference. The sample application does this based on user input, but it is easy to create one that does the synchronization automatically. The INSERT, UPDATE, and DELETE statements available in both data providers makes it easy to create, update, or delete contacts in either data source as needed. Pre-Built Demo Application The executable for the demo application can be downloaded here. Note that this demo is built using BETA builds of the ADO.NET Provider for Google V2 and ADO.NET Provider for QuickBooks V3, and will expire in 2013. Source Code You can download the full source of the demo application here. You will need the Google ADO.NET Data Provider V2 and the QuickBooks ADO.NET Data Provider V3, which can be obtained here.

Read the article

Regexp that matches user-agents of end-user browsers but NOT crawlers with >90 % accuracy

- by knorv

I'm trying to construct a regexp that will evaluate to true for User-Agent:s of "browsers navigated by humans", but false for bots. Needless to say the matching will not be exact, but if it gets things right in say 90 % of cases that is more than good enough. My approach so far is to target the User-Agent string of the the five major desktop browsers (MSIE, Firefox, Chrome, Safari, Opera). Specifically I want the regexp NOT to match if the user-agent is a bot (Googlebot, msnbot, etc.). Currently I'm using the following regexp which appears to achieve the desired precision: ^(Mozilla.*(Gecko|KHTML|MSIE|Presto|Trident)|Opera).*$ I've observed small number of false negatives which are mostly mobile browsers. The exceptions all match: (BlackBerry|HTC|LG|MOT|Nokia|NOKIAN|PLAYSTATION|PSP|SAMSUNG|SonyEricsson) My question is: Given the desired accuracy level, how would you improve the regexp? Can you think of any major false positives or false negatives to the given regexp? Please note that the question is specifically about regexp-based User-Agent matching. There are a bunch of other approaches to solving this problem, but those are out of the scope of this question.

Read the article

Google Analytics Event Tracking and Variable visibility.

- by Jeow

Hi guys, I have added to my html page the standard latest snippet to get google analytics to work: ... ... var _gaq = _gaq || []; _gaq.push(['_setAccount', 'UA-15080849-1']); _gaq.push(['_trackPageview']); (function() { var ga = document.createElement('script'); ga.type = 'text/javascript'; ga.async = true; ga.src = 'http://www.google-analytics.com/ga.js'; (document.getElementsByTagName('head')[0] || document.getElementsByTagName('body')[0]).appendChild(ga); })(); Now looking at the official 'event tracking guide' google says: add a snippet such as: pageTracker._trackEvent('Videos', 'Play', 'Gone With the Wind'); my question is: where is pageTracker coming from ? is it a global object in ga.js ? but if it is, why google did not tell me that they run a risk on breaking some script... I must be missing something any help really appreciated.

Read the article

How do I mashup Google Maps with geolocated photos from one or more social networks?

- by PureCognition

I'm working on a proof of concept for a project, and I need to pin random photos to a Google Map. These photos can come from another social network, but need to be non-porn. I've done some research so far, Google's Image Search API is deprecated. So, one has to use the Custom Search API. A lot of the images aren't photos, and I'm not sure how well it handles geolocation yet. Twitter seems a little more well suited, except for the fact that people can post pictures of pretty much anything. I was also going to look into the API's for other networks such as Flickr, Picasa, Pinterest and Instagram. I know there are some aggregate services out there that might have done some of this mash-up work for me as well. If there is anyone out there that has a handle on social APIs and where I should look for this type of solution, I would really appreciate the help. Also, in cases where server-side implementation matters, I'm a .NET developer by experience.

Read the article

Grapeshot crawler ignoring robots.txt

- by QF_Developer

Has anyone come across a crawler called Grapeshot? They are hammering the same page repeatedly on our website. I believe they are looking for ad related keywords, based on previous content ad campaigns. The odd thing is we never ran any such campaigns on the page they are so interested in. We do have only a few pages running AdSense, is this what has attracted Grapeshot? I've added the following declaration to my robots.txt, but they don't seem to be honouring it? User-agent: grapeshot Disallow: / Any ideas on how to block this nuisance crawler? I'm starting to think the best way is by setting up IP rules in IIS?

Read the article

why the difference in google search result using script for search and using a browser for search

- by Jayapal Chandran

I wrote a code to find the position in google search result for a search keyword. I also did the same with the browser. Both the results are different. Let me explain in detail here. I have a website and i wanted to know on which page number my domain appears for a search string. Like when i search for 'code snippets' i wanted to find in google search on which page number a certain domain appears. I wrote a php code to search page by page starting from page 1 to page n. I did the same task using a browser. The script returned page 4 and when browsed i can see the domain appearing in second page. here is the search string i use in my code. /search?hl=en&output=search&sclient=psy-ab&q=code+snippets&start=0&btnG= and for each request i change the start=0 to start=1, start=2, etc... and in the response i will check whether my domain appears in it. any idea for this different in search results?

Read the article

Understanding the maximum hit-rate supported by a web-server

- by SNag

I would like to crawl a publicly available site (and one that's legal to crawl) for a personal project. From a brief trial of the crawler, I gathered that my program hits the server with a new HTTPRequest 8 times in a second. At this rate, as per my estimate, to obtain the full set of data I need about 60 full days of crawling. While the site is legal to crawl, I understand it can still be unethical to crawl at a rate that causes inconvenience to the regular traffic on the site. What I'd like to understand here is -- how high is 8 hits per second to the server I'm crawling? Could I possibly do 4 times that (by running 4 instances of my crawler in parallel) to bring the total effort down to just 15 days instead of 60? How do you find the maximum hit-rate a web-server supports? What would be the theoretical (and ethical) upper-limit for the crawl-rate so as to not adversely affect the server's routine traffic?

Read the article

One site being on a subdirectory of another. Does google count this against you?

- by Mick

I have created two similar websites (relating to monetary systems). So far, one appears to be loved by Google and the other hated. I'm struggling to work out why. This is a mystery to me because both sites were created by me with the same design philosophy, both in pure html. Both are packed to the rafters with references to, and information about, their respective subjects. One issue I'm worried may be the cause is to do with the location of the sites. I got a web hosting package from hostmonster.com for the successful one, but less liked one is just an "add-on" which sits on a subdirectory of the successful one. I wonder if Google somehow detects this and treats it as a less significant website? EDIT: Just to clarify, even though one site is an add-on that sits on a subdirectory of the other, the URL is arranged to look like it is a root. I.e. the unpopular site can be accessed directly with a simple www.myunpopularsite.com name, without specifying any subdirectory. EDIT: Just in case its important... say the popular site is called pop.com and the unpopular one unpop.com. In the webspace I've purchased, there is a directory called public_html. This is where I put the index.htm and all the other files of my popular site. When I purchased the add-on unpop.com. I made a subdirectory of public_html called unpop. It is within this "public_html\unpop\" that I place the index.htm and all the other files of my unpopular site. Typing www.unpop.com into the address bar of a browser links directly to the contents of "public_html\unpop\" and the user is not aware that this site is sitting on a subdirectory of another site. BUT if you type "www.pop.com/unpop" into the address bar of a browser you DO see the unpopular site.

Read the article

Is this Anti-Scraping technique viable with Crawl-Delay?

- by skibulk

I want to prevent web scrapers from abusing 1,000,000 on my website. I'd like to do this by returning a "503 Service Unavailable" error code for users that access an abnormal number of pages per minute. I don't want search engine spiders to ever receive the error. My inclination is to set a robots.txt crawl-delay which will ensure spiders access a number of pages per minute under my 503 threshold. Is this an appropriate solution? Do all major search engines support the directive? Could it negatively affect SEO? Are there any other solutions or recommendations?

Read the article

Ideas to tackle unwanted bad press/review on Google's SERP?

- by Rob

After Googling our company name to our horror we've found someone on Yelp.co.uk has reviewed our company. On the SERP your eye is immediately drawn to the 2 star review some complete stranger has written, which to be honest is pure slander! The most infuriating thing is the person who reviewed our company has never even been a client/customer. It's a bit like me reviewing a restaurant having never eaten or even been in there! We've sent her a private message on Yelp to remove the review and also sent a complaint to Yelp themselves but have yet to get a reply. We've resisted going mad at the reviewer and also requested that she re-review us having just relaunched our new website (it still riles us that she's not even a client though!). We've had genuine customers/clients review us on Yelp yet this 2 star review remains on Google's SERP. Roughly how long would it take to for our new reviews to over take this review? Does anyone have any suggestions as to how we can push the review off the 1st page of Google's SERP or any creative ways in which we can tackle this issue?

Read the article

Creating sitemap for google bot - how to mark dynamic content / dynamic subpages?

- by ojek

I have a website that is internet forum. This forum has many categories, and single category page contains alot of subpages with listed threads. This internet forum is brand new, and about a week ago I filled it with few hundred thousands threads. I then looked at google webmasters page to see any changes in indexing, but the index went up from 300 to about 1200, so that means it did not index my added threads (although it added something). This is what my sitemap.xml contains which I uploaded on their website (of course there is a lot more of the code, this is just a snipped for a single category, in my real sitemap file I have all the categories listed as below): <url> <loc>http://mysite.com/Forums/Physics</loc> <changefreq>hourly</changefreq> </url> Now, I would expect google bot to go into http://mysite.com/Forums/Physics, and move through all the subpages with thread links, and then get inside of each thread and index it's content. How can I do this? Also if this will be unclear, I will add a real link to my website.

Read the article

Trying to update a google visualization using jquery

- by Mark in A2

I'm relatively inexperienced, so please bear with me. I'm developing a simple dashboard using the Google visualization API. I'm developing in vb.net. I have the Annotated Timeline, the Intensity Map, and a set of tables on my apsx. What I am trying to do is update the Intensity Map and tables based on the date range the user selects using the Annotated Timeline tool. I was hoping to update only these visualizations without doing a full page load. Apparently, a great way to do this is to separate the visualizations into self-contained aspx pages and use jQuery to "load" them into a div. I say apparently, as this is not working. When I try to update an aspx containing a Google visualization using jQuery, I get the message "Loading data from www.google.com..." in the browser and it just runs continuously and never returns. I ran this by an experienced developer and he was stumped, but thought must be a conflict between the google API and jQuery. Any tips, advice, alternative solutions are greatly appreciated! Thanks, Mark in Ann Arbor

Read the article

How to make Google recognize language for a multilingual website?

- by Julien Fouilhé

Few weeks ago, I implemented translation functionality for the website of my company. The website is now available in french and english and I did look on the internet the best way to do if we want to do not lose any ranking and to have our pages on Google. Here is what I did: I did set a response header: Content-Language:en and Content-Language:fr My URLs are formatted as: http://www.website.com/en/... and http://www.website.com/fr/... My html tag is set with a lang attribute: <html lang="en"> and <html lang="fr"> There is a <link rel="alternate" hreflang="en" href="EnglishPageUrl"> on french pages and a <link rel="alternate" hreflang="en" href="frenchPageUrl"> on english pages. But Google keeps referring to some english pages when I'm doing a search on french engine, knowing that the website was first only available in english. Is that normal? Do I have to wait still, it has been now almost one month, I thought it would be okay...? Thank you.

Read the article

Web Crawler for Learnign Topics on Wikipedia

- by Chris Okyen

When I want to learn a vast topic on wikipedia, I don't know where to start. For instance say I want to learn about Binary Stars, I then have to know other things linked on that pages and linked pages on all the linked pages and so on for the specified number of levels. I want to write a web crawler like HTTracker or something similiar, that will display a heiarchy of the links on a certain page and the links on those linked pages.I wish to use as much prewritten code as possible. Here is an example: Pretending we are bending the rules by grabing links from only the first sentence of each pages The example archives and "processes" two levels deep The page is Ternary operation The First Level In mathematics a ternary operation is an N-ary operation The Second Level Under Mathmatics: Mathematics (from Greek µ???µa máthema, “knowledge, study, learning”) is the abstract study of topics encompassing quantity, structure, space, change and others; it has no generally accepted definition. Under N-ary In logic,mathematics, and computer science, the arity i/'ær?ti/ of a function or operation is the number of arguments or operands that the function takes Under Operation In its simplest meaning in mathematics and logic, an operation is an action or procedure which produces a new value from one or more input values ------------------------------------------------------------------------- I need some way to determine what oder to approach all these wiki pages to learn the concept ( in this case ternary operations )... Following along with this exmpakle, one way to show the path to read would a printout flowout like so: This shows that the first sentence of the Mathematics page doesn't link to the first sentence of pages linked on ternary page two levels deep. (Please tell me how I should explain this ) --- In otherwords, the child node of the top pages first sentence, ternary_operation, does not have any child nodes that reference the children of the top pages other children nodes- N-ary and operation. Thus it is safe to read this first. Since N-ary has a link to operations we shoudl read the operation page second and finally read the N-ary page last. Again, I wish to use as much prewritten code as possible, and was wondering what language to use and what would be the simpliest way to go about doing this if there isn't already somethign out there? Thank You!

Read the article

get_by_id method on Model classes in Google App Engine Datastore

- by tarn

I'm unable to workout how you can get objects from the Google App Engine Datastore using get_by_id. Here is the model from google.appengine.ext import db class Address(db.Model): description = db.StringProperty(multiline=True) latitude = db.FloatProperty() longitdue = db.FloatProperty() date = db.DateTimeProperty(auto_now_add=True) I can create them, put them, and retrieve them with gql. address = Address() address.description = self.request.get('name') address.latitude = float(self.request.get('latitude')) address.longitude = float(self.request.get('longitude')) address.put() A saved address has values for >> address.key() aglndWVzdGJvb2tyDQsSB0FkZHJlc3MYDQw >> address.key().id() 14 I can find them using the key from google.appengine.ext import db address = db.get('aglndWVzdGJvb2tyDQsSB0FkZHJlc3MYDQw') But can't find them by id >> from google.appengine.ext import db >> address = db.Model.get_by_id(14) The address is None, when I try >> Address.get_by_id(14) AttributeError: type object 'Address' has no attribute 'get_by_id' How can I find by id? EDIT: It turns out I'm an idiot and was trying find an Address Model in a function called Address. Thanks for your answers, I've marked Brandon as the correct answer as he got in first and demonstrated it should all work.

Read the article

Is there an elegant way to track multiple domains under separate accounts with google analytics?

- by J_M_J

I have a situation where a content management system uses the same template for multiple websites with different domain names and I can't make a separate template for each. However, each website needs to be tracked with Google analytics. Would this be appropriate to track each domain like this by putting in some conditional code? And would this be robust enough not to break? Is there a more elegant way to do this? <script type="text/javascript"> var _gaq = _gaq || []; switch (location.hostname){ case 'www.aaa.com': _gaq.push(['_setAccount', 'UA-xxxxxxx-1']); break; case 'www.bbb.com': _gaq.push(['_setAccount', 'UA-xxxxxxx-2']); break; case 'www.ccc.com': _gaq.push(['_setAccount', 'UA-xxxxxxx-3']); break; } _gaq.push(['_trackPageview']); (function() { var ga = document.createElement('script'); ga.type = 'text/javascript'; ga.async = true; ga.src = ('https:' == document.location.protocol ? 'https://ssl' : 'http://www') + '.google-analytics.com/ga.js'; (document.getElementsByTagName('head')[0] || document.getElementsByTagName('body')[0]).appendChild(ga); })(); </script> Just to be clear, each website is a separate domain name and must be tracked separately, NOT different domains with same pages on one analytics profile.

Read the article

Logic behind crawling an webpages like that of Screaming Frog? [on hold]

- by sree

I would like to know what is the parameters to be considered while developing a crawler like that of Screaming Frog. Am looking forward for information on do's and dont's of webpage crawling. What are the problems the crawler may infuse on the webpages like loadtime (maybe?) or anything that effects webpage during crawling. What are the rules the crawler needs to follow etc. Basically anything info that makes the crawler look good and accurate. Just point me in a right direction to achieve it.. Hope my requirement is clear this time.. :)

Read the article

Disqus thread migration. Gotchas?

- by sramsay

I've been migrating a site to a new domain. The site itself is pretty straightforward (it uses Jekyll), and everything has gone fine -- except migration of Disqus threads. I've had partial success -- some of the threads have migrated successfully, but not all. I've tried the domain migration wizard (which caught a few), the URL mapper (which caught a few), and the 301 redirect crawler (which caught a few). But the remaining threads just won't move, no matter which method I use. So, I suppose I suppose I'm asking if there are any "gotchas" I should know about with this. When you execute any of these migration tools, it says it will "take awhile." Does that mean hours? Days? I can't tell if it's working, and there's no logging or error reporting that I can see.

Read the article

One site being on a subdirectory of another. Does google count this againt you?

- by Mick

I have created two similar websites (relating to monetary systems). So far, one appears to be loved by Google and the other hated. I'm struggling to work out why. This is a mystery to me because both sites were created by me with the same design philosophy, both in pure html. Both are packed to the rafters with references to, and information about, their respective subjects. One issue I'm worried may be the cause is to do with the location of the sites. I got a web hosting package from hostmonster.com for the successful one, but less liked one is just an "add-on" which sits on a subdirectory of the successful one. I wonder if Google somehow detects this and treats it as a less significant website? EDIT: Just to clarify, even though one site is an add-on that sits on a subdirectory of the other, the URL is arranged to look like it is a root. I.e. the unpopular site can be accessed directly with a simple www.myunpopularsite.com name, without specifying any subdirectory.

Read the article

How to enable logging for Google Chrome in Ubuntu 12.04?

- by skytreader

I'm trying to capture the logs for a certain bug I'm having with Google Chrome. However, I can't find/enable logs for GC. According to this Chromium project page, I just need to add the flags --enable-logging --v=1 and a chrome_debug.log file will appear in my user data directory. However, after running GC (and closing through the 'X' title bar button) there is no chrome_debug.log file in the specified directory. I even tried running as root as it may have something to do with write permissions but GC refuses to start as root. Another thing, GC also prints messages when invoked from command line. I tried capturing this and redirecting them to a file via $ google-chrome > today.log but the messages are still printed in the command line and the file I specify gets created but remains empty. Note that I can't just copy-paste the messages printed on terminal after my bug occurs as the bug freezes up my whole system that, when it occurs, my only option is to turn off my computer straight via the power button. I've seen a few similar bugs already posted but I find that they don't exactly describe my situation so I'd really like to get some logs for this. So how do I enable logging or, at least, get those terminal messages in a file?

Read the article

If C-Panel Indexing Manager sets a folder to "No Indexing" can it be crawled by a webcrawler?

- by Graham

People are able to view directories / folders on my site right now. So, they could go to mysite.com/images and see the full index. To prevent this, C-Panel offers an option to set a directory / folder to "No Indexing" under the "Index Manager." Will this option allow webcrawlers to crawl / index the images? Or, is there a simpler alternative to block access to all folders directly while still having it SEO friendly? My old server restricted direct access to folders by default. But, the new one does not. Any ideas on this? Thanks!

Read the article

Google Analytics API data for goals (funnels) doesn't match - how do they reconcile?

- by bkgraham

I have a Google Analytics account with a well-functioning funnel made up of 4 goals. I can query the API and get the data out, but it does not match the funnel report in Analytics. Without getting into specific values, I can give you an example with faked data. Here's how the funnel might look: Shopping Cart 100 > 100 > 20 80 (80%) Address Page 5 > 85 > 25 60 (71%) Payment Page 2 > 62 > 10 52 (84%) Checkout 1 > 53 (49.07% funnel conversion rate) Okay, so you would expect the API to output data something like this: goal1Starts goal1Completions goal1Abandons 100 80 20 goal2Starts goal2Completions goal2Abandons 85 60 25 goal3Starts goal3Completions goal3Abandons 62 52 10 goal4Starts goal4Completions goal4Abandons 53 53 0 Instead, it's different. Firstly, the abandons are associated with the following goal (so goal1 always has 0 abandons and goal4 always has 0 abandons. Okay, I can work with that. What's confusing is that the numbers are always a little different. The goal1Completions always match the report, as do the goal4Completions, but everything else is off by a small amount. Sometimes it's only 2 visits, other times it's off by 50. For the report above here's the kind of results I would tend to get: goal1Starts goal1Completions goal1Abandons 100 100 0 goal2Starts goal2Completions goal2Abandons 105 84 21 goal3Starts goal3Completions goal3Abandons 90 65 25 goal4Starts goal4Completions goal4Abandons 58 53 5 Here's what I know: Goal(n)Completions + Goal(n)Abandons = Goal(n)Starts Goal(n)Starts = Goal(n-1)Completions Goal(n)Starts - Goal(n-1)Completions != reported number entering at that level That third one is particularly disappointing. So, here's my question: What data do I need to pull from the API in order to recreate the counts in the Funnel report in Google Analytics? I don't need the pages exited to entering from - just the counts at every level.

Search Results

Search found 21370 results on 855 pages for 'google crawlers'.

Page 101/855 | < Previous Page | 97 98 99 100 101 102 103 104 105 106 107 108 | Next Page >

- by Quandary

- by blunders

- by cj333

- by dataintegration

- by knorv

- by Jeow

- by PureCognition

- by QF_Developer

- by Jayapal Chandran

- by SNag

- by Mick

- by skibulk

- by Rob

- by ojek

- by Mark in A2

- by Julien Fouilhé

- by Chris Okyen

- by tarn

- by J_M_J

- by sree

- by sramsay

- by Mick

- by skytreader

- by Graham

- by bkgraham

< Previous Page | 97 98 99 100 101 102 103 104 105 106 107 108 | Next Page >