crawling - Page 6 - Developer IT

Should I prevent search engines indexing tag/category pages?

- by Macha

On my site, I currently have no special rules for search engines. It is a blog, statically generated using a Python program. When I search for some of my articles on Google, there is usually a tag or category page included in the results. Sometimes it even ranks ahead of the article itself. Obviously, as these links aren't always going to have the article on them, this aren't the results I want people to click on. So, I'm thinking of setting noindex on these pages. Is there any possible downside to doing so? Is this possible to do via robots.txt, or do I have to add it to all the relevant templates? All I can find for robots.txt are ways to stop the search engine crawling those pages, which isn't what I want - while I don't want them indexed, it's still the only surefire way to find all my blog posts.

Read the article

How to identify the client is a search robot?

- by Yau Leung

I have built my entire site using AJAX (indeed it's GWT). I have also implemented AJAX crawling proposed by Google. However, after the implementation, I found that neither Yahoo , Bing, nor Baidu implemented that scheme! I'm wondering if there is a way to identify the web client is a search robot. If they are, they will be shown the HTML snapshot I created. It will be best if I can identify them in APACHE level, then I can just do a mod_rewrite. But it's still ok if I can do that in PHP or GWT.

Read the article

How could I manage Google Adsense to approve my Web App? It keeps denying it

- by Javierfdr

Google adsense keeps denying my app from having ads, because of an "insufficient content" issue. I manage a Web Application that allows the users to set Youtube Videos as Alarm Clocks. It includes an in-site Youtube search to retrieve videos from user queries and lists the users alarms. The site has a good traffic (500 users per day), is currently promoted by Google in Google Chrome Webstore, and the ajax requests are crawlable, following Google's guidelines (https://developers.google.com/webmasters/ajax-crawling/). Although I understand there is not much content, beyond the user-generated, I really don't what else should I include in the site. Perhaps adding contact and about pages, and maybe another section would increase the navigation. Google argues I need a "fully launched and functioning site, allowing users to navigate throughout your site with a menu, sitemap, or appropiate links". They also ask for "full sentences or paragraphs" Isn't a Google Adsense solutions for Web Applications? Would all the web-apps have to include useless navigable subpages?

Read the article

Getting a lot of '/_' errors from webmaster tools

- by Vermino

I'm using a WordPress site and I thought I got all the kinks out of it. For some reason Webmaster Tools is crawling my website and showing a lot of 404 errors which are from /_ like additional pages that I've never created. I just can't figure out what is creating these for Google crawlers and then displaying a 404. My robots.txt is here. My sitemap (created by the Yoast plugin) is here. I have Yoast and Jetpack plugins installed. What could be causing these links to appear

Read the article

Duplicate page content and the Google index

- by Kit Sunde

I have a static pages with dynamically expanding content that google is indexing. I also have deep links into virtually duplicate pages which will pre-expand the relevant section of content into the relevant section. It seems like Google is ignoring all my specialized pages and not putting them in the index. Even after going through web-masters tools, crawling and submitting them to the index manually. I also use the google API for integrating search on the site, and the deep linked pages won't show up. Is there a good solution for this?

Read the article

Where would you start if you were trying to solve this PDF classification problem?

- by burtonic

We are crawling and downloading lots of companies' PDFs and trying to pick out the ones that are Annual Reports. Such reports can be downloaded from most companies' investor-relations pages. The PDFs are scanned and the database is populated with, among other things, the: Title Contents (full text) Page count Word count Orientation First line Using this data we are checking for the obvious phrases such as: Annual report Financial statement Quarterly report Interim report Then recording the frequency of these phrases and others. So far we have around 350,000 PDFs to scan and a training set of 4,000 documents that have been manually classified as either a report or not. We are experimenting with a number of different approaches including Bayesian classifiers and weighting the different factors available. We are building the classifier in Ruby. My question is: if you were thinking about this problem, where would you start?

Read the article

How to identify a PDF classification problem?

- by burtonic

We are crawling and downloading lots of companies' PDFs and trying to pick out the ones that are Annual Reports. Such reports can be downloaded from most companies' investor-relations pages. The PDFs are scanned and the database is populated with, among other things, the: Title Contents (full text) Page count Word count Orientation First line Using this data we are checking for the obvious phrases such as: Annual report Financial statement Quarterly report Interim report Then recording the frequency of these phrases and others. So far we have around 350,000 PDFs to scan and a training set of 4,000 documents that have been manually classified as either a report or not. We are experimenting with a number of different approaches including Bayesian classifiers and weighting the different factors available. We are building the classifier in Ruby. My question is: if you were thinking about this problem, where would you start?

Read the article

Why google isn't updating my site title in search results? [closed]

- by SharkTheDark

Possible Duplicate: Google doesn't seem to update the description or title of my homepage I had my domain for few days before I uploaded site to it, and it had one title, and then when I uploaded content it should get new title, but with my misunderstanding of WordPress it had blocked robots.txt and keyword with no-index and no-follow. But I removed that like 7 days ago, and I see in reports that Google bot is crawling over my site, but my site title isn't updating, it still has old domain title when site wasn't there... My robots.txt has now: User-agent: * Allow: / I have clear title tag on every page. How long does it take to update? Do I need to check something else?

Read the article

Another website is mirroring and ranks above my site in search results

- by Marlboro Goodluck

There is a site of ill-repute known as thedirty which has completely mirrored my site and now has links appearing on Google at the #1 spot using my content. I checked my log files and noticed that this site has been crawling mine for sometime, and also has 10,000 links from their site to mine. I have blocked user access which is referred from this site and reported them as web spam to Google already. I also disavowed the domain. How are they getting top links in Google (even overtaking mine) for such nefarious tactics? What are the steps to completely eliminating an issue such as this?

Read the article

Another website is mirroring my site

- by Marlboro Goodluck

Question for you all. There is a site of ill repute known as thedirty which has completely mirrored my site and now has links appearing on Google at the #1 spot using my content. I checked my log file and noticed that this site has been crawling mine from sometime, and also has 10k links from their site to mine. I have blocked user access which is referred from this site and reported them as web spam to Google already. I also disavowed the domain. How are they getting top links in Google (even overtaking mine) for such nefarious tactics? What are the steps to completely eliminating an issue such as this?

Read the article

Google indexing pages with #! although we don't have any

- by Benjamin Gruenbaum

Our company has developed a Single Page Application using AngularJS and its routing. Google indexed our site decently with JavaScript but it did not index some pages very well so we have developed an HTML only version. We have followed the Ajax Crawling Specification posted here and have a <meta name='fragment' content='!'> tag and canonical urls. We expect http://www.example.com/foo/bar to be fetched from http://www.example.com/?_escaped_fragment_=/foo/bar. However, we have found out that when we rolled the AJAX specification we now have all pages indexed twice, once with the JavaScript version as http://www.example.com/foo/bar and once with the new version as http://www.example.com/#!/foo/bar. This is harmful to us since it's duplicate content and also mis-representing out site. I have tried looking for similar questions here and in the Google product forum but could not come up with anything.

Read the article

Xpath Injection detection Tool

- by preeti

Hi, I am working on xpath Injection attack, so looking forward to build a tool to detect xpath Injection Tool in a website.Is web crawling and scanning be used for this? What can be the Logic to detect it? Are there any open source tools to detect it, so that i can develop it in Java by looking at logic used in that code. Thank You.

Read the article

WebCrawling Dynamic Links

- by Jojo

Hi Everyone, Anybody has any idea on crawling websites that have dynamic pages/queries? I mean if I click a certain link, it has different values every I try to reload it in a web browser. Now my webcrawler could not download the contents of these pages. Please advise.

Read the article

How to write a crawler?

- by Jason

Hi All, I have had thoughts of trying to write a simple crawler that might crawl and produce a list of its findings for our NPO's websites and content. Does anybody have any thoughts on how to do this? Where do you point the crawler to get started? How does it send back its findings and still keep crawling? How does it know what it finds, etc,etc. Thanks! -Jason

Read the article

How to retrieve Directories size including all sub-directories?

- by vikingosegundo

I have stored images from the net like this Documents/imagecache/domain1/original/path/inURI/foo.png Documents/imagecache/domain2/original/path/inURI/bar.png Documents/imagecache/... Documents/imagecache/... Now I'd like to check the size of imagecache including all it sub-directories. Is there a convenient way of doing it — preferable without crawling through all the data manually?

Read the article

Is there a way to crawl all facebook fan pages?

- by user220755

Is there a way to crawl all facebook fan pages and collect some information? like for example crawling facebook fan pages and save their names, or how many fans, etc? Or at least, do you have a hint of how this could be possibly done?

Read the article

What is the best way to freshen a Nutch index?

- by Miles

I haven't looked at Nutch for a year or so and it looks like it has changed significantly. The documentation on re-crawling isn't clear. What is the best way to update an existing Nutch index?

Read the article

Google search box

- by user343282

I am working on a google box, something like this, http://mytwentyfive.com/blog/wp-content/uploads/byme/Google%20Search%20Appliances.jpg I am pointing the crawler to a folder where there are html files. before the crawler was crawling the files and indexing them but right now it finds the pattern or the folder but not following any html files within the folder. I have tried everything I could and know but, can't think of anything else. Can someone help? thanks

Read the article

how to prevent all crawlers except good ones (google, bing, yahoo) access website content?

- by tranhuyhung

I just want to let Google, Bing, Yahoo crawl my website to build indexes. But I do not want my opposite website use crawling service to steal my website content. What should I do?

Read the article

Investment advice data dump analysis

- by portoalet

For my year-end pet project, I'd like to analyze investment advices and their correlation to the stock market performance. The problem is, where do I get the dump of investment advice data (free) ? something like stackoverflow.com data dump will be nice. Or maybe it's easier to do distributed crawling and crawl the public finance webpages for investment advices? Investment advice is buy/sell advice for stocks/forex, issued by institution/investment advisor.

Read the article

List of default managed properties in SharePoint search

- by stranger001

Hi, I would like to what are the default managed properties available with default installation of a SharePoint. Also would like to know what is the default crawling property name maped to a managed property "ModifiedBy". Thanks.

Read the article

Retrieivng coordinates in this page

- by hao

Hey guys, Im trying to do some data mining and analyze data based on locations. For this site, http://www.dianping.com/shop/1898365 I am trying to figure out whats the latitude and longitude by crawling. But I cant seem to figure out where this information is stored. Can someone give me some pointers

Read the article

which Distribution of Linux is best suited for Nutch-Hadoop?

- by vipin k.

Hi experts, we are Trying to figure out which Distribution of Linux be best suited for the Nutch-Hadoop Integration?. we are planning to Use Clusters for Crawling large contents through Nutch. Let me Know if You need more clarification on this question?. Thanks you.

Read the article

Is there any Gmail API for Java

- by chenxgre

I am trying to crawling google gmail inbox and download messages into the database, is there any gmail java api can best do this job?

Read the article

Sharepoint Search crawl not working

- by Satish

Search Crawling is error out on my MOSS 2007 installation. I get the following error for all the web apps I have following error in Crawl logs. http://mysites.devserver URL could not be resolved. The host may be unavailable, or the proxy settings are not configured correctly on the index server. The Application Event log also has the following corresponding error The start address http://mysites.devserver cannot be crawled. Context: Application 'SSPMain', Catalog 'Portal_Content' Details: The URL of the item could not be resolved. The repository might be unavailable, or the crawler proxy settings are not configured. To configure the crawler proxy settings, use the Proxy and Timeout page in search administration. (0x80041221) I'm using Windows 2008 server. I tried accessing the site using the above mentioned url and its available. I did the registry setting for loop back issue found here http://support.microsoft.com/kb/896861 still not luck. Any Ideas?

Search Results

Search found 241 results on 10 pages for 'crawling'.

Page 6/10 | < Previous Page | 2 3 4 5 6 7 8 9 10 | Next Page >

- by Macha

- by Yau Leung

- by Javierfdr

- by Vermino

- by Kit Sunde

- by burtonic

- by burtonic

- by SharkTheDark

- by Marlboro Goodluck

- by Marlboro Goodluck

- by Benjamin Gruenbaum

- by preeti

- by Jojo

- by Jason

- by vikingosegundo

- by user220755

- by Miles

- by user343282

- by tranhuyhung

- by portoalet

- by stranger001

- by hao

- by vipin k.

- by chenxgre

- by Satish

< Previous Page | 2 3 4 5 6 7 8 9 10 | Next Page >