Search Results

Search found 446 results on 18 pages for 'crawl'.

Page 14/18 | < Previous Page | 10 11 12 13 14 15 16 17 18 | Next Page >

Prevent bot from crawling certain areas of site.

- by Skoder

Hey, I don't know much about SEO and how web spiders work, so forgive my ignorance here. I'm creating a site (using ASP.NET-MVC) which has areas that displays information retrieved from the database. The data is unique to the user, so there's no real server-side output caching going on. However, since the data can contain things the user may not wish to have displayed from search engine results, I'd like to prevent any spiders from accessing the search results page. Are there any special actions I should take to ensure that the search result directory isn't crawled? Also, would a spider even crawl a page that's dynamically generated and would any actions preventing certain directories being search mess up my search engine rankings? edit: I should add, I'm reading up on robots.txt protocol, but it relies on co-operation from the web crawler. However, I'd also like to prevent any data-mining users who will ignore the robots.txt file. I appreciate any help!

Read the article
PHP application variable... maybe?

- by James

I went to a PHP job interview, I was asked to implement a piece of code to detect visitors are bots to crawl thru the website and steal content. So I implemented a few lines of code to detect if the site is being refreshed/visited too quickly/often by using a session variable to store last visit timestamp. I got told that session varaibles can be manupilated by cookies etc, so I am wondering if there is a application variable that I can use to store the timestamp information against visitor IPs eg $_SERVER[REMOTE_ADDR]? I know that I can write the data to a file but it's not very good for a high traffic website. Regards James

Read the article
Is it ever a bad idea to publish a sitemap for a blog?

- by mipadi

I have a blog, and I have been considering publishing a sitemap for it, which would include the index page, archives page, and an entry for each individual blog post. Is this ever a bad idea? Is it a good (or useful) idea? I'm particularly interested in the <changefreq> element: I edit posts from time to time, and while that's not a common occurrence, I don't want to set a particularly infrequent change frequency that prevents search engines like Google from indexing the edits. (The sitemaps protocol says that search engines may still crawl the pages more frequently, but has no further details on the matter.)

Read the article
Stop applet execution on load, pause/resume using javascript?

- by Zane

I'm making something of a java applet gallery for my website (processing applets, if you're interested) and I'd like to keep the applets from running when the sit first loads. Then, when the appropriate button is clicked, a piece of javascript would tell the applet to continue execution until another button is pressed to stop it. I know that I can use appletName.start() and appletName.stop(), but it doesn't seem to work on load, at least not well. I'm using element.getElementsById( "applet" ) to get the applets to use the start and stop methods on. It slows Firefox to a crawl for some reason.

Read the article
Google Sitemap and Robots.txt Issue

- by Sarfaraz Soomro

Hi, We have a sitemap at our site, http://www.gamezebo.com/sitemap.xml Some of the urls in the sitemap, are being reported in the webmaster central as being blocked by our robots.txt, see, gamezebo.com/robots.txt ! Although these urls are not Disallowed in Robots.txt. There are other such urls aswell, for example, gamezebo.com/gamelinks is present in our sitemap, but it's being reported as "URL restricted by robots.txt". Also I have this parse result in the Webmaster Central that says, "Line 21: Crawl-delay: 10 Rule ignored by Googlebot". What does it mean? I appreciate your help, Thanks.

Read the article
Scrapy Could not find spider Error

- by Nacari

I have been trying to get a simple spider to run with scrapy, but keep getting the error: Could not find spider for domain:stackexchange.com when I run the code with the expression scrapy-ctl.py crawl stackexchange.com. The spider is as follow: from scrapy.spider import BaseSpider from __future__ import absolute_import class StackExchangeSpider(BaseSpider): domain_name = "stackexchange.com" start_urls = [ "http://www.stackexchange.com/", ] def parse(self, response): filename = response.url.split("/")[-2] open(filename, 'wb').write(response.body) SPIDER = StackExchangeSpider()` Another person posted almost the exact same problem months ago but did not say how they fixed it, http://stackoverflow.com/questions/1806990/scrapy-spider-is-not-working I have been following the turtorial exactly at http://doc.scrapy.org/intro/tutorial.html, and cannot figure out why it is not working.

Read the article
SoundPlayer causing Memory Leaks?

- by Nick Udell

I'm writing a basic writing app in C# and I wanted to have the program make typewriter sounds as you typed. I've hooked the KeyPress event on my RichTextBox to a function that uses a SoundPlayer to play a short wav file every time a key is pressed, however I've noticed after a while my computer slows to a crawl and checking my processes, audiodlg.exe was using 5 GIGABYTES of RAM. The code I'm using is as follows: I initialise the SoundPlayer as a global variable on program start with SoundPlayer sp = new SoundPlayer("typewriter.wav") Then on the KeyPress event I simply call sp.Play(); Does anybody know what's causing the heavy memory usage? The file is less than a second long, so it shouldn't be clogging the thing up too much.

Read the article
What is a good Java web crawler library?

- by DrDee

Hi, I am about to develop a crawler in Java but don't feel like reinventing the wheel. A quick Google search gives a whole bunch of Java libraries to build a web crawler. Besides that Nutch is of course a very robust package but seems a bit too advanced for my needs. I only need to crawl a handful websites a week containing a couple of 1000 pages each. Which open source Java library would you recommend considering: speed multithreading (or even distributed) extending it with new functionality active maintained and documentation?

Read the article
WordPress > Activating plugin makes site go blank in one theme, not in another. Generated source ide

- by Scott B

Strangest thing. When I activate this specific plugin, the public side of the site goes blank (nothing but a white screen with blank view source). However, when I test the site with the wordpress default theme, the plugin does not conflict and the site works fine. The interesting thing is that I've compared the generated source (using FF's webmaster tools) with and without plugin activated and in each case they are identical. This led me to believe that perhaps the plugin was altering htaccess, however, that file is the same whether or not the plugin is active or not. How can I find out what is causing the problem with this plugin? The plugin is called "Crawl Rate Tracker".

Read the article
Data Mining open source tools

- by Andriyev

Hi I'm due to take up a project which is into data mining. Before I jump in I wanted to probe around for different data mining tools (preferably open source) which allows web based reporting. In my scenario the all the data would be provided to me, so I'm not supposed to crawl for it. In n nutshell, am looking for a tool which does - Data Analysis, Web based Reporting, provides some kind of a dashboard and mining features. I have worked on the Microsoft Analysis Services and BOXI and off late I have been looking at Pentaho, which seems to be a good option. Please share your experiences on any such tool which you know of. cheers

Read the article
Database storage for high sample rate data in web app

- by Jim

I've got multiple sensors feeding data to my web app. Each channel is 5 samples per second and the data gets uploaded bundled together in 1 minute json messages (containing 300 samples). The data will be graphed using flot at multiple zoom levels from 1 day to 1 minute. I'm using Amazon SimpleDB and I'm currently storing the data in the 1 minute chunks that I receive it in. This works well for high zoom levels, but for full days there will be simply be too many rows to retrieve. The idea I've currently got is that every hour I can crawl through the data and collect together 300 samples for the last hour and store them in an hour Domain (table if you like). Does this sound like a reasonable solution? How have others implemented the same sort of systems?

Read the article
Sharepoint 2010 - AAM - SPSite(SPContext.Current.Site.ID) RootWeb.Url is from wrong zone

- by user2026343

I have a sharepoint 2010 web application with 2 different zones, default zone with windows login (for search crawl), internet with Claims (FBA) for users to login. I have custom webparts that uses using (SPSite mySite = new SPSite(SPContext.Current.Site.ID)) using (SPWeb web = mySite.RootWeb) { string url = web.Url I use this url to include to emails etc... Problem is: when user connects to FBA (extended zone), and goes to the webpart,string url in my code returns the url of the default zone(windows auth) where user should not be touching. I have different host headers for these zones, any help would be very appreciated. Update: fixed it with using (SPSite newsite =new SPSite(SPContext.Current.Site.ID,SPContext.Current.Site.Zone)) using (SPWeb web = newsite.RootWeb) { //do your implementation here }

Read the article
How to reset Scrapy parameters? (always running under same parameters)

- by Jean Ventura

I've been running my Scrapy project with a couple of accounts (the project scrapes a especific site that requieres login credentials), but no matter the parameters I set, it always runs with the same ones (same credentials). I'm running under virtualenv. Is there a variable or setting I'm missing? Edit: It seems that this problem is Twisted related. Even when I run: scrapy crawl -a user='user' -a password='pass' -o items.json -t json SpiderName I still get an error saying: ERROR: twisted.internet.error.ReactorNotRestartable And all the information I get, is the last 'succesful' run of the spider.

Read the article
Monitoring Reasoning Progress using the Pellet Reasoner

- by Nico

I am currently constructing an OWL ontology, which - until very recently classified rapidly using the Pellet reasoner. However, since the introduction of several new classes, the reasoning performance has slowed to a crawl. Although the reasoner completes and the ontology does not contain any unsatisfiable concepts etc, the time the reasoning takes is unacceptable. I am currently trying to track down the offending classes/class that may have led to the slowdown. Here's my question: is it possible to log the reasoning progreess of Pellet? I.e. is it possible to produce some output that will document how long pellet has spent on certain reasoning tasks/traces how long reasoning over any given class and axiom takes? If so, does anyone have some java code they could post up? Thanks in advance for your answers!

Read the article
How to estimate memory need by XPathDocument for a specific xml file

- by bill seacham

Is there any way to estimate the memory requirement for creating an XpathDocument instance based on the file size of the xml? XpathDocument xdoc = new XpathDocument(xmlfile); Is there any way to programmatically stop the process of creating the XpathDocument if memory drops to a very low level? Since it loads the entire xml into memory, it would be nice to know ahead of time if the xml is too big. What I have found is that when I create a new XpathDocument with a big xml file, an outofmemory exception is never fired, but that the process slows to a crawl, only 5 Mb of memory remains a available and the Task Manager reports it is not responding. This happened with a 266 Mb xml file when there was 584 Mb of ram. I was able to load a 150 Mb file with no problems in 18. After loading the xml, I want to do xpath queries using an XpathNavigator and an XpathNodeIterator. I am using .net 2.0, xp sp3.

Read the article
Silverlight 4 application on localhost runs extremely slow

- by rams

Silverlight 4 app running in IE8 and hosted on VS2010 internal webserver. The website takes atleast a minute to download the xap and code runs slow on client (IE8). I am running the app in debug mode and have turned intellitrace off. Symbol loading is also turned off. However if I kill the VS webserver, clean the solution, the app runs fast. 3 debugging sessions later, the app slows to a crawl. Have also tried turning off McAfee live scanning but no use. Looked in event log for any clue but found none. What could be the cause of the slowness? TIA rams

Read the article
Django equivalent to paster for backend processes

- by intractelicious

I use pylons in my job, but I'm new to django. I'm making an rss filtering application, and so I'd like to have two backend processes that run on a schedule: one to crawl rss feeds for each user, and another to determine relevance of individual posts relative to users' past preferences. In pylons, I'd just write paster commands to update the db with that data. Is there an equivalent in django? EG is there a way to run the equivalent of python manage.py shell in a non-interactive mode?

Read the article
What are the best security measures to take for making certain directories private?

- by Sattvic

I have a directory on my server that I do not want Search Engines to crawl and I already set this rule in robots.txt I do want people that have logged in to be able to have access to this directory without having to enter a password or anything. I am thinking that a cookie is the best thing to put on users computers after they login, and if they have a cookie, they can access the directory. Is this possible, or is there a better way? I want people without this cookie to not have access to this directory - access for members only Any suggestions on the best design for this?

Read the article
3 fixed Columns (header and footer) using DIVs, NO Absolute DIVs, IE friendly, ALL columns stretch e

- by Phillip Schein

Left to right, Col1 id 560px wide with 10 px padding, middle column, 250px wide with 5px padding and Col3 (siderbar) is 200px wide with 3px padding. Background coloR, no matter text length in any column should stretch vertically equal. No javascript (jQuery workarounds) to make it work. It needs to be pure Semantic Markup with CSS. Each Column should have a nested column of color were content will go. Column 1 should be SEO prominant which means the highest nested column for Google and other Search Engines to crawl. I have used 'The Holy Grail" layout, articles at "A List Apart" and these solution are so convoluted that they push the main columns left and than the nested columns push them with padding back right. This is crazy! I try to adjust these examples, but they're not editable by just adjusting a width in the CSS or the padding, etc. Can you please help me?

Read the article
How to end a thread in java?

- by beagleguy

hi all, I have 2 pools of threads ioThreads = (ThreadPoolExecutor)Executors.newCachedThreadPool(); cpuThreads = (ThreadPoolExecutor)Executors.newFixedThreadPool(numCpus); I have a simple web crawler that I want to create an iothread, pass it a url, it will then fetch the url and pass the contents over to a cpuThread to be processed and the ioThread will then fetch another url, etc... At some point the IO thread will not have any new pages to crawl and I want to update my database that this session is complete. How can I best tell when the threads are all done processing and the program can be ended?

Read the article
Storing HTML in MySQL using Java

- by mpcabd

Hello there again, So, I'm working on a project now where I should store webpages inside a database, I'm using crawler4j to crawl and Proxool along with MySQL Java Connector to connect to my database. When I tested the application I got: com.mysql.jdbc.MysqlDataTruncation: Data truncation: Data too long for column 'HTMLData'. The HTMLData column wasTEXT. When I changed the HTMLData column to LONGTEXT the error was gone, but I'm afraid it might get back in the future. Any idea on how to do that perfectly so I don't worry about that error (or any other similar error) in the future? Thanks :)

Read the article
Scrapy + Eclipse PyDev : how to setup the debugger?

- by AsTeR

I've successfully setup Eclipse with my Scrapy project. I did it by setting a new Run/Debug configuration : Whose main module links to Scrapy /usr/local/bin/scrapy for me (I've found suggestion to use cmdline.py but that failed on my computer (OSX Lion & scrapy installed through easy_install) Defining the arguments to send "crawl ny" in my case as I would if I used the Scrapy command line Setting the correct working directory (${workspace_loc:My Project/src} in my case) Eclipse can successfully launch my project, but I've no debbuger. I'm missing my breakpoints and variable inspection, does anyone know how to setup the debbugger with this environment ?

Read the article
Create an seo and web accessibility analyzer

- by rebellion

I'm thinking of making a little web tool for analyzing the search engine optimization and web accessiblity of a whole website. First of all, this is just a private tool for now. Crawling a whole website takes up alot of resources and time. I've found out that wget is the best option for downloading the markup for a whole site. I plan on using PHP/MySQL (maybe even CodeIgniter), but I'm not quite sure if that's the right way to do it. There's always someone who recommends Python, Ruby or Perl. I only know PHP and a little bit Rails. I've also found a great HTML DOM parser class in PHP on SourceForge. But, the thing is, I need some feedback on what I should and should not do. Everything from how I should make the crawl process to what I should be checking for in regards to SEO and WCAG. So, what comes to your mind when you hear this?

Read the article
Controlling a browser from Python

- by Noio

I am looking for a way to control a browser from Python, i.e. fill out form fields and submit them, possibly call JS functions. I've looked around a bit, but as far as I could see PyWebKitGtk only lets you show the browser as a GUI element, not interface with it. Is there a way to do this easily? I wrote my program logic in Python, and I would hate to port it to JS. Besides that, even if I'd use pure JS "bookmarklets", those wouldn't be able to read/write to my local filesystem, would they? P.S. to quell your suspicions, I'm not trying to automatically fill out forum account creation forms or something similarly spammious, though the task is technically similar. I need to crawl/scrape sites for my research project.

Read the article
How to find the most recent associations created between two objects with Rails?

- by Kevin

Hi, I have a user model, a movie model and this association in user.rb (I use has_many :through because there are other associations between these two models): has_many :has_seens has_many :movies_seen, :through = :has_seens, :source = :movie I'm trying to get an array of the ten most recent has_seens associations created. For now, the only solution I found is to crawl through all the users, creating an array with every user.has_seens found, then sort the array by has_seen.created_at and only keep the last 10 items… Seems like a heavy operation. Is there a better way? Kevin

Read the article

< Previous Page | 10 11 12 13 14 15 16 17 18 | Next Page >