crawl errors - Page 145

Stop applet execution on load, pause/resume using javascript?

- by Zane

I'm making something of a java applet gallery for my website (processing applets, if you're interested) and I'd like to keep the applets from running when the sit first loads. Then, when the appropriate button is clicked, a piece of javascript would tell the applet to continue execution until another button is pressed to stop it. I know that I can use appletName.start() and appletName.stop(), but it doesn't seem to work on load, at least not well. I'm using element.getElementsById( "applet" ) to get the applets to use the start and stop methods on. It slows Firefox to a crawl for some reason.

Read the article

BizTalk: History of one project architecture

- by Leonid Ganeline

"In the beginning God made heaven and earth. Then he started to integrate." At the very start was the requirement: integrate two working systems. Small digging up: It was one system. It was good but IT guys want to change it to the new one, much better, chipper, more flexible, and more progressive in technologies, more suitable for the future, for the faster world and hungry competitors. One thing. One small, little thing. We cannot turn off the old system (call it A, because it was the first), turn on the new one (call it B, because it is second but not the last one). The A has a hundreds users all across a country, they must study B. A still has a lot nice custom features, home-made features that cannot disappear. These features have to be moved to the B and it is a long process, months and months of redevelopment. So, the decision was simple. Let’s move not jump, let’s both systems working side-by-side several months. In this time we could teach the users and move all custom A’s special functionality to B. That automatically means both systems should work side-by-side all these months and use the same data. Data in A and B must be in sync. That’s how the integration projects get birth. Moreover, the specific of the user tasks requires the both systems must be in sync in real-time. Nightly synchronization is not working, absolutely. First draft The first draft seems simple. Both systems keep data in SQL databases. When data changes, the Create, Update, Delete operations performed on the data, and the sync process could be started. The obvious decision is to use triggers on tables. When we are talking about data, we are talking about several entities. For example, Orders and Items [in Orders]. We decided to use the BizTalk Server to synchronize systems. Why it was chosen is another story. Second draft Let’s take an example how it works in more details. 1. User creates a new entity in the A system. This fires an insert trigger on the entity table. Trigger has to pass the message “Entity created”. This message includes all attributes of the new entity, but I focused on the Id of this entity in the A system. Notation for this message is id.A. System A sends id.A to the BizTalk Server. 2. BizTalk transforms id.A to the format of the system B. This is easiest part and I will not focus on this kind of transformations in the following text. The message on the picture is still id.A but it is in slightly different format, that’s why it is changing in color. BizTalk sends id.A to the system B. 3. The system B creates the entity on its side. But it uses different id-s for entities, these id-s are id.B. System B saves id.A+id.B. System B sends the message id.A+id.B back to the BizTalk. 4. BizTalk sends the message id.A+id.B to the system A. 5. System A saves id.A+id.B. Why both id-s should be saved on both systems? It was one of the next requirements. Users of both systems have to know the systems are in sync or not in sync. Users working with the entity on the system A can see the id.B and use it to switch to the system B and work there with the copy of the same entity. The decision was to store the pairs of entity id-s on both sides. If there is only one id, the entities are not in sync yet (for the Create operation). Third draft Next problem was the reliability of the synchronization. The synchronizing process can be interrupted on each step, when message goes through the wires. It can be communication problem, timeout, temporary shutdown one of the systems, the second system cannot be synchronized by some internal reason. There were several potential problems that prevented from enclosing the whole synchronization process in one transaction. Decision was to restart the whole sync process if it was not finished (in case of the error). For this purpose was created an additional service. Let’s call it the Resync service. We still keep the id pairs in both systems, but only for the fast access not for the synchronization process. For the synchronizing these id-s now are kept in one main place, in the Resync service database. The Resync service keeps record as: · Id.A · Id.B · Entity.Type · Operation (Create, Update, Delete) · IsSyncStarted (true/false) · IsSyncFinished (true/false0 The example now looks like: 1. System A creates id.A. id.A is saved on the A. Id.A is sent to the BizTalk. 2. BizTalk sends id.A to the Resync and to the B. id.A is saved on the Resync. 3. System B creates id.B. id.A+id.B are saved on the B. id.A+id.B are sent to the BizTalk. 4. BizTalk sends id.A+id.B to the Resync and to the A. id.A+id.B are saved on the Resync. 5. id.A+id.B are saved on the B. Resync changes the IsSyncStarted and IsSyncFinished flags accordingly. The Resync service implements three main methods: · Save (id.A, Entity.Type, Operation) · Save (id.A, id.B, Entity.Type, Operation) · Resync () Two Save() are used to save id-s to the service storage. See in the above example, in 2 and 4 steps. What about the Resync()? It is the method that finishes the interrupted synchronization processes. If Save() is started by the trigger event, the Resync() is working as an independent process. It periodically scans the Resync storage to find out “unfinished” records. Then it restarts the synchronization processes. It tries to synchronize them several times then gives up. One more thing, both systems A and B must tolerate duplicates of one synchronizing process. Say on the step 3 the system B was not able to send id.A+id.B back. The Resync service must restart the synchronization process that will send the id.A to B second time. In this case system B must just send back again also created id.A+id.B pair without errors. That means “tolerate duplicates”. Fourth draft Next draft was created only because of the aesthetics. As it always happens, aesthetics gave significant performance gain to the whole system. First was the stupid question. Why do we need this additional service with special database? Can we just master the BizTalk to do something like this Resync() does? So the Resync orchestration is doing the same thing as the Resync service. It is started by the Id.A and finished by the id.A+id.B message. The first works as a Start message, the second works as a Finish message. Here is a diagram the whole process without errors. It is pretty straightforward. The Resync orchestration is waiting for the Finish message specific period of time then resubmits the Id.A message. It resubmits the Id.A message specific number of times then gives up and gets suspended. It can be resubmitted then it starts the whole process again: waiting [, resubmitting [, get suspended]], finishing. Tuning up The Resync orchestration resubmits the id.A message with special “Resubmitted” flag. The subscription filter on the Resync orchestration includes predicate as (Resubmit_Flag != “Resubmitted”). That means only the first Sync orchestration starts the Resync orchestration. Other Sync orchestration instantiated by the resubmitting can finish this Resync orchestration but cannot start another instance of the Resync Here is a diagram where system B was inaccessible for some period of time. The Resync orchestration resubmitted the id.A two times. Then system B got the response the id.A+id.B and this finished the Resync service execution. What is interesting about this, there were submitted several identical id.A messages and only one id.A+id.B message. Because of this, the system B and the Resync must tolerate the duplicate messages. We also told about this requirement for the system B. Now the same requirement is for the Resunc. Let’s assume the system B was very slow in the first response and the Resync service had time to resubmit two id.A messages. System B responded not, as it was in previous case, with one id.A+id.B but with two id.A+id.B messages. First of them finished the Resync execution for the id.A. What about the second id.A+id.B? Where it goes? So, we have to add one more internal requirement. The whole solution must tolerate many identical id.A+id.B messages. It is easy task with the BizTalk. I added the “SinkExtraMessages” subscriber (orchestration with one receive shape), that just get these messages and do nothing. Real design Real architecture is much more complex and interesting. In reality each system can submit several id.A almost simultaneously and completely unordered. There are not only the “Create entity” operation but the Update and Delete operations. And these operations relate each other. Say the Update operation after Delete means not the same as Update after Create. In reality there are entities related each other. Say the Order and Order Items. Change on one of it could start the series of the operations on another. Moreover, the system internals are the “black boxes” and we cannot predict the exact content and order of the operation series. It worth to say, I had to spend a time to manage the zombie message problems. The zombies are still here, but this is not a problem now. And this is another story. What is interesting in the last design? One orchestration works to help another to be more reliable. Why two orchestration design is more reliable, isn’t it something strange? The Synch orchestration takes all the message exchange between systems, here is the area where most of the errors could happen. The Resync orchestration sends and receives messages only within the BizTalk server. Is there another design? Sure. All Resync functionality could be implemented inside the Sync orchestration. Hey guys, some other ideas?

Read the article

Google Sitemap and Robots.txt Issue

- by Sarfaraz Soomro

Hi, We have a sitemap at our site, http://www.gamezebo.com/sitemap.xml Some of the urls in the sitemap, are being reported in the webmaster central as being blocked by our robots.txt, see, gamezebo.com/robots.txt ! Although these urls are not Disallowed in Robots.txt. There are other such urls aswell, for example, gamezebo.com/gamelinks is present in our sitemap, but it's being reported as "URL restricted by robots.txt". Also I have this parse result in the Webmaster Central that says, "Line 21: Crawl-delay: 10 Rule ignored by Googlebot". What does it mean? I appreciate your help, Thanks.

Read the article

Scrapy Could not find spider Error

- by Nacari

I have been trying to get a simple spider to run with scrapy, but keep getting the error: Could not find spider for domain:stackexchange.com when I run the code with the expression scrapy-ctl.py crawl stackexchange.com. The spider is as follow: from scrapy.spider import BaseSpider from __future__ import absolute_import class StackExchangeSpider(BaseSpider): domain_name = "stackexchange.com" start_urls = [ "http://www.stackexchange.com/", ] def parse(self, response): filename = response.url.split("/")[-2] open(filename, 'wb').write(response.body) SPIDER = StackExchangeSpider()` Another person posted almost the exact same problem months ago but did not say how they fixed it, http://stackoverflow.com/questions/1806990/scrapy-spider-is-not-working I have been following the turtorial exactly at http://doc.scrapy.org/intro/tutorial.html, and cannot figure out why it is not working.

Read the article

SoundPlayer causing Memory Leaks?

- by Nick Udell

I'm writing a basic writing app in C# and I wanted to have the program make typewriter sounds as you typed. I've hooked the KeyPress event on my RichTextBox to a function that uses a SoundPlayer to play a short wav file every time a key is pressed, however I've noticed after a while my computer slows to a crawl and checking my processes, audiodlg.exe was using 5 GIGABYTES of RAM. The code I'm using is as follows: I initialise the SoundPlayer as a global variable on program start with SoundPlayer sp = new SoundPlayer("typewriter.wav") Then on the KeyPress event I simply call sp.Play(); Does anybody know what's causing the heavy memory usage? The file is less than a second long, so it shouldn't be clogging the thing up too much.

Read the article

What is a good Java web crawler library?

- by DrDee

Hi, I am about to develop a crawler in Java but don't feel like reinventing the wheel. A quick Google search gives a whole bunch of Java libraries to build a web crawler. Besides that Nutch is of course a very robust package but seems a bit too advanced for my needs. I only need to crawl a handful websites a week containing a couple of 1000 pages each. Which open source Java library would you recommend considering: speed multithreading (or even distributed) extending it with new functionality active maintained and documentation?

Read the article

Data Mining open source tools

- by Andriyev

Hi I'm due to take up a project which is into data mining. Before I jump in I wanted to probe around for different data mining tools (preferably open source) which allows web based reporting. In my scenario the all the data would be provided to me, so I'm not supposed to crawl for it. In n nutshell, am looking for a tool which does - Data Analysis, Web based Reporting, provides some kind of a dashboard and mining features. I have worked on the Microsoft Analysis Services and BOXI and off late I have been looking at Pentaho, which seems to be a good option. Please share your experiences on any such tool which you know of. cheers

Read the article

WordPress > Activating plugin makes site go blank in one theme, not in another. Generated source ide

- by Scott B

Strangest thing. When I activate this specific plugin, the public side of the site goes blank (nothing but a white screen with blank view source). However, when I test the site with the wordpress default theme, the plugin does not conflict and the site works fine. The interesting thing is that I've compared the generated source (using FF's webmaster tools) with and without plugin activated and in each case they are identical. This led me to believe that perhaps the plugin was altering htaccess, however, that file is the same whether or not the plugin is active or not. How can I find out what is causing the problem with this plugin? The plugin is called "Crawl Rate Tracker".

Read the article

Database storage for high sample rate data in web app

- by Jim

I've got multiple sensors feeding data to my web app. Each channel is 5 samples per second and the data gets uploaded bundled together in 1 minute json messages (containing 300 samples). The data will be graphed using flot at multiple zoom levels from 1 day to 1 minute. I'm using Amazon SimpleDB and I'm currently storing the data in the 1 minute chunks that I receive it in. This works well for high zoom levels, but for full days there will be simply be too many rows to retrieve. The idea I've currently got is that every hour I can crawl through the data and collect together 300 samples for the last hour and store them in an hour Domain (table if you like). Does this sound like a reasonable solution? How have others implemented the same sort of systems?

Read the article

Sharepoint 2010 - AAM - SPSite(SPContext.Current.Site.ID) RootWeb.Url is from wrong zone

- by user2026343

I have a sharepoint 2010 web application with 2 different zones, default zone with windows login (for search crawl), internet with Claims (FBA) for users to login. I have custom webparts that uses using (SPSite mySite = new SPSite(SPContext.Current.Site.ID)) using (SPWeb web = mySite.RootWeb) { string url = web.Url I use this url to include to emails etc... Problem is: when user connects to FBA (extended zone), and goes to the webpart,string url in my code returns the url of the default zone(windows auth) where user should not be touching. I have different host headers for these zones, any help would be very appreciated. Update: fixed it with using (SPSite newsite =new SPSite(SPContext.Current.Site.ID,SPContext.Current.Site.Zone)) using (SPWeb web = newsite.RootWeb) { //do your implementation here }

Read the article

Monitoring Reasoning Progress using the Pellet Reasoner

- by Nico

I am currently constructing an OWL ontology, which - until very recently classified rapidly using the Pellet reasoner. However, since the introduction of several new classes, the reasoning performance has slowed to a crawl. Although the reasoner completes and the ontology does not contain any unsatisfiable concepts etc, the time the reasoning takes is unacceptable. I am currently trying to track down the offending classes/class that may have led to the slowdown. Here's my question: is it possible to log the reasoning progreess of Pellet? I.e. is it possible to produce some output that will document how long pellet has spent on certain reasoning tasks/traces how long reasoning over any given class and axiom takes? If so, does anyone have some java code they could post up? Thanks in advance for your answers!

Read the article

How to estimate memory need by XPathDocument for a specific xml file

- by bill seacham

Is there any way to estimate the memory requirement for creating an XpathDocument instance based on the file size of the xml? XpathDocument xdoc = new XpathDocument(xmlfile); Is there any way to programmatically stop the process of creating the XpathDocument if memory drops to a very low level? Since it loads the entire xml into memory, it would be nice to know ahead of time if the xml is too big. What I have found is that when I create a new XpathDocument with a big xml file, an outofmemory exception is never fired, but that the process slows to a crawl, only 5 Mb of memory remains a available and the Task Manager reports it is not responding. This happened with a 266 Mb xml file when there was 584 Mb of ram. I was able to load a 150 Mb file with no problems in 18. After loading the xml, I want to do xpath queries using an XpathNavigator and an XpathNodeIterator. I am using .net 2.0, xp sp3.

Read the article

How to reset Scrapy parameters? (always running under same parameters)

- by Jean Ventura

I've been running my Scrapy project with a couple of accounts (the project scrapes a especific site that requieres login credentials), but no matter the parameters I set, it always runs with the same ones (same credentials). I'm running under virtualenv. Is there a variable or setting I'm missing? Edit: It seems that this problem is Twisted related. Even when I run: scrapy crawl -a user='user' -a password='pass' -o items.json -t json SpiderName I still get an error saying: ERROR: twisted.internet.error.ReactorNotRestartable And all the information I get, is the last 'succesful' run of the spider.

Read the article

Silverlight 4 application on localhost runs extremely slow

- by rams

Silverlight 4 app running in IE8 and hosted on VS2010 internal webserver. The website takes atleast a minute to download the xap and code runs slow on client (IE8). I am running the app in debug mode and have turned intellitrace off. Symbol loading is also turned off. However if I kill the VS webserver, clean the solution, the app runs fast. 3 debugging sessions later, the app slows to a crawl. Have also tried turning off McAfee live scanning but no use. Looked in event log for any clue but found none. What could be the cause of the slowness? TIA rams

Read the article

Django equivalent to paster for backend processes

- by intractelicious

I use pylons in my job, but I'm new to django. I'm making an rss filtering application, and so I'd like to have two backend processes that run on a schedule: one to crawl rss feeds for each user, and another to determine relevance of individual posts relative to users' past preferences. In pylons, I'd just write paster commands to update the db with that data. Is there an equivalent in django? EG is there a way to run the equivalent of python manage.py shell in a non-interactive mode?

Read the article

What are the best security measures to take for making certain directories private?

- by Sattvic

I have a directory on my server that I do not want Search Engines to crawl and I already set this rule in robots.txt I do want people that have logged in to be able to have access to this directory without having to enter a password or anything. I am thinking that a cookie is the best thing to put on users computers after they login, and if they have a cookie, they can access the directory. Is this possible, or is there a better way? I want people without this cookie to not have access to this directory - access for members only Any suggestions on the best design for this?

Read the article

How to end a thread in java?

- by beagleguy

hi all, I have 2 pools of threads ioThreads = (ThreadPoolExecutor)Executors.newCachedThreadPool(); cpuThreads = (ThreadPoolExecutor)Executors.newFixedThreadPool(numCpus); I have a simple web crawler that I want to create an iothread, pass it a url, it will then fetch the url and pass the contents over to a cpuThread to be processed and the ioThread will then fetch another url, etc... At some point the IO thread will not have any new pages to crawl and I want to update my database that this session is complete. How can I best tell when the threads are all done processing and the program can be ended?

Read the article

3 fixed Columns (header and footer) using DIVs, NO Absolute DIVs, IE friendly, ALL columns stretch e

- by Phillip Schein

Left to right, Col1 id 560px wide with 10 px padding, middle column, 250px wide with 5px padding and Col3 (siderbar) is 200px wide with 3px padding. Background coloR, no matter text length in any column should stretch vertically equal. No javascript (jQuery workarounds) to make it work. It needs to be pure Semantic Markup with CSS. Each Column should have a nested column of color were content will go. Column 1 should be SEO prominant which means the highest nested column for Google and other Search Engines to crawl. I have used 'The Holy Grail" layout, articles at "A List Apart" and these solution are so convoluted that they push the main columns left and than the nested columns push them with padding back right. This is crazy! I try to adjust these examples, but they're not editable by just adjusting a width in the CSS or the padding, etc. Can you please help me?

Read the article

Scrapy + Eclipse PyDev : how to setup the debugger?

- by AsTeR

I've successfully setup Eclipse with my Scrapy project. I did it by setting a new Run/Debug configuration : Whose main module links to Scrapy /usr/local/bin/scrapy for me (I've found suggestion to use cmdline.py but that failed on my computer (OSX Lion & scrapy installed through easy_install) Defining the arguments to send "crawl ny" in my case as I would if I used the Scrapy command line Setting the correct working directory (${workspace_loc:My Project/src} in my case) Eclipse can successfully launch my project, but I've no debbuger. I'm missing my breakpoints and variable inspection, does anyone know how to setup the debbugger with this environment ?

Read the article

How to find the most recent associations created between two objects with Rails?

- by Kevin

Hi, I have a user model, a movie model and this association in user.rb (I use has_many :through because there are other associations between these two models): has_many :has_seens has_many :movies_seen, :through = :has_seens, :source = :movie I'm trying to get an array of the ten most recent has_seens associations created. For now, the only solution I found is to crawl through all the users, creating an array with every user.has_seens found, then sort the array by has_seen.created_at and only keep the last 10 items… Seems like a heavy operation. Is there a better way? Kevin

Read the article

Storing HTML in MySQL using Java

- by mpcabd

Hello there again, So, I'm working on a project now where I should store webpages inside a database, I'm using crawler4j to crawl and Proxool along with MySQL Java Connector to connect to my database. When I tested the application I got: com.mysql.jdbc.MysqlDataTruncation: Data truncation: Data too long for column 'HTMLData'. The HTMLData column wasTEXT. When I changed the HTMLData column to LONGTEXT the error was gone, but I'm afraid it might get back in the future. Any idea on how to do that perfectly so I don't worry about that error (or any other similar error) in the future? Thanks :)

Read the article

Create an seo and web accessibility analyzer

- by rebellion

I'm thinking of making a little web tool for analyzing the search engine optimization and web accessiblity of a whole website. First of all, this is just a private tool for now. Crawling a whole website takes up alot of resources and time. I've found out that wget is the best option for downloading the markup for a whole site. I plan on using PHP/MySQL (maybe even CodeIgniter), but I'm not quite sure if that's the right way to do it. There's always someone who recommends Python, Ruby or Perl. I only know PHP and a little bit Rails. I've also found a great HTML DOM parser class in PHP on SourceForge. But, the thing is, I need some feedback on what I should and should not do. Everything from how I should make the crawl process to what I should be checking for in regards to SEO and WCAG. So, what comes to your mind when you hear this?

Read the article

How to Identify the website's content language

- by Ajay

I am developing a website to crawl the other website content in ASP.NET . I am able to get the content correctly but how can I identify which language is used based on that content. I used following code. HttpWebRequest request = (HttpWebRequest)HttpWebRequest.Create(TextBox1.Text ); request.UserAgent = "A .NET Web Crawler"; WebResponse response = request.GetResponse(); Stream stream = response.GetResponseStream(); StreamReader reader = new StreamReader(stream); string htmlText = reader.ReadToEnd();

Read the article

Is it so bad to have heaps of elements in your DOM?

- by alex

I am making a real estate non interactive display for their shop window. I have kicked jCarousel into doing what I want: Add panels per AJAX Towards the end of the current set, go and AJAX some new panels and insert them This works fine, but it appears calling jQuery's remove() on the prior elements cause an ugly bump. I'm not sure if calling hide() will free up any resources, as the element will still exist (and the element will be off screen anyway). I've seen this, and tried carousel.reset() from within a callback. It just clears out all the elements. This will be running on Google Chrome on Windows XP, and will solely by displaying on LCD televisions. I am wondering, if I can't find a reasonable solution to remove the extra DOM elements, will it bring my application to a crawl, or will Chrome do some clever garbage collecting? Or, how would you solve this problem? Thanks

Read the article

Controlling a browser from Python

- by Noio

I am looking for a way to control a browser from Python, i.e. fill out form fields and submit them, possibly call JS functions. I've looked around a bit, but as far as I could see PyWebKitGtk only lets you show the browser as a GUI element, not interface with it. Is there a way to do this easily? I wrote my program logic in Python, and I would hate to port it to JS. Besides that, even if I'd use pure JS "bookmarklets", those wouldn't be able to read/write to my local filesystem, would they? P.S. to quell your suspicions, I'm not trying to automatically fill out forum account creation forms or something similarly spammious, though the task is technically similar. I need to crawl/scrape sites for my research project.

Search Results

Search found 12829 results on 514 pages for 'crawl errors'.

Page 145/514 | < Previous Page | 141 142 143 144 145 146 147 148 149 150 151 152 | Next Page >

- by Zane

- by Leonid Ganeline

- by Sarfaraz Soomro

- by Nacari

- by Nick Udell

- by DrDee

- by Andriyev

- by Scott B

- by Jim

- by user2026343

- by Nico

- by bill seacham

- by Jean Ventura

- by rams

- by intractelicious

- by Sattvic

- by beagleguy

- by Phillip Schein

- by AsTeR

- by Kevin

- by mpcabd

- by rebellion

- by Ajay

- by alex

- by Noio

< Previous Page | 141 142 143 144 145 146 147 148 149 150 151 152 | Next Page >