crawl - Page 17 - Developer IT

How to protect/monitor your site from crawling by malicious user

- by deathy

Situation: Site with content protected by username/password (not all controlled since they can be trial/test users) a normal search engine can't get at it because of username/password restrictions a malicious user can still login and pass the session cookie to a "wget -r" or something else. The question would be what is the best solution to monitor such activity and respond to it (considering the site policy is no-crawling/scraping allowed) I can think of some options: Set up some traffic monitoring solution to limit the number of requests for a given user/IP. Related to the first point: Automatically block some user-agents (Evil :)) Set up a hidden link that when accessed logs out the user and disables his account. (Presumably this would not be accessed by a normal user since he wouldn't see it to click it, but a bot will crawl all links.) For point 1. do you know of a good already-implemented solution? Any experiences with it? One problem would be that some false positives might show up for very active but human users. For point 3: do you think this is really evil? Or do you see any possible problems with it? Also accepting other suggestions.

Read the article

svn import, dont modify revision OR modify the list of files in a transaction

- by Vaughan Durno

Hi Ive gained so much knowledge/insight from this site in the past few years, now im actually hoping to get some enlightenment. The scenario is as follows: You have the general structure of the repo (trunk,branches,tags) but added to the layout you have another directory called 'db_revs'. Now in the pre-commit, you take a dump of a specific database (the specifics are irrelevant) into a temporary file, say /tmp/REV.sql (REV being the HEAD revision number of the repo, or the transaction). K all is well and you can just import that temp file into the repo at /db_revs/REV.sql Now obviously that import, even tho its happening during a commit, increments the revision of the repo. So when u do a commit at some point to say 'test.php' in the trunk and it completes at say revision 159, then the pre-commit runs as it should and the DB dump gets imported but then u r sitting with a tree in the repo-browser where 'trunk' is at revision 159, and 'db_revs', which has the imported dump, is at 158 (Ive made it so that the filename matches the revision ie: 159.sql but that file is then at revision 158). NB If you're doing an import in a pre-commit, you need to add some logic to not perform the import, say by checking first for the existence of the temp file, otherwise it will cause, um, a stack overflow and your PC will quickly crawl to a stand still So I wanted to know if it was possible to make an import to not commit its changes. I realise I might be barking up the wrong tree to begin with so I have another idea of doing this so that brings me to the 2nd part of my question, would it be possible to modify the list of files that the transaction is about to commit to the repo. I know this can be done to a WC but that wont help as a WC is a checked out copy of say the trunk so im not sure how u would add a file to the 'db_revs' folder which is above trunk? Any help is greatly appreciated Cheers Vaughan

Read the article

Locating memory leak in Apache httpd process, PHP/Doctrine-based application

- by Sam

I have a PHP application using these components: Apache 2.2.3-31 on Centos 5.4 PHP 5.2.10 Xdebug 2.0.5 with Remote Debugging enabled APC 3.0.19 Doctrine ORM for PHP 1.2.1 using Query Caching and Results Caching via APC MySQL 5.0.77 using Query Caching I've noticed that when I start up Apache, I eventually end up 10 child processes. As time goes on, each process will grow in memory until each one approaches 10% of available memory, which begins to slow the server to a crawl since together they grow to take up 100% of memory. Here is a snapshot of my top output: PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND 1471 apache 16 0 626m 201m 18m S 0.0 10.2 1:11.02 httpd 1470 apache 16 0 622m 198m 18m S 0.0 10.1 1:14.49 httpd 1469 apache 16 0 619m 197m 18m S 0.0 10.0 1:11.98 httpd 1462 apache 18 0 622m 197m 18m S 0.0 10.0 1:11.27 httpd 1460 apache 15 0 622m 195m 18m S 0.0 10.0 1:12.73 httpd 1459 apache 16 0 618m 191m 18m S 0.0 9.7 1:13.00 httpd 1461 apache 18 0 616m 190m 18m S 0.0 9.7 1:14.09 httpd 1468 apache 18 0 613m 190m 18m S 0.0 9.7 1:12.67 httpd 7919 apache 18 0 116m 75m 15m S 0.0 3.8 0:19.86 httpd 9486 apache 16 0 97.7m 56m 14m S 0.0 2.9 0:13.51 httpd I have no long-running scripts (they all terminate eventually, the longest being maybe 2 minutes long), and I am working under the assumption that once each script terminates, the memory it uses gets deallocated. (Maybe someone can correct me on that). My hunch is that it could be APC, since it stores data between requests, but at the same time, it seems weird that it would store data inside the httpd process. How can I track down which part of my app is causing the memory leak? What tools can I use to see how the memory usage is growing inside the httpd process and what is contributing to it?

Read the article

Background worker not working right

- by vbNewbie

I have created a background worker to go and run a pretty long task that includes creating more threads which will read from a file of urls and crawl each. I tried following it through debugging and found that the background process ends prematurely for no apparent reason. Is there something wrong in the logic of my code that is causing this. I will try and paste as much as possible to make sense. While Not myreader.EndOfData Try currentRow = myreader.ReadFields() Dim currentField As String For Each currentField In currentRow itemCount = itemCount + 1 searchItem = currentField generateSearchFromFile(currentField) processQuerySearch() Next Catch ex As Microsoft.VisualBasic.FileIO.MalformedLineException Console.WriteLine(ex.Message.ToString) End Try End While This first bit of code is the loop to input from file and this is what the background worker does. The next bit of code is where the background worker creates threads to work all the 'landingPages'. After about 10 threads are created the background worker exits this sub and skips the file input loop and exits the program. Try For Each landingPage As String In landingPages pgbar.Timer1.Stop() If VisitedPages.Contains(landingPage) Then Continue For Else Dim thread = New Thread(AddressOf processQuery) count = count + 1 thread.Name = "Worm" & count thread.Start(landingPage) If numThread >= 10 Then For Each thread In ThreadList thread.Join() Next numThread = 0 Continue For Else numThread = numThread + 1 SyncLock ThreadList ThreadList.Add(thread) End SyncLock End If End If Next

Read the article

Join + IEqualityComparer<T> and HashCode

- by Jesus Rodriguez

Im writing my own LINQ reference but Im getting troubles with some of the more complicated operators implementations. There is a Join implementation that takes a IEqualityComparer Im getting just crazy. Im trying to understand it first before I write (obviously) Image this two lists: List<string> initials = new List<string> {"A", "B", "C", "D", "E"}; List<string> words = new List<string> {"Ant", "Crawl", "Pig", "Boat", "Elephant", "Arc"}; Nothing weird here. I want to join both lists by the Initial, something like: Initial=A Word=Ant Initial=A Word=Arc Initial=B Word=Boat ... I need a comparator, I wrote this: public class InitialComparator : IEqualityComparer<string> { public bool Equals(string x, string y) { return x.StartsWith(y); } public int GetHashCode(string obj) { return obj[0].GetHashCode(); } } The Join itself: var blah = initials.Join(words, initial => initial, word => word, (initial, word) => new {Initial = initial, Word = word}, new InitialComparator()); It's the first time Im using HashCodes, after a good session of debugging I see that every word go to the comparator and look at its HashCode, if another word has the same HashCode it calls equals. Since I want to compare just the initial I though that I just need the first letter Hash (Am I wrong?) The thing is that this is not working correctly. Its says that "Ant" and "Arc" are equals, Ok, its comparing every word in the same list or not, But it adds only the last word it finds, in this case Arc, ignoring Ant and Ant is equals to "A" too... If I put "Ant" and "Ant" it add both. In short, What is the way of doing something like that? I know that Im doing something wrong. Thank you.

Read the article

multithreading issue

- by vbNewbie

I have written a multithreaded crawler and the process is simply creating threads and having them access a list of urls to crawl. They then access the urls and parse the html content. All this seems to work fine. Now when I need to write to tables in a database is when I experience issues. I have 2 declared arraylists that will contain the content each thread parse. The first arraylist is simply the rss feed links and the other arraylist contains the different posts. I then use a for each loop to iterate one while sequentially incrementing the other and writing to the database. My problem is that each time a new thread accesses one of the lists the content is changed and this affects the iteration. I tried using nested loops but it did not work before and this works fine using a single thread.I hope this makes sense. Here is my code: SyncLock dlock For Each l As String In links finallinks.Add(l) Next End SyncLock SyncLock dlock For Each p As String In posts finalposts.Add(p) Next End SyncLock ... Dim i As Integer = 0 SyncLock dlock For Each rsslink As String In finallinks postlink = finalposts.Item(i) i = i + 1 finallinks and finalposts are the two arraylists. I did not include the rest of the code which shows the threads working but this is the essential part where my error occurs which is basically here postlink = finalposts.Item(i) i = i + 1 ERROR: index was out of range. Must be non-negative and less than the size of the collection Is there an alternative?

Read the article

Proper Usage of SqlConnection in .NET

- by Jojo

Hi guys, I just want an opinion on the proper usage or a proper design with regards to using SqlConnection object. Which of the 2 below is the best use: A data provider class whose methods (each of them) contain SqlConnection object (and disposed when done). Like: IList<Employee> GetAllEmployees() { using (SqlConnection connection = new SqlConnection(this.connectionString)) { // Code goes here... } } Employee GetEmployee(int id) { using (SqlConnection connection = new SqlConnection(this.connectionString)) { // Code goes here... } } or SqlConnection connection; // initialized in constructor IList<Employee> GetAllEmployees() { this.TryOpenConnection(); // tries to open member SqlConnection instance // Code goes here... this.CloseConnection(); // return } Employee GetEmployee(int id) { this.TryOpenConnection(); // tries to open member SqlConnection instance // Code goes here... this.CloseConnection(); // return } Or is there a better approach than this? I have a focused web crawler type of application and this application will crawl 50 or more websites simultaneously (multithreaded) with each website contained in a crawler object and each crawler object has an instance of a data provider class (above). Please advise. Thanks.

Read the article

How to build a web crawler to find a specific advert, which is in an iframe loaded by Javascript

- by ZoFreX

I'm trying to find all instances of an advert on a website. The advert is in an iframe which is loaded by javascript (it doesn't appear at all if javascript is turned off). Detecting the advert itself is extremely simple, both the name of the flash file and the target of the href always contain a certain string. What would be the best "starting point" for achieving this? At the moment I'm considering an Adobe AIR app, which could crawl the site and examine the DOM to find the ad, and would run javascript and load the content of the iframe. The other option I can think of is using Firefox as the platform (using maybe GreaseMonkey or Selenium? I don't really know how to leverage Firefox like this). Does anyone know of anything suitable to build this, or have any suggestions on using Firefox to do it? Extra details: Being CPU intensive isn't really an issue, nor is anything depending on a browser being open. This doesn't need to run on a headless server, it will be running on a powerful desktop box. OS is also not an issue. It would be advantageous if the crawler loaded each page multiple times, as the advert is in rotation. While the crawler does need to execute the javascript and load the content of the iframe, it does not need to be able to display flash files.

Read the article

Adding multiple rss feeds to a script in SCALA InfoChannel Designer 5

- by godleuf

Okay, since it is impossible to talk to anyone on the phone or get support through Scala's "forum", I am going to take a shot and see if anyone out there is feeling my pain. I have a client that uses Scala's InfoChannel Designer and Content Manager. I have had to learn this software from scratch and I have to say it hasn't been easy. I think I am at a point where the overall design is set, but I need to implement a couple of things before I can make this happen. RSS feeds are my issue at this point. Multiple RSS feeds to be specific. I need a feed coming in for 3 areas of content: Wiki News (or equivalent), local weather and a stock ticker. I have learned how to setup a "crawl" using a script example available from Scala's file center and copying and pasting into my design. But from what I have learned first hand and through reading through other forums, you can not have a feed from 3 different sources or urls happening simultaneously. Doesn't seem like it would be an issue, but apparently it is. This small step has held up this project for far too long and I need to get it figured out. This doesn't even touch on my issue of feeding in streaming video as a background but I have gone over this in another question but with no luck thus far. If there is ANYONE out there who is in anything similar using this software, your feedback and/or suggestions would be greatly appreciated. Thanks you for allowing me to vent!

Read the article

Can I automatically throw descriptive exceptions with parameter values and class feild information?

- by Robert H.

I honestly don't throw exceptions often. I catch them even less, ironically. I currently work in shop where we let them bubble up to avicode. For whatever reason, however, avicode isn't configured to capture some of the critical bits I need when these exceptions come bouncing back to my attention. Specifically, I'd like to see the parameter values and the class’s field data at the time of the exception. I’d guess with the large suite of .Net services that I could create a static method to crawl up the stack, gather these bits and store them in a string that I could stick in my exception message. I really don't are how long such a method would take to execute as performance is no longer a concern when I hit one of these scenarios. If it's possible, I'm sure someone has done it. If that's the case, I'm having a hard time finding it. I think any search containing "exception" brings back too many resutls. Anyway, can this be done? If so, some examples or links would be great. Thanks in advance for your time, Robert

Read the article

Where is the Open Source alternative to WPF?

- by Evan Plaice

If we've learned anything from HTML/CSS it's that, declarative languages (like XML) work best to describe User Interfaces because: It's easy to build code preprocessors that can template the code effectively. The code is in a well defined well structured (ideally) format so it's easy to parse. The technology to effectively parse or crawl an XML based source file already exists. The UIs scripted code becomes much simpler and easier to understand. It simple enough that designers are able to design the interface themselves. Programmers suck at creating UIs so it should be made easy enough for designers. I recently took a look at the meat of a WPF application (ie. the XAML) and it looks surprisingly familiar to the declarative language style used in HTML. It's blindingly apparent to me that the current state of desktop UI development is largely fractionalized, otherwise there wouldn't be so much duplicated effort in the domain of user interfaces (IE. GTK, XUL, Qt, Winforms, WPF, etc). There are 45 GUI platforms for Python alone It's painfully obvious to me that there should be a general purpose, open source, standardized, platform independent, markup language for designing desktop GUIs. Much like what the W3C made HTML/CSS into. WPF, or more specifically XAML seems like a pretty likely step in the right direction. Why hasn't anyone in the Open Source community (AFAIK) even scratched the surface of this issue. Now that the 'browser wars' are over should we look forward to a future of 'desktop gui wars?' Note: This topic is relatively subjective in the attempt to be 'future-thinking.' I think that desktop GUI development in its current state sucks ((really)hard) and, even though WPF is still in it's infancy, it presents a likely solution to the problem. Has no one in the OS community looked into developing something similar because they don't see the value, or because it's not worth the effort?

Read the article

Good strategy for copying a "sliding window" of data from a table?

- by chiborg

I have a MySQL table from a third-party application that has millions of rows and only one index - the timestamp of each entry. Now I want to do some heavy self-joins and queries on the data using fields other than the timestamp. Doing the query on the original table would bring the database to a crawl, adding indexes to the table is not an option. Additionally, I only need entries that are newer than one week. My current strategy for doing the queries efficiently is to use a separate table (aux_table) that has the necessary indexes. My questions are: Is there another way to do the queries? and if not, How do I update the data in the indexed table efficiently? So far I have found two approaches for updating aux_table: Truncate aux_table and insert the desired data from the original table. Not very efficient because all the indexes must be re-crated. Check for the biggest timestamp in aux_table and insert all entries with a greater or equal timestamp from the original table. Occasionally drop older entries. Only copying entries with greater timestamp leads to dropped entries (because of entries with same timestamp that were inserted into the original table after the last update).

Read the article

[C#] Three System.Drawing methods manifest slow drawing or flickery: Solutions? or Other Options?

- by Luke Mcneice

Hi all, I am doing a little graphing via the System.Drawing and im having a few problems. I'm holding data in a Queue and i'm drawing(graphing) out that data onto three picture boxes this method fills the picture box then scrolls the graph across. so not to draw on top of the previous drawings (and graduly looking messier) i found 2 solutions to draw the graph. Call plot.Clear(BACKGOUNDCOLOR) before the draw loop [block commented] although this causes a flicker to appear from the time it takes to do the actual drawing loop. call plot.DrawLine(channelPen[5], j, 140, j, 0); just before each drawline [commented] although this causes the drawing to start ok then slow down very quickly to a crawl as if a wait command had been placed before the draw command. Here is the Code for reference: /*plotx.Clear(BACKGOUNDCOLOR) ploty.Clear(BACKGOUNDCOLOR) plotz.Clear(BACKGOUNDCOLOR)*/ for (int j = 1; j < 599; j++) { if (j > RealTimeBuffer.Count - 1) break; QueueEntity past = RealTimeBuffer.ElementAt(j - 1); QueueEntity current = RealTimeBuffer.ElementAt(j); if (j == 1) { //plotx.DrawLine(channelPen[5], 0, 140, 0, 0); //ploty.DrawLine(channelPen[5], 0, 140, 0, 0); //plotz.DrawLine(channelPen[5], 0, 140, 0, 0); } //plotx.DrawLine(channelPen[5], j, 140, j, 0); plotx.DrawLine(channelPen[0], j - 1, (((past.accdata.X - 0x7FFF) / 256) + 64), j, (((current.accdata.X - 0x7FFF) / 256) + 64)); //ploty.DrawLine(channelPen[5], j, 140, j, 0); ploty.DrawLine(channelPen[1], j - 1, (((past.accdata.Y - 0x7FFF) / 256) + 64), j, (((current.accdata.Y - 0x7FFF) / 256) + 64)); //plotz.DrawLine(markerPen, j, 140, j, 0); plotz.DrawLine(channelPen[2], j - 1, (((past.accdata.Z - 0x7FFF) / 256) + 94), j, (((current.accdata.Z - 0x7FFF) / 256) + 94)); } Is there any tricks to avoid these overheads? If not would there be any other/better solutions?

Read the article

SqlCE DB occasionally freezes on one handheld, not another

- by Michael

I have two types of custom handhelds which are similar, but slightly different, each running the same WinForm application and a WinCE database: Type 1: WinCE 4.2, 400 mhz, 93244 kb Type 2: WinCE 5.0, 520 mhz, 84208 kb Type 1 will happily proceed through a large batch db operation (initiated) by the app, by Type 2 will consistently begin c-r-a-w-l-i-n-g (for several to many cycles) at around the 200 cycle mark. As several points it will begin running normally and then crawl again. The app does several db op's (inserts, updates and selects, no deletes). To simplify my situation, I've built a small test app which essentially does this: command_s.CommandText = "select dvr from vr where vid = 2211250"; command_u.CommandText = "update pvr set LocationID=81 where Status='OK' and vri = 27861"; while(going) { command_s.ExecuteScalar(); command_u.ExecuteNonQuery(); } and set it off running on the two units side by side. Sure enough, the slower (400 mhz) unit is outpacing the faster (520 mhz) unit (it's about 5000 cycles ahead right now) and I can see noticable pauses on the 520 mhz unit. What is causing this?

Read the article

[C#] Two System.Drawing methods manifest slow drawing or flickery: Solutions? or Other Options?

- by Luke Mcneice

Hi all, I am doing a little graphing via the System.Drawing and im having a few problems. I'm holding data in a Queue and i'm drawing(graphing) out that data onto three picture boxes this method fills the picture box then scrolls the graph across. so not to draw on top of the previous drawings (and graduly looking messier) i found 2 solutions to draw the graph. Call plot.Clear(BACKGOUNDCOLOR) before the draw loop [block commented] although this causes a flicker to appear from the time it takes to do the actual drawing loop. call plot.DrawLine(channelPen[5], j, 140, j, 0); just before each drawline [commented] although this causes the drawing to start ok then slow down very quickly to a crawl as if a wait command had been placed before the draw command. Here is the Code for reference: /*plotx.Clear(BACKGOUNDCOLOR) ploty.Clear(BACKGOUNDCOLOR) plotz.Clear(BACKGOUNDCOLOR)*/ for (int j = 1; j < 599; j++) { if (j > RealTimeBuffer.Count - 1) break; QueueEntity past = RealTimeBuffer.ElementAt(j - 1); QueueEntity current = RealTimeBuffer.ElementAt(j); if (j == 1) { //plotx.DrawLine(channelPen[5], 0, 140, 0, 0); //ploty.DrawLine(channelPen[5], 0, 140, 0, 0); //plotz.DrawLine(channelPen[5], 0, 140, 0, 0); } //plotx.DrawLine(channelPen[5], j, 140, j, 0); plotx.DrawLine(channelPen[0], j - 1, (((past.accdata.X - 0x7FFF) / 256) + 64), j, (((current.accdata.X - 0x7FFF) / 256) + 64)); //ploty.DrawLine(channelPen[5], j, 140, j, 0); ploty.DrawLine(channelPen[1], j - 1, (((past.accdata.Y - 0x7FFF) / 256) + 64), j, (((current.accdata.Y - 0x7FFF) / 256) + 64)); //plotz.DrawLine(markerPen, j, 140, j, 0); plotz.DrawLine(channelPen[2], j - 1, (((past.accdata.Z - 0x7FFF) / 256) + 94), j, (((current.accdata.Z - 0x7FFF) / 256) + 94)); } Is there any tricks to avoid these overheads? If not would there be any other/better solutions?

Read the article

Using SMO to call Database.ExecuteNonQuery() concurrently?

- by JimDaniel

I have been banging my head against the wall trying to figure out how I can run update scripts concurrently against multiple databases in a single SQL Server instance using SMO. Our environments have an ever-increasing number of databases which need updating, and iterating through one at a time is becoming a problem (too slow). From what I understand SMO does not support concurrent operations, and my tests have bore that out. There seems to be shared memory at the Server object level, for things like DataReader context, keeps throwing exceptions such as "reader is already open." I apologize for not having the exact exceptions I am getting. I will try to get them and update this post. I am no expert on SMO and just feeling my way through to be honest. Not really sure I am approaching it the right way, but it's something that has to be done, or our productivity will slow to a crawl. So how would you guys do something like this? Am I using the wrong technology with SMO? All I am wanting to do is execute sql scripts against databases in a single sql server instance in parallel. Thanks for any help you can give, Daniel

Read the article

Plotting a graph with GD

- by Nayena

Here it goes. I have been thinking about this for a long time, and havent really been able to put up a proper way to do it yet. I havent implemented anything yet, as im still designing the thing. The idea is that i crawl a website for internal links, i got this settled, its easy, but after the crawling, i end up with an array with lots of links, and how many times those particular link appears on the site that i crawled (and how they're connected). With this huge array, i want to draw a graph somehow. Assuming i can handle the data correctly, the real question here is how i can draw this in a image by the use of the GD library. I figured if theres less than 12 elements, i can align them up on a unit circle spacing them up as a circle and then connecting them accordingly, so anything up to 12 elements shouldn't be a problem, but if theres more than 12, it could be awesome getting them lined up like this Or well, thats just a rough drawing, but i guess its just to prove a point. So i'm here looking for guidance or tips towards getting the math down to getting the stuff lined up in a good way. I have previously made bar-graphs, so i have little experience doing math with GD. If possible, id prefer not using some plotter-library - in the end, it gives me a better understanding on how things are supposed to be.

Read the article

Wordpress paths issue

- by Martin

I have set a crawler up in wordpress which grabs stocks data and writes to file which when a user enters a symbol/ticker the variable is read and if it matches the data of a previous crawl for that particular companies data will echo the text file on page, if no data is found the crawler then sets off grabs it and writes to file to save for the next time that symbol is used. The problem im having is that everything works groovey apart from one thing, when the content is written to file it saves it in the WP root and not inside a subfolder of the theme, basicaly this means that root becomes untidy very quickly and also should the theme be used on another site then its not practical as some important info is missing. I have tried bloginfo and absolute both return the same failure. This is the code i am using to write to file, like i say it works apart from writing the file into root. <?php $CompDetails = "http://www.devserverurl.com/mattv1/wp-content/themes/stocks/tools/modules/Stock_Quote/company_details/$Symbol.txt"; if (file_exists($CompDetails)) {} else { include ('crawler_file.php'); $html = file_get_html("http://targeturl.com/research/stocks/snapshot/snapshot.asp?ticker=$Symbol:US"); $es = $html->find('div[class="detailsDataContainerLt"]'); $tickerdetails = ("$es[0]"); $FileHandle2 = fopen($CompDetails, 'w') or die("can't open file"); fwrite($FileHandle2, $tickerdetails); fclose($FileHandle2); } ?> edit below, have also tried this and the same happens as above <?php if (file_exists($_SERVER['DOCUMENT_ROOT'] . "/wp-content/themes/stocks/tools/modules/Stock_Quote/company_details/$Symbol.txt")) {} else { include ('crawler_file.php'); $html = file_get_html("http://targeturl.com/research/stocks/snapshot/snapshot.asp?ticker=$Symbol:US"); $es = $html->find('div[class="detailsDataContainerLt"]'); $tickerdetails = ("$es[0]"); $FileHandle2 = fopen($_SERVER['DOCUMENT_ROOT'] . "/wp-content/themes/stocks/tools/modules/Stock_Quote/company_details/$Symbol.txt", 'w') or die("can't open file"); fwrite($FileHandle2, $tickerdetails); fclose($FileHandle2); } ?>

Read the article

recursive wget with hotlinked requisites

- by dongle

I often use wget to mirror very large websites. Sites that contain hotlinked content (be it images, video, css, js) pose a problem, as I seem unable to specify that I would like wget to grab page requisites that are on other hosts, without having the crawl also follow hyperlinks to other hosts. For example, let's look at this page https://dl.dropbox.com/u/11471672/wget-all-the-things.html Let's pretend that this is a large site that I would like to completely mirror, including all page requisites – including those that are hotlinked. wget -e robots=off -r -l inf -pk ^^ gets everything but the hotlinked image wget -e robots=off -r -l inf -pk -H ^^ gets everything, including hotlinked image, but goes wildly out of control, proceeding to download the entire web wget -e robots=off -r -l inf -pk -H --ignore-tags=a ^^ gets the first page, including both hotlinked and local image, does not follow the hyperlink to the site outside of scope, but obviously also does not follow the hyperlink to the next page of the site. I know that there are various other tools and methods of accomplishing this (HTTrack and Heritrix allow for the user to make a distinction between hotlinked content on other hosts vs hyperlinks to other hosts) but I'd like to see if this is possible with wget. Ideally this would not be done in post-processing, as I would like the external content, requests, and headers to be included in the WARC file I'm outputting.

Read the article

VMware Player loses internet connectivity

- by Martha

Periodically, the internet simply stops working in my virtual machine, and the only way I can get it working again is to restart the host computer. Since I use the virtual machine specifically for testing web pages, this is, shall we say, a bother. Details: I have Windows XP Pro running in VMware Player (v. 3.0.0 build-203739) on a Windows 7 host. It's set to NAT (shared IP address) because the firewall won't allow a bridged connection. Every couple of days or so, first the internet slows down to a crawl, then eventually it stops working altogether. Both VMWare and the virtual OS report that they are connected, everything looks just peachy, I can reach the internet from the host, but on the VM, all web pages time out and/or report that the server could not be found. (Browser-independent; tried with IE, FF, Chrome, Safari, and Opera.) When this happens, the only way I've found to restore the internet connectivity is to restart the host machine. Restarting the VM doesn't help, nor does refreshing network connections on either the host or the guest. (Although I'm not entirely sure I've found the proper way to refresh a network connection in Windows 7...) I have not noticed any predictability about when the problem occurs, i.e. it's not immediately after I do anything special. It seems to occur mostly after putting the host to sleep once or twice, but it has happened even if the host has been in continuous use. It also seems independent of when I start using the VM - sometimes, I wake up the VM and the internet is really slow in it, then eventually stops working altogether; other times, I wake up the VM, use it perfectly happily for a while, then suddenly the internet is gone. Does anyone know why this is occurring? Failing that, is there a workaround that's less drastic than restarting the host? (Windows 7 startup times are blazingly fast compared to previous versions of Windows, but it's still a hassle to close all my programs and reopen them again.) Edit: while badges overall are nice, the Tumbleweed badge isn't helping me to solve my problem. Hasn't anyone encountered anything even remotely similar?

Read the article

Slow Windows Explorer on Windows 7

- by MadBoy

I have Laptop with i7 (4 cores), 8GB ram and SSD OCZ Vertex 3 MaxIOPS which in testing that I did just now does 400mb/s+ read/write. However the responsiveness of Windows Explorer is far from being perfect. Opening up Computer, Documents, going into folders is very slow (1-5seconds). I don't have any viruses or spyware and I have tried changing properties to optimize view for General Items. I tried disabling Search Indexer but it made search in Outlook 2010 crawl and didn't bring any other effect. Even double clicking on file takes some time to open things up (like clicking a Word document). I don't have any drives mapped, my computer is not joined to domain. I have multiple VPN connections that I connect to but they all have disabled default gateways. I tried using CC Cleaner or some Windows 7 Tweaks app to disable some things. I am power user using Visual Studio, Tortoise SVN and other developer/administration apps. Any non obvious ideas? Edit: So I've been trying to pinpoint where the issue comes from and it seems that straight after reboot Windows Explorer opens very fast, when I load 3-4 programs (Royal TS, Visual Studio, Outlook) it's noticeably slower and the more programs I have it gets worse. After I start closing programs it starts working better and if I leave 2 open it's fast again. I tried doing some research with DiskMon and other programs from sysinternals but couldn't find anything suspicious. Below are stats during normal usage with a lots of programs open: - Ram usage with a lot of programs open and no swap file (i disabled it for testing): 6.95GB - CPU usage: 15%, none of the cores takes more then 50% (I have VS 2010 open x 4) HD Tune Pro: OCZ-VERTEX3 MI Benchmark Test capacity: full Read transfer rate Transfer Rate Minimum : 363.9 MB/s Transfer Rate Maximum : 505.5 MB/s Transfer Rate Average : Access Time : Burst Rate : CPU Usage : HD Tune Pro: OCZ-VERTEX3 MI File Benchmark Drive C: Transfer rate test File Size: 500 MB Sequential read 484102 KB/s Sequential write 444714 KB/s Random read 7779 IOPS Random write 16888 IOPS Random read (queue depth = 32) 73007 IOPS Random write (queue depth = 32) 69790 IOPS HD Tune Pro: OCZ-VERTEX3 MI Random Access Test capacity: full Read test Transfer size operations / sec avg. access time max. access time avg. speed 512 bytes 3260 IOPS 0.306 ms 2.106 ms 1.592 MB/s 4 KB 4161 IOPS 0.240 ms 2.006 ms 16.256 MB/s 64 KB 2382 IOPS 0.419 ms 2.367 ms 148.934 MB/s 1 MB 449 IOPS 2.225 ms 4.197 ms 449.407 MB/s Random 809 IOPS 1.235 ms 6.551 ms 410.527 MB/s HD Tune Pro: OCZ-VERTEX3 MI Extra Tests Test capacity: full Random seek 3975 IOPS 0.252 ms 1.941 MB/s Random seek 4 KB 4245 IOPS 0.236 ms 16.583 MB/s Butterfly seek 4086 IOPS 0.245 ms 1.995 MB/s Random seek / size 64 KB 3812 IOPS 0.262 ms 58.606 MB/s Random seek / size 8 MB 120 IOPS 8.348 ms 485.737 MB/s Sequential outer 4524 IOPS 0.221 ms 282.721 MB/s Sequential middle 4429 IOPS 0.226 ms 276.818 MB/s Sequential inner 5504 IOPS 0.182 ms 344.000 MB/s Burst rate 4472 IOPS 0.224 ms 279.475 MB/s

Read the article

Troubleshooting major performance issue: Is culprit Intel RST, Hard drive, or something else?

- by Sean Killeen

The Setup I have the following components that come into play in this situation: ASUS P8Z68 V/PRO motherboard a RAID1 configuration (1x 1TB drive, 1 x 2TB drive -- I explain below), accelerated with an SSD using Intel's RST software, and 1 TB drive standing by as a spare. Core i7 2600k 32 GB RAM Windows 8.1 This box was designed to be beast, and until just recently, was very good at being just that. What's Happening The system has slowed to a crawl whenever it touches the disk. Things appear to work at normal speed when dealing with memory. For example, typing this is fine, but saving it to disk from notepad gave me a 5-7 second pause when clicking save. The disks appear to be at 100% all the time (e.g. the light on the disk access on the PC is solidly on -- not even any flashing) In ProcExp it appears that the disk is barely being utilized at all: Intel RST reports that everything is fine: Other Details Prior to this happening, RST had reported that my drives were failing (one went bad, one was throwing SMART events). This made sense; they were at the tail end of their warranty and the PC is on almost all the time. I RMA'd the drives via Seagate. In the meantime, I'd purchased a 2TB drive because I didn't realize that the 1TB drives were under warranty. I figured I'd replace the other 1 TB drive with another 2 TB when it died but then discovered the warranty. AFAIK, I haven't done any major updates since 8.1 and it worked fine after those. Question(s) How can I troubleshoot this? What is the best way to try to figure out why disks are being maxed out despite the OS reporting barely any disk usage and that everything is OK? Given the failures, etc. that I describe above, is it possible that the problem could be the I/O on the motherboard itself? If so, how would I even be able to diagnose it? I'm betting the drives that Seagate gave me are refurbished (didn't think to look; that's dumb). Is it possible that the same model drive, refurbished, could somehow cause this? In terms of how RAID1 works, is it possible that one drive is "falling behind" somehow, and that the RAID1 is constantly trying to fix the mirroring? If so, this seems like Intel RST would report on it, but I wanted to consider it as an option.

Read the article

Nginx giving a lot of 502 errors

- by Loki

Since a while I have installed nginx and everything seemed to be working fine, recently I found out that about 20% of the time users are getting 502-errors. This is also noticable when Google tries to crawl my site in Webmaster Tools (from 10000 posts, approx. 2000 502 errors) At first I was thinking to disable nginx, but I'd really like to keep using it. I'm running it on a server with 2GB RAM and 4 Reserved CPU Cores. WHM/cPanel installed and Mod_Ruid2 enabled + DSO as a PHP Handler with APC caching installed. Is there anything I can change in the config, that can fix this? I have installed Nginx Admin in WHM and here is what's in the configuration editor screen: user nobody; worker_processes 4; error_log /var/log/nginx/error.log info; worker_rlimit_nofile 20480; events { worker_connections 10240; # increase for busier servers use epoll; # you should use epoll here for Linux kernels 2.6.x } http { server_name_in_redirect off; server_names_hash_max_size 10240; server_names_hash_bucket_size 1024; include mime.types; default_type application/octet-stream; server_tokens off; disable_symlinks if_not_owner; sendfile on; tcp_nopush on; tcp_nodelay on; keepalive_timeout 5; gzip on; gzip_vary on; gzip_disable "MSIE [1-6]\."; gzip_proxied any; gzip_http_version 1.1; gzip_min_length 1000; gzip_comp_level 6; gzip_buffers 16 8k; text/plain text/xml text/css application/x-javascript application/xml image/png image/x-icon image/gif image/jpeg application/xml+rss text/javascript application/atom+xml; ignore_invalid_headers on; client_header_timeout 3m; client_body_timeout 3m; send_timeout 3m; reset_timedout_connection on; connection_pool_size 256; client_header_buffer_size 256k; large_client_header_buffers 4 256k; client_max_body_size 200M; client_body_buffer_size 128k; request_pool_size 32k; output_buffers 4 32k; postpone_output 1460; proxy_temp_path /tmp/nginx_proxy/; client_body_in_file_only on; log_format bytes_log "$msec $bytes_sent ."; include "/etc/nginx/vhosts/*"; } I hope someone can help me out. Thanks in advance!

Read the article

Should I choose KVM/XEN over OpenVZ or use them together?

- by Krystian

I've got a dual xeon e5504 server, with [for now] only 8GB of ram. Storage is'n impressive either: 3x 146GB sas in raid5 + 500GB sata drives. Currently it works as a development server, but it's over speced for our needs and since our development methods changed through last 2 years we decided it will work as a production system for some of our applications + we would like to have a separate system for testing/research. Our apps are mainly web apps deployed on tomcats [plural as some of the apps require older versions] and connected to Postgres. I would like to have a production system, where only httpd+tomcat+db are setup and nothing else runs there. Sterile system. Apart from that, I would like a test system, where I can play with different JVM settings, deploy my test apps, play with tomcat/httpd settings and restart them without interfering with the production system. Apart from that, I would like to be able to play with different linux flavors, with newer kernels to test how they work etc. I know, this is not possible with OpenVZ and I would have to choose KVM for that. I am thinking about merging the two, and setting up a KVM to be able to work with different systems [linux only to be frank] + use openVZ to setup separate machines for my development needs. I would simply go with that, but reading here and there about the performance impact full virtualization has over containers and looking at the specs of my server makes me think twice about it. I don't want to loose too much performance, especially because of the nature of my apps [few JVMs running at the same time]. It will be my first time with virtualization, apart from using desktop virtualbox/vmserver. Although I am a fast learner I don't want to mess with the main system so much that it will break the production apps or make them crawl. Although they are more or less internal apps and they don't produce much load, they need to be stable. I've read, that KVM host is a normal linux installation and it allows to run normal processes on it. If that is so, does it allow to run openVZ as well? I mean... can I have KVM and OpenVZ running on the same system/kernel? Or do I have to setup another system to run OpenVZ containers? How much performance impact can this have for me? Will my hardware suffice? oh and one more thing... unfortunately I'm quite limited with the funds... I'm looking for a free solution only :/

Read the article

Why is my concurrency capacity so low for my web app on a LAMP EC2 instance?

- by AMF

I come from a web developer background and have been humming along building my PHP app, using the CakePHP framework. The problem arose when I began the ab (Apache Bench) testing on the Amazon EC2 instance in which the app resides. I'm getting pretty horrendous average page load times, even though I'm running a c1.medium instance (2 cores, 2GB RAM), and I think I'm doing everything right. I would run: ab -n 200 -c 20 http://localhost/heavy-but-view-cached-page.php Here are the results: Concurrency Level: 20 Time taken for tests: 48.197 seconds Complete requests: 200 Failed requests: 0 Write errors: 0 Total transferred: 392111200 bytes HTML transferred: 392047600 bytes Requests per second: 4.15 [#/sec] (mean) Time per request: 4819.723 [ms] (mean) Time per request: 240.986 [ms] (mean, across all concurrent requests) Transfer rate: 7944.88 [Kbytes/sec] received While the ab test is running, I run VMStat, which shows that Swap stays at 0, CPU is constantly at 80-100% (although I'm not sure I can trust this on a VM), RAM utilization ramps up to about 1.6G (leaving 400M free). Load goes up to about 8 and site slows to a crawl. Here's what I think I'm doing right on the code side: In Chrome browser uncached pages typically load in 800-1000ms, and cached pages load in 300-500ms. Not stunning, but not terrible either. Thanks to view caching, there might be at most one DB query per page-load to write session data. So we can rule out a DB bottleneck. I have APC on. I am using Memcached to serve the view cache and other site caches. xhprof code profiler shows that cached pages take up 10MB-40MB in memory and 100ms - 1000ms in wall time. Pages that would be the worst offenders would look something like this in xhprof: Total Incl. Wall Time (microsec): 330,143 microsecs Total Incl. CPU (microsecs): 320,019 microsecs Total Incl. MemUse (bytes): 36,786,192 bytes Total Incl. PeakMemUse (bytes): 46,667,008 bytes Number of Function Calls: 5,195 My Apache config: KeepAlive On MaxKeepAliveRequests 100 KeepAliveTimeout 3 <IfModule mpm_prefork_module> StartServers 5 MinSpareServers 5 MaxSpareServers 10 MaxClients 120 MaxRequestsPerChild 1000 </IfModule> Is there something wrong with the server? Some gotcha with the EC2? Or is it my code? Some obvious setting I should look into? Too many DNS lookups? What am I missing? I really want to get to 1,000 concurrency capacity, but at this rate, it ain't gonna happen.

Search Results

Search found 446 results on 18 pages for 'crawl'.

Page 17/18 | < Previous Page | 13 14 15 16 17 18 | Next Page >

- by deathy

- by Vaughan Durno

- by Sam

- by vbNewbie

- by Jesus Rodriguez

- by vbNewbie

- by Jojo

- by ZoFreX

- by godleuf

- by Robert H.

- by Evan Plaice

- by chiborg

- by Luke Mcneice

- by Michael

- by Luke Mcneice

- by JimDaniel

- by Nayena

- by Martin

- by dongle

- by Martha

- by MadBoy

- by Sean Killeen

- by Loki

- by Krystian

- by AMF

< Previous Page | 13 14 15 16 17 18 | Next Page >