Search Results

Search found 415 results on 17 pages for 'bottleneck'.

Page 15/17 | < Previous Page | 11 12 13 14 15 16 17 | Next Page >

IIS/ASP.NET performance incident - Perfmon Current Annonymous Users going through roof but Requests/sec low

- by Laurence

Setup: ASP.NET 4.0 website on IIS 6.0 on Win 2003 64 bit, 8xCPUs, 16GB memory, separate SQL 2005 DB server. Had a serious slowdown today with any otherwise fairly well performing ASP.NET site. For a period of a couple of hours all page requests were taking a very long time to be served - e.g. 30-60s compared to usual 2s. The w3wp.exe's CPU and memory usage on the webserver was not much higher than normal. The application pool was not in the middle of recycling (and it hadn't recycled for several hours). Bottlenecks in the database were ruled out - no blocks occurring and query results were being returned quickly. I couldn't make any sense of it and set up the following Perfmon counters: Current Anonymous Users (for site in question) Get requests/sec (ditto) Requests/sec for the ASP.NET application running the site Get requests/sec was averaging 100-150. Requests/sec for ASP.NET was averaging 5-10. However Current Anonymous Users was around 200. And then as I was watching, the Current Anonymous Users began to climb steeply going up to about 500 within a few minutes. All this time Get requests/sec & Requests/sec for ASP.NET was if anything going down. I did a whole load of things (in a panic!) to try to get the site working, like shutting it down, recycling the app pool, and adding another worker process to the pool. I also extended the expiration time for content (in IIS under HTTP Headers) in an attempt to lower the number of requests for static files (there are a lot of images on the site). The site is now back to normal, and the counters are fairly steady and reading (added Current Connections counter): Current Anonymous Users : average 30 Get requests/sec : average 100 Requests/sec for ASP.NET : 5 Current Connections : average 300 I have also observed an inverse relationship between Get requests/sec & Current Anonymous Users. Usually both are fairly steady but there will be short periods when Get requests/sec will go down dramatically and Current Anonymous Users will go up in a perfect mirror image. Then they will flip back to their usual levels. So, my questions are: Thinking of the original performance issue - if w3wp.exe CPU, memory usage were normal and there was no DB bottleneck, what could explain page requests taking 20 times longer to be served than usual? What other counters should I be looking at if this happens again? What explains the inverse relationship between Get requests/sec & Current Anonymous Users? What could explain Current Anonymous Users going from 200 to 500 within a few minutes? Many thanks for any insight into this.

Read the article
udp expected behaviour not responding to test result

- by ernst

I have a local network topology that is structured as follows: three hosts and a switch in the middle. I am using a switch that supports 10,100,1000 Mbit/s full/half duplex connection. I have configured the hosts with a static ip 172.16.0.1-2-3/25. This is the output of ifconfig eth0 Link encap: Ethernet HWaddr ***** inet addr:172.16.0.3 Bcast:172.16.0.127 Mask:255.255.255.128 UP BROADCAST MULTICAST MTU:1500 Metric:1 RX packets:0 errors:0 dropped:0 overruns:0 frame:0 TX packets:0 errors:0 dropped:0 overruns:0 carrier:0 collisions:0 txqueuelen:1000 RX bytes:0 (0.0 B) TX bytes:0 (0.0 B) Interrupt:16 The output on H1 and H2 is perfectly matchable They are mutually reachable since i have tested the network with ping. I have forced the ethernet interface to work at 10M with ethtool -s eth0 speed 10 duplex full autoneg on this is the output of ethtool eth0 supported ports: [ TP ] Supported link modes: 10baseT/Half 10baseT/Full 100baseT/Half 100baseT/Full 1000baseT/Half 1000baseT/Full S upported pause frame use: No Supports auto-negotiation: Yes Advertised link modes: 10baseT/Full Advertised pause frame use: Symmetric A dvertised auto-negotiation: Yes Speed: 10Mb/s Duplex: Full Port: Twisted Pair PHYAD: 1 Transceiver: internal Auto-negotiation: on MDI-X: Unknown Supports Wake-on: g Wake-on: d Current message level: 0x000000ff (255) drv probe link timer ifdown ifup rx_err tx_err Link detected: yes – I am doing an experimental test using nttcp to calculate the GOODPUT in the case that H1 and H2 at the same time send data to H3. Since the three links have the same forced capability and the amount of arrving data speed is 10 from H1+10 from H2--20M to H3 it would be expected a bottleneck effect and, due to the non reliable nature of udp, a packet loss. But this doesn't appen since the output of nttcp application shows the same number of byte sended and received. this is the output of nttcp on h3 nttcp -T -r -u 172.16.0.2 & nttcp -T -r -u 172.16.0.1 [1] 4071 Bytes Real s CPU s Real-MBit/s CPU-MBit/s Calls Real-C/s CPU-C/s l 8388608 13.74 0.05 4.8848 1398.0140 2049 149.14 42684.8 Bytes Real s CPU s Real-MBit/s CPU-MBit/s Calls Real-C/s CPU-C/s l 8388608 14.02 0.05 4.7872 1398.0140 2049 146.17 42684.8 1 8388608 13.56 0.06 4.9500 1118.4065 2051 151.28 34181.1 1 8388608 13.89 0.06 4.8310 1198.3084 2051 147.65 36623.0 – How is this possible? Am i missing something? Any help will be gratefully apprecciated, Best regards

Read the article
Choosing parts for a high-spec custom PC - feedback required [closed]

- by James

I'm looking to build a high-spec PC costing under ~£800 (bearing in mind I can get the CPU half price). This is my first time doing this so I have plenty of questions! I have been doing lots of research and this is what I have come up with: http://pcpartpicker.com/uk/p/j4lE Usage: I will be using it for Adobe CS6, rendering in 3DS Max, particle simulations in Realflow and for playing games like GTA IV (and V when it comes out), Crysis 1/2, Saints Row The Third, Deus Ex HR, etc. Questions: Can you see any obvious problem areas with the current setup? Will it be sufficient for the above usage? I won't be doing any overclocking initially. Is it worth buying the H60 liquid cooler, or will the fan that comes with the CPU be sufficient? Is water cooling generally quieter? Is the chosen motherboard good for the current components? And is it future-proof? I read that the HDD is often the bottleneck when it comes to gaming. I presume this is true to other high-end applications? If so, is my selection good? I keep changing my mind about the GPU; first the 560, now the 660. Can anyone shed some light on how to choose? I read mixed opinions about matching the GPU to the CPU. Will the 560 or the 660 be sufficient for my required usage? Atm I'm basing my choice on the PassMark benchmarks and how much they cost. The specs on the GeForce website state that the 560 and the 660 both require 450W. Is this a good figure to base the wattage of my PSU on? If so, how do you decide? Do I really need 750W? The latest GTX 690 requires 650W. Is it a good idea to buy a 750W PSU now to future-proof myself?

Read the article
Tips for maximizing Nginx requests/sec?

- by linkedlinked

I'm building an analytics package, and project requirements state that I need to support 1 billion hits per day. Yep, "billion". In other words, no less than 12,000 hits per second sustained, and preferably some room to burst. I know I'll need multiple servers for this, but I'm trying to get maximum performance out of each node before "throwing more hardware at it". Right now, I have the hits-tracking portion completed, and well optimized. I pretty much just save the requests straight into Redis (for later processing with Hadoop). The application is Python/Django with a gunicorn for the gateway. My 2GB Ubuntu 10.04 Rackspace server (not a production machine) can serve about 1200 static files per second (benchmarked using Apache AB against a single static asset). To compare, if I swap out the static file link with my tracking link, I still get about 600 requests per second -- I think this means my tracker is well optimized, because it's only a factor of 2 slower than serving static assets. However, when I benchmark with millions of hits, I notice a few things -- No disk usage -- this is expected, because I've turned off all Nginx logs, and my custom code doesn't do anything but save the request details into Redis. Non-constant memory usage -- Presumably due to Redis' memory managing, my memory usage will gradually climb up and then drop back down, but it's never once been my bottleneck. System load hovers around 2-4, the system is still responsive during even my heaviest benchmarks, and I can still manually view http://mysite.com/tracking/pixel with little visible delay while my (other) server performs 600 requests per second. If I run a short test, say 50,000 hits (takes about 2m), I get a steady, reliable 600 requests per second. If I run a longer test (tried up to 3.5m so far), my r/s degrades to about 250. My questions -- a. Does it look like I'm maxing out this server yet? Is 1,200/s static files nginx performance comparable to what others have experienced? b. Are there common nginx tunings for such high-volume applications? I have worker threads set to 64, and gunicorn worker threads set to 8, but tweaking these values doesn't seem to help or harm me much. c. Are there any linux-level settings that could be limiting my incoming connections? d. What could cause my performance to degrade to 250r/s on long-running tests? Again, the memory is not maxing out during these tests, and HDD use is nil. Thanks in advance, all :)

Read the article
Why is my concurrency capacity so low for my web app on a LAMP EC2 instance?

- by AMF

I come from a web developer background and have been humming along building my PHP app, using the CakePHP framework. The problem arose when I began the ab (Apache Bench) testing on the Amazon EC2 instance in which the app resides. I'm getting pretty horrendous average page load times, even though I'm running a c1.medium instance (2 cores, 2GB RAM), and I think I'm doing everything right. I would run: ab -n 200 -c 20 http://localhost/heavy-but-view-cached-page.php Here are the results: Concurrency Level: 20 Time taken for tests: 48.197 seconds Complete requests: 200 Failed requests: 0 Write errors: 0 Total transferred: 392111200 bytes HTML transferred: 392047600 bytes Requests per second: 4.15 [#/sec] (mean) Time per request: 4819.723 [ms] (mean) Time per request: 240.986 [ms] (mean, across all concurrent requests) Transfer rate: 7944.88 [Kbytes/sec] received While the ab test is running, I run VMStat, which shows that Swap stays at 0, CPU is constantly at 80-100% (although I'm not sure I can trust this on a VM), RAM utilization ramps up to about 1.6G (leaving 400M free). Load goes up to about 8 and site slows to a crawl. Here's what I think I'm doing right on the code side: In Chrome browser uncached pages typically load in 800-1000ms, and cached pages load in 300-500ms. Not stunning, but not terrible either. Thanks to view caching, there might be at most one DB query per page-load to write session data. So we can rule out a DB bottleneck. I have APC on. I am using Memcached to serve the view cache and other site caches. xhprof code profiler shows that cached pages take up 10MB-40MB in memory and 100ms - 1000ms in wall time. Pages that would be the worst offenders would look something like this in xhprof: Total Incl. Wall Time (microsec): 330,143 microsecs Total Incl. CPU (microsecs): 320,019 microsecs Total Incl. MemUse (bytes): 36,786,192 bytes Total Incl. PeakMemUse (bytes): 46,667,008 bytes Number of Function Calls: 5,195 My Apache config: KeepAlive On MaxKeepAliveRequests 100 KeepAliveTimeout 3 <IfModule mpm_prefork_module> StartServers 5 MinSpareServers 5 MaxSpareServers 10 MaxClients 120 MaxRequestsPerChild 1000 </IfModule> Is there something wrong with the server? Some gotcha with the EC2? Or is it my code? Some obvious setting I should look into? Too many DNS lookups? What am I missing? I really want to get to 1,000 concurrency capacity, but at this rate, it ain't gonna happen.

Read the article
Parallelism in .NET – Part 8, PLINQ’s ForAll Method

- by Reed

Parallel LINQ extends LINQ to Objects, and is typically very similar. However, as I previously discussed, there are some differences. Although the standard way to handle simple Data Parellelism is via Parallel.ForEach, it’s possible to do the same thing via PLINQ. PLINQ adds a new method unavailable in standard LINQ which provides new functionality… LINQ is designed to provide a much simpler way of handling querying, including filtering, ordering, grouping, and many other benefits. Reading the description in LINQ to Objects on MSDN, it becomes clear that the thinking behind LINQ deals with retrieval of data. LINQ works by adding a functional programming style on top of .NET, allowing us to express filters in terms of predicate functions, for example. PLINQ is, generally, very similar. Typically, when using PLINQ, we write declarative statements to filter a dataset or perform an aggregation. However, PLINQ adds one new method, which provides a very different purpose: ForAll. The ForAll method is defined on ParallelEnumerable, and will work upon any ParallelQuery<T>. Unlike the sequence operators in LINQ and PLINQ, ForAll is intended to cause side effects. It does not filter a collection, but rather invokes an action on each element of the collection. At first glance, this seems like a bad idea. For example, Eric Lippert clearly explained two philosophical objections to providing an IEnumerable<T>.ForEach extension method, one of which still applies when parallelized. The sole purpose of this method is to cause side effects, and as such, I agree that the ForAll method “violates the functional programming principles that all the other sequence operators are based upon”, in exactly the same manner an IEnumerable<T>.ForEach extension method would violate these principles. Eric Lippert’s second reason for disliking a ForEach extension method does not necessarily apply to ForAll – replacing ForAll with a call to Parallel.ForEach has the same closure semantics, so there is no loss there. Although ForAll may have philosophical issues, there is a pragmatic reason to include this method. Without ForAll, we would take a fairly serious performance hit in many situations. Often, we need to perform some filtering or grouping, then perform an action using the results of our filter. Using a standard foreach statement to perform our action would avoid this philosophical issue: // Filter our collection var filteredItems = collection.AsParallel().Where( i => i.SomePredicate() ); // Now perform an action foreach (var item in filteredItems) { // These will now run serially item.DoSomething(); } .csharpcode, .csharpcode pre { font-size: small; color: black; font-family: consolas, "Courier New", courier, monospace; background-color: #ffffff; /*white-space: pre;*/ } .csharpcode pre { margin: 0em; } .csharpcode .rem { color: #008000; } .csharpcode .kwrd { color: #0000ff; } .csharpcode .str { color: #006080; } .csharpcode .op { color: #0000c0; } .csharpcode .preproc { color: #cc6633; } .csharpcode .asp { background-color: #ffff00; } .csharpcode .html { color: #800000; } .csharpcode .attr { color: #ff0000; } .csharpcode .alt { background-color: #f4f4f4; width: 100%; margin: 0em; } .csharpcode .lnum { color: #606060; } This would cause a loss in performance, since we lose any parallelism in place, and cause all of our actions to be run serially. We could easily use a Parallel.ForEach instead, which adds parallelism to the actions: // Filter our collection var filteredItems = collection.AsParallel().Where( i => i.SomePredicate() ); // Now perform an action once the filter completes Parallel.ForEach(filteredItems, item => { // These will now run in parallel item.DoSomething(); }); This is a noticeable improvement, since both our filtering and our actions run parallelized. However, there is still a large bottleneck in place here. The problem lies with my comment “perform an action once the filter completes”. Here, we’re parallelizing the filter, then collecting all of the results, blocking until the filter completes. Once the filtering of every element is completed, we then repartition the results of the filter, reschedule into multiple threads, and perform the action on each element. By moving this into two separate statements, we potentially double our parallelization overhead, since we’re forcing the work to be partitioned and scheduled twice as many times. This is where the pragmatism comes into play. By violating our functional principles, we gain the ability to avoid the overhead and cost of rescheduling the work: // Perform an action on the results of our filter collection .AsParallel() .Where( i => i.SomePredicate() ) .ForAll( i => i.DoSomething() ); The ability to avoid the scheduling overhead is a compelling reason to use ForAll. This really goes back to one of the key points I discussed in data parallelism: Partition your problem in a way to place the most work possible into each task. Here, this means leaving the statement attached to the expression, even though it causes side effects and is not standard usage for LINQ. This leads to my one guideline for using ForAll: The ForAll extension method should only be used to process the results of a parallel query, as returned by a PLINQ expression. Any other usage scenario should use Parallel.ForEach, instead.

Read the article
SQL SERVER – PAGELATCH_DT, PAGELATCH_EX, PAGELATCH_KP, PAGELATCH_SH, PAGELATCH_UP – Wait Type – Day 12 of 28

- by pinaldave

This is another common wait type. However, I still frequently see people getting confused with PAGEIOLATCH_X and PAGELATCH_X wait types. Actually, there is a big difference between the two. PAGEIOLATCH is related to IO issues, while PAGELATCH is not related to IO issues but is oftentimes linked to a buffer issue. Before we delve deeper in this interesting topic, first let us understand what Latch is. Latches are internal SQL Server locks which can be described as very lightweight and short-term synchronization objects. Latches are not primarily to protect pages being read from disk into memory. It’s a synchronization object for any in-memory access to any portion of a log or data file.[Updated based on comment of Paul Randal] The difference between locks and latches is that locks seal all the involved resources throughout the duration of the transactions (and other processes will have no access to the object), whereas latches locks the resources during the time when the data is changed. This way, a latch is able to maintain the integrity of the data between storage engine and data cache. A latch is a short-living lock that is put on resources on buffer cache and in the physical disk when data is moved in either directions. As soon as the data is moved, the latch is released. Now, let us understand the wait stat type related to latches. From Book On-Line: PAGELATCH_DT Occurs when a task is waiting on a latch for a buffer that is not in an I/O request. The latch request is in Destroy mode. PAGELATCH_EX Occurs when a task is waiting on a latch for a buffer that is not in an I/O request. The latch request is in Exclusive mode. PAGELATCH_KP Occurs when a task is waiting on a latch for a buffer that is not in an I/O request. The latch request is in Keep mode. PAGELATCH_SH Occurs when a task is waiting on a latch for a buffer that is not in an I/O request. The latch request is in Shared mode. PAGELATCH_UP Occurs when a task is waiting on a latch for a buffer that is not in an I/O request. The latch request is in Update mode. PAGELATCH_X Explanation: When there is a contention of access of the in-memory pages, this wait type shows up. It is quite possible that some of the pages in the memory are of very high demand. For the SQL Server to access them and put a latch on the pages, it will have to wait. This wait type is usually created at the same time. Additionally, it is commonly visible when the TempDB has higher contention as well. If there are indexes that are heavily used, contention can be created as well, leading to this wait type. Reducing PAGELATCH_X wait: The following counters are useful to understand the status of the PAGELATCH: Average Latch Wait Time (ms): The wait time for latch requests that have to wait. Latch Waits/sec: This is the number of latch requests that could not be granted immediately. Total Latch Wait Time (ms): This is the total latch wait time for latch requests in the last second. If there is TempDB contention, I suggest that you read the blog post of Robert Davis right away. He has written an excellent blog post regarding how to find out TempDB contention. The same blog post explains the terms in the allocation of GAM, SGAM and PFS. If there was a TempDB contention, Paul Randal explains the optimal settings for the TempDB in his misconceptions series. Trace Flag 1118 can be useful but use it very carefully. I totally understand that this blog post is not as clear as my other blog posts. I suggest if this wait stats is on one of your higher wait type. Do leave a comment or send me an email and I will get back to you with my solution for your situation. May the looking at all other wait stats and types together become effective as this wait type can help suggest proper bottleneck in your system. Read all the post in the Wait Types and Queue series. Note: The information presented here is from my experience and there is no way that I claim it to be accurate. I suggest reading Book OnLine for further clarification. All the discussions of Wait Stats in this blog are generic and vary from system to system. It is recommended that you test this on a development server before implementing it to a production server. Reference: Pinal Dave (http://blog.SQLAuthority.com) Filed under: Pinal Dave, PostADay, SQL, SQL Authority, SQL Query, SQL Scripts, SQL Server, SQL Tips and Tricks, SQL Wait Stats, SQL Wait Types, T SQL, Technology

Read the article
SQL Authority News – Play by Play with Pinal Dave – A Birthday Gift

- by Pinal Dave

Today is my birthday. Personal Note When I was young, I was always looking forward to my birthday as on this day, I used to get gifts from everybody. Now when I am getting old on each of my birthday, I have almost same feeling but the direction is different. Now on each of my birthday, I feel like giving gifts to everybody. I have received lots of support, love and respect from everybody; and now I must return it back.Well, on this birthday, I have very unique gifts for everybody – my latest course on SQL Server. How I Tune Performance I often get questions where I am asked how do I work on a normal day. I am often asked that how do I work when I have performance tuning project is assigned to me. Lots of people have expressed their desire that they want me to explain and demonstrate my own method of solving performance problem when I am facing real world problem. It is a pretty difficult task as in the real world, nothing goes as planned and usually planned demonstrations have no place there. The real world, demands real solutions and in a timely fashion. If a consultant goes to industry and does not demonstrate his/her capabilities in very first few minutes, it does not matter how much fame he/she is, the door is shown to them eventually. It is true and in my early career, I have faced it quite commonly. I have learned the trick to be honest from the start and request absolutely transparent communication from the organization where I am to consult. Play by Play Play by Play is a very unique setup. It is not planned and it is a step by step course. It is like a reality show – a very real encounter to the problem and real problem solving approach. I had a great time doing this course. Geoffrey Grosenbach (VP of Pluralsight) sits down with me to see what a SQL Server Admin does in the real world. This Play-by-Play focuses on SQL Server performance tuning and I go over optimizing queries and fine-tuning the server. The table of content of this course is very simple. Introduction In the introduction I explained my basic strategies when I am approached by a customer for performance tuning. Basic Information Gathering In this module I explain how I do gather various information for performance tuning project. It is very crucial to demonstrate to customers for consultant his capability of solving problem. I attempt to resolve a small problem which gives a big positive impact on performance, consultant have to gather proper information from the start. I demonstrate in this module, how one can collect all the important performance tuning metrics. Removing Performance Bottleneck In this module, I build upon the previous module’s statistics collected. I analysis various performance tuning measures and immediately start implementing various tweaks on the performance, which will start improving the performance of my server. This is a very effective method and it gives immediate return of efforts. Index Optimization Indexes are considered as a silver bullet for performance tuning. However, it is not true always there are plenty of examples where indexes even performs worst after implemented. The key is to understand a few of the basic properties of the index and implement the right things at the right time. In this module, I describe in detail how to do index optimizations and what are right and wrong with Index. If you are a DBA or developer, and if your application is running slow – this is must attend module for you. I have some really interesting stories to tell as well. Optimize Query with Rewrite Every problem has more than one solution, in this module we will see another very famous, but hard to master skills for performance tuning – Query Rewrite. There are few do’s and don’ts for any query rewrites. I take a very simple example and demonstrate how query rewrite can improve the performance of the query at many folds. I also share some real world funny stories in this module. This course is hosted at Pluralsight. You will need a valid login for Pluralsight to watch Play by Play: Pinal Dave course. You can also sign up for FREE Trial of Pluralsight to watch this course. As today is my birthday – I will give 10 people (randomly) who will express their desire to learn this course, a free code. Please leave your comment and I will send you free code to watch this course for free. Reference: Pinal Dave (http://blog.sqlauthority.com)Filed under: PostADay, SQL, SQL Authority, SQL Query, SQL Server, SQL Tips and Tricks, SQL Training, SQLAuthority News, T SQL, Video

Read the article
Trash Destination Adapter

The Trash Destination and this article came from early experiences of using SSIS and community feedback at the time. When developing a package it is very useful to have a destination adapter that does nothing but consume rows with no setup requirement. You often want run a package part way through development, or just add a path so you can set a Data Viewer. There are stock tasks that can be used, but with the Trash Destination all columns are treated as selected automatically (usage type of read-only), so the pipeline knows they are required. It is also obvious that this is for development or diagnostic purposes, and is clearly not a part of the functional design of the package. It is also ideal for just playing around and exploring concepts in SSIS, and is often used in conjunction with the Data Generator Source. Using these two components it is easy to setup a test of an expression in the Derived Column Transformation for example. The Data Generator Source provides some dummy data, and the Trash Destination allows you to anchor the output path and set a Data Viewer to examine the results. It can also be used when performance tuning packages. It is a consistent and known quantity that has no external influences, so it is ideal as a destination when breaking the data flow into sections to isolate a bottleneck. The adapter is really simple to use and requires no setup. Simply drop it onto the pipeline designer and use it to terminate your data flow path. Installation The component is provided as an MSI file which you can download and run to install it. This simply places the files on disk in the correct locations and also installs the assemblies in the Global Assembly Cache as per Microsoft’s recommendations. You may need to restart the SQL Server Integration Services service, as this caches information about what components are installed, as well as restarting any open instances of Business Intelligence Development Studio (BIDS) / Visual Studio that you may be using to build your SSIS packages. Finally, for 2005/2008, you will have to add the transformation to the Visual Studio toolbox manually. Right-click the toolbox, and select Choose Items.... Select the SSIS Data Flow Items tab, and then check the Trash Destination transformation in the Choose Toolbox Items window. This process has been described in detail in the related FAQ entry for How do I install a task or transform component? We recommend you follow best practice and apply the current Microsoft SQL Server Service pack to your SQL Server servers and workstations. Downloads The Trash Destination is available for SQL Server 2005, SQL Server 2008 (includes R2) and SQL Server 2012. Please choose the version to match your SQL Server version, or you can install multiple versions and use them side by side if you have more than one version of SQL Server installed. Trash Destination for SQL Server 2005 Trash Destination for SQL Server 2008 Trash Destination for SQL Server 2012 Version History SQL Server 2012 Version 3.0.0.34 - SQL Server 2012 release. Includes upgrade support for both 2005 and 2008 packages to 2012. (5 Jun 2012) SQL Server 2008 Version 2.0.0.33 - SQL Server 2008 release. Includes support for upgrade of 2005 packages. RTM compatible, previously February 2008 CTP. (4 Mar 2008) Version 2.0.0.31 - SQL Server 2008 November 2007 CTP. (14 Feb 2008) SQL Server 2005 Version 1.0.2.18 - SQL Server 2005 RTM Refresh. SP1 Compatibility Testing. (12 Jun 2006) Version 1.0.1.1 - SQL Server 2005 IDW 15 June CTP. Minor enhancements over v1.0.1.0. (11 Jun 2005) Version 1.0.1.0 - SQL Server 2005 IDW 14 April CTP. First Public Release. (30 May 2005) Troubleshooting Make sure you have downloaded the version that matches your version of SQL Server. We offer separate downloads for SQL Server 2005, SQL Server 2008 and SQL Server 2012. If you an error when you try and use the component along the lines of The component could not be added to the Data Flow task. Please verify that this component is properly installed. ... The data flow object "Konesans ..." is not installed correctly on this computer, this usually indicates that the internal cache of SSIS components needs to be updated. This is held by the SSIS service, so you need restart the the SQL Server Integration Services service. You can do this from the Services applet in Control Panel or Administrative Tools in Windows. You can also restart the computer if you prefer. You may also need to restart any current instances of Business Intelligence Development Studio (BIDS) / Visual Studio that you may be using to build your SSIS packages. The full error message is shown below for reference: TITLE: Microsoft Visual Studio ------------------------------ The component could not be added to the Data Flow task. Please verify that this component is properly installed. ------------------------------ ADDITIONAL INFORMATION: The data flow object "Konesans.Dts.Pipeline.TrashDestination.Trash, Konesans.Dts.Pipeline.TrashDestination, Version=1.0.1.0, Culture=neutral, PublicKeyToken=b8351fe7752642cc" is not installed correctly on this computer. (Microsoft.DataTransformationServices.Design) For 2005/2008, once installation is complete you need to manually add the task to the toolbox before you will see it and to be able add it to packages - How do I install a task or transform component? This is not necessary for SQL Server 2012 as the new SSIS toolbox automatically detects components. If you are still having issues then contact us, but please provide as much detail as possible about error, as well as which version of the the task you are using and details of the SSIS tools installed.

Read the article
Why lock-free data structures just aren't lock-free enough

- by Alex.Davies

Today's post will explore why the current ways to communicate between threads don't scale, and show you a possible way to build scalable parallel programming on top of shared memory. The problem with shared memory Soon, we will have dozens, hundreds and then millions of cores in our computers. It's inevitable, because individual cores just can't get much faster. At some point, that's going to mean that we have to rethink our architecture entirely, as millions of cores can't all access a shared memory space efficiently. But millions of cores are still a long way off, and in the meantime we'll see machines with dozens of cores, struggling with shared memory. Alex's tip: The best way for an application to make use of that increasing parallel power is to use a concurrency model like actors, that deals with synchronisation issues for you. Then, the maintainer of the actors framework can find the most efficient way to coordinate access to shared memory to allow your actors to pass messages to each other efficiently. At the moment, NAct uses the .NET thread pool and a few locks to marshal messages. It works well on dual and quad core machines, but it won't scale to more cores. Every time we use a lock, our core performs an atomic memory operation (eg. CAS) on a cell of memory representing the lock, so it's sure that no other core can possibly have that lock. This is very fast when the lock isn't contended, but we need to notify all the other cores, in case they held the cell of memory in a cache. As the number of cores increases, the total cost of a lock increases linearly. A lot of work has been done on "lock-free" data structures, which avoid locks by using atomic memory operations directly. These give fairly dramatic performance improvements, particularly on systems with a few (2 to 4) cores. The .NET 4 concurrent collections in System.Collections.Concurrent are mostly lock-free. However, lock-free data structures still don't scale indefinitely, because any use of an atomic memory operation still involves every core in the system. A sync-free data structure Some concurrent data structures are possible to write in a completely synchronization-free way, without using any atomic memory operations. One useful example is a single producer, single consumer (SPSC) queue. It's easy to write a sync-free fixed size SPSC queue using a circular buffer*. Slightly trickier is a queue that grows as needed. You can use a linked list to represent the queue, but if you leave the nodes to be garbage collected once you're done with them, the GC will need to involve all the cores in collecting the finished nodes. Instead, I've implemented a proof of concept inspired by this intel article which reuses the nodes by putting them in a second queue to send back to the producer. * In all these cases, you need to use memory barriers correctly, but these are local to a core, so don't have the same scalability problems as atomic memory operations. Performance tests I tried benchmarking my SPSC queue against the .NET ConcurrentQueue, and against a standard Queue protected by locks. In some ways, this isn't a fair comparison, because both of these support multiple producers and multiple consumers, but I'll come to that later. I started on my dual-core laptop, running a simple test that had one thread producing 64 bit integers, and another consuming them, to measure the pure overhead of the queue. So, nothing very interesting here. Both concurrent collections perform better than the lock-based one as expected, but there's not a lot to choose between the ConcurrentQueue and my SPSC queue. I was a little disappointed, but then, the .NET Framework team spent a lot longer optimising it than I did. So I dug out a more powerful machine that Red Gate's DBA tools team had been using for testing. It is a 6 core Intel i7 machine with hyperthreading, adding up to 12 logical cores. Now the results get more interesting. As I increased the number of producer-consumer pairs to 6 (to saturate all 12 logical cores), the locking approach was slow, and got even slower, as you'd expect. What I didn't expect to be so clear was the drop-off in performance of the lock-free ConcurrentQueue. I could see the machine only using about 20% of available CPU cycles when it should have been saturated. My interpretation is that as all the cores used atomic memory operations to safely access the queue, they ended up spending most of the time notifying each other about cache lines that need invalidating. The sync-free approach scaled perfectly, despite still working via shared memory, which after all, should still be a bottleneck. I can't quite believe that the results are so clear, so if you can think of any other effects that might cause them, please comment! Obviously, this benchmark isn't realistic because we're only measuring the overhead of the queue. Any real workload, even on a machine with 12 cores, would dwarf the overhead, and there'd be no point worrying about this effect. But would that be true on a machine with 100 cores? Still to be solved. The trouble is, you can't build many concurrent algorithms using only an SPSC queue to communicate. In particular, I can't see a way to build something as general purpose as actors on top of just SPSC queues. Fundamentally, an actor needs to be able to receive messages from multiple other actors, which seems to need an MPSC queue. I've been thinking about ways to build a sync-free MPSC queue out of multiple SPSC queues and some kind of sign-up mechanism. Hopefully I'll have something to tell you about soon, but leave a comment if you have any ideas.

Read the article
Changing the sequencing strategy for File/Ftp Adapter

- by [email protected]

The File/Ftp Adapter allows the user to configure the outbound write to use a sequence number. For example, if I choose address-data_%SEQ%.txt as the FileNamingConvention, then all my files would be generated as address-data_1.txt, address-data_2.txt,...and so on. But, where does this sequence number come from? The answer lies in the "control directory" for the particular adapter project(or scenario). In general, for every project that use the File or Ftp Adapter, a unique directory is created for book keeping purposes. And since this control directory is required to be unique, the adapter uses a digest to make sure that no two control directories are the same. For example, for my FlatStructure sample, the control information for my project would go under FMW_HOME/user_projects/domains/soainfra/fileftp/controlFiles/[DIGEST]/outbound where the value of DIGEST would differ from one project to another. If you look under this directory, you will see a file control_ob.properties and this is where the sequence number is maintained. Please note that the sequence number is maintained in binary form and you hence you might need a hex editor to view its content. You will also see another zero byte file, SEQ_nnn, but, ignore that for now. We'll get to it some other time. For now, please remember that this extra file is maintained as a backup. One of the challenges faced by the adapter runtime is to guard all writes to the control files so no two threads inadverently try to update them at the same time. And, it does so with the help of a "Mutex". For now, please remember that the mutex comes in different flavors: In-memory DB-based Coherence-based User-defined Again, we will talk about these mutexes some other time. Please note that there might be scenarios, particularly under heavy load, where the mutex might become a bottleneck. The adapter, however, allows you to change the configuration so that the adapter sequence value comes from a database sequence or a stored procedure and in such situation, the mutex is acually by-passed and thereby resulting in better throughputs. In later releases, the behavior of the adapter would be defaulted to use a db-sequence. The simplest way to achieve this is by switching your JNDI for the outbound JCA file to use "eis/HAFileAdapter" as shown But, what does this do? Internally, the adapter runtime creates a sequence on the oracle database. For example, if you do a "select * from user_sequences" in your soa-infra schema, you will see a new sequence being created with name as SEQ_<GUID>__ where the GUID will differ from one project to another. However, if you want to use your own sequence, then it would require you to add a new property to your JCA file called SequenceName as shown below. Please note that you will need to create this sequence on your soainfra schema beforehand. But, what if we use DB2 or MSSQL Server as the dehydration support? DB2 supports sequences natively but MSSQL Server does not. So, the adapter runtime uses a natively generated sequence for DB2, but, for MSSQL server, the adapter relies on a stored procedure that ships with the product. If you wish to achieve the same result for SOA Suite running DB2 as the dehydration store, simply change your connection factory JNDI name in the JCA file to eis/HAFileAdapterDB2 and for MSSQL, please use eis/HAFileAdapterMSSQL. And, if you wish to use a stored procedure other than the one that ships with the product, you will need to rely on binding properties to override the adapter behavior. Particularly, you will need to instruct the adapter that you wish to use a stored procedure as shown: Please note that if you're using the File/Ftp Adapter in Append mode, then the adapter runtime degrades the mutex to use pessimistic locks as we don't want writers from different nodes to append to the same file at the same time.

Read the article
Agile Development Requires Agile Support

- by Matt Watson

Agile developmentAgile development has become the standard methodology for application development. The days of long term planning with giant Gantt waterfall charts and detailed requirements is fading away. For years the product planning process frustrated product owners and businesses because no matter the plan, nothing ever went to plan. Agile development throws the detailed planning out the window and instead focuses on giving developers some basic requirements and pointing them in the right direction. Constant collaboration via quick iterations with the end users, product owners, and the development team helps ensure the project is done correctly. The various agile development methodologies have helped greatly with creating products faster, but not without causing new problems. Complicated application deployments now occur weekly or monthly. Most of the products are web-based and deployed as a software service model. System performance and availability of these apps becomes mission critical. This is all much different from the old process of mailing new releases of client-server apps on CD once per quarter or year.The steady stream of new products and product enhancements puts a lot of pressure on IT operations to keep up with the software deployments and adding infrastructure capacity. The problem is most operations teams still move slowly thanks to change orders, documentation, procedures, testing and other processes. Operations can slow the process down and push back on the development team in some organizations. The DevOps movement is trying to solve some of these problems by integrating the development and operations teams more together. Rapid change introduces new problemsThe rapid product change ultimately creates some application problems along the way. Higher rates of change increase the likelihood of new application defects. Delivering applications as a software service also means that scalability of applications is critical. Development teams struggle to keep up with application defects and scalability concerns in their applications. Fixing application problems is a never ending job for agile development teams. Fixing problems before your customers do and fixing them quickly is critical. Most companies really struggle with this due to the divide between the development and operations groups. Fixing application problems typically requires querying databases, looking at log files, reviewing config files, reviewing error logs and other similar tasks. It becomes difficult to work on new features when your lead developers are working on defects from the last product version. Developers need more visibilityThe problem is most developers are not given access to see server and application information in the production environments. The operations team doesn’t trust giving all the developers the keys to the kingdom to log in to production and poke around the servers. The challenge is either give them no access, or potentially too much access. Those with access can still waste time figuring out the location of the application and how to connect to it over VPN. In addition, reproducing problems in test environments takes too much time and isn't always possible. System administrators spend a lot of time helping developers track down server information. Most companies give key developers access to all of the production resources so they can help resolve application defects. The problem is only those key people have access and they become a bottleneck. They end up spending 25-50% of their time on a daily basis trying to solve application issues because they are the only ones with access. These key employees’ time is best spent on strategic new projects, not addressing application defects. This job should fall to entry level developers, provided they have access to all the information they need to troubleshoot the problems.The solution to agile application support is giving all the developers limited access to the production environment and all the server information they need to see. Some companies create their own solutions internally to collect log files, centralize errors or other things to address the problem. Some developers even have access to server monitoring or other tools. But they key is giving them access to everything they need so they can see the full picture and giving access to the whole team. Giving access to everyone scales up the application support team and creates collaboration around providing improved application support.Stackify enables agile application supportStackify has created a solution that can give all developers a secure and read only view of the entire production server environment without console or remote desktop access.They provide a web application that provides real time visibility to the important information that developers need to see. An application centric view enables them to see all of their apps across multiple datacenters and environments. They don’t need to know where the application is deployed, just the name of the application to find it and dig in to see more. All your developers can see server health, application health, log files, config files, windows event viewer, deployment history, application notes, and much more. They can receive email and text alerts when problems arise and even safely query your production databases.Stackify enables companies that do agile development to scale up their application support team by getting more team members involved. The lead developers can spend more time on new projects. Application issues can be fixed quicker than ever. Operations can spend less time helping developers collect server information. Agile application support starts with Stackify. Visit Stackify.com to learn more.

Read the article
Part 1 - Load Testing In The Cloud

- by Tarun Arora

Azure is fascinating, but even more fascinating is the marriage of Azure and TFS! Introduction Recently a client I worked for had 2 major business critical applications being delivered, with very little time budgeted for Performance testing, we immediately hit a bottleneck when the performance testing phase started, the in house infrastructure team could not support the hardware requirements in the short notice. It was suggested that the performance testing be performed on one of the QA environments which was a fraction of the production environment. This didn’t seem right, the team decided to turn to the cloud. The team took advantage of the elasticity offered by Azure, starting with a single test agent which was provisioned and ready for use with in 30 minutes the team scaled up to 17 test agents to perform a very comprehensive performance testing cycle. Issues were identified and resolved but the highlight was that the cost of running the ‘test rig’ proved to be less than if hosted on premise by the infrastructure team. Thank you for taking the time out to read this blog post, in the series of posts, I’ll try and cover the start to end of everything you need to know to use Azure to build your Test Rig in the cloud. But Why Azure? I have my own Data Centre… If the environment is provisioned in your own datacentre, - No matter what level of service agreement you may have with your infrastructure team there will be down time when the environment is patched - How fast can you scale up or down the environments (keeping the enterprise processes in mind) Administration, Cost, Flexibility and Scalability are the areas you would want to think around when taking the decision between your own Data Centre and Azure! How is Microsoft's Public Cloud Offering different from Amazon’s Public Cloud Offering? Microsoft's offering of the Cloud is a hybrid of Platform as a Service (PaaS) and Infrastructure as a Service (IaaS) which distinguishes Microsoft's offering from other providers such as Amazon (Amazon only offers IaaS). PaaS – Platform as a Service IaaS – Infrastructure as a Service Fills the needs of those who want to build and run custom applications as services. Similar to traditional hosting, where a business will use the hosted environment as a logical extension of the on-premises datacentre. A service provider offers a pre-configured, virtualized application server environment to which applications can be deployed by the development staff. Since the service providers manage the hardware (patching, upgrades and so forth), as well as application server uptime, the involvement of IT pros is minimized. On-demand scalability combined with hardware and application server management relieves developers from infrastructure concerns and allows them to focus on building applications. The servers (physical and virtual) are rented on an as-needed basis, and the IT professionals who manage the infrastructure have full control of the software configuration. This kind of flexibility increases the complexity of the IT environment, as customer IT professionals need to maintain the servers as though they are on-premises. The maintenance activities may include patching and upgrades of the OS and the application server, load balancing, failover clustering of database servers, backup and restoration, and any other activities that mitigate the risks of hardware and software failures. The biggest advantage with PaaS is that you do not have to worry about maintaining the environment, you can focus all your time in solving the business problems with your solution rather than worrying about maintaining the environment. If you decide to use a VM Role on Azure, you are asking for IaaS, more on this later. A nice blog post here on the difference between Saas, PaaS and IaaS. Now that we are convinced why we should be turning to the cloud and why in specific Azure, let’s discuss about the Test Rig. The Load Test Rig – Topology Now the moment of truth, Of course a big part of getting value from cloud computing is identifying the most adequate workloads to take to the cloud, so I’ve decided to try to make a Load Testing rig where the Agents are running on Windows Azure. I’ll talk you through the above Topology, - User: User kick starts the load test run from the developer workstation on premise. This passes the request to the Test Controller. - Test Controller: The Test Controller is on premise connected to the same domain as the developer workstation. As soon as the Test Controller receives the request it makes use of the Windows Azure Connect service to orchestrate the test responsibilities to all the Test Agents. The Windows Azure Connect endpoint software must be active on all Azure instances and on the Controller machine as well. This allows IP connectivity between them and, given that the firewall is properly configured, allows the Controller to send work loads to the agents. In parallel, the Controller will collect the performance data from the agents, using the traditional WMI mechanisms. - Test Agents: The Test Agents are on the Windows Azure Public Cloud, as soon as the test controller issues instructions to the test agents, the test agents start executing the load tests. The HTTP requests are issued against the web server on premise, the results are captured by the test agents. And finally the results are passed over to the controller. - Servers: The Web Server and DB Server are hosted on premise in the datacentre, this is usually the case with business critical applications, you probably want to manage them your self. Recap and What’s next? So, in the introduction in the series of blog posts on Load Testing in the cloud I highlighted why creating a test rig in the cloud is a good idea, what advantages does Windows Azure offer and the Test Rig topology that I will be using. I would also like to mention that i stumbled upon this [Video] on Azure in a nutshell, great watch if you are new to Windows Azure. In the next post I intend to start setting up the Load Test Environment and discuss pricing with respect to test agent machine types that will be used in the test rig. Hope you enjoyed this post, If you have any recommendations on things that I should consider or any questions or feedback, feel free to add to this blog post. Remember to subscribe to http://feeds.feedburner.com/TarunArora. See you in Part II. Share this post : CodeProject

Read the article
CPU Usage in Very Large Coherence Clusters

- by jpurdy

When sizing Coherence installations, one of the complicating factors is that these installations (by their very nature) tend to be application-specific, with some being large, memory-intensive caches, with others acting as I/O-intensive transaction-processing platforms, and still others performing CPU-intensive calculations across the data grid. Regardless of the primary resource requirements, Coherence sizing calculations are inherently empirical, in that there are so many permutations that a simple spreadsheet approach to sizing is rarely optimal (though it can provide a good starting estimate). So we typically recommend measuring actual resource usage (primarily CPU cycles, network bandwidth and memory) at a given load, and then extrapolating from those measurements. Of course there may be multiple types of load, and these may have varying degrees of correlation -- for example, an increased request rate may drive up the number of objects "pinned" in memory at any point, but the increase may be less than linear if those objects are naturally shared by concurrent requests. But for most reasonably-designed applications, a linear resource model will be reasonably accurate for most levels of scale. However, at extreme scale, sizing becomes a bit more complicated as certain cluster management operations -- while very infrequent -- become increasingly critical. This is because certain operations do not naturally tend to scale out. In a small cluster, sizing is primarily driven by the request rate, required cache size, or other application-driven metrics. In larger clusters (e.g. those with hundreds of cluster members), certain infrastructure tasks become intensive, in particular those related to members joining and leaving the cluster, such as introducing new cluster members to the rest of the cluster, or publishing the location of partitions during rebalancing. These tasks have a strong tendency to require all updates to be routed via a single member for the sake of cluster stability and data integrity. Fortunately that member is dynamically assigned in Coherence, so it is not a single point of failure, but it may still become a single point of bottleneck (until the cluster finishes its reconfiguration, at which point this member will have a similar load to the rest of the members). The most common cause of scaling issues in large clusters is disabling multicast (by configuring well-known addresses, aka WKA). This obviously impacts network usage, but it also has a large impact on CPU usage, primarily since the senior member must directly communicate certain messages with every other cluster member, and this communication requires significant CPU time. In particular, the need to notify the rest of the cluster about membership changes and corresponding partition reassignments adds stress to the senior member. Given that portions of the network stack may tend to be single-threaded (both in Coherence and the underlying OS), this may be even more problematic on servers with poor single-threaded performance. As a result of this, some extremely large clusters may be configured with a smaller number of partitions than ideal. This results in the size of each partition being increased. When a cache server fails, the other servers will use their fractional backups to recover the state of that server (and take over responsibility for their backed-up portion of that state). The finest granularity of this recovery is a single partition, and the single service thread can not accept new requests during this recovery. Ordinarily, recovery is practically instantaneous (it is roughly equivalent to the time required to iterate over a set of backup backing map entries and move them to the primary backing map in the same JVM). But certain factors can increase this duration drastically (to several seconds): large partitions, sufficiently slow single-threaded CPU performance, many or expensive indexes to rebuild, etc. The solution of course is to mitigate each of those factors but in many cases this may be challenging. Larger clusters also lead to the temptation to place more load on the available hardware resources, spreading CPU resources thin. As an example, while we've long been aware of how garbage collection can cause significant pauses, it usually isn't viewed as a major consumer of CPU (in terms of overall system throughput). Typically, the use of a concurrent collector allows greater responsiveness by minimizing pause times, at the cost of reducing system throughput. However, at a recent engagement, we were forced to turn off the concurrent collector and use a traditional parallel "stop the world" collector to reduce CPU usage to an acceptable level. In summary, there are some less obvious factors that may result in excessive CPU consumption in a larger cluster, so it is even more critical to test at full scale, even though allocating sufficient hardware may often be much more difficult for these large clusters.

Read the article
A Patent for Workload Management Based on Service Level Objectives

- by jsavit

I'm very pleased to announce that after a tiny :-) wait of about 5 years, my patent application for a workload manager was finally approved. Background Many operating systems have a resource manager which lets you control machine resources. For example, Solaris provides controls for CPU with several options: shares for proportional CPU allocation. If you have twice as many shares as me, and we are competing for CPU, you'll get about twice as many CPU cycles), dedicated CPU allocation in which a number of CPUs are exclusively dedicated to an application's use. You can say that a zone or project "owns" 8 CPUs on a 32 CPU machine, for example. And, capped CPU in which you specify the upper bound, or cap, of how much CPU an application gets. For example, you can throttle an application to 0.125 of a CPU. (This isn't meant to be an exhaustive list of Solaris RM controls.) Workload management Useful as that is (and tragic that some other operating systems have little resource management and isolation, and frighten people into running only 1 app per OS instance - and wastefully size every server for the peak workload it might experience) that's not really workload management. With resource management one controls the resources, and hope that's enough to meet application service objectives. In fact, we hold resource distribution constant, see if that was good enough, and adjust resource distribution if that didn't meet service level objectives. Here's an example of what happens today: Let's try 30% dedicated CPU. Not enough? Let's try 80% Oh, that's too much, and we're achieving much better response time than the objective, but other workloads are starving. Let's back that off and try again. It's not the process I object to - it's that we to often do this manually. Worse, we sometimes identify and adjust the wrong resource and fiddle with that to no useful result. Back in my days as a customer managing large systems, one of my users would call me up to beg for a "CPU boost": Me: "it won't make any difference - there's plenty of spare CPU to be had, and your application is completely I/O bound." User: "Please do it anyway." Me: "oh, all right, but it won't do you any good." (I did, because he was a friend, but it didn't help.) Prior art There are some operating environments that take a stab about workload management (rather than resource management) but I find them lacking. I know of one that uses synthetic "service units" composed of the sum of CPU, I/O and memory allocations multiplied by weighting factors. A workload is set to make a target rate of service units consumed per second. But this seems to be missing a key point: what is the relationship between artificial 'service units' and actually meeting a throughput or response time objective? What if I get plenty of one of the components (so am getting enough service units), but not enough of the resource whose needed to remove the bottleneck? Actual workload management That's not really the answer either. What is needed is to specify a workload's service levels in terms of externally visible metrics that are meaningful to a business, such as response times or transactions per second, and have the workload manager figure out which resources are not being adequately provided, and then adjust it as needed. If an application is not meeting its service level objectives and the reason is that it's not getting enough CPU cycles, adjust its CPU resource accordingly. If the reason is that the application isn't getting enough RAM to keep its working set in memory, then adjust its RAM assignment appropriately so it stops swapping. Simple idea, but that's a task we keep dumping on system administrators. In other words - don't hold the number of CPU shares constant and watch the achievement of service level vary. Instead, hold the service level constant, and dynamically adjust the number of CPU shares (or amount of other resources like RAM or I/O bandwidth) in order to meet the objective. Instrumenting non-instrumented applications There's one little problem here: how do I measure application performance in a way relating to a service level. I don't want to do it based on internal resources like number of CPU seconds it received per minute - We need to make resource decisions based on externally visible and meaningful measures of performance, not synthetic items or internal resource counters. If I have a way of marking the beginning and end of a transaction, I can then measure whether or not the application is meeting an objective based on it. If I can observe the delay factors for an application, I can see which resource shortages are slowing an application enough to keep it from meeting its objectives. I can then adjust resource allocations to relieve those shortages. Fortunately, Solaris provides facilities for both marking application progress and determining what factors cause application latency. The Solaris DTrace facility let's me introspect on application behavior: in particular I can see events like "receive a web hit" and "respond to that web hit" so I can get transaction rate and response time. DTrace (and tools like prstat) let me see where latency is being added to an application, so I know which resource to adjust. Summary After a delay of a mere few years, I am the proud creator of a patent (advice to anyone interested in going through the process: don't hold your breath!). The fundamental idea is fairly simple: instead of holding resource constant and suffering variable levels of success meeting service level objectives, properly characterise the service level objective in meaningful terms, instrument the application to see if it's meeting the objective, and then have a workload manager change resource allocations to remove delays preventing service level attainment. I've done it by hand for a long time - I think that's what a computer should do for me.

Read the article
OBIA on Teradata - Part 1 Loader and Monitoring

- by Mohan Ramanuja

Normal 0 false false false EN-US X-NONE X-NONE /* Style Definitions */ table.MsoNormalTable {mso-style-name:"Table Normal"; mso-tstyle-rowband-size:0; mso-tstyle-colband-size:0; mso-style-noshow:yes; mso-style-priority:99; mso-style-qformat:yes; mso-style-parent:""; mso-padding-alt:0in 5.4pt 0in 5.4pt; mso-para-margin:0in; mso-para-margin-bottom:.0001pt; mso-pagination:widow-orphan; font-size:11.0pt; font-family:"Calibri","sans-serif"; mso-ascii-font-family:Calibri; mso-ascii-theme-font:minor-latin; mso-fareast-font-family:"Times New Roman"; mso-fareast-theme-font:minor-fareast; mso-hansi-font-family:Calibri; mso-hansi-theme-font:minor-latin; mso-bidi-font-family:"Times New Roman"; mso-bidi-theme-font:minor-bidi;} The out-of-the-box (OOB) OBIA Informatica mappings come with TPump loader. TPUMP FASTLOAD TPump does not lock the table. FastLoad applies exclusive lock on the table. The table that TPump is loading can have data. The table that FastLoad is loading needs to be empty. Normal 0 false false false EN-US X-NONE X-NONE /* Style Definitions */ table.MsoNormalTable {mso-style-name:"Table Normal"; mso-tstyle-rowband-size:0; mso-tstyle-colband-size:0; mso-style-noshow:yes; mso-style-priority:99; mso-style-qformat:yes; mso-style-parent:""; mso-padding-alt:0in 5.4pt 0in 5.4pt; mso-para-margin:0in; mso-para-margin-bottom:.0001pt; mso-pagination:widow-orphan; font-size:11.0pt; font-family:"Calibri","sans-serif"; mso-ascii-font-family:Calibri; mso-ascii-theme-font:minor-latin; mso-fareast-font-family:"Times New Roman"; mso-fareast-theme-font:minor-fareast; mso-hansi-font-family:Calibri; mso-hansi-theme-font:minor-latin; mso-bidi-font-family:"Times New Roman"; mso-bidi-theme-font:minor-bidi;} TPump is not efficient with lookups. FastLoad is more efficient in the absence of lookups. Normal 0 false false false EN-US X-NONE X-NONE /* Style Definitions */ table.MsoNormalTable {mso-style-name:"Table Normal"; mso-tstyle-rowband-size:0; mso-tstyle-colband-size:0; mso-style-noshow:yes; mso-style-priority:99; mso-style-qformat:yes; mso-style-parent:""; mso-padding-alt:0in 5.4pt 0in 5.4pt; mso-para-margin:0in; mso-para-margin-bottom:.0001pt; mso-pagination:widow-orphan; font-size:11.0pt; font-family:"Calibri","sans-serif"; mso-ascii-font-family:Calibri; mso-ascii-theme-font:minor-latin; mso-fareast-font-family:"Times New Roman"; mso-fareast-theme-font:minor-fareast; mso-hansi-font-family:Calibri; mso-hansi-theme-font:minor-latin; mso-bidi-font-family:"Times New Roman"; mso-bidi-theme-font:minor-bidi;} The out-of the box Informatica mappings come with TPump loader. There is chance for bottleneck in writer thread The out-of the box tables in Teradata supplied with OBAW features all Dimension and Fact tables using ROW_WID as the key for primary index. Also, all staging tables use integration_id as the key for primary index. This reduces skewing of data across Teradata AMPs.You can use an SQL statement similar to the following to determine if data for a given table is distributed evenly across all AMP vprocs. The SQL statement displays the AMP with the most used through the AMP with the least-used space, investigating data distribution in the Message table in database RST.SELECT vproc,CurrentPermFROM DBC.TableSizeWHERE Databasename = ‘PRJ_CRM_STGC’AND Tablename = ‘w_party_per_d’ORDER BY 2 descIf you suspect distribution problems (skewing) among AMPS, the following is a sample of what you might enter for a three-column PI:SELECT HASHAMP (HASHBUCKET (HASHROW (col_x, col_y, col_z))), count (*)FROM hash15GROUP BY 1ORDER BY 2 desc; ETL Error Monitoring Error Table – These are tables that start with ET. Location and name can be specified in Informatica session as well as the loader connection.Loader Log – Loader log is available in the Informatica server under the session log folder. These give feedback on the loader parameters such as Packing Factor to use. These however need to be monitored in the production environment. The recommendations made in one environment may not be used in another environment.Log Table – These are tables that start with TL. These are sparse on information.Bad File – This is the Informatica file generated in case there is data quality issues

Read the article
Right-Time Retail Part 2

- by David Dorf

This is part two of the three-part series. Normal 0 false false false EN-US X-NONE X-NONE /* Style Definitions */ table.MsoNormalTable {mso-style-name:"Table Normal"; mso-tstyle-rowband-size:0; mso-tstyle-colband-size:0; mso-style-noshow:yes; mso-style-priority:99; mso-style-qformat:yes; mso-style-parent:""; mso-padding-alt:0in 5.4pt 0in 5.4pt; mso-para-margin-top:0in; mso-para-margin-right:0in; mso-para-margin-bottom:10.0pt; mso-para-margin-left:0in; line-height:115%; mso-pagination:widow-orphan; font-size:11.0pt; font-family:"Calibri","sans-serif"; mso-ascii-font-family:Calibri; mso-ascii-theme-font:minor-latin; mso-fareast-font-family:"Times New Roman"; mso-fareast-theme-font:minor-fareast; mso-hansi-font-family:Calibri; mso-hansi-theme-font:minor-latin; mso-bidi-font-family:"Times New Roman"; mso-bidi-theme-font:minor-bidi;} Right-Time Integration Of course these real-time enabling technologies are only as good as the systems that utilize them, and it only takes one bottleneck to slow everyone else down. What good is an immediate stock-out notification if the supply chain can’t react until tomorrow? Since being formed in 2006, Oracle Retail has been not only adding more integrations between systems, but also modernizing integrations for appropriate speed. Notice I tossed in the word “appropriate.” Not everything needs to be real-time – again, we’re talking about Right-Time Retail. The speed of data capture, analysis, and execution must be synchronized or you’re wasting effort. Unfortunately, there isn’t an enterprise-wide dial that you can crank-up for your estate. You’ll need to improve things piecemeal, with people and processes as limiting factors while choosing the appropriate types of integrations. There are three integration styles we see in the retail industry. First is batch. I know, the word “batch” just sounds slow, but this pattern is less about velocity and more about volume. When there are large amounts of data to be moved, you’ll want to use batch processes. Our technology of choice here is Oracle Data Integrator (ODI), which provides a fast version of Extract-Transform-Load (ETL). Instead of the three-step process, the load and transform steps are combined to save time. ODI is a key technology for moving data into Retail Analytics where we can apply science. Performing analytics on each sale as it occurs doesn’t make any sense, so we batch up a statistically significant amount and submit all at once. The second style is fire-and-forget. For some types of data, we want the data to arrive ASAP but immediacy is not necessary. Speed is less important than guaranteed delivery, so we use message-oriented middleware available in both Weblogic and the Oracle database. For example, Point-of-Service transactions are queued for delivery to Central Office at corporate. If the network is offline, those transactions remain in the queue and will be delivered when the network returns. Transactions cannot be lost and they must be delivered in order. (Ever tried processing a return before the sale?) To enhance the standard queues, we offer the Retail Integration Bus (RIB) to help the management and monitoring of fire-and-forget messaging in the enterprise. The third style is request-response and is most commonly implemented as Web services. This is a synchronous message where the sender waits for a response. In this situation, the volume of data is small, guaranteed delivery is not necessary, but speed is very important. Examples include the website checking inventory, a price lookup, or processing a credit card authorization. The Oracle Service Bus (OSB) typically handles the routing of such messages, and we’ve enhanced its abilities with the Retail Service Backbone (RSB). To better understand these integration patterns and where they apply within the retail enterprise, we’re providing the Retail Reference Library (RRL) at no charge to Oracle Retail customers. The library is composed of a large number of industry business processes, including those necessary to support Commerce Anywhere, as well as detailed architectural diagrams. These diagrams allow implementers to understand the systems involved in integrations and the specific data payloads. Furthermore, with our upcoming release we’ll be providing a new tool called the Retail Integration Console (RIC) that allows IT to monitor and manage integrations from a single point. Using RIC, retailers can quickly discern where integration activity is occurring, volume statistics, average response times, and errors. The dashboards provide the ability to dive down into the architecture documentation to gather information all the way down to the specific payload. Retailers that want real-time integrations will also need real-time monitoring of those integrations to ensure service-level agreements are maintained. Part 3 looks at marketing.

Read the article
Implementing a Custom Coherence PartitionAssignmentStrategy

- by jpurdy

A recent A-Team engagement required the development of a custom PartitionAssignmentStrategy (PAS). By way of background, a PAS is an implementation of a Java interface that controls how a Coherence partitioned cache service assigns partitions (primary and backup copies) across the available set of storage-enabled members. While seemingly straightforward, this is actually a very difficult problem to solve. Traditionally, Coherence used a distributed algorithm spread across the cache servers (and as of Coherence 3.7, this is still the default implementation). With the introduction of the PAS interface, the model of operation was changed so that the logic would run solely in the cache service senior member. Obviously, this makes the development of a custom PAS vastly less complex, and in practice does not introduce a significant single point of failure/bottleneck. Note that Coherence ships with a default PAS implementation but it is not used by default. Further, custom PAS implementations are uncommon (this engagement was the first custom implementation that we know of). The particular implementation mentioned above also faced challenges related to managing multiple backup copies but that won't be discussed here. There were a few challenges that arose during design and implementation: Naive algorithms had an unreasonable upper bound of computational cost. There was significant complexity associated with configurations where the member count varied significantly between physical machines. Most of the complexity of a PAS is related to rebalancing, not initial assignment (which is usually fairly simple). A custom PAS may need to solve several problems simultaneously, such as: Ensuring that each member has a similar number of primary and backup partitions (e.g. each member has the same number of primary and backup partitions) Ensuring that each member carries similar responsibility (e.g. the most heavily loaded member has no more than one partition more than the least loaded). Ensuring that each partition is on the same member as a corresponding local resource (e.g. for applications that use partitioning across message queues, to ensure that each partition is collocated with its corresponding message queue). Ensuring that a given member holds no more than a given number of partitions (e.g. no member has more than 10 partitions) Ensuring that backups are placed far enough away from the primaries (e.g. on a different physical machine or a different blade enclosure) Achieving the above goals while ensuring that partition movement is minimized. These objectives can be even more complicated when the topology of the cluster is irregular. For example, if multiple cluster members may exist on each physical machine, then clearly the possibility exists that at certain points (e.g. following a member failure), the number of members on each machine may vary, in certain cases significantly so. Consider the case where there are three physical machines, with 3, 3 and 9 members each (respectively). This introduces complexity since the backups for the 9 members on the the largest machine must be spread across the other 6 members (to ensure placement on different physical machines), preventing an even distribution. For any given problem like this, there are usually reasonable compromises available, but the key point is that objectives may conflict under extreme (but not at all unlikely) circumstances. The most obvious general purpose partition assignment algorithm (possibly the only general purpose one) is to define a scoring function for a given mapping of partitions to members, and then apply that function to each possible permutation, selecting the most optimal permutation. This would result in N! (factorial) evaluations of the scoring function. This is clearly impractical for all but the smallest values of N (e.g. a partition count in the single digits). It's difficult to prove that more efficient general purpose algorithms don't exist, but the key take away from this is that algorithms will tend to either have exorbitant worst case performance or may fail to find optimal solutions (or both) -- it is very important to be able to show that worst case performance is acceptable. This quickly leads to the conclusion that the problem must be further constrained, perhaps by limiting functionality or by using domain-specific optimizations. Unfortunately, it can be very difficult to design these more focused algorithms. In the specific case mentioned, we constrained the solution space to very small clusters (in terms of machine count) with small partition counts and supported exactly two backup copies, and accepted the fact that partition movement could potentially be significant (preferring to solve that issue through brute force). We then used the out-of-the-box PAS implementation as a fallback, delegating to it for configurations that were not supported by our algorithm. Our experience was that the PAS interface is quite usable, but there are intrinsic challenges to designing PAS implementations that should be very carefully evaluated before committing to that approach.

Read the article
Using XA Transactions in Coherence-based Applications

- by jpurdy

While the costs of XA transactions are well known (e.g. increased data contention, higher latency, significant disk I/O for logging, availability challenges, etc.), in many cases they are the most attractive option for coordinating logical transactions across multiple resources. There are a few common approaches when integrating Coherence into applications via the use of an application server's transaction manager: Use of Coherence as a read-only cache, applying transactions to the underlying database (or any system of record) instead of the cache. Use of TransactionMap interface via the included resource adapter. Use of the new ACID transaction framework, introduced in Coherence 3.6. Each of these may have significant drawbacks for certain workloads. Using Coherence as a read-only cache is the simplest option. In this approach, the application is responsible for managing both the database and the cache (either within the business logic or via application server hooks). This approach also tends to provide limited benefit for many workloads, particularly those workloads that either have queries (given the complexity of maintaining a fully cached data set in Coherence) or are not read-heavy (where the cost of managing the cache may outweigh the benefits of reading from it). All updates are made synchronously to the database, leaving it as both a source of latency as well as a potential bottleneck. This approach also prevents addressing "hot data" problems (when certain objects are updated by many concurrent transactions) since most database servers offer no facilities for explicitly controlling concurrent updates. Finally, this option tends to be a better fit for key-based access (rather than filter-based access such as queries) since this makes it easier to aggressively invalidate cache entries without worrying about when they will be reloaded. The advantage of this approach is that it allows strong data consistency as long as optimistic concurrency control is used to ensure that database updates are applied correctly regardless of whether the cache contains stale (or even dirty) data. Another benefit of this approach is that it avoids the limitations of Coherence's write-through caching implementation. TransactionMap is generally used when Coherence acts as system of record. TransactionMap is not generally compatible with write-through caching, so it will usually be either used to manage a standalone cache or when the cache is backed by a database via write-behind caching. TransactionMap has some restrictions that may limit its utility, the most significant being: The lock-based concurrency model is relatively inefficient and may introduce significant latency and contention. As an example, in a typical configuration, a transaction that updates 20 cache entries will require roughly 40ms just for lock management (assuming all locks are granted immediately, and excluding validation and writing which will require a similar amount of time). This may be partially mitigated by denormalizing (e.g. combining a parent object and its set of child objects into a single cache entry), at the cost of increasing false contention (e.g. transactions will conflict even when updating different child objects). If the client (application server JVM) fails during the commit phase, locks will be released immediately, and the transaction may be partially committed. In practice, this is usually not as bad as it may sound since the commit phase is usually very short (all locks having been previously acquired). Note that this vulnerability does not exist when a single NamedCache is used and all updates are confined to a single partition (generally implying the use of partition affinity). The unconventional TransactionMap API is cumbersome but manageable. Only a few methods are transactional, primarily get(), put() and remove(). The ACID transactions framework (accessed via the Connection class) provides atomicity guarantees by implementing the NamedCache interface, maintaining its own cache data and transaction logs inside a set of private partitioned caches. This feature may be used as either a local transactional resource or as logging XA resource. However, a lack of database integration precludes the use of this functionality for most applications. A side effect of this is that this feature has not seen significant adoption, meaning that any use of this is subject to the usual headaches associated with being an early adopter (greater chance of bugs and greater risk of hitting an unoptimized code path). As a result, for the moment, we generally recommend against using this feature. In summary, it is possible to use Coherence in XA-oriented applications, and several customers are doing this successfully, but it is not a core usage model for the product, so care should be taken before committing to this path. For most applications, the most robust solution is normally to use Coherence as a read-only cache of the underlying data resources, even if this prevents taking advantage of certain product features.

Read the article
WPF Animation FPS vs. CPU usage - Am I expecting too much?

- by Cory Charlton

Working on a screen saver for my wife, http://cchearts.codeplex.com/, and while I've been able to improve FPS on lower end machines (switch from Path to StreamGeometry, use DrawingVisual instead of UserControl, etc) the CPU usage still seems very high. Here's some numbers I ran from a few 5 minute sampling periods: ~60FPS 35% average CPU on Core 2 Duo T7500 @ 2.2GHz, 3GB ram, NVIDIA Quadro NVS 140M (128MB), Vista [My dev laptop] ~40FPS 50% average CPU on Pentium D @ 3.4GHz, 1.5GB ram, Standard VGA Graphics Adapter (unknown), 2003 Server [A crappy desktop] I can understand the lower frame rate and higher CPU usage on the crappy desktop but it still seems pretty high and 35% on my dev laptop seems high as well. I'd really like to analyze the application to get more details but I'm having issues there as well so I'm wondering if I'm doing something wrong (never profiled WPF before). WPF Performance Suite: Process Launch Error Unable to attach to process: CCHearts.exe Do you want to kill it? This error message occurs when I click cancel after attempting launch. If I don't click cancel it sits there idle, I guess waiting to attach. Performance Explorer: Could not launch C:\Projects2\CC.Hearts\CC.Hearts\bin\Debug (USEVISUAL)\CCHearts.exe. Previous attempt to profile the application finished unsuccessfully. Please restart the application. Output Window from Performance: Profiling started. Profiling process ID 5360 (CCHearts). Process ID 5360 has exited. Data written to C:\Projects2\CC.Hearts\CCHearts100608.vsp. Profiling finished. PRF0025: No data was collected. Profiling complete. So I'm stuck wanting to improve performance but have no concrete way to determine where the bottleneck is. Have been relatively successful throwing darts at this point but I'm beyond that now :) PS: Screensaver is hosted at CodePlex if you want to look at the source and missed the link above. Edit: My RenderOptions darts... // NOTE: Grasping at straws here ;-) RenderOptions.SetBitmapScalingMode(newHeart, BitmapScalingMode.LowQuality); RenderOptions.SetCachingHint(newHeart, CachingHint.Cache); RenderOptions.SetEdgeMode(newHeart, EdgeMode.Aliased); I threw those a little while back and didn't see much difference (not sure if the bitmap scaling even comes into play). Really wish I could get profiling working to know where I should try to optimize. For now I assume there is some overhead in creating a new HeartVisual and the DrawingVisual contained inside. Maybe if I reset and reused the hearts (tossed them in a queue once they completed or something) I'd see an improvement. Shrug Throwing darts while blindfolder is always fun.

Read the article
Why is drawing to OnPaint graphics faster than image graphics?

- by Tesserex

I'm looking for a way to speed up the drawing of my game engine, which is currently the significant bottleneck, and is causing slowdowns. I'm on the verge of converting it over to XNA, but I just noticed something. Say I have a small image that I've loaded. Image img = Image.FromFile("mypict.png"); We have a picturebox on the screen we want to draw on. So we have a handler. pictureBox1.Paint += new PaintEventHandler(pictureBox1_Paint); I want our loaded image to be tiled on the picturebox (this is for a game, after all). Why on earth is this code: void pictureBox1_Paint(object sender, PaintEventArgs e) { for (int y = 0; y < 16; y++) for (int x = 0; x < 16; x++) e.Graphics.DrawImage(image, x * 16, y * 16, 16, 16); } over 25 TIMES FASTER than this code: Image buff = new Bitmap(256, 256, PixelFormat.Format32bppPArgb); // actually a form member void pictureBox1_Paint(object sender, PaintEventArgs e) { using (Graphics g = Graphics.FromImage(buff)) { for (int y = 0; y < 16; y++) for (int x = 0; x < 16; x++) g.DrawImage(image, x * 16, y * 16, 16, 16); } e.Graphics.DrawImage(buff, 0, 0, 256, 256); } To eliminate the obvious, I've tried commenting out the last e.Graphics.DrawImage (which means I don't see anything, but it gets rid a call that isn't in the first example). I've also left in the using block (needlessly) in the first example, but it's still just as blazingly fast. I've set properties of g to match e.Graphics - things like InterpolationMode, CompositingQuality, etc, but nothing I do bridges this incredible gap in performance. I can't find any difference between the two Graphics objects. What gives? My test with a System.Diagnostics.Stopwatch says that the first code snippet runs at about 7100 fps, while the second runs at a measly 280 fps. My reference image is VS2010ImageLibrary\Objects\png_format\WinVista\SecurityLock.png, which is 48x48 px, and which I modified to be 72 dpi instead of 96, but those made no difference either.

Read the article
What are the reasons why the CPU usage doesn’t go 100% with C# and APM?

- by Martin

I have an application which is CPU intensive. When the data is processed on a single thread, the CPU usage goes to 100% for many minutes. So the performance of the application appears to be bound by the CPU. I have multithreaded the logic of the application, which result in an increase of the overall performance. However, the CPU usage hardly goes above 30%-50%. I would expect the CPU (and the many cores) to go to 100% since I process many set of data at the same time. Below is a simplified example of the logic I use to start the threads. When I run this example, the CPU goes to 100% (on an 8/16 cores machine). However, my application which uses the same pattern doesn’t. public class DataExecutionContext { public int Counter { get; set; } // Arrays of data } static void Main(string[] args) { // Load data from the database into the context var contexts = new List<DataExecutionContext>(100); for (int i = 0; i < 100; i++) { contexts.Add(new DataExecutionContext()); } // Data loaded. Start to process. var latch = new CountdownEvent(contexts.Count); var processData = new Action<DataExecutionContext>(c => { // The thread doesn't access data from a DB, file, // network, etc. It reads and write data in RAM only // (in its context). for (int i = 0; i < 100000000; i++) c.Counter++; }); foreach (var context in contexts) { processData.BeginInvoke(context, new AsyncCallback(ar => { latch.Signal(); }), null); } latch.Wait(); } I have reduced the number of locks to the strict minimum (only the latch is locking). The best way I found was to create a context in which a thread can read/write in memory. Contexts are not shared among other threads. The threads can’t access the database, files or network. In other words, I profiled my application and I didn’t find any bottleneck. Why the CPU usage of my application doesn’t go about 50%? Is it the pattern I use? Should I create my own thread instead of using the .Net thread pool? Is there any gotchas? Is there any tool that you could recommend me to find my issue? Thanks!

Read the article
How to maximize http.sys file upload performance

- by anelson

I'm building a tool that transfers very large streaming data sets (possibly on the order of terabytes in a single stream; routinely in the tens of gigabytes) from one server to another. The client portion of the tool will read blocks from the source disk, and send them over the network. The server side will read these blocks off the network and write them to a file on the server disk. Right now I'm trying to decide which transport to use. Options are raw TCP, and HTTP. I really, REALLY want to be able to use HTTP. The HttpListener (or WCF if I want to go that route) make it easy to plug in to the HTTP Server API (http.sys), and I can get things like authentication and SSL for free. The problem right now is performance. I wrote a simple test harness that sends 128K blocks of NULL bytes using the BeginWrite/EndWrite async I/O idiom, with async BeginRead/EndRead on the server side. I've modified this test harness so I can do this with either HTTP PUT operations via HttpWebRequest/HttpListener, or plain old socket writes using TcpClient/TcpListener. To rule out issues with network cards or network pathways, both the client and server are on one machine and communicate over localhost. On my 12-core Windows 2008 R2 test server, the TCP version of this test harness can push bytes at 450MB/s, with minimal CPU usage. On the same box, the HTTP version of the test harness runs between 130MB/s and 200MB/s depending upon how I tweak it. In both cases CPU usage is low, and the vast majority of what CPU usage there is is kernel time, so I'm pretty sure my usage of C# and the .NET runtime is not the bottleneck. The box has two 6-core Xeon X5650 processors, 24GB of single-ranked DDR3 RAM, and is used exclusively by me for my own performance testing. I already know about HTTP client tweaks like ServicePointManager.MaxServicePointIdleTime, ServicePointManager.DefaultConnectionLimit, ServicePointManager.Expect100Continue, and HttpWebRequest.AllowWriteStreamBuffering. Does anyone have any ideas for how I can get HTTP.sys performance beyond 200MB/s? Has anyone seen it perform this well on any environment?

Read the article
Implement Semi-Round-Robin file which can be expanded and saved on demand

- by ircmaxell

Ok, that title is going to be a little bit confusing. Let me try to explain it a little bit better. I am building a logging program. The program will have 3 main states: Write to a round-robin buffer file, keeping only the last 10 minutes of data. Write to a buffer file, ignoring the time (record all data). Rename entire buffer file, and start a new one with the past 10 minutes of data (and change state to 1). Now, the use case is this. I have been experiencing some network bottlenecks from time to time in our network. So I want to build a system to record TCP traffic when it detects the bottleneck (detection via Nagios). However by the time it detects the bottlenecking, most of the useful data has already been transmitted. So, what I'd like is to have a deamon that runs something like dumpcap all the time. In normal mode, it'll only keep the past 10 minutes of data (Since there's no point in keeping a boat load of data if it's not needed). But when Nagios alerts, I will send a signal in the deamon to store everything. Then, when Naigos recovers it will send another signal to stop storing and flush the buffer to a save file. Now, the problem is that I can't see how to cleanly store a rotating 10 minutes of data. I could store a new file every 10 minutes and delete the old ones if in mode 1. But that seems a bit dirty to me (especially when it comes to figuring out when the alert happened in the file). Ideally, the file that was saved should be such that the alert is always at the 10:00 mark in the file. While that is possible with new files every 10 minutes, it seems like a bit dirty to "repair" the files to that point. Any ideas? Should I just do a rotating file system and combine them into 1 at the end (doing quite a bit of post-processing)? Is there a way to implement the semi-round-robin file cleanly so that there is no need for any post-processing? Thanks Oh, and the language doesn't matter as much at this stage (I'm leaning towards Python, but have no objection to any other language. It's less of an issue than the overall design)...

Read the article
How can I limit the cache used by copying so there is still memory available for other cache?

- by Peter

Basic situation: I am copying some NTFS disks in openSuSE. Each one is 2TB. When I do this, the system runs slow. My guesses: I believe it is likely due to caching. Linux decides to discard useful cache (eg. kde4 bloat, virtual machine disks, LibreOffice binaries, Thunderbird binaries, etc.) and instead fill all available memory (24 GB total) with stuff from the copying disks, which will be read only once, then written and never used again. So then any time I use these apps (or kde4), the disk needs to be read again, and reading the bloat off the disk again makes things freeze/hiccup. Due to the cache being gone and the fact that these bloated applications need lots of cache, this makes the system horribly slow. Since it is USB,the disk and disk controller are not the bottleneck, so using ionice does not make it faster. I believe it is the cache rather than just the motherboard going too slow, because if I stop everything copying, it still runs choppy for a while until it recaches everything. And if I restart the copying, it takes a minute before it is choppy again. But also, I can limit it to around 40 MB/s, and it runs faster again (not because it has the right things cached, but because the motherboard busses have lots of extra bandwidth for the system disks). I can fully accept a performance loss from my motherboard's IO capability being completely consumed (which is 100% used, meaning 0% wasted power which makes me happy), but I can't accept that this caching mechanism performs so terribly in this specific use case. # free total used free shared buffers cached Mem: 24731556 24531876 199680 0 8834056 12998916 -/+ buffers/cache: 2698904 22032652 Swap: 4194300 24764 4169536 I also tried the same thing on Ubuntu, which causes a total system hang instead. ;) And to clarify, I am not asking how to leave memory free for the "system", but for "cache". I know that cache memory is automatically given back to the system when needed, but my problem is that it is not reserved for caching of specific things. Question: Is there some way to tell these copy operations to limit memory usage so some important things remain cached, and therefore any slowdowns are a result of normal disk usage and not rereading the same commonly used files? For example, is there a setting of max memory per process/user/file system allowed to be used as cache/buffers?

Read the article

Search Results

Search found 415 results on 17 pages for 'bottleneck'.

Page 15/17 | < Previous Page | 11 12 13 14 15 16 17 | Next Page >

- by Laurence

- by ernst

- by James

- by linkedlinked

- by AMF

- by Reed

- by pinaldave

- by Pinal Dave

- by Alex.Davies

- by [email protected]

- by Matt Watson

- by Tarun Arora

- by jpurdy

- by jsavit

- by Mohan Ramanuja

- by David Dorf

- by jpurdy

- by jpurdy

- by Cory Charlton

- by Tesserex

- by Martin

- by anelson

- by ircmaxell

- by Peter

< Previous Page | 11 12 13 14 15 16 17 | Next Page >