throughput - Page 4 - Developer IT

Using thread inter-communication to increase my server app's IO throughput; not sure how

- by Howard Guo

My server application creates a new thread for each incoming connection. Incoming requests are serialized in a BlockingQueue. There is one worker thread taking items from the queue, produce a response and send the response through socket. I have noticed a throughput issue: Currently, worker thread is responsible of sending the response message through socket, thus severely wasting processing power and throughput. I am considering: rather than sending the response itself, why not telling network IO threads to send the response? However, when I think about thread inter-communication, I cannot yet figure out how to approach it: Worker thread will produce a response, but how will it inform the response message to IO thread? Is there a standard/best practice? Thank you.

Read the article

How to test Laptop NIC's throughput using a router and PC - without be bounded?

- by 0x90

My setup includes: Cisco router An i-7 PC running windows A laptop with high speed wifi nic, which I want to check its throughput. I would like to run an FTP server on the PC. hook the router over cables to the PC. I would like to have the PC create its own subnet accessible via the cisco router that would be hooked directly to the PC's nic. From the laptop I want to connect via wifi to the PC's wireless router and connect to the ftp server on the PC. is it possible? how do i connect the router to the PC nic and make it broadcast a subnet via wifi for my laptop to connect to? how do i configure an FTP server to operate only on this subnet?

Read the article

Need to increase nginx throughput to an upstream unix socket -- linux kernel tuning?

- by Ben Lee

I am running an nginx server that acts as a proxy to an upstream unix socket, like this: upstream app_server { server unix:/tmp/app.sock fail_timeout=0; } server { listen ###.###.###.###; server_name whatever.server; root /web/root; try_files $uri @app; location @app { proxy_set_header X-Forwarded-For $proxy_add_x_forwarded_for; proxy_set_header X-Forwarded-Proto $scheme; proxy_set_header Host $http_host; proxy_redirect off; proxy_pass http://app_server; } } Some app server processes, in turn, pull requests off /tmp/app.sock as they become available. The particular app server in use here is Unicorn, but I don't think that's relevant to this question. The issue is, it just seems that past a certain amount of load, nginx can't get requests through the socket at a fast enough rate. It doesn't matter how many app server processes I set up, it doesn't even matter what the app is (tried it with a dummy app with just a single endpoint that returned an empty page with status 404). The bottleneck seems to be the socket, not the app. I'm getting a flood of these messages in the nginx error log: connect() to unix:/tmp/app.sock failed (11: Resource temporarily unavailable) while connecting to upstream Many requests result in status code 502, and those that don't take a long time to complete. The nginx write queue stat hovers around 1000. Anyway, I feel like I'm missing something obvious here, because this particular configuration of nginx and app server is pretty common, especially with Unicorn (it's the recommended method in fact). Are there any linux kernel options that needs to be set, or something in nginx? Any ideas about how to increase the throughput to the upstream socket? Something that I'm clearly doing wrong? Additional information on the environment: $ uname -a Linux app1 3.2.0-24-generic #39-Ubuntu SMP Mon May 21 16:52:17 UTC 2012 x86_64 x86_64 x86_64 GNU/Linux $ ruby -v ruby 1.9.3p194 (2012-04-20 revision 35410) [x86_64-linux] $ unicorn -v unicorn v4.3.1 $ nginx -V nginx version: nginx/1.2.1 built by gcc 4.6.3 (Ubuntu/Linaro 4.6.3-1ubuntu5) TLS SNI support enabled Current kernel tweaks: net.core.rmem_default = 65536 net.core.wmem_default = 65536 net.core.rmem_max = 16777216 net.core.wmem_max = 16777216 net.ipv4.tcp_rmem = 4096 87380 16777216 net.ipv4.tcp_wmem = 4096 65536 16777216 net.ipv4.tcp_mem = 16777216 16777216 16777216 net.ipv4.tcp_window_scaling = 1 net.ipv4.route.flush = 1 net.ipv4.tcp_no_metrics_save = 1 net.ipv4.tcp_moderate_rcvbuf = 1 net.core.somaxconn = 8192 net.netfilter.nf_conntrack_max = 131072

Read the article

Are there any protocols/standards on top of TCP optimized for high throughput and low latency?

- by Nosrama

Are there any protocols/standards that work over TCP that are optimized for high throughput and low latency? The only one I can think of is FAST. At the moment I have devised just a simple text-based protocol delimited by special characters. I'd like to adopt a protocol which is designed for fast transfer and supports perhaps compression and minification of the data that travels over the TCP socket.

Read the article

How can I get the current network interface throughput statistics on Linux/UNIX?

- by DavidM

Tools such as MRTG provide network throughput / bandwidth graphs for the current network utilisation on specific interfaces, such as eth0. How can I return that information at the command line on Linux/UNIX? Preferably this would be without installing anything other than what is available on the system as standard.

Read the article

Ruby: would using Fibers increase my DB insert throughput?

- by Zombies

Currently I am using Ruby 1.9.1 and the 'ruby-mysql' gem, which unlike the 'mysql' gem is written in ruby only. This is pretty slow actually, as it seems to insert at a rate of almost 1 per second (SLOOOOOWWWWWW). And I have a lot of inserts to make too, its pretty much what this script does ultamitely. I am using just 1 connection (since I am using just one thread). I am hoping to speed things up by creating a fiber that will create a new DB connection insert 1-3 records close the DB connection I would imagine launching 20-50 of these would greatly increase DB throughput. Am I correct to go along this route? I feel that this is the best option, as opposed to refactoring all of my DB code :(

Read the article

How to properly cast a global memory array using the uint4 vector in CUDA to increase memory throughput?

- by charis

There are generally two techniques to increase the memory throughput of the global memory on a CUDA kernel; memory accesses coalescence and accessing words of at least 4 bytes. With the first technique accesses to the same memory segment by threads of the same half-warp are coalesced to fewer transactions while be accessing words of at least 4 bytes this memory segment is effectively increased from 32 bytes to 128. To access 16-byte instead of 1-byte words when there are unsigned chars stored in the global memory, the uint4 vector is commonly used by casting the memory array to uint4: uint4 *text4 = ( uint4 * ) d_text; var = text4[i]; In order to extract the 16 chars from var, i am currently using bitwise operations. For example: s_array[j * 16 + 0] = var.x & 0x000000FF; s_array[j * 16 + 1] = (var.x >> 8) & 0x000000FF; s_array[j * 16 + 2] = (var.x >> 16) & 0x000000FF; s_array[j * 16 + 3] = (var.x >> 24) & 0x000000FF; My question is, is it possible to recast var (or for that matter *text4) to unsigned char in order to avoid the additional overhead of the bitwise operations?

Read the article

Tiny linux box with 2xGbLAN, WLAN and 10MB/s AES throughput?

- by Nakedible

I'd like to find a small linux box with the following specifications: Small (mini-ITX size is OK) Fanless Runs Debian At least two gigabit network interfaces WLAN that supports "host ap" with hostapd + mac80211 in AP mode Can encrypt AES at least 10 megabytes per second Total cost $300 or less Solutions from multiple parts also accepted - I can buy an external network card etc. and build the box myself if the components are available. If you don't know about the "host ap" thing, just suggest your solution, I'll find out if I can get that resolved. If I can't get all that, I can possibly skip the "runs Debian" part, and I can definitely skip the hostapd part if the box can be a wireless access point with multiple ESSIDs out of the box. Something like Asus RT-N16 is close - doesn't run Debian easily, and probably doesn't encrypt AES fast enough. Something like Zotac ZBOX HD-ID11 is also close - no idea which WLAN card it has and it lacks second gigabit interface, but otherwise nice.

Read the article

What can impact the throughput rate at tcp or Os level?

- by Jimm

I am facing a problem, where running the same application on different servers, yields unexpected performance results. For example, running the application on a particular faster server (faster cpu, more memory), with no load, yields slower performance than running on a less powerful server on the same network. I am suspecting that either OS or TCP is causing the slowness on the faster server. I cannot use IPerf , unless i modify it, because the "performance" in my application is defined as Component A sends a message to Component B. Component B sends an ACK to component A and ONLY then Component A would send the next message. So it is different from what IPerf does, which to my knowledge, simply tries to push as many messages as possible. Is there a tool that can look at OS and TCP configuration and suggest the cause of slowness?

Read the article

Oracle TimesTen In-Memory Database Performance on SPARC T4-2

- by Brian

The Oracle TimesTen In-Memory Database is optimized to run on Oracle's SPARC T4 processor platforms running Oracle Solaris 11 providing unsurpassed scalability, performance, upgradability, protection of investment and return on investment. The following demonstrate the value of combining Oracle TimesTen In-Memory Database with SPARC T4 servers and Oracle Solaris 11: On a Mobile Call Processing test, the 2-socket SPARC T4-2 server outperforms: Oracle's SPARC Enterprise M4000 server (4 x 2.66 GHz SPARC64 VII+) by 34%. Oracle's SPARC T3-4 (4 x 1.65 GHz SPARC T3) by 2.7x, or 5.4x per processor. Utilizing the TimesTen Performance Throughput Benchmark (TPTBM), the SPARC T4-2 server protects investments with: 2.1x the overall performance of a 4-socket SPARC Enterprise M4000 server in read-only mode and 1.5x the performance in update-only testing. This is 4.2x more performance per processor than the SPARC64 VII+ 2.66 GHz based system. 10x more performance per processor than the SPARC T2+ 1.4 GHz server. 1.6x better performance per processor than the SPARC T3 1.65 GHz based server. In replication testing, the two socket SPARC T4-2 server is over 3x faster than the performance of a four socket SPARC Enterprise T5440 server in both asynchronous replication environment and the highly available 2-Safe replication. This testing emphasizes parallel replication between systems. Performance Landscape Mobile Call Processing Test Performance System Processor Sockets/Cores/Threads Tps SPARC T4-2 SPARC T4, 2.85 GHz 2 16 128 218,400 M4000 SPARC64 VII+, 2.66 GHz 4 16 32 162,900 SPARC T3-4 SPARC T3, 1.65 GHz 4 64 512 80,400 TimesTen Performance Throughput Benchmark (TPTBM) Read-Only System Processor Sockets/Cores/Threads Tps SPARC T3-4 SPARC T3, 1.65 GHz 4 64 512 7.9M SPARC T4-2 SPARC T4, 2.85 GHz 2 16 128 6.5M M4000 SPARC64 VII+, 2.66 GHz 4 16 32 3.1M T5440 SPARC T2+, 1.4 GHz 4 32 256 3.1M TimesTen Performance Throughput Benchmark (TPTBM) Update-Only System Processor Sockets/Cores/Threads Tps SPARC T4-2 SPARC T4, 2.85 GHz 2 16 128 547,800 M4000 SPARC64 VII+, 2.66 GHz 4 16 32 363,800 SPARC T3-4 SPARC T3, 1.65 GHz 4 64 512 240,500 TimesTen Replication Tests System Processor Sockets/Cores/Threads Asynchronous 2-Safe SPARC T4-2 SPARC T4, 2.85 GHz 2 16 128 38,024 13,701 SPARC T5440 SPARC T2+, 1.4 GHz 4 32 256 11,621 4,615 Configuration Summary Hardware Configurations: SPARC T4-2 server 2 x SPARC T4 processors, 2.85 GHz 256 GB memory 1 x 8 Gbs FC Qlogic HBA 1 x 6 Gbs SAS HBA 4 x 300 GB internal disks Sun Storage F5100 Flash Array (40 x 24 GB flash modules) 1 x Sun Fire X4275 server configured as COMSTAR head SPARC T3-4 server 4 x SPARC T3 processors, 1.6 GHz 512 GB memory 1 x 8 Gbs FC Qlogic HBA 8 x 146 GB internal disks 1 x Sun Fire X4275 server configured as COMSTAR head SPARC Enterprise M4000 server 4 x SPARC64 VII+ processors, 2.66 GHz 128 GB memory 1 x 8 Gbs FC Qlogic HBA 1 x 6 Gbs SAS HBA 2 x 146 GB internal disks Sun Storage F5100 Flash Array (40 x 24 GB flash modules) 1 x Sun Fire X4275 server configured as COMSTAR head Software Configuration: Oracle Solaris 11 11/11 Oracle TimesTen 11.2.2.4 Benchmark Descriptions TimesTen Performance Throughput BenchMark (TPTBM) is shipped with TimesTen and measures the total throughput of the system. The workload can test read-only, update-only, delete and insert operations as required. Mobile Call Processing is a customer-based workload for processing calls made by mobile phone subscribers. The workload has a mixture of read-only, update, and insert-only transactions. The peak throughput performance is measured from multiple concurrent processes executing the transactions until a peak performance is reached via saturation of the available resources. Parallel Replication tests using both asynchronous and 2-Safe replication methods. For asynchronous replication, transactions are processed in batches to maximize the throughput capabilities of the replication server and network. In 2-Safe replication, also known as no data-loss or high availability, transactions are replicated between servers immediately emphasizing low latency. For both environments, performance is measured in the number of parallel replication servers and the maximum transactions-per-second for all concurrent processes. See Also SPARC T4-2 Server oracle.com OTN Oracle TimesTen In-Memory Database oracle.com OTN Oracle Solaris oracle.com OTN Oracle Database 11g Release 2 Enterprise Edition oracle.com OTN Disclosure Statement Copyright 2012, Oracle and/or its affiliates. All rights reserved. Oracle and Java are registered trademarks of Oracle and/or its affiliates. Other names may be trademarks of their respective owners. Results as of 1 October 2012.

Read the article

Latency Matters

- by Frederic P

A lot of interest in low latencies has been expressed within the financial services segment, most especially in the stock trading applications where every millisecond directly influences the profitability of the trader. These days, much of the trading is executed by software applications which are trained to respond to each other almost instantaneously. In fact, you could say that we are in an arms race where traders are using any and all options to cut down on the delay in executing transactions, even by moving physically closer to the trading venue. The Solaris OS network stack has traditionally been engineered for high throughput, at the expense of higher latencies. Knowledge of tuning parameters to redress the imbalance is critical for applications that are latency sensitive. We are presenting in this blog how to configure further a default Oracle Solaris 10 installation to reduce network latency. There are many parameters in fact that can be altered, but the most effective ones are intr_blank_time and intr_blank_packets. These parameters affect on-board network throughput and latency on Solaris systems. If interrupt blanking is disabled, packets are processed by the driver as soon as they arrive, resulting in higher network throughput and lower latency, but with higher CPU utilization. With interrupt blanking disabled, processor utilization can be as high as 80–90% in some high-load web server environments. If interrupt blanking is enabled, packets are processed when the interrupt is issued. Enabling interrupt blanking can result in reduced processor utilization and network throughput, but higher network latency. Both parameters should be set at the same time. You can set these parameters by using the ndd command as follows: # ndd -set /dev/eri intr_blank_time 0 # ndd -set /dev/eri intr_blank_packets 0 You can add them to the /etc/system file as follows: set eri:intr_blank_time 0 set eri:intr_blank_packets 0 The value of the interrupt blanking parameter is a trade-off between network throughput and processor utilization. If higher processor utilization is acceptable for achieving higher network throughput, then disable interrupt blanking. If lower processor utilization is preferred and higher network latency is the penalty, then enable interrupt blanking. Our experience at ISV Engineering is that under controlled experiments the above settings result in reduction of network latency by at least 50%; on a two-socket 3GHz Sun Fire X4170 M2 running Solaris 10 Update 9, the above settings improved ping-pong latency from 60µs to 25-30µs with the on-board NIC.

Read the article

Is there a better way to throttle a high throughput job?

- by ChaosPandion

I created a simple class that shows what I am trying to do without any noise. Feel free to bash away at my code. That's why I posted it here. public class Throttled : IDisposable { private readonly Action work; private readonly Func<bool> stop; private readonly ManualResetEvent continueProcessing; private readonly Timer throttleTimer; private readonly int throttlePeriod; private readonly int throttleLimit; private int totalProcessed; public Throttled(Action work, Func<bool> stop, int throttlePeriod, int throttleLimit) { this.work = work; this.stop = stop; this.throttlePeriod = throttlePeriod; this.throttleLimit = throttleLimit; continueProcessing = new ManualResetEvent(true); throttleTimer = new Timer(ThrottleUpdate, null, throttlePeriod, throttlePeriod); } public void Dispose() { throttleTimer.Dispose(); ((IDisposable)continueProcessing).Dispose(); } public void Execute() { while (!stop()) { if (Interlocked.Increment(ref totalProcessed) > throttleLimit) { lock (continueProcessing) { continueProcessing.Reset(); } if (!continueProcessing.WaitOne(throttlePeriod)) { throw new TimeoutException(); } } work(); } } private void ThrottleUpdate(object state) { Interlocked.Exchange(ref totalProcessed, 0); lock (continueProcessing) { continueProcessing.Set(); } } }

Read the article

file read performance degrades as number of files increases

- by bfallik-bamboom

We're observing poor file read IO results that we'd like to better understand. We can use fio to write 100 files with a sustained aggregate throughput of ~700MB/s. When we switch the test to read instead of write, the aggregate throughput is only ~55MB/s. The drop seems related to the number of files since the throughput for read and write are comparable for a single file then diverge proportionally as we increase the number of files. The test server has 24 CPU cores, 48GB of memory, and is running CentOS 6.0. The disk hardware is a RAID 6 array with 12 disks and a Dell H800 controller. This device is partitioned with ext4 using the default settings. Increasing the readahead (using blockdev) improves the read throughput significantly but it still doesn't match write speed. For instance, increasing the readahead from 128KB to 1M improved the read throughput to ~145MB/s. Is this a known performance issue in our OS/disk/filesystem configuration? If so, how can we tell? If not, what tools or tests can we use to further isolate the issue? Thanks.

Read the article

WebLogic Server Performance and Tuning: Part I - Tuning JVM

- by Gokhan Gungor

Each WebLogic Server instance runs in its own dedicated Java Virtual Machine (JVM) which is their runtime environment. Every Admin Server in any domain executes within a JVM. The same also applies for Managed Servers. WebLogic Server can be used for a wide variety of applications and services which uses the same runtime environment and resources. Oracle WebLogic ships with 2 different JVM, HotSpot and JRocket but you can choose which JVM you want to use. JVM is designed to optimize itself however it also provides some startup options to make small changes. There are default values for its memory and garbage collection. In real world, you will not want to stick with the default values provided by the JVM rather want to customize these values based on your applications which can produce large gains in performance by making small changes with the JVM parameters. We can tell the garbage collector how to delete garbage and we can also tell JVM how much space to allocate for each generation (of java Objects) or for heap. Remember during the garbage collection no other process is executed within the JVM or runtime, which is called STOP THE WORLD which can affect the overall throughput. Each JVM has its own memory segment called Heap Memory which is the storage for java Objects. These objects can be grouped based on their age like young generation (recently created objects) or old generation (surviving objects that have lived to some extent), etc. A java object is considered garbage when it can no longer be reached from anywhere in the running program. Each generation has its own memory segment within the heap. When this segment gets full, garbage collector deletes all the objects that are marked as garbage to create space. When the old generation space gets full, the JVM performs a major collection to remove the unused objects and reclaim their space. A major garbage collect takes a significant amount of time and can affect system performance. When we create a managed server either on the same machine or on remote machine it gets its initial startup parameters from $DOMAIN_HOME/bin/setDomainEnv.sh/cmd file. By default two parameters are set: Xms: The initial heapsize Xmx: The max heapsize Try to set equal initial and max heapsize. The startup time can be a little longer but for long running applications it will provide a better performance. When we set -Xms512m -Xmx1024m, the physical heap size will be 512m. This means that there are pages of memory (in the state of the 512m) that the JVM does not explicitly control. It will be controlled by OS which could be reserve for the other tasks. In this case, it is an advantage if the JVM claims the entire memory at once and try not to spend time to extend when more memory is needed. Also you can use -XX:MaxPermSize (Maximum size of the permanent generation) option for Sun JVM. You should adjust the size accordingly if your application dynamically load and unload a lot of classes in order to optimize the performance. You can set the JVM options/heap size from the following places: Through the Admin console, in the Server start tab In the startManagedWeblogic script for the managed servers $DOMAIN_HOME/bin/startManagedWebLogic.sh/cmd JAVA_OPTIONS="-Xms1024m -Xmx1024m" ${JAVA_OPTIONS} In the setDomainEnv script for the managed servers and admin server (domain wide) USER_MEM_ARGS="-Xms1024m -Xmx1024m" When there is free memory available in the heap but it is too fragmented and not contiguously located to store the object or when there is actually insufficient memory we can get java.lang.OutOfMemoryError. We should create Thread Dump and analyze if that is possible in case of such error. The second option we can use to produce higher throughput is to garbage collection. We can roughly divide GC algorithms into 2 categories: parallel and concurrent. Parallel GC stops the execution of all the application and performs the full GC, this generally provides better throughput but also high latency using all the CPU resources during GC. Concurrent GC on the other hand, produces low latency but also low throughput since it performs GC while application executes. The JRockit JVM provides some useful command-line parameters that to control of its GC scheme like -XgcPrio command-line parameter which takes the following options; XgcPrio:pausetime (To minimize latency, parallel GC) XgcPrio:throughput (To minimize throughput, concurrent GC ) XgcPrio:deterministic (To guarantee maximum pause time, for real time systems) Sun JVM has similar parameters (like -XX:UseParallelGC or -XX:+UseConcMarkSweepGC) to control its GC scheme. We can add -verbosegc -XX:+PrintGCDetails to monitor indications of a problem with garbage collection. Try configuring JVM’s of all managed servers to execute in -server mode to ensure that it is optimized for a server-side production environment.

Read the article

Tuning Default WorkManager - Advantages and Disadvantages

- by Murali Veligeti

Before discussing on Tuning Default WorkManager, lets have a brief introduction on What is Default WorkManger Before Weblogic Server 9.0 release, we had the concept of Execute Queues. WebLogic Server (before WLS 9.0), processing was performed in multiple execute queues. Different classes of work were executed in different queues, based on priority and ordering requirements, and to avoid deadlocks. In addition to the default execute queue, weblogic.kernel.default, there were pre-configured queues dedicated to internal administrative traffic, such as weblogic.admin.HTTP and weblogic.admin.RMI.Users could control thread usage by altering the number of threads in the default queue, or configure custom execute queues to ensure that particular applications had access to a fixed number of execute threads, regardless of overall system load. From WLS 9.0 release onwards WebLogic Server uses is a single thread pool (single thread pool which is called Default WorkManager), in which all types of work are executed. WebLogic Server prioritizes work based on rules you define, and run-time metrics, including the actual time it takes to execute a request and the rate at which requests are entering and leaving the pool.The common thread pool changes its size automatically to maximize throughput. The queue monitors throughput over time and based on history, determines whether to adjust the thread count. For example, if historical throughput statistics indicate that a higher thread count increased throughput, WebLogic increases the thread count. Similarly, if statistics indicate that fewer threads did not reduce throughput, WebLogic decreases the thread count. This new strategy makes it easier for administrators to allocate processing resources and manage performance, avoiding the effort and complexity involved in configuring, monitoring, and tuning custom executes queues. The Default WorkManager is used to handle thread management and perform self-tuning.This Work Manager is used by an application when no other Work Managers are specified in the application’s deployment descriptors. In many situations, the default Work Manager may be sufficient for most application requirements. WebLogic Server’s thread-handling algorithms assign each application its own fair share by default. Applications are given equal priority for threads and are prevented from monopolizing them. The default work-manager, as its name tells, is the work-manager defined by default.Thus, all applications deployed on WLS will use it. But sometimes, when your application is already in production, it's obvious you can't take your EAR / WAR, update the deployment descriptor(s) and redeploy it.The default work-manager belongs to a thread-pool, as initial thread-pool comes with only five threads, that's not much. If your application has to face a large number of hits, you may want to start with more than that.Well, that's quite easy. You have two option to do so.1) Modify the config.xmlJust add the following line(s) in your server definition : <server> <name>AdminServer</name> <self-tuning-thread-pool-size-min>100</self-tuning-thread-pool-size-min> <self-tuning-thread-pool-size-max>200</self-tuning-thread-pool-size-max> [...] </server> 2) Adding some JVM parameters Add the following system property in setDomainEnv.sh/setDomainEnv.cmd or startWebLogic.sh/startWebLogic.cmd : -Dweblogic.threadpool.MinPoolSize=100 -Dweblogic.threadpool.MaxPoolSize=100 Reboot WLS and see the option has been taken into account . Disadvantage: So far its fine. But here there is an disadvantage in tuning Default WorkManager. Internally Weblogic Server has many work managers configured for different types of work. if we run out of threads in the self-tuning pool(because of system property -Dweblogic.threadpool.MaxPoolSize) due to being undersized, then important work that WLS might need to do could be starved. So, while limiting the self-tuning would limit the default WorkManager and internally it also limits all other internal WorkManagers which WLS uses.So the best alternative is to override the default WorkManager that means creating a WorkManager for the Application and assign the WorkManager for the application instead of tuning the Default WorkManager.

Read the article

SQL IO and SAN troubles

- by James

We are running two servers with identical software setup but different hardware. The first one is a VM on VMWare on a normal tower server with dual core xeons, 16 GB RAM and a 7200 RPM drive. The second one is a VM on XenServer on a powerful brand new rack server, with 4 core xeons and shared storage. We are running Dynamics AX 2012 and SQL Server 2008 R2. When I insert 15 000 records into a table on the slow tower server (as a test), it does so in 13 seconds. On the fast server it takes 33 seconds. I re-ran these tests several times with the same results. I have a feeling it is some sort of IO bottleneck, so I ran SQLIO on both. Here are the results for the slow tower server: C:\Program Files (x86)\SQLIO>test.bat C:\Program Files (x86)\SQLIO>sqlio -kW -t8 -s120 -o8 -frandom -b8 -BH -LS C:\Tes tFile.dat sqlio v1.5.SG using system counter for latency timings, 14318180 counts per second 8 threads writing for 120 secs to file C:\TestFile.dat using 8KB random IOs enabling multiple I/Os per thread with 8 outstanding buffering set to use hardware disk cache (but not file cache) using current size: 5120 MB for file: C:\TestFile.dat initialization done CUMULATIVE DATA: throughput metrics: IOs/sec: 226.97 MBs/sec: 1.77 latency metrics: Min_Latency(ms): 0 Avg_Latency(ms): 281 Max_Latency(ms): 467 histogram: ms: 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24+ %: 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 99 C:\Program Files (x86)\SQLIO>sqlio -kR -t8 -s120 -o8 -frandom -b8 -BH -LS C:\Tes tFile.dat sqlio v1.5.SG using system counter for latency timings, 14318180 counts per second 8 threads reading for 120 secs from file C:\TestFile.dat using 8KB random IOs enabling multiple I/Os per thread with 8 outstanding buffering set to use hardware disk cache (but not file cache) using current size: 5120 MB for file: C:\TestFile.dat initialization done CUMULATIVE DATA: throughput metrics: IOs/sec: 91.34 MBs/sec: 0.71 latency metrics: Min_Latency(ms): 14 Avg_Latency(ms): 699 Max_Latency(ms): 1124 histogram: ms: 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24+ %: 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 100 C:\Program Files (x86)\SQLIO>sqlio -kW -t8 -s120 -o8 -fsequential -b64 -BH -LS C :\TestFile.dat sqlio v1.5.SG using system counter for latency timings, 14318180 counts per second 8 threads writing for 120 secs to file C:\TestFile.dat using 64KB sequential IOs enabling multiple I/Os per thread with 8 outstanding buffering set to use hardware disk cache (but not file cache) using current size: 5120 MB for file: C:\TestFile.dat initialization done CUMULATIVE DATA: throughput metrics: IOs/sec: 1094.50 MBs/sec: 68.40 latency metrics: Min_Latency(ms): 0 Avg_Latency(ms): 58 Max_Latency(ms): 467 histogram: ms: 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24+ %: 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 100 C:\Program Files (x86)\SQLIO>sqlio -kR -t8 -s120 -o8 -fsequential -b64 -BH -LS C :\TestFile.dat sqlio v1.5.SG using system counter for latency timings, 14318180 counts per second 8 threads reading for 120 secs from file C:\TestFile.dat using 64KB sequential IOs enabling multiple I/Os per thread with 8 outstanding buffering set to use hardware disk cache (but not file cache) using current size: 5120 MB for file: C:\TestFile.dat initialization done CUMULATIVE DATA: throughput metrics: IOs/sec: 1155.31 MBs/sec: 72.20 latency metrics: Min_Latency(ms): 17 Avg_Latency(ms): 55 Max_Latency(ms): 205 histogram: ms: 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24+ %: 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 100 Here are the results of the fast rack server: C:\Program Files (x86)\SQLIO>test.bat C:\Program Files (x86)\SQLIO>sqlio -kW -t8 -s120 -o8 -frandom -b8 -BH -LS E:\Tes tFile.dat sqlio v1.5.SG using system counter for latency timings, 62500000 counts per second 8 threads writing for 120 secs to file E:\TestFile.dat using 8KB random IOs enabling multiple I/Os per thread with 8 outstanding buffering set to use hardware disk cache (but not file cache) open_file: CreateFile (E:\TestFile.dat for write): The system cannot find the pa th specified. exiting C:\Program Files (x86)\SQLIO>sqlio -kR -t8 -s120 -o8 -frandom -b8 -BH -LS E:\Tes tFile.dat sqlio v1.5.SG using system counter for latency timings, 62500000 counts per second 8 threads reading for 120 secs from file E:\TestFile.dat using 8KB random IOs enabling multiple I/Os per thread with 8 outstanding buffering set to use hardware disk cache (but not file cache) open_file: CreateFile (E:\TestFile.dat for read): The system cannot find the pat h specified. exiting C:\Program Files (x86)\SQLIO>sqlio -kW -t8 -s120 -o8 -fsequential -b64 -BH -LS E :\TestFile.dat sqlio v1.5.SG using system counter for latency timings, 62500000 counts per second 8 threads writing for 120 secs to file E:\TestFile.dat using 64KB sequential IOs enabling multiple I/Os per thread with 8 outstanding buffering set to use hardware disk cache (but not file cache) open_file: CreateFile (E:\TestFile.dat for write): The system cannot find the pa th specified. exiting C:\Program Files (x86)\SQLIO>sqlio -kR -t8 -s120 -o8 -fsequential -b64 -BH -LS E :\TestFile.dat sqlio v1.5.SG using system counter for latency timings, 62500000 counts per second 8 threads reading for 120 secs from file E:\TestFile.dat using 64KB sequential IOs enabling multiple I/Os per thread with 8 outstanding buffering set to use hardware disk cache (but not file cache) open_file: CreateFile (E:\TestFile.dat for read): The system cannot find the pat h specified. exiting C:\Program Files (x86)\SQLIO>test.bat C:\Program Files (x86)\SQLIO>sqlio -kW -t8 -s120 -o8 -frandom -b8 -BH -LS c:\Tes tFile.dat sqlio v1.5.SG using system counter for latency timings, 62500000 counts per second 8 threads writing for 120 secs to file c:\TestFile.dat using 8KB random IOs enabling multiple I/Os per thread with 8 outstanding buffering set to use hardware disk cache (but not file cache) using current size: 5120 MB for file: c:\TestFile.dat initialization done CUMULATIVE DATA: throughput metrics: IOs/sec: 2575.77 MBs/sec: 20.12 latency metrics: Min_Latency(ms): 1 Avg_Latency(ms): 24 Max_Latency(ms): 655 histogram: ms: 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24+ %: 0 0 0 5 8 9 9 9 8 5 3 1 1 1 1 0 0 0 0 0 0 0 0 0 37 C:\Program Files (x86)\SQLIO>sqlio -kR -t8 -s120 -o8 -frandom -b8 -BH -LS c:\Tes tFile.dat sqlio v1.5.SG using system counter for latency timings, 62500000 counts per second 8 threads reading for 120 secs from file c:\TestFile.dat using 8KB random IOs enabling multiple I/Os per thread with 8 outstanding buffering set to use hardware disk cache (but not file cache) using current size: 5120 MB for file: c:\TestFile.dat initialization done CUMULATIVE DATA: throughput metrics: IOs/sec: 1141.39 MBs/sec: 8.91 latency metrics: Min_Latency(ms): 1 Avg_Latency(ms): 55 Max_Latency(ms): 652 histogram: ms: 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24+ %: 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 1 1 1 1 1 1 1 91 C:\Program Files (x86)\SQLIO>sqlio -kW -t8 -s120 -o8 -fsequential -b64 -BH -LS c :\TestFile.dat sqlio v1.5.SG using system counter for latency timings, 62500000 counts per second 8 threads writing for 120 secs to file c:\TestFile.dat using 64KB sequential IOs enabling multiple I/Os per thread with 8 outstanding buffering set to use hardware disk cache (but not file cache) using current size: 5120 MB for file: c:\TestFile.dat initialization done CUMULATIVE DATA: throughput metrics: IOs/sec: 341.37 MBs/sec: 21.33 latency metrics: Min_Latency(ms): 5 Avg_Latency(ms): 186 Max_Latency(ms): 120037 histogram: ms: 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24+ %: 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 100 C:\Program Files (x86)\SQLIO>sqlio -kR -t8 -s120 -o8 -fsequential -b64 -BH -LS c :\TestFile.dat sqlio v1.5.SG using system counter for latency timings, 62500000 counts per second 8 threads reading for 120 secs from file c:\TestFile.dat using 64KB sequential IOs enabling multiple I/Os per thread with 8 outstanding buffering set to use hardware disk cache (but not file cache) using current size: 5120 MB for file: c:\TestFile.dat initialization done CUMULATIVE DATA: throughput metrics: IOs/sec: 1024.07 MBs/sec: 64.00 latency metrics: Min_Latency(ms): 5 Avg_Latency(ms): 61 Max_Latency(ms): 81632 histogram: ms: 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24+ %: 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 100 Three of the four tests are, to my mind, within reasonable parameters for the rack server. However, the 64 write test is incredibly slow on the rack server. (68 mb/sec on the slow tower vs 21 mb/s on the rack). The read speed for 64k also seems slow. Is this enough to say there is some sort of bottleneck with the shared storage? I need to know if I can take this evidence and say we need to launch an investigation into this. Any help is appreciated.

Read the article

OBIEE 11.1.1 - How to configure HTTP compression / caching on Oracle BI Mobile app

- by Ahmed Awan

Applies to: OBIEE 11.1.1.5 Supported Physical Devices and OS: The Oracle BI Mobile application with HTTP compression / caching configurations is tested on following devices: iPhone 4S, 4, 3GS. iPad 2 and 1. Note these devices must be running the latest version of the iOS version, i.e. iOS 4.2.1 / iOS 5 is also supported. Configuring Pre-requisites: Prior to configuration, the Oracle Web tier software must be installed on server, as described in product documentation i.e. Enterprise Deployment Guide for Oracle Business Intelligence in Section 3.2, "Installing Oracle HTTP Server." The steps for configuring the compression and caching on Oracle HTTP Server are described in this PA blog at http://blogs.oracle.com/pa/entry/obiee_11g_user_interface_ui and in support Doc ID 1312299.1. Configuration Steps in Oracle BI Mobile application: 1. Download the BI Mobile app from the Apple iTunes App Store. The link is http://itunes.apple.com/us/app/oracle-business-intelligence/id434559909?mt=8 . 2. Add Server for example http://pew801.us.oracle.com:7777/analytics/ , here is how your “Server Setting” screen should look like on your OBI Mobile app: Performance Gain Test (using Oracle® HTTP Server with OBIEE) The test with/without HTTP compression / caching was conducted on iPhone 4S / iPad 2 to measure the throughput (i.e. total bytes received) for Oracle® Business Intelligence Enterprise Edition. Below table shows the throughput comparison before and after using HTTP compression / caching for SampleApp using “QuickStart” dashboard accessing reports i.e. Overview, Details, Published Reporting and Scorecard. Testing shows that total bytes received were reduced from 2.3 MB to 723 KB. a. Test Results > Without HTTP Compression / Caching setting - Total Throughput (in Bytes) captured below: Total Bytes Statistics: b. Test Results > With HTTP Compression / Caching settings - Total Throughput (in Bytes) captured below: Total Bytes Statistics:

Read the article

How to diagnose storage system scaling problems?

- by Unknown

We are currently testing the maximum sequential read throughput of a storage system (48 disks total behind two HP P2000 arrays) connected to HP DL580 G7 running RHEL 5 with 128 GB of memory. Initial testing has been mainly done by running DD-commands like this: dd if=/dev/mapper/mpath1 of=/dev/null bs=1M count=3000 In parallel for each disk. However, we have been unable to scale the results from one array (maximum throughput of 1.3 GB/s) to two (almost the same throughput). Each array is connected to a dedicated host bust adapter, so they should not be the bottleneck. The disks are currently in JBOD configuration, so each disk can be addressed directly. I have two questions: Is running multiple DD commands in parallel really a good way to test maximum read throughput? We have noticed very high SWAPIN-% numbers in iotop, which I find hard to explain because the target is /dev/null How shoud we proceed in trying to find the reason for the scaling problem? Do you thing the server itself is the bottleneck here, or could there be some linux parameters that we have overlooked?

Read the article

Gigabit network limited to 25MB/s by CPU. How to make it faster?

- by netvope

I have a Acer Aspire R1600-U910H with a nForce gigabit network adapter. The maximum TCP throughput of it is about 25MB/s, and apparently it is limited by the single core Intel Atom 230; when the maximum throughput is reached, the CPU usage is about 50%-60%, which corresponds to full utilization considering this is a Hyper-threading enabled CPU. The same problem occurs on both Windows XP and on Ubuntu 8.04. On Windows, I have installed the latest nForce chipset driver, disabled power saving features, and enabled checksum offload. On Linux, the default driver has checksum offload enabled. There is no Linux driver available on Nvidia's website. ethtool -k eth0 shows that checksum offload is enabled: Offload parameters for eth0: rx-checksumming: on tx-checksumming: on scatter-gather: on tcp segmentation offload: on udp fragmentation offload: off generic segmentation offload: off The following is the output of powertop when the network is idle: Wakeups-from-idle per second : 61.9 interval: 10.0s no ACPI power usage estimate available Top causes for wakeups: 90.9% (101.3) <interrupt> : eth0 4.5% ( 5.0) iftop : schedule_timeout (process_timeout) 1.8% ( 2.0) <kernel core> : clocksource_register (clocksource_watchdog) 0.9% ( 1.0) dhcdbd : schedule_timeout (process_timeout) 0.5% ( 0.6) <kernel core> : neigh_table_init_no_netlink (neigh_periodic_timer) And when the maximum throughput of about 25MB/s is reached: Wakeups-from-idle per second : 11175.5 interval: 10.0s no ACPI power usage estimate available Top causes for wakeups: 99.9% (22097.4) <interrupt> : eth0 0.0% ( 5.0) iftop : schedule_timeout (process_timeout) 0.0% ( 2.0) <kernel core> : clocksource_register (clocksource_watchdog) 0.0% ( 1.0) dhcdbd : schedule_timeout (process_timeout) 0.0% ( 0.6) <kernel core> : neigh_table_init_no_netlink (neigh_periodic_timer) Notice the 20000 interrupts per second. Could this be the cause for the high CPU usage and low throughput? If so, how can I improve the situation? The other computers in the network can usually transfer at 50+MB/s without problems. And a minor question: How can I find out what is the driver in use for eth0?

Read the article

What kind of server configuration is best for a chatting app? [closed]

- by mohabitar

I'm just now starting to go deeper into the world of cloud hosting and databases, and am getting overwhelmed by how deep this information goes. It's all a little too much to consume in a short amount of time. I get a lot of pricing information, but I'm unable to determine what that means to me. I'm making what you might compare to an email app. Users can send messages to one another. I just don't understand, out of the several options, what would be ideal for an app like this, where users would be constantly sending and receiving text data. With Amazon DynamoDB, I have to specify a pre-defined throughput with number of reads and writes per second. Sure I can just type 50, but I'm not exactly sure what 50 writes per second represents. I'm trying to determine what would be the most cost efficient solution, and I want to know what a throughput of 50 reads/writes/second compares to. Is that a high number? What is a good throughput number for a message sending app with say 50,000 daily users? I'm just providing specific numbers so I can understand what these throughput numbers represent. 100 transactions/second to me seems like a small number since I'm not familiar with this stuff, so I'm just looking to bring everything in context. What would 100 read/write/second be useful for? Are there any average example values available? And I'm not sure what each service is good for. For a message sending app, is there any reason I'd want to choose say Amazon DynamoDB over Google App Engine? Any insight would be greatly appreciated.

Read the article

Sizing Switches for Storage and Production

- by Untalented

Couple questions. Should you always completely separate the storage network switches from production switches or are VLANs fine to segment this traffic? Is there a golden rule here? How do you properly size a switch for your environment based on the specifications the manufacturer provide (Throughput, Forwarding Throughput, Stacking Throughput, Max Mac)? If you have two switch options and one has a maximum Mac address of 8,000 vs. another with 16,0000. What does this really mean to me? How do make sure one vs. another is sized properly for me? Besides VLAN and Jumbo Frame support, is there any other "Must" haves for a virtual environments production or storage networks? There is a wealth of knowledge on sizing SANs and such, but this seems equally important and it's quite challenging to find as much information. -- Just to add some tidbits of information for the environment. This setup above is referring to the data centers which supports two different locations which have about 100 users between the two in total. The storage traffic will be iSCSI and will be 3 ESXi Hosts and one SAN housing about 2.7TB of data. Since there is currently no storage network in place (no SAN), I'm having a hard time regarding #2 to really determine what backplane throughput and switch specifications will be sufficient.

Read the article

Retrieve Performance Data from SOA Infrastructure Database

- by fip

My earlier blog posting shows how to enable, retrieve and interpret BPEL engine performance statistics to aid performance troubleshooting. The strength of BPEL engine statistics at EM is its break down per request. But there are some limitations with the BPEL performance statistics mentioned in that blog posting: The statistics were stored in memory instead of being persisted. To avoid memory overflow, the data are stored to a buffer with limited size. When the statistic entries exceed the limitation, old data will be flushed out to give ways to new statistics. Therefore it can only keep the last X number of entries of data. The statistics 5 hour ago may not be there anymore. The BPEL engine performance statistics only includes latencies. It does not provide throughputs. Fortunately, Oracle SOA Suite runs with the SOA Infrastructure database and a lot of performance data are naturally persisted there. It is at a more coarse grain than the in-memory BPEL Statistics, but it does have its own strengths as it is persisted. Here I would like offer examples of some basic SQL queries you can run against the infrastructure database of Oracle SOA Suite 11G to acquire the performance statistics for a given period of time. You can run it immediately after you modify the date range to match your actual system. 1. Asynchronous/one-way messages incoming rates The following query will show number of messages sent to one-way/async BPEL processes during a given time period, organized by process names and states select composite_name composite, state, count(*) Count from dlv_message where receive_date >= to_timestamp('2012-10-24 21:00:00','YYYY-MM-DD HH24:MI:SS') and receive_date <= to_timestamp('2012-10-24 21:59:59','YYYY-MM-DD HH24:MI:SS') group by composite_name, state order by Count; 2. Throughput of BPEL process instances The following query shows the number of synchronous and asynchronous process instances created during a given time period. It list instances of all states, including the unfinished and faulted ones. The results will include all composites cross all SOA partitions select state, count(*) Count, composite_name composite, component_name,componenttype from cube_instance where creation_date >= to_timestamp('2012-10-24 21:00:00','YYYY-MM-DD HH24:MI:SS') and creation_date <= to_timestamp('2012-10-24 21:59:59','YYYY-MM-DD HH24:MI:SS') group by composite_name, component_name, componenttype order by count(*) desc; 3. Throughput and latencies of BPEL process instances This query is augmented on the previous one, providing more comprehensive information. It gives not only throughput but also the maximum, minimum and average elapse time BPEL process instances. select composite_name Composite, component_name Process, componenttype, state, count(*) Count, trunc(Max(extract(day from (modify_date-creation_date))*24*60*60 + extract(hour from (modify_date-creation_date))*60*60 + extract(minute from (modify_date-creation_date))*60 + extract(second from (modify_date-creation_date))),4) MaxTime, trunc(Min(extract(day from (modify_date-creation_date))*24*60*60 + extract(hour from (modify_date-creation_date))*60*60 + extract(minute from (modify_date-creation_date))*60 + extract(second from (modify_date-creation_date))),4) MinTime, trunc(AVG(extract(day from (modify_date-creation_date))*24*60*60 + extract(hour from (modify_date-creation_date))*60*60 + extract(minute from (modify_date-creation_date))*60 + extract(second from (modify_date-creation_date))),4) AvgTime from cube_instance where creation_date >= to_timestamp('2012-10-24 21:00:00','YYYY-MM-DD HH24:MI:SS') and creation_date <= to_timestamp('2012-10-24 21:59:59','YYYY-MM-DD HH24:MI:SS') group by composite_name, component_name, componenttype, state order by count(*) desc; 4. Combine all together Now let's combine all of these 3 queries together, and parameterize the start and end time stamps to make the script a bit more robust. The following script will prompt for the start and end time before querying against the database: accept startTime prompt 'Enter start time (YYYY-MM-DD HH24:MI:SS)' accept endTime prompt 'Enter end time (YYYY-MM-DD HH24:MI:SS)' Prompt "==== Rejected Messages ===="; REM 2012-10-24 21:00:00 REM 2012-10-24 21:59:59 select count(*), composite_dn from rejected_message where created_time >= to_timestamp('&&StartTime','YYYY-MM-DD HH24:MI:SS') and created_time <= to_timestamp('&&EndTime','YYYY-MM-DD HH24:MI:SS') group by composite_dn; Prompt " "; Prompt "==== Throughput of one-way/asynchronous messages ===="; select state, count(*) Count, composite_name composite from dlv_message where receive_date >= to_timestamp('&StartTime','YYYY-MM-DD HH24:MI:SS') and receive_date <= to_timestamp('&EndTime','YYYY-MM-DD HH24:MI:SS') group by composite_name, state order by Count; Prompt " "; Prompt "==== Throughput and latency of BPEL process instances ====" select state, count(*) Count, trunc(Max(extract(day from (modify_date-creation_date))*24*60*60 + extract(hour from (modify_date-creation_date))*60*60 + extract(minute from (modify_date-creation_date))*60 + extract(second from (modify_date-creation_date))),4) MaxTime, trunc(Min(extract(day from (modify_date-creation_date))*24*60*60 + extract(hour from (modify_date-creation_date))*60*60 + extract(minute from (modify_date-creation_date))*60 + extract(second from (modify_date-creation_date))),4) MinTime, trunc(AVG(extract(day from (modify_date-creation_date))*24*60*60 + extract(hour from (modify_date-creation_date))*60*60 + extract(minute from (modify_date-creation_date))*60 + extract(second from (modify_date-creation_date))),4) AvgTime, composite_name Composite, component_name Process, componenttype from cube_instance where creation_date >= to_timestamp('&StartTime','YYYY-MM-DD HH24:MI:SS') and creation_date <= to_timestamp('&EndTime','YYYY-MM-DD HH24:MI:SS') group by composite_name, component_name, componenttype, state order by count(*) desc;

Read the article

Talend Enterprise Data Integration overperforms on Oracle SPARC T4

- by Amir Javanshir

The SPARC T microprocessor, released in 2005 by Sun Microsystems, and now continued at Oracle, has a good track record in parallel execution and multi-threaded performance. However it was less suited for pure single-threaded workloads. The new SPARC T4 processor is now filling that gap by offering a 5x better single-thread performance over previous generations. Following our long-term relationship with Talend, a fast growing ISV positioned by Gartner in the “Visionaries” quadrant of the “Magic Quadrant for Data Integration Tools”, we decided to test some of their integration components with the T4 chip, more precisely on a T4-1 system, in order to verify first hand if this new processor stands up to its promises. Several tests were performed, mainly focused on: Single-thread performance of the new SPARC T4 processor compared to an older SPARC T2+ processor Overall throughput of the SPARC T4-1 server using multiple threads The tests consisted in reading large amounts of data --ten's of gigabytes--, processing and writing them back to a file or an Oracle 11gR2 database table. They are CPU, memory and IO bound tests. Given the main focus of this project --CPU performance--, bottlenecks were removed as much as possible on the memory and IO sub-systems. When possible, the data to process was put into the ZFS filesystem cache, for instance. Also, two external storage devices were directly attached to the servers under test, each one divided in two ZFS pools for read and write operations. Multi-thread: Testing throughput on the Oracle T4-1 The tests were performed with different number of simultaneous threads (1, 2, 4, 8, 12, 16, 32, 48 and 64) and using different storage devices: Flash, Fibre Channel storage, two stripped internal disks and one single internal disk. All storage devices used ZFS as filesystem and volume management. Each thread read a dedicated 1GB-large file containing 12.5M lines with the following structure: customerID;FirstName;LastName;StreetAddress;City;State;Zip;Cust_Status;Since_DT;Status_DT 1;Ronald;Reagan;South Highway;Santa Fe;Montana;98756;A;04-06-2006;09-08-2008 2;Theodore;Roosevelt;Timberlane Drive;Columbus;Louisiana;75677;A;10-05-2009;27-05-2008 3;Andrew;Madison;S Rustle St;Santa Fe;Arkansas;75677;A;29-04-2005;09-02-2008 4;Dwight;Adams;South Roosevelt Drive;Baton Rouge;Vermont;75677;A;15-02-2004;26-01-2007 […] The following graphs present the results of our tests: Unsurprisingly up to 16 threads, all files fit in the ZFS cache a.k.a L2ARC : once the cache is hot there is no performance difference depending on the underlying storage. From 16 threads upwards however, it is clear that IO becomes a bottleneck, having a good IO subsystem is thus key. Single-disk performance collapses whereas the Sun F5100 and ST6180 arrays allow the T4-1 to scale quite seamlessly. From 32 to 64 threads, the performance is almost constant with just a slow decline. For the database load tests, only the best IO configuration --using external storage devices-- were used, hosting the Oracle table spaces and redo log files. Using the Sun Storage F5100 array allows the T4-1 server to scale up to 48 parallel JVM processes before saturating the CPU. The final result is a staggering 646K lines per second insertion in an Oracle table using 48 parallel threads. Single-thread: Testing the single thread performance Seven different tests were performed on both servers. Given the fact that only one thread, thus one file was read, no IO bottleneck was involved, all data being served from the ZFS cache. Read File ? Filter ? Write File: Read file, filter data, write the filtered data in a new file. The filter is set on the “Status” column: only lines with status set to “A” are selected. This limits each output file to about 500 MB. Read File ? Load Database Table: Read file, insert into a single Oracle table. Average: Read file, compute the average of a numeric column, write the result in a new file. Division & Square Root: Read file, perform a division and square root on a numeric column, write the result data in a new file. Oracle DB Dump: Dump the content of an Oracle table (12.5M rows) into a CSV file. Transform: Read file, transform, write the result data in a new file. The transformations applied are: set the address column to upper case and add an extra column at the end, which is the concatenation of two columns. Sort: Read file, sort a numeric and alpha numeric column, write the result data in a new file. The following table and graph present the final results of the tests: Throughput unit is thousand lines per second processed (K lines/second). Improvement is the % of improvement between the T5140 and T4-1. Test T4-1 (Time s.) T5140 (Time s.) Improvement T4-1 (Throughput) T5140 (Throughput) Read/Filter/Write 125 806 645% 100 16 Read/Load Database 195 1111 570% 64 11 Average 96 557 580% 130 22 Division & Square Root 161 1054 655% 78 12 Oracle DB Dump 164 945 576% 76 13 Transform 159 1124 707% 79 11 Sort 251 1336 532% 50 9 The improvement of single-thread performance is quite dramatic: depending on the tests, the T4 is between 5.4 to 7 times faster than the T2+. It seems clear that the SPARC T4 processor has gone a long way filling the gap in single-thread performance, without sacrifying the multi-threaded capability as it still shows a very impressive scaling on heavy-duty multi-threaded jobs. Finally, as always at Oracle ISV Engineering, we are happy to help our ISV partners test their own applications on our platforms, so don't hesitate to contact us and let's see what the SPARC T4-based systems can do for your application! "As describe in this benchmark, Talend Enterprise Data Integration has overperformed on T4. I was generally happy to see that the T4 gave scaling opportunities for many scenarios like complex aggregations. Row by row insertion in Oracle DB is faster with more than 650,000 rows per seconds without using any bulk Oracle capabilities !" Cedric Carbone, Talend CTO.

Read the article

Event on SQL Server 2008 Disk IO and the new Complex Event Processing (StreamInsight) feature in R2

- by tonyrogerson

Allan Mitchell and myself are doing a double act, Allan is becoming one of the leading guys in the UK on StreamInsight and will give an introduction to this new exciting technology; on top of that I'll being talking about SQL Server Disk IO - well, "Disk" might not be relevant anymore because I'll be talking about SSD and IOFusion - basically I'll be talking about the underpinnings - making sure you understand and get it right, how to monitor etc... If you've any specific problems or questions just ping me an email [email protected]. To register for the event see: http://sqlserverfaq.com/events/217/SQL-Server-and-Disk-IO-File-GroupsFiles-SSDs-FusionIO-InRAM-DBs-Fragmentation-Tony-Rogerson-Complex-Event-Processing-Allan-Mitchell.aspx 18:15 SQL Server and Disk IOTony Rogerson, SQL Server MVPTony's Blog; Tony on TwitterIn this session Tony will talk about RAID levels, how SQL server writes to and reads from disk, the effect SSD has and will talk about other options for throughput enhancement like Fusion IO. He will look at the effect fragmentation has and how to minimise the impact, he will look at the File structure of a database and talk about what benefits multiple files and file groups bring. We will also touch on Database Mirroring and the effect that has on throughput, how to get a feeling for the throughput you should expect.19:15 Break19:45 Complex Event Processing (CEP)Allan Mitchell, SQL Server MVPhttp://sqlis.com/sqlisStreamInsight is Microsoft’s first foray into the world of Complex Event Processing (CEP) and Event Stream Processing (ESP). In this session I want to show an introduction to this technology. I will show how and why it is useful. I will get us used to some new terminology but best of all I will show just how easy it is to start building your first CEP/ESP application.

Read the article

RAID10 without write-back cache = horrible write performance?

- by Harry Mexican

I have just provisioned a dedicated server on singlehop. I'm running it through some tests to know what to expect performance-wise. On the I/O side (with 4 1TB disks in RAID 10) I get: write-cache disabled 200 MB/s read throughput 30 MB/s write throughput I thought that was really low compared to my desktop HD which gets 150-150 or so. So I had a chat with them and they suggested enabling the write cache. New results: write-cache enabled 280 MB/s read 260 MB/s write which is great and all but means I'd have to add a BBU for an additional monthly cost. Is it normal for the write throughput to be 1/4 of a regular drive on RAID10, if you don't have write cache? It almost feels like its intentionally bad to force you to pony up for the BBU. I'd be happy with normal non-raid performance of 150/150.

Search Results

Search found 453 results on 19 pages for 'throughput'.

Page 4/19 | < Previous Page | 1 2 3 4 5 6 7 8 9 10 11 12 | Next Page >

- by Howard Guo

- by 0x90

- by Ben Lee

- by Nosrama

- by DavidM

- by Zombies

- by charis

- by Nakedible

- by Jimm

- by Brian

- by Frederic P

- by ChaosPandion

- by bfallik-bamboom

- by Gokhan Gungor

- by Murali Veligeti

- by James

- by Ahmed Awan

- by Unknown

- by netvope

- by mohabitar

- by Untalented

- by fip

- by Amir Javanshir

- by tonyrogerson

- by Harry Mexican

< Previous Page | 1 2 3 4 5 6 7 8 9 10 11 12 | Next Page >