Search Results

Search found 1282 results on 52 pages for 'overhead'.

Page 16/52 | < Previous Page | 12 13 14 15 16 17 18 19 20 21 22 23 | Next Page >

Cisco 678 Will Not Work using PPPoE - Possibly Because I Configured it Incorrectly..?

- by Brian Stinar

I am attempting to configure a Cisco 678 because I am totally sick on my Actiontec. However, I am running into some problems. It seems as though the Cisco is able to train the line, but I am unable to ping out. I am all right at programming, but still learning a lot when it comes to being a system administrator. I apologize in advance if I did something ridiculous, or am attempting to configure this device to do something it was not designed to do. It is almost like I am not correctly configuring the device to grab it's IP using PPPoA (like my Actiontec.) The output from "show running" (below) makes me think this too. Below are the commands I ran in order to configure this: # en # set nvram erase # write # reboot # en # set nat enable # set dhcp server enable # set PPP wan0-0 ipcp 0.0.0.0 # set ppp wan0-0 dns 0.0.0.0 # set PPP wan0-0 login xxxxx // My actual login # set PPP wan0-0 password yyyyy // My actual password # set PPP restart enabled # set int wan0-0 close # set int wan0-0 vpi 0 # set int wan0-0 vci 32 # set int wan0-0 open # write # reboot Here is the output from a few commands I thought could provide some useful information: cbos#ping 74.125.224.113 Sending 1 8 byte ping(s) to 74.125.224.113 every 2 second(s) Request timed out cbos#show version Cisco Broadband Operating System CBOS (tm) 678 Software (C678-I-M), Version v2.4.9 - Release Software Copyright (c) 1986-2001 by cisco Systems, Inc. Compiled Nov 17 2004 15:26:29 DMT FULL firmware version G96 NVRAM image at 0x1030f000 cbos#show errors - Current Error Messages - ## Ticks Module Level Message 0 000:00:00:00 PPP Info IPCP Open Event on wan0-0 1 000:00:00:14 ATM Info Wan0 Up 2 000:00:00:14 PPP Info PPP Up Event on wan0-0 3 000:00:01:54 PPP Info PPP Down Event on wan0-0 Total Number of Error Messages: 4 cbos#show interface wan0 wan0 ADSL Physical Port Line Trained Actual Configuration: Overhead Framing: 3 Trellis Coding: Enabled Standard Compliance: T1.413 Downstream Data Rate: 1184 Kbps Upstream Data Rate: 928 Kbps Interleave S Downstream: 4 Interleave D Downstream: 16 Interleave R Downstream: 16 Interleave S Upstream: 4 Interleave D Upstream: 8 Interleave R Upstream: 16 Modem Microcode: G96 DSP version: 0 Operating State: Showtime/Data Mode Configured: Echo Cancellation: Disabled Overhead Framing: 3 Coding Gain: Auto TX Power Attenuation: 0dB Trellis Coding: Enabled Bit Swapping: Disabled Standard Compliance: T1.413 Remote Standard Compliance: T1.413 Tx Start Bin: 0x6 Tx End Bin: 0x1f Data Interface: Utopia L1 Status: Local SNR Margin: 19.0dB Local Coding Gain: 7.5dB Local Transmit Power: 12.5dB Local Attenuation: 46.0dB Remote Attenuation: 31.0dB Local Counters: Interleaved RS Corrected Bytes: 0 Interleaved Symbols with CRC Errors: 2 No Cell Delineation Interleaved: 0 Out of Cell Delineation Interleaved: 0 Header Error Check Counter Interleaved: 0 Count of Severely Errored Frames: 0 Count of Loss of Signal Frames: 0 Remote Counters: Interleaved RS Corrected Bytes: 0 Interleaved Symbols with CRC Errors: 1 No Cell Delineation Interleaved: 0 Header Error Check Counter Interleaved: 0 Count of Severely Errored Frames: 0 Count of Loss of Signal Frames: 0 cbos#show int wan0-0 WAN0-0 ATM Logical Port PVC (VPI 0, VCI 32) is configured. ScalaRate set to Auto AAL 5 UBR Traffic IP Port Enabled cbos#show running Warning: traffic may pause while NVRAM is being accessed [[ CBOS = Section Start ]] NSOS MD5 Enable Password = XXXX NSOS MD5 Root Password = XXXX NSOS MD5 Commander Password = XXXX [[ PPP Device Driver = Section Start ]] PPP Port User Name = 00, "XXXX" PPP Port User Password = 00, XXXX PPP Port Option = 00, IPCP,IP Address,3,Auto,Negotiation Not Required,Negotiable ,IP,0.0.0.0 PPP Port Option = 00, IPCP,Primary DNS Server,129,Auto,Negotiation Not Required, Negotiable,IP,0.0.0.0 PPP Port Option = 00, IPCP,Secondary DNS Server,131,Auto,Negotiation Not Require d,Negotiable,IP,0.0.0.0 [[ ATM WAN Device Driver = Section Start ]] ATM WAN Virtual Connection Parms = 00, 0, 32, 0 [[ DHCP = Section Start ]] DHCP Server = enabled [[ IP Routing = Section Start ]] IP NAT = enabled [[ WEB = Section Start ]] WEB = enabled cbos# wtf...? Thank you all very much for taking the time to read this, and the help.

Read the article
Introducing Oracle VM Server for SPARC

- by Honglin Su

As you are watching Oracle's Virtualization Strategy Webcast and exploring the great virtualization offerings of Oracle VM product line, I'd like to introduce Oracle VM Server for SPARC -- highly efficient, enterprise-class virtualization solution for Sun SPARC Enterprise Systems with Chip Multithreading (CMT) technology. Oracle VM Server for SPARC, previously called Sun Logical Domains, leverages the built-in SPARC hypervisor to subdivide supported platforms' resources (CPUs, memory, network, and storage) by creating partitions called logical (or virtual) domains. Each logical domain can run an independent operating system. Oracle VM Server for SPARC provides the flexibility to deploy multiple Oracle Solaris operating systems simultaneously on a single platform. Oracle VM Server also allows you to create up to 128 virtual servers on one system to take advantage of the massive thread scale offered by the CMT architecture. Oracle VM Server for SPARC integrates both the industry-leading CMT capability of the UltraSPARC T1, T2 and T2 Plus processors and the Oracle Solaris operating system. This combination helps to increase flexibility, isolate workload processing, and improve the potential for maximum server utilization. Oracle VM Server for SPARC delivers the following: Leading Price/Performance - The low-overhead architecture provides scalable performance under increasing workloads without additional license cost. This enables you to meet the most aggressive price/performance requirement Advanced RAS - Each logical domain is an entirely independent virtual machine with its own OS. It supports virtual disk mutipathing and failover as well as faster network failover with link-based IP multipathing (IPMP) support. Moreover, it's fully integrated with Solaris FMA (Fault Management Architecture), which enables predictive self healing. CPU Dynamic Resource Management (DRM) - Enable your resource management policy and domain workload to trigger the automatic addition and removal of CPUs. This ability helps you to better align with your IT and business priorities. Enhanced Domain Migrations - Perform domain migrations interactively and non-interactively to bring more flexibility to the management of your virtualized environment. Improve active domain migration performance by compressing memory transfers and taking advantage of cryptographic acceleration hardware. These methods provide faster migration for load balancing, power saving, and planned maintenance. Dynamic Crypto Control - Dynamically add and remove cryptographic units (aka MAU) to and from active domains. Also, migrate active domains that have cryptographic units. Physical-to-virtual (P2V) Conversion - Quickly convert an existing SPARC server running the Oracle Solaris 8, 9 or 10 OS into a virtualized Oracle Solaris 10 image. Use this image to facilitate OS migration into the virtualized environment. Virtual I/O Dynamic Reconfiguration (DR) - Add and remove virtual I/O services and devices without needing to reboot the system. CPU Power Management - Implement power saving by disabling each core on a Sun UltraSPARC T2 or T2 Plus processor that has all of its CPU threads idle. Advanced Network Configuration - Configure the following network features to obtain more flexible network configurations, higher performance, and scalability: Jumbo frames, VLANs, virtual switches for link aggregations, and network interface unit (NIU) hybrid I/O. Official Certification Based On Real-World Testing - Use Oracle VM Server for SPARC with the most sophisticated enterprise workloads under real-world conditions, including Oracle Real Application Clusters (RAC). Affordable, Full-Stack Enterprise Class Support - Obtain worldwide support from Oracle for the entire virtualization environment and workloads together. The support covers hardware, firmware, OS, virtualization, and the software stack. SPARC Server Virtualization Oracle offers a full portfolio of virtualization solutions to address your needs. SPARC is the leading platform to have the hard partitioning capability that provides the physical isolation needed to run independent operating systems. Many customers have already used Oracle Solaris Containers for application isolation. Oracle VM Server for SPARC provides another important feature with OS isolation. This gives you the flexibility to deploy multiple operating systems simultaneously on a single Sun SPARC T-Series server with finer granularity for computing resources. For SPARC CMT processors, the natural level of granularity is an execution thread, not a time-sliced microsecond of execution resources. Each CPU thread can be treated as an independent virtual processor. The scheduler is naturally built into the CPU for lower overhead and higher performance. Your organizations can couple Oracle Solaris Containers and Oracle VM Server for SPARC with the breakthrough space and energy savings afforded by Sun SPARC Enterprise systems with CMT technology to deliver a more agile, responsive, and low-cost environment. Management with Oracle Enterprise Manager Ops Center The Oracle Enterprise Manager Ops Center Virtualization Management Pack provides full lifecycle management of virtual guests, including Oracle VM Server for SPARC and Oracle Solaris Containers. It helps you streamline operations and reduce downtime. Together, the Virtualization Management Pack and the Ops Center Provisioning and Patch Automation Pack provide an end-to-end management solution for physical and virtual systems through a single web-based console. This solution automates the lifecycle management of physical and virtual systems and is the most effective systems management solution for Oracle's Sun infrastructure. Ease of Deployment with Configuration Assistant The Oracle VM Server for SPARC Configuration Assistant can help you easily create logical domains. After gathering the configuration data, the Configuration Assistant determines the best way to create a deployment to suit your requirements. The Configuration Assistant is available as both a graphical user interface (GUI) and terminal-based tool. Oracle Solaris Cluster HA Support The Oracle Solaris Cluster HA for Oracle VM Server for SPARC data service provides a mechanism for orderly startup and shutdown, fault monitoring and automatic failover of the Oracle VM Server guest domain service. In addition, applications that run on a logical domain, as well as its resources and dependencies can be controlled and managed independently. These are managed as if they were running in a classical Solaris Cluster hardware node. Supported Systems Oracle VM Server for SPARC is supported on all Sun SPARC Enterprise Systems with CMT technology. UltraSPARC T2 Plus Systems · Sun SPARC Enterprise T5140 Server · Sun SPARC Enterprise T5240 Server · Sun SPARC Enterprise T5440 Server · Sun Netra T5440 Server · Sun Blade T6340 Server Module · Sun Netra T6340 Server Module UltraSPARC T2 Systems · Sun SPARC Enterprise T5120 Server · Sun SPARC Enterprise T5220 Server · Sun Netra T5220 Server · Sun Blade T6320 Server Module · Sun Netra CP3260 ATCA Blade Server Note that UltraSPARC T1 systems are supported on earlier versions of the software.Sun SPARC Enterprise Systems with CMT technology come with the right to use (RTU) of Oracle VM Server, and the software is pre-installed. If you have the systems under warranty or with support, you can download the software and system firmware as well as their updates. Oracle Premier Support for Systems provides fully-integrated support for your server hardware, firmware, OS, and virtualization software. Visit oracle.com/support for information about Oracle's support offerings for Sun systems. For more information about Oracle's virtualization offerings, visit oracle.com/virtualization.

Read the article
WebSocket and Java EE 7 - Getting Ready for JSR 356 (TOTD #181)

- by arungupta

WebSocket is developed as part of HTML 5 specification and provides a bi-directional, full-duplex communication channel over a single TCP socket. It provides dramatic improvement over the traditional approaches of Polling, Long-Polling, and Streaming for two-way communication. There is no latency from establishing new TCP connections for each HTTP message. There is a WebSocket API and the WebSocket Protocol. The Protocol defines "handshake" and "framing". The handshake defines how a normal HTTP connection can be upgraded to a WebSocket connection. The framing defines wire format of the message. The design philosophy is to keep the framing minimum to avoid the overhead. Both text and binary data can be sent using the API. WebSocket may look like a competing technology to Server-Sent Events (SSE), but they are not. Here are the key differences: WebSocket can send and receive data from a client. A typical example of WebSocket is a two-player game or a chat application. Server-Sent Events can only push data data to the client. A typical example of SSE is stock ticker or news feed. With SSE, XMLHttpRequest can be used to send data to the server. For server-only updates, WebSockets has an extra overhead and programming can be unecessarily complex. SSE provides a simple and easy-to-use model that is much better suited. SSEs are sent over traditional HTTP and so no modification is required on the server-side. WebSocket require servers that understand the protocol. SSE have several features that are missing from WebSocket such as automatic reconnection, event IDs, and the ability to send arbitrary events. The client automatically tries to reconnect if the connection is closed. The default wait before trying to reconnect is 3 seconds and can be configured by including "retry: XXXX\n" header where XXXX is the milliseconds to wait before trying to reconnect. Event stream can include a unique event identifier. This allows the server to determine which events need to be fired to each client in case the connection is dropped in between. The data can span multiple lines and can be of any text format as long as EventSource message handler can process it. WebSockets provide true real-time updates, SSE can be configured to provide close to real-time by setting appropriate timeouts. OK, so all excited about WebSocket ? Want to convert your POJOs into WebSockets endpoint ? websocket-sdk and GlassFish 4.0 is here to help! The complete source code shown in this project can be downloaded here. On the server-side, the WebSocket SDK converts a POJO into a WebSocket endpoint using simple annotations. Here is how a WebSocket endpoint will look like: @WebSocket(path="/echo")public class EchoBean { @WebSocketMessage public String echo(String message) { return message + " (from your server)"; }} In this code "@WebSocket" is a class-level annotation that declares a POJO to accept WebSocket messages. The path at which the messages are accepted is specified in this annotation. "@WebSocketMessage" indicates the Java method that is invoked when the endpoint receives a message. This method implementation echoes the received message concatenated with an additional string. The client-side HTML page looks like <div style="text-align: center;"> <form action=""> <input onclick="send_echo()" value="Press me" type="button"> <input id="textID" name="message" value="Hello WebSocket!" type="text"><br> </form></div><div id="output"></div> WebSocket allows a full-duplex communication. So the client, a browser in this case, can send a message to a server, a WebSocket endpoint in this case. And the server can send a message to the client at the same time. This is unlike HTTP which follows a "request" followed by a "response". In this code, the "send_echo" method in the JavaScript is invoked on the button click. There is also a <div> placeholder to display the response from the WebSocket endpoint. The JavaScript looks like: <script language="javascript" type="text/javascript"> var wsUri = "ws://localhost:8080/websockets/echo"; var websocket = new WebSocket(wsUri); websocket.onopen = function(evt) { onOpen(evt) }; websocket.onmessage = function(evt) { onMessage(evt) }; websocket.onerror = function(evt) { onError(evt) }; function init() { output = document.getElementById("output"); } function send_echo() { websocket.send(textID.value); writeToScreen("SENT: " + textID.value); } function onOpen(evt) { writeToScreen("CONNECTED"); } function onMessage(evt) { writeToScreen("RECEIVED: " + evt.data); } function onError(evt) { writeToScreen('<span style="color: red;">ERROR:</span> ' + evt.data); } function writeToScreen(message) { var pre = document.createElement("p"); pre.style.wordWrap = "break-word"; pre.innerHTML = message; output.appendChild(pre); } window.addEventListener("load", init, false);</script> In this code The URI to connect to on the server side is of the format ws://<HOST>:<PORT>/websockets/<PATH> "ws" is a new URI scheme introduced by the WebSocket protocol. <PATH> is the path on the endpoint where the WebSocket messages are accepted. In our case, it is ws://localhost:8080/websockets/echo WEBSOCKET_SDK-1 will ensure that context root is included in the URI as well. WebSocket is created as a global object so that the connection is created only once. This object establishes a connection with the given host, port and the path at which the endpoint is listening. The WebSocket API defines several callbacks that can be registered on specific events. The "onopen", "onmessage", and "onerror" callbacks are registered in this case. The callbacks print a message on the browser indicating which one is called and additionally also prints the data sent/received. On the button click, the WebSocket object is used to transmit text data to the endpoint. Binary data can be sent as one blob or using buffering. The HTTP request headers sent for the WebSocket call are: GET ws://localhost:8080/websockets/echo HTTP/1.1Origin: http://localhost:8080Connection: UpgradeSec-WebSocket-Extensions: x-webkit-deflate-frameHost: localhost:8080Sec-WebSocket-Key: mDbnYkAUi0b5Rnal9/cMvQ==Upgrade: websocketSec-WebSocket-Version: 13 And the response headers received are Connection:UpgradeSec-WebSocket-Accept:q4nmgFl/lEtU2ocyKZ64dtQvx10=Upgrade:websocket(Challenge Response):00:00:00:00:00:00:00:00:00:00:00:00:00:00:00:00 The headers are shown in Chrome as shown below: The complete source code shown in this project can be downloaded here. The builds from websocket-sdk are integrated in GlassFish 4.0 builds. Would you like to live on the bleeding edge ? Then follow the instructions below to check out the workspace and install the latest SDK: Check out the source code svn checkout https://svn.java.net/svn/websocket-sdk~source-code-repository Build and install the trunk in your local repository as: mvn install Copy "./bundles/websocket-osgi/target/websocket-osgi-0.3-SNAPSHOT.jar" to "glassfish3/glassfish/modules/websocket-osgi.jar" in your GlassFish 4 latest promoted build. Notice, you need to overwrite the JAR file. Anybody interested in building a cool application using WebSocket and get it running on GlassFish ? :-) This work will also feed into JSR 356 - Java API for WebSocket. On a lighter side, there seems to be less agreement on the name. Here are some of the options that are prevalent: WebSocket (W3C API, the URL is www.w3.org/TR/websockets though) Web Socket (HTML5 Demos - html5demos.com/web-socket) Websocket (Jenkins Plugin - wiki.jenkins-ci.org/display/JENKINS/Websocket%2BPlugin) WebSockets (Used by Mozilla - developer.mozilla.org/en/WebSockets, but use WebSocket as well) Web sockets (HTML5 Working Group - www.whatwg.org/specs/web-apps/current-work/multipage/network.html) Web Sockets (Chrome Blog - blog.chromium.org/2009/12/web-sockets-now-available-in-google.html) I prefer "WebSocket" as that seems to be most common usage and used by the W3C API as well. What do you use ?

Read the article
Solaris X86 AESNI OpenSSL Engine

- by danx

Solaris X86 AESNI OpenSSL Engine Cryptography is a major component of secure e-commerce. Since cryptography is compute intensive and adds a significant load to applications, such as SSL web servers (https), crypto performance is an important factor. Providing accelerated crypto hardware greatly helps these applications and will help lead to a wider adoption of cryptography, and lower cost, in e-commerce and other applications. The Intel Westmere microprocessor has six new instructions to acclerate AES encryption. They are called "AESNI" for "AES New Instructions". These are unprivileged instructions, so no "root", other elevated access, or context switch is required to execute these instructions. These instructions are used in a new built-in OpenSSL 1.0 engine available in Solaris 11, the aesni engine. Previous Work Previously, AESNI instructions were introduced into the Solaris x86 kernel and libraries. That is, the "aes" kernel module (used by IPsec and other kernel modules) and the Solaris pkcs11 library (for user applications). These are available in Solaris 10 10/09 (update 8) and above, and Solaris 11. The work here is to add the aesni engine to OpenSSL. X86 AESNI Instructions Intel's Xeon 5600 is one of the processors that support AESNI. This processor is used in the Sun Fire X4170 M2 As mentioned above, six new instructions acclerate AES encryption in processor silicon. The new instructions are: aesenc performs one round of AES encryption. One encryption round is composed of these steps: substitute bytes, shift rows, mix columns, and xor the round key. aesenclast performs the final encryption round, which is the same as above, except omitting the mix columns (which is only needed for the next encryption round). aesdec performs one round of AES decryption aesdeclast performs the final AES decryption round aeskeygenassist Helps expand the user-provided key into a "key schedule" of keys, one per round aesimc performs an "inverse mixed columns" operation to convert the encryption key schedule into a decryption key schedule pclmulqdq Not a AESNI instruction, but performs "carryless multiply" operations to acclerate AES GCM mode. Since the AESNI instructions are implemented in hardware, they take a constant number of cycles and are not vulnerable to side-channel timing attacks that attempt to discern some bits of data from the time taken to encrypt or decrypt the data. Solaris x86 and OpenSSL Software Optimizations Having X86 AESNI hardware crypto instructions is all well and good, but how do we access it? The software is available with Solaris 11 and is used automatically if you are running Solaris x86 on a AESNI-capable processor. AESNI is used internally in the kernel through kernel crypto modules and is available in user space through the PKCS#11 library. For OpenSSL on Solaris 11, AESNI crypto is available directly with a new built-in OpenSSL 1.0 engine, called the "aesni engine." This is in lieu of the extra overhead of going through the Solaris OpenSSL pkcs11 engine, which accesses Solaris crypto and digest operations. Instead, AESNI assembly is included directly in the new aesni engine. Instead of including the aesni engine in a separate library in /lib/openssl/engines/, the aesni engine is "built-in", meaning it is included directly in OpenSSL's libcrypto.so.1.0.0 library. This reduces overhead and the need to manually specify the aesni engine. Since the engine is built-in (that is, in libcrypto.so.1.0.0), the openssl -engine command line flag or API call is not needed to access the engine—the aesni engine is used automatically on AESNI hardware. Ciphers and Digests supported by OpenSSL aesni engine The Openssl aesni engine auto-detects if it's running on AESNI hardware and uses AESNI encryption instructions for these ciphers: AES-128-CBC, AES-192-CBC, AES-256-CBC, AES-128-CFB128, AES-192-CFB128, AES-256-CFB128, AES-128-CTR, AES-192-CTR, AES-256-CTR, AES-128-ECB, AES-192-ECB, AES-256-ECB, AES-128-OFB, AES-192-OFB, and AES-256-OFB. Implementation of the OpenSSL aesni engine The AESNI assembly language routines are not a part of the regular Openssl 1.0.0 release. AESNI is a part of the "HEAD" ("development" or "unstable") branch of OpenSSL, for future release. But AESNI is also available as a separate patch provided by Intel to the OpenSSL project for OpenSSL 1.0.0. A minimal amount of "glue" code in the aesni engine works between the OpenSSL libcrypto.so.1.0.0 library and the assembly functions. The aesni engine code is separate from the base OpenSSL code and requires patching only a few source files to use it. That means OpenSSL can be more easily updated to future versions without losing the performance from the built-in aesni engine. OpenSSL aesni engine Performance Here's some graphs of aesni engine performance I measured by running openssl speed -evp $algorithm where $algorithm is aes-128-cbc, aes-192-cbc, and aes-256-cbc. These are using the 64-bit version of openssl on the same AESNI hardware, a Sun Fire X4170 M2 with a Intel Xeon E5620 @2.40GHz, running Solaris 11 FCS. "Before" is openssl without the aesni engine and "after" is openssl with the aesni engine. The numbers are MBytes/second. OpenSSL aesni engine performance on Sun Fire X4170 M2 (Xeon E5620 @2.40GHz) (Higher is better; "before"=OpenSSL on AESNI without AESNI engine software, "after"=OpenSSL AESNI engine) As you can see the speedup is dramatic for all 3 key lengths and for data sizes from 16 bytes to 8 Kbytes—AESNI is about 7.5-8x faster over hand-coded amd64 assembly (without aesni instructions). Verifying the OpenSSL aesni engine is present The easiest way to determine if you are running the aesni engine is to type "openssl engine" on the command line. No configuration, API, or command line options are needed to use the OpenSSL aesni engine. If you are running on Intel AESNI hardware with Solaris 11 FCS, you'll see this output indicating you are using the aesni engine: intel-westmere $ openssl engine (aesni) Intel AES-NI engine (no-aesni) (dynamic) Dynamic engine loading support (pkcs11) PKCS #11 engine support If you are running on Intel without AESNI hardware you'll see this output indicating the hardware can't support the aesni engine: intel-nehalem $ openssl engine (aesni) Intel AES-NI engine (no-aesni) (dynamic) Dynamic engine loading support (pkcs11) PKCS #11 engine support For Solaris on SPARC or older Solaris OpenSSL software, you won't see any aesni engine line at all. Third-party OpenSSL software (built yourself or from outside Oracle) will not have the aesni engine either. Solaris 11 FCS comes with OpenSSL version 1.0.0e. The output of typing "openssl version" should be "OpenSSL 1.0.0e 6 Sep 2011". 64- and 32-bit OpenSSL OpenSSL comes in both 32- and 64-bit binaries. 64-bit executable is now the default, at /usr/bin/openssl, and OpenSSL 64-bit libraries at /lib/amd64/libcrypto.so.1.0.0 and libssl.so.1.0.0 The 32-bit executable is at /usr/bin/i86/openssl and the libraries are at /lib/libcrytpo.so.1.0.0 and libssl.so.1.0.0. Availability The OpenSSL AESNI engine is available in Solaris 11 x86 for both the 64- and 32-bit versions of OpenSSL. It is not available with Solaris 10. You must have a processor that supports AESNI instructions, otherwise OpenSSL will fallback to the older, slower AES implementation without AESNI. Processors that support AESNI include most Westmere and Sandy Bridge class processor architectures. Some low-end processors (such as for mobile/laptop platforms) do not support AESNI. The easiest way to determine if the processor supports AESNI is with the isainfo -v command—look for "amd64" and "aes" in the output: $ isainfo -v 64-bit amd64 applications pclmulqdq aes sse4.2 sse4.1 ssse3 popcnt tscp ahf cx16 sse3 sse2 sse fxsr mmx cmov amd_sysc cx8 tsc fpu Conclusion The Solaris 11 OpenSSL aesni engine provides easy access to powerful Intel AESNI hardware cryptography, in addition to Solaris userland PKCS#11 libraries and Solaris crypto kernel modules.

Read the article
Windows 8.1 Will Start Encrypting Hard Drives By Default: Everything You Need to Know

- by Chris Hoffman

Windows 8.1 will automatically encrypt the storage on modern Windows PCs. This will help protect your files in case someone steals your laptop and tries to get at them, but it has important ramifications for data recovery. Previously, “BitLocker” was available on Professional and Enterprise editions of Windows, while “Device Encryption” was available on Windows RT and Windows Phone. Device encryption is included with all editions of Windows 8.1 — and it’s on by default. When Your Hard Drive Will Be Encrypted Windows 8.1 includes “Pervasive Device Encryption.” This works a bit differently from the standard BitLocker feature that has been included in Professional, Enterprise, and Ultimate editions of Windows for the past few versions. Before Windows 8.1 automatically enables Device Encryption, the following must be true: The Windows device “must support connected standby and meet the Windows Hardware Certification Kit (HCK) requirements for TPM and SecureBoot on ConnectedStandby systems.” (Source) Older Windows PCs won’t support this feature, while new Windows 8.1 devices you pick up will have this feature enabled by default. When Windows 8.1 installs cleanly and the computer is prepared, device encryption is “initialized” on the system drive and other internal drives. Windows uses a clear key at this point, which is removed later when the recovery key is successfully backed up. The PC’s user must log in with a Microsoft account with administrator privileges or join the PC to a domain. If a Microsoft account is used, a recovery key will be backed up to Microsoft’s servers and encryption will be enabled. If a domain account is used, a recovery key will be backed up to Active Directory Domain Services and encryption will be enabled. If you have an older Windows computer that you’ve upgraded to Windows 8.1, it may not support Device Encryption. If you log in with a local user account, Device Encryption won’t be enabled. If you upgrade your Windows 8 device to Windows 8.1, you’ll need to enable device encryption, as it’s off by default when upgrading. Recovering An Encrypted Hard Drive Device encryption means that a thief can’t just pick up your laptop, insert a Linux live CD or Windows installer disc, and boot the alternate operating system to view your files without knowing your Windows password. It means that no one can just pull the hard drive from your device, connect the hard drive to another computer, and view the files. We’ve previously explained that your Windows password doesn’t actually secure your files. With Windows 8.1, average Windows users will finally be protected with encryption by default. However, there’s a problem — if you forget your password and are unable to log in, you’d also be unable to recover your files. This is likely why encryption is only enabled when a user logs in with a Microsoft account (or connects to a domain). Microsoft holds a recovery key, so you can gain access to your files by going through a recovery process. As long as you’re able to authenticate using your Microsoft account credentials — for example, by receiving an SMS message on the cell phone number connected to your Microsoft account — you’ll be able to recover your encrypted data. With Windows 8.1, it’s more important than ever to configure your Microsoft account’s security settings and recovery methods so you’ll be able to recover your files if you ever get locked out of your Microsoft account. Microsoft does hold the recovery key and would be capable of providing it to law enforcement if it was requested, which is certainly a legitimate concern in the age of PRISM. However, this encryption still provides protection from thieves picking up your hard drive and digging through your personal or business files. If you’re worried about a government or a determined thief who’s capable of gaining access to your Microsoft account, you’ll want to encrypt your hard drive with software that doesn’t upload a copy of your recovery key to the Internet, such as TrueCrypt. How to Disable Device Encryption There should be no real reason to disable device encryption. If nothing else, it’s a useful feature that will hopefully protect sensitive data in the real world where people — and even businesses — don’t enable encryption on their own. As encryption is only enabled on devices with the appropriate hardware and will be enabled by default, Microsoft has hopefully ensured that users won’t see noticeable slow-downs in performance. Encryption adds some overhead, but the overhead can hopefully be handled by dedicated hardware. If you’d like to enable a different encryption solution or just disable encryption entirely, you can control this yourself. To do so, open the PC settings app — swipe in from the right edge of the screen or press Windows Key + C, click the Settings icon, and select Change PC settings. Navigate to PC and devices -> PC info. At the bottom of the PC info pane, you’ll see a Device Encryption section. Select Turn Off if you want to disable device encryption, or select Turn On if you want to enable it — users upgrading from Windows 8 will have to enable it manually in this way. Note that Device Encryption can’t be disabled on Windows RT devices, such as Microsoft’s Surface RT and Surface 2. If you don’t see the Device Encryption section in this window, you’re likely using an older device that doesn’t meet the requirements and thus doesn’t support Device Encryption. For example, our Windows 8.1 virtual machine doesn’t offer Device Encryption configuration options. This is the new normal for Windows PCs, tablets, and devices in general. Where files on typical PCs were once ripe for easy access by thieves, Windows PCs are now encrypted by default and recovery keys are sent to Microsoft’s servers for safe keeping. This last part may be a bit creepy, but it’s easy to imagine average users forgetting their passwords — they’d be very upset if they lost all their files because they had to reset their passwords. It’s also an improvement over Windows PCs being completely unprotected by default.

Read the article
Solaris X86 AESNI OpenSSL Engine

- by danx

Solaris X86 AESNI OpenSSL Engine Cryptography is a major component of secure e-commerce. Since cryptography is compute intensive and adds a significant load to applications, such as SSL web servers (https), crypto performance is an important factor. Providing accelerated crypto hardware greatly helps these applications and will help lead to a wider adoption of cryptography, and lower cost, in e-commerce and other applications. The Intel Westmere microprocessor has six new instructions to acclerate AES encryption. They are called "AESNI" for "AES New Instructions". These are unprivileged instructions, so no "root", other elevated access, or context switch is required to execute these instructions. These instructions are used in a new built-in OpenSSL 1.0 engine available in Solaris 11, the aesni engine. Previous Work Previously, AESNI instructions were introduced into the Solaris x86 kernel and libraries. That is, the "aes" kernel module (used by IPsec and other kernel modules) and the Solaris pkcs11 library (for user applications). These are available in Solaris 10 10/09 (update 8) and above, and Solaris 11. The work here is to add the aesni engine to OpenSSL. X86 AESNI Instructions Intel's Xeon 5600 is one of the processors that support AESNI. This processor is used in the Sun Fire X4170 M2 As mentioned above, six new instructions acclerate AES encryption in processor silicon. The new instructions are: aesenc performs one round of AES encryption. One encryption round is composed of these steps: substitute bytes, shift rows, mix columns, and xor the round key. aesenclast performs the final encryption round, which is the same as above, except omitting the mix columns (which is only needed for the next encryption round). aesdec performs one round of AES decryption aesdeclast performs the final AES decryption round aeskeygenassist Helps expand the user-provided key into a "key schedule" of keys, one per round aesimc performs an "inverse mixed columns" operation to convert the encryption key schedule into a decryption key schedule pclmulqdq Not a AESNI instruction, but performs "carryless multiply" operations to acclerate AES GCM mode. Since the AESNI instructions are implemented in hardware, they take a constant number of cycles and are not vulnerable to side-channel timing attacks that attempt to discern some bits of data from the time taken to encrypt or decrypt the data. Solaris x86 and OpenSSL Software Optimizations Having X86 AESNI hardware crypto instructions is all well and good, but how do we access it? The software is available with Solaris 11 and is used automatically if you are running Solaris x86 on a AESNI-capable processor. AESNI is used internally in the kernel through kernel crypto modules and is available in user space through the PKCS#11 library. For OpenSSL on Solaris 11, AESNI crypto is available directly with a new built-in OpenSSL 1.0 engine, called the "aesni engine." This is in lieu of the extra overhead of going through the Solaris OpenSSL pkcs11 engine, which accesses Solaris crypto and digest operations. Instead, AESNI assembly is included directly in the new aesni engine. Instead of including the aesni engine in a separate library in /lib/openssl/engines/, the aesni engine is "built-in", meaning it is included directly in OpenSSL's libcrypto.so.1.0.0 library. This reduces overhead and the need to manually specify the aesni engine. Since the engine is built-in (that is, in libcrypto.so.1.0.0), the openssl -engine command line flag or API call is not needed to access the engine—the aesni engine is used automatically on AESNI hardware. Ciphers and Digests supported by OpenSSL aesni engine The Openssl aesni engine auto-detects if it's running on AESNI hardware and uses AESNI encryption instructions for these ciphers: AES-128-CBC, AES-192-CBC, AES-256-CBC, AES-128-CFB128, AES-192-CFB128, AES-256-CFB128, AES-128-CTR, AES-192-CTR, AES-256-CTR, AES-128-ECB, AES-192-ECB, AES-256-ECB, AES-128-OFB, AES-192-OFB, and AES-256-OFB. Implementation of the OpenSSL aesni engine The AESNI assembly language routines are not a part of the regular Openssl 1.0.0 release. AESNI is a part of the "HEAD" ("development" or "unstable") branch of OpenSSL, for future release. But AESNI is also available as a separate patch provided by Intel to the OpenSSL project for OpenSSL 1.0.0. A minimal amount of "glue" code in the aesni engine works between the OpenSSL libcrypto.so.1.0.0 library and the assembly functions. The aesni engine code is separate from the base OpenSSL code and requires patching only a few source files to use it. That means OpenSSL can be more easily updated to future versions without losing the performance from the built-in aesni engine. OpenSSL aesni engine Performance Here's some graphs of aesni engine performance I measured by running openssl speed -evp $algorithm where $algorithm is aes-128-cbc, aes-192-cbc, and aes-256-cbc. These are using the 64-bit version of openssl on the same AESNI hardware, a Sun Fire X4170 M2 with a Intel Xeon E5620 @2.40GHz, running Solaris 11 FCS. "Before" is openssl without the aesni engine and "after" is openssl with the aesni engine. The numbers are MBytes/second. OpenSSL aesni engine performance on Sun Fire X4170 M2 (Xeon E5620 @2.40GHz) (Higher is better; "before"=OpenSSL on AESNI without AESNI engine software, "after"=OpenSSL AESNI engine) As you can see the speedup is dramatic for all 3 key lengths and for data sizes from 16 bytes to 8 Kbytes—AESNI is about 7.5-8x faster over hand-coded amd64 assembly (without aesni instructions). Verifying the OpenSSL aesni engine is present The easiest way to determine if you are running the aesni engine is to type "openssl engine" on the command line. No configuration, API, or command line options are needed to use the OpenSSL aesni engine. If you are running on Intel AESNI hardware with Solaris 11 FCS, you'll see this output indicating you are using the aesni engine: intel-westmere $ openssl engine (aesni) Intel AES-NI engine (no-aesni) (dynamic) Dynamic engine loading support (pkcs11) PKCS #11 engine support If you are running on Intel without AESNI hardware you'll see this output indicating the hardware can't support the aesni engine: intel-nehalem $ openssl engine (aesni) Intel AES-NI engine (no-aesni) (dynamic) Dynamic engine loading support (pkcs11) PKCS #11 engine support For Solaris on SPARC or older Solaris OpenSSL software, you won't see any aesni engine line at all. Third-party OpenSSL software (built yourself or from outside Oracle) will not have the aesni engine either. Solaris 11 FCS comes with OpenSSL version 1.0.0e. The output of typing "openssl version" should be "OpenSSL 1.0.0e 6 Sep 2011". 64- and 32-bit OpenSSL OpenSSL comes in both 32- and 64-bit binaries. 64-bit executable is now the default, at /usr/bin/openssl, and OpenSSL 64-bit libraries at /lib/amd64/libcrypto.so.1.0.0 and libssl.so.1.0.0 The 32-bit executable is at /usr/bin/i86/openssl and the libraries are at /lib/libcrytpo.so.1.0.0 and libssl.so.1.0.0. Availability The OpenSSL AESNI engine is available in Solaris 11 x86 for both the 64- and 32-bit versions of OpenSSL. It is not available with Solaris 10. You must have a processor that supports AESNI instructions, otherwise OpenSSL will fallback to the older, slower AES implementation without AESNI. Processors that support AESNI include most Westmere and Sandy Bridge class processor architectures. Some low-end processors (such as for mobile/laptop platforms) do not support AESNI. The easiest way to determine if the processor supports AESNI is with the isainfo -v command—look for "amd64" and "aes" in the output: $ isainfo -v 64-bit amd64 applications pclmulqdq aes sse4.2 sse4.1 ssse3 popcnt tscp ahf cx16 sse3 sse2 sse fxsr mmx cmov amd_sysc cx8 tsc fpu Conclusion The Solaris 11 OpenSSL aesni engine provides easy access to powerful Intel AESNI hardware cryptography, in addition to Solaris userland PKCS#11 libraries and Solaris crypto kernel modules.

Read the article
Replication Services as ETL extraction tool

- by jorg

In my last blog post I explained the principles of Replication Services and the possibilities it offers in a BI environment. One of the possibilities I described was the use of snapshot replication as an ETL extraction tool: “Snapshot Replication can also be useful in BI environments, if you don’t need a near real-time copy of the database, you can choose to use this form of replication. Next to an alternative for Transactional Replication it can be used to stage data so it can be transformed and moved into the data warehousing environment afterwards. In many solutions I have seen developers create multiple SSIS packages that simply copies data from one or more source systems to a staging database that figures as source for the ETL process. The creation of these packages takes a lot of (boring) time, while Replication Services can do the same in minutes. It is possible to filter out columns and/or records and it can even apply schema changes automatically so I think it offers enough features here. I don’t know how the performance will be and if it really works as good for this purpose as I expect, but I want to try this out soon!” Well I have tried it out and I must say it worked well. I was able to let replication services do work in a fraction of the time it would cost me to do the same in SSIS. What I did was the following: Configure snapshot replication for some Adventure Works tables, this was quite simple and straightforward. Create an SSIS package that executes the snapshot replication on demand and waits for its completion. This is something that you can’t do with out of the box functionality. While configuring the snapshot replication two SQL Agent Jobs are created, one for the creation of the snapshot and one for the distribution of the snapshot. Unfortunately these jobs are asynchronous which means that if you execute them they immediately report back if the job started successfully or not, they do not wait for completion and report its result afterwards. So I had to create an SSIS package that executes the jobs and waits for their completion before the rest of the ETL process continues. Fortunately I was able to create the SSIS package with the desired functionality. I have made a step-by-step guide that will help you configure the snapshot replication and I have uploaded the SSIS package you need to execute it. Configure snapshot replication The first step is to create a publication on the database you want to replicate. Connect to SQL Server Management Studio and right-click Replication, choose for New.. Publication… The New Publication Wizard appears, click Next Choose your “source” database and click Next Choose Snapshot publication and click Next You can now select tables and other objects that you want to publish Expand Tables and select the tables that are needed in your ETL process In the next screen you can add filters on the selected tables which can be very useful. Think about selecting only the last x days of data for example. Its possible to filter out rows and/or columns. In this example I did not apply any filters. Schedule the Snapshot Agent to run at a desired time, by doing this a SQL Agent Job is created which we need to execute from a SSIS package later on. Next you need to set the Security Settings for the Snapshot Agent. Click on the Security Settings button. In this example I ran the Agent under the SQL Server Agent service account. This is not recommended as a security best practice. Fortunately there is an excellent article on TechNet which tells you exactly how to set up the security for replication services. Read it here and make sure you follow the guidelines! On the next screen choose to create the publication at the end of the wizard Give the publication a name (SnapshotTest) and complete the wizard The publication is created and the articles (tables in this case) are added Now the publication is created successfully its time to create a new subscription for this publication. Expand the Replication folder in SSMS and right click Local Subscriptions, choose New Subscriptions The New Subscription Wizard appears Select the publisher on which you just created your publication and select the database and publication (SnapshotTest) You can now choose where the Distribution Agent should run. If it runs at the distributor (push subscriptions) it causes extra processing overhead. If you use a separate server for your ETL process and databases choose to run each agent at its subscriber (pull subscriptions) to reduce the processing overhead at the distributor. Of course we need a database for the subscription and fortunately the Wizard can create it for you. Choose for New database Give the database the desired name, set the desired options and click OK You can now add multiple SQL Server Subscribers which is not necessary in this case but can be very useful. You now need to set the security settings for the Distribution Agent. Click on the …. button Again, in this example I ran the Agent under the SQL Server Agent service account. Read the security best practices here Click Next Make sure you create a synchronization job schedule again. This job is also necessary in the SSIS package later on. Initialize the subscription at first synchronization Select the first box to create the subscription when finishing this wizard Complete the wizard by clicking Finish The subscription will be created In SSMS you see a new database is created, the subscriber. There are no tables or other objects in the database available yet because the replication jobs did not ran yet. Now expand the SQL Server Agent, go to Jobs and search for the job that creates the snapshot: Rename this job to “CreateSnapshot” Now search for the job that distributes the snapshot: Rename this job to “DistributeSnapshot” Create an SSIS package that executes the snapshot replication We now need an SSIS package that will take care of the execution of both jobs. The CreateSnapshot job needs to execute and finish before the DistributeSnapshot job runs. After the DistributeSnapshot job has started the package needs to wait until its finished before the package execution finishes. The Execute SQL Server Agent Job Task is designed to execute SQL Agent Jobs from SSIS. Unfortunately this SSIS task only executes the job and reports back if the job started succesfully or not, it does not report if the job actually completed with success or failure. This is because these jobs are asynchronous. The SSIS package I’ve created does the following: It runs the CreateSnapshot job It checks every 5 seconds if the job is completed with a for loop When the CreateSnapshot job is completed it starts the DistributeSnapshot job And again it waits until the snapshot is delivered before the package will finish successfully Quite simple and the package is ready to use as standalone extract mechanism. After executing the package the replicated tables are added to the subscriber database and are filled with data: Download the SSIS package here (SSIS 2008) Conclusion In this example I only replicated 5 tables, I could create a SSIS package that does the same in approximately the same amount of time. But if I replicated all the 70+ AdventureWorks tables I would save a lot of time and boring work! With replication services you also benefit from the feature that schema changes are applied automatically which means your entire extract phase wont break. Because a snapshot is created using the bcp utility (bulk copy) it’s also quite fast, so the performance will be quite good. Disadvantages of using snapshot replication as extraction tool is the limitation on source systems. You can only choose SQL Server or Oracle databases to act as a publisher. So if you plan to build an extract phase for your ETL process that will invoke a lot of tables think about replication services, it would save you a lot of time and thanks to the Extract SSIS package I’ve created you can perfectly fit it in your usual SSIS ETL process.

Read the article
Of C# Iterators and Performance

- by James Michael Hare

Some of you reading this will be wondering, "what is an iterator" and think I'm locked in the world of C++. Nope, I'm talking C# iterators. No, not enumerators, iterators. So, for those of you who do not know what iterators are in C#, I will explain it in summary, and for those of you who know what iterators are but are curious of the performance impacts, I will explore that as well. Iterators have been around for a bit now, and there are still a bunch of people who don't know what they are or what they do. I don't know how many times at work I've had a code review on my code and have someone ask me, "what's that yield word do?" Basically, this post came to me as I was writing some extension methods to extend IEnumerable<T> -- I'll post some of the fun ones in a later post. Since I was filtering the resulting list down, I was using the standard C# iterator concept; but that got me wondering: what are the performance implications of using an iterator versus returning a new enumeration? So, to begin, let's look at a couple of methods. This is a new (albeit contrived) method called Every(...). The goal of this method is to access and enumeration and return every nth item in the enumeration (including the first). So Every(2) would return items 0, 2, 4, 6, etc. Now, if you wanted to write this in the traditional way, you may come up with something like this: public static IEnumerable<T> Every<T>(this IEnumerable<T> list, int interval) { List<T> newList = new List<T>(); int count = 0; foreach (var i in list) { if ((count++ % interval) == 0) { newList.Add(i); } } return newList; } So basically this method takes any IEnumerable<T> and returns a new IEnumerable<T> that contains every nth item. Pretty straight forward. The problem? Well, Every<T>(...) will construct a list containing every nth item whether or not you care. What happens if you were searching this result for a certain item and find that item after five tries? You would have generated the rest of the list for nothing. Enter iterators. This C# construct uses the yield keyword to effectively defer evaluation of the next item until it is asked for. This can be very handy if the evaluation itself is expensive or if there's a fair chance you'll never want to fully evaluate a list. We see this all the time in Linq, where many expressions are chained together to do complex processing on a list. This would be very expensive if each of these expressions evaluated their entire possible result set on call. Let's look at the same example function, this time using an iterator: public static IEnumerable<T> Every<T>(this IEnumerable<T> list, int interval) { int count = 0; foreach (var i in list) { if ((count++ % interval) == 0) { yield return i; } } } Notice it does not create a new return value explicitly, the only evidence of a return is the "yield return" statement. What this means is that when an item is requested from the enumeration, it will enter this method and evaluate until it either hits a yield return (in which case that item is returned) or until it exits the method or hits a yield break (in which case the iteration ends. Behind the scenes, this is all done with a class that the CLR creates behind the scenes that keeps track of the state of the iteration, so that every time the next item is asked for, it finds that item and then updates the current position so it knows where to start at next time. It doesn't seem like a big deal, does it? But keep in mind the key point here: it only returns items as they are requested. Thus if there's a good chance you will only process a portion of the return list and/or if the evaluation of each item is expensive, an iterator may be of benefit. This is especially true if you intend your methods to be chainable similar to the way Linq methods can be chained. For example, perhaps you have a List<int> and you want to take every tenth one until you find one greater than 10. We could write that as: List<int> someList = new List<int>(); // fill list here someList.Every(10).TakeWhile(i => i <= 10); Now is the difference more apparent? If we use the first form of Every that makes a copy of the list. It's going to copy the entire list whether we will need those items or not, that can be costly! With the iterator version, however, it will only take items from the list until it finds one that is > 10, at which point no further items in the list are evaluated. So, sounds neat eh? But what's the cost is what you're probably wondering. So I ran some tests using the two forms of Every above on lists varying from 5 to 500,000 integers and tried various things. Now, iteration isn't free. If you are more likely than not to iterate the entire collection every time, iterator has some very slight overhead: Copy vs Iterator on 100% of Collection (10,000 iterations) Collection Size Num Iterated Type Total ms 5 5 Copy 5 5 5 Iterator 5 50 50 Copy 28 50 50 Iterator 27 500 500 Copy 227 500 500 Iterator 247 5000 5000 Copy 2266 5000 5000 Iterator 2444 50,000 50,000 Copy 24,443 50,000 50,000 Iterator 24,719 500,000 500,000 Copy 250,024 500,000 500,000 Iterator 251,521 Notice that when iterating over the entire produced list, the times for the iterator are a little better for smaller lists, then getting just a slight bit worse for larger lists. In reality, given the number of items and iterations, the result is near negligible, but just to show that iterators come at a price. However, it should also be noted that the form of Every that returns a copy will have a left-over collection to garbage collect. However, if we only partially evaluate less and less through the list, the savings start to show and make it well worth the overhead. Let's look at what happens if you stop looking after 80% of the list: Copy vs Iterator on 80% of Collection (10,000 iterations) Collection Size Num Iterated Type Total ms 5 4 Copy 5 5 4 Iterator 5 50 40 Copy 27 50 40 Iterator 23 500 400 Copy 215 500 400 Iterator 200 5000 4000 Copy 2099 5000 4000 Iterator 1962 50,000 40,000 Copy 22,385 50,000 40,000 Iterator 19,599 500,000 400,000 Copy 236,427 500,000 400,000 Iterator 196,010 Notice that the iterator form is now operating quite a bit faster. But the savings really add up if you stop on average at 50% (which most searches would typically do): Copy vs Iterator on 50% of Collection (10,000 iterations) Collection Size Num Iterated Type Total ms 5 2 Copy 5 5 2 Iterator 4 50 25 Copy 25 50 25 Iterator 16 500 250 Copy 188 500 250 Iterator 126 5000 2500 Copy 1854 5000 2500 Iterator 1226 50,000 25,000 Copy 19,839 50,000 25,000 Iterator 12,233 500,000 250,000 Copy 208,667 500,000 250,000 Iterator 122,336 Now we see that if we only expect to go on average 50% into the results, we tend to shave off around 40% of the time. And this is only for one level deep. If we are using this in a chain of query expressions it only adds to the savings. So my recommendation? If you have a resonable expectation that someone may only want to partially consume your enumerable result, I would always tend to favor an iterator. The cost if they iterate the whole thing does not add much at all -- and if they consume only partially, you reap some really good performance gains. Next time I'll discuss some of my favorite extensions I've created to make development life a little easier and maintainability a little better.

Read the article
SortedDictionary and SortedList

- by Simon Cooper

Apart from Dictionary<TKey, TValue>, there's two other dictionaries in the BCL - SortedDictionary<TKey, TValue> and SortedList<TKey, TValue>. On the face of it, these two classes do the same thing - provide an IDictionary<TKey, TValue> interface where the iterator returns the items sorted by the key. So what's the difference between them, and when should you use one rather than the other? (as in my previous post, I'll assume you have some basic algorithm & datastructure knowledge) SortedDictionary We'll first cover SortedDictionary. This is implemented as a special sort of binary tree called a red-black tree. Essentially, it's a binary tree that uses various constraints on how the nodes of the tree can be arranged to ensure the tree is always roughly balanced (for more gory algorithmical details, see the wikipedia link above). What I'm concerned about in this post is how the .NET SortedDictionary is actually implemented. In .NET 4, behind the scenes, the actual implementation of the tree is delegated to a SortedSet<KeyValuePair<TKey, TValue>>. One example tree might look like this: Each node in the above tree is stored as a separate SortedSet<T>.Node object (remember, in a SortedDictionary, T is instantiated to KeyValuePair<TKey, TValue>): class Node { public bool IsRed; public T Item; public SortedSet<T>.Node Left; public SortedSet<T>.Node Right; } The SortedSet only stores a reference to the root node; all the data in the tree is accessed by traversing the Left and Right node references until you reach the node you're looking for. Each individual node can be physically stored anywhere in memory; what's important is the relationship between the nodes. This is also why there is no constructor to SortedDictionary or SortedSet that takes an integer representing the capacity; there are no internal arrays that need to be created and resized. This may seen trivial, but it's an important distinction between SortedDictionary and SortedList that I'll cover later on. And that's pretty much it; it's a standard red-black tree. Plenty of webpages and datastructure books cover the algorithms behind the tree itself far better than I could. What's interesting is the comparions between SortedDictionary and SortedList, which I'll cover at the end. As a side point, SortedDictionary has existed in the BCL ever since .NET 2. That means that, all through .NET 2, 3, and 3.5, there has been a bona-fide sorted set class in the BCL (called TreeSet). However, it was internal, so it couldn't be used outside System.dll. Only in .NET 4 was this class exposed as SortedSet. SortedList Whereas SortedDictionary didn't use any backing arrays, SortedList does. It is implemented just as the name suggests; two arrays, one containing the keys, and one the values (I've just used random letters for the values): The items in the keys array are always guarenteed to be stored in sorted order, and the value corresponding to each key is stored in the same index as the key in the values array. In this example, the value for key item 5 is 'z', and for key item 8 is 'm'. Whenever an item is inserted or removed from the SortedList, a binary search is run on the keys array to find the correct index, then all the items in the arrays are shifted to accomodate the new or removed item. For example, if the key 3 was removed, a binary search would be run to find the array index the item was at, then everything above that index would be moved down by one: and then if the key/value pair {7, 'f'} was added, a binary search would be run on the keys to find the index to insert the new item, and everything above that index would be moved up to accomodate the new item: If another item was then added, both arrays would be resized (to a length of 10) before the new item was added to the arrays. As you can see, any insertions or removals in the middle of the list require a proportion of the array contents to be moved; an O(n) operation. However, if the insertion or removal is at the end of the array (ie the largest key), then it's only O(log n); the cost of the binary search to determine it does actually need to be added to the end (excluding the occasional O(n) cost of resizing the arrays to fit more items). As a side effect of using backing arrays, SortedList offers IList Keys and Values views that simply use the backing keys or values arrays, as well as various methods utilising the array index of stored items, which SortedDictionary does not (and cannot) offer. The Comparison So, when should you use one and not the other? Well, here's the important differences: Memory usage SortedDictionary and SortedList have got very different memory profiles. SortedDictionary... has a memory overhead of one object instance, a bool, and two references per item. On 64-bit systems, this adds up to ~40 bytes, not including the stored item and the reference to it from the Node object. stores the items in separate objects that can be spread all over the heap. This helps to keep memory fragmentation low, as the individual node objects can be allocated wherever there's a spare 60 bytes. In contrast, SortedList... has no additional overhead per item (only the reference to it in the array entries), however the backing arrays can be significantly larger than you need; every time the arrays are resized they double in size. That means that if you add 513 items to a SortedList, the backing arrays will each have a length of 1024. To conteract this, the TrimExcess method resizes the arrays back down to the actual size needed, or you can simply assign list.Capacity = list.Count. stores its items in a continuous block in memory. If the list stores thousands of items, this can cause significant problems with Large Object Heap memory fragmentation as the array resizes, which SortedDictionary doesn't have. Performance Operations on a SortedDictionary always have O(log n) performance, regardless of where in the collection you're adding or removing items. In contrast, SortedList has O(n) performance when you're altering the middle of the collection. If you're adding or removing from the end (ie the largest item), then performance is O(log n), same as SortedDictionary (in practice, it will likely be slightly faster, due to the array items all being in the same area in memory, also called locality of reference). So, when should you use one and not the other? As always with these sort of things, there are no hard-and-fast rules. But generally, if you: need to access items using their index within the collection are populating the dictionary all at once from sorted data aren't adding or removing keys once it's populated then use a SortedList. But if you: don't know how many items are going to be in the dictionary are populating the dictionary from random, unsorted data are adding & removing items randomly then use a SortedDictionary. The default (again, there's no definite rules on these sort of things!) should be to use SortedDictionary, unless there's a good reason to use SortedList, due to the bad performance of SortedList when altering the middle of the collection.

Read the article
Premature-Optimization and Performance Anxiety

- by James Michael Hare

While writing my post analyzing the new .NET 4 ConcurrentDictionary class (here), I fell into one of the classic blunders that I myself always love to warn about. After analyzing the differences of time between a Dictionary with locking versus the new ConcurrentDictionary class, I noted that the ConcurrentDictionary was faster with read-heavy multi-threaded operations. Then, I made the classic blunder of thinking that because the original Dictionary with locking was faster for those write-heavy uses, it was the best choice for those types of tasks. In short, I fell into the premature-optimization anti-pattern. Basically, the premature-optimization anti-pattern is when a developer is coding very early for a perceived (whether rightly-or-wrongly) performance gain and sacrificing good design and maintainability in the process. At best, the performance gains are usually negligible and at worst, can either negatively impact performance, or can degrade maintainability so much that time to market suffers or the code becomes very fragile due to the complexity. Keep in mind the distinction above. I'm not talking about valid performance decisions. There are decisions one should make when designing and writing an application that are valid performance decisions. Examples of this are knowing the best data structures for a given situation (Dictionary versus List, for example) and choosing performance algorithms (linear search vs. binary search). But these in my mind are macro optimizations. The error is not in deciding to use a better data structure or algorithm, the anti-pattern as stated above is when you attempt to over-optimize early on in such a way that it sacrifices maintainability. In my case, I was actually considering trading the safety and maintainability gains of the ConcurrentDictionary (no locking required) for a slight performance gain by using the Dictionary with locking. This would have been a mistake as I would be trading maintainability (ConcurrentDictionary requires no locking which helps readability) and safety (ConcurrentDictionary is safe for iteration even while being modified and you don't risk the developer locking incorrectly) -- and I fell for it even when I knew to watch out for it. I think in my case, and it may be true for others as well, a large part of it was due to the time I was trained as a developer. I began college in in the 90s when C and C++ was king and hardware speed and memory were still relatively priceless commodities and not to be squandered. In those days, using a long instead of a short could waste precious resources, and as such, we were taught to try to minimize space and favor performance. This is why in many cases such early code-bases were very hard to maintain. I don't know how many times I heard back then to avoid too many function calls because of the overhead -- and in fact just last year I heard a new hire in the company where I work declare that she didn't want to refactor a long method because of function call overhead. Now back then, that may have been a valid concern, but with today's modern hardware even if you're calling a trivial method in an extremely tight loop (which chances are the JIT compiler would optimize anyway) the results of removing method calls to speed up performance are negligible for the great majority of applications. Now, obviously, there are those coding applications where speed is absolutely king (for example drivers, computer games, operating systems) where such sacrifices may be made. But I would strongly advice against such optimization because of it's cost. Many folks that are performing an optimization think it's always a win-win. That they're simply adding speed to the application, what could possibly be wrong with that? What they don't realize is the cost of their choice. For every piece of straight-forward code that you obfuscate with performance enhancements, you risk the introduction of bugs in the long term technical debt of the application. It will become so fragile over time that maintenance will become a nightmare. I've seen such applications in places I have worked. There are times I've seen applications where the designer was so obsessed with performance that they even designed their own memory management system for their application to try to squeeze out every ounce of performance. Unfortunately, the application stability often suffers as a result and it is very difficult for anyone other than the original designer to maintain. I've even seen this recently where I heard a C++ developer bemoaning that in VS2010 the iterators are about twice as slow as they used to be because Microsoft added range checking (probably as part of the 0x standard implementation). To me this was almost a joke. Twice as slow sounds bad, but it almost never as bad as you think -- especially if you're gaining safety. The only time twice is really that much slower is when once was too slow to begin with. Think about it. 2 minutes is slow as a response time because 1 minute is slow. But if an iterator takes 1 microsecond to move one position and a new, safer iterator takes 2 microseconds, this is trivial! The only way you'd ever really notice this would be in iterating a collection just for the sake of iterating (i.e. no other operations). To my mind, the added safety makes the extra time worth it. Always favor safety and maintainability when you can. I know it can be a hard habit to break, especially if you started out your career early or in a language such as C where they are very performance conscious. But in reality, these type of micro-optimizations only end up hurting you in the long run. Remember the two laws of optimization. I'm not sure where I first heard these, but they are so true: For beginners: Do not optimize. For experts: Do not optimize yet. This is so true. If you're a beginner, resist the urge to optimize at all costs. And if you are an expert, delay that decision. As long as you have chosen the right data structures and algorithms for your task, your performance will probably be more than sufficient. Chances are it will be network, database, or disk hits that will be your slow-down, not your code. As they say, 98% of your code's bottleneck is in 2% of your code so premature-optimization may add maintenance and safety debt that won't have any measurable impact. Instead, code for maintainability and safety, and then, and only then, when you find a true bottleneck, then you should go back and optimize further.

Read the article
The SPARC SuperCluster

- by Karoly Vegh

Oracle has been providing a lead in the Engineered Systems business for quite a while now, in accordance with the motto "Hardware and Software Engineered to Work Together." Indeed it is hard to find a better definition of these systems. Allow me to summarize the idea. It is: Build a compute platform optimized to run your technologies Develop application aware, intelligently caching storage components Take an impressively fast network technology interconnecting it with the compute nodes Tune the application to scale with the nodes to yet unseen performance Reduce the amount of data moving via compression Provide this all in a pre-integrated single product with a single-pane management interface All these ideas have been around in IT for quite some time now. The real Oracle advantage is adding the last one to put these all together. Oracle has built quite a portfolio of Engineered Systems, to run its technologies - and run those like they never ran before. In this post I'll focus on one of them that serves as a consolidation demigod, a multi-purpose engineered system. As you probably have guessed, I am talking about the SPARC SuperCluster. It has many great features inherited from its predecessors, and it adds several new ones. Allow me to pick out and elaborate about some of the most interesting ones from a technological point of view. I. It is the SPARC SuperCluster T4-4. That is, as compute nodes, it includes SPARC T4-4 servers that we learned to appreciate and respect for their features: The SPARC T4 CPUs: Each CPU has 8 cores, each core runs 8 threads. The SPARC T4-4 servers have 4 sockets. That is, a single compute node can in parallel, simultaneously execute 256 threads. Now, a full-rack SPARC SuperCluster has 4 of these servers on board. Remember the keyword demigod. While retaining the forerunner SPARC T3's exceptional throughput, the SPARC T4 CPUs raise the bar with single performance too - a humble 5x better one than their ancestors. actually, the SPARC T4 CPU cores run in both single-threaded and multi-threaded mode, and switch between these two on-the-fly, fulfilling not only single-threaded OR multi-threaded applications' needs, but even mixed requirements (like in database workloads!). Data security, anyone? Every SPARC T4 CPU core has a built-in encryption engine, that is, encryption algorithms cast into silicon. A PCI controller right on the chip for customers who need I/O performance. Built-in, no-cost Virtualization: Oracle VM for SPARC (the former LDoms or Logical Domains) is not a server-emulation virtualization technology but rather a serverpartitioning one, the hypervisor runs in the server firmware, and all the VMs' HW resources (I/O, CPU, memory) are accessed natively, without performance overhead. This enables customers to run a number of Solaris 10 and Solaris 11 VMs separated, independent of each other within a physical server II. For Database performance, it includes Exadata Storage Cells - one of the main reasons why the Exadata Database Machine performs at diabolic speed. What makes them important? They provide DB backend storage for your Oracle Databases to run on the SPARC SuperCluster, that is what they are built and tuned for DB performance. These storage cells are SQL-aware. That is, if a SPARC T4 database compute node executes a query, it doesn't simply request tons of raw datablocks from the storage, filters the received data, and throws away most of it where the statement doesn't apply, but provides the SQL query to the storage node too. The storage cell software speaks SQL, that is, it is able to prefilter and through that transfer only the relevant data. With this, the traffic between database nodes and storage cells is reduced immensely. Less I/O is a good thing - as they say, all the CPUs of the world do one thing just as fast as any other - and that is waiting for I/O. They don't only pre-filter, but also provide data preprocessing features - e.g. if a DB-node requests an aggregate of data, they can calculate it, and handover only the results, not the whole set. Again, less data to transfer. They support the magical HCC, (Hybrid Columnar Compression). That is, data can be stored in a precompressed form on the storage. Less data to transfer. Of course one can't simply rely on disks for performance, there is Flash Storage included there for caching. III. The low latency, high-speed backbone network: InfiniBand, that interconnects all the members with: Real High Speed: 40 Gbit/s. Full Duplex, of course. Oh, and a really low latency. RDMA. Remote Direct Memory Access. This technology allows the DB nodes to do exactly that. Remotely, directly placing SQL commands into the Memory of the storage cells. Dodging all the network-stack bottlenecks, avoiding overhead, placing requests directly into the process queue. You can also run IP over InfiniBand if you please - that's the way the compute nodes can communicate with each other. IV. Including a general-purpose storage too: the ZFSSA, which is a unified storage, providing NAS and SAN access too, with the following features: NFS over RDMA over InfiniBand. Nothing is faster network-filesystem-wise. All the ZFS features onboard, hybrid storage pools, compression, deduplication, snapshot, replication, NFS and CIFS shares Storageheads in a HA-Cluster configuration providing availability of the data DTrace Live Analytics in a web-based Administration UI Being a general purpose application data storage for your non-database applications running on the SPARC SuperCluster over whichever protocol they prefer, easily replicating, snapshotting, cloning data for them. There's a lot of great technology included in Oracle's SPARC SuperCluster, we have talked its interior through. As for external scalability: you can start with a half- of full- rack SPARC SuperCluster, and scale out to several racks - that is, stacking not separate full-rack SPARC SuperClusters, but extending always one large instance of the size of several full-racks. Yes, over InfiniBand network. Add racks as you grow. What technologies shall run on it? SPARC SuperCluster is a general purpose scaleout consolidation/cloud environment. You can run Oracle Databases with RAC scaling, or Oracle Weblogic (end enjoy the SPARC T4's advantages to run Java). Remember, Oracle technologies have been integrated with the Oracle Engineered Systems - this is the Oracle on Oracle advantage. But you can run other software environments such as SAP if you please too. Run any application that runs on Oracle Solaris 10 or Solaris 11. Separate them in Virtual Machines, or even Oracle Solaris Zones, monitor and manage those from a central UI. Here the key takeaways once again: The SPARC SuperCluster: Is a pre-integrated Engineered System Contains SPARC T4-4 servers with built-in virtualization, cryptography, dynamic threading Contains the Exadata storage cells that intelligently offload the burden of the DB-nodes Contains a highly available ZFS Storage Appliance, that provides SAN/NAS storage in a unified way Combines all these elements over a high-speed, low-latency backbone network implemented with InfiniBand Can grow from a single half-rack to several full-rack size Supports the consolidation of hundreds of applications To summarize: All these technologies are great by themselves, but the real value is like in every other Oracle Engineered System: Integration. All these technologies are tuned to perform together. Together they are way more than the sum of all - and a careful and actually very time consuming integration process is necessary to orchestrate all these for performance. The SPARC SuperCluster's goal is to enable infrastructure operations and offer a pre-integrated solution that can be architected and delivered in hours instead of months of evaluations and tests. The tedious and most importantly time and resource consuming part of the work - testing and evaluating - has been done. Now go, provide services. -- charlie

Read the article
Agile Like Jazz

- by Jeff Certain

(I’ve been sitting on this for a week or so now, thinking that it needed to be tightened up a bit to make it less rambling. Since that’s clearly not going to happen, reader beware!) I had the privilege of spending around 90 minutes last night sitting and listening to Sonny Rollins play a concert at the Disney Center in LA. If you don’t know who Sonny Rollins is, I don’t know how to explain the experience; if you know who he is, I don’t need to. Suffice it to say that he has been recording professionally for over 50 years, and helped create an entire genre of music. A true master by any definition. One of the most intriguing aspects of a concert like this, however, is watching the master step aside and let the rest of the musicians play. Not just play their parts, but really play… letting them take over the spotlight, to strut their stuff, to soak up enthusiastic applause from the crowd. Maybe a lot of it has to do with the fact that Sonny Rollins has been doing this for more than a half-century. Maybe it has something to do with a kind of patience you learn when you’re on the far side of 80 – and the man can still blow a mean sax for 90 minutes without stopping! Maybe it has to do with the fact that he was out there for the love of the music and the love of the show, not because he had anything to prove to anyone and, I like to think, not for the money. Perhaps it had more to do with the fact that, when you’re at that level of mastery, the other musicians are going to be good. Really good. Whatever the reasons, there was a incredible freedom on that stage – the ability to improvise, for each musician to showcase their own specialization and skills, and them come back to the common theme, back to being on the same page, as it were. All this took place in the same venue that is home to the L.A. Phil. Somehow, I can’t ever see the same kind of free-wheeling improvisation happening in that context. And, since I’m a geek, I started thinking about agility. Rollins has put together a quintet that reflects his own particular style and past. No upright bass or piano for Rollins – drums, bongos, electric guitar and bass guitar along with his sax. It’s not about the mix of instruments. Other trios, quartets, and sextets use different mixes of instruments. New Orleans jazz tends towards trombones instead of sax; some prefer cornet or trumpet. But no matter what the choice of instruments, size matters. Team sizes are something I’ve been thinking about for a while. We’re on a quest to rethink how our teams are organized. They just feel too big, too unwieldy. In fact, they really don’t feel like teams at all. Most of the time, they feel more like collections or people who happen to report to the same manager. I attribute this to a couple factors. One is over-specialization; we have a tendency to have people work in silos. Although the teams are product-focused, within them our developers are both generalists and specialists. On the one hand, we expect them to be able to build an entire vertical slice of the application; on the other hand, each developer tends to be responsible for the vertical slice. As a result, developers often work on their own piece of the puzzle, in isolation. This sort of feels like working on a jigsaw in a group – each person taking a set of colors and piecing them together to reveal a portion of the overall picture. But what inevitably happens when you go to meld all those pieces together? Inevitably, you have some sections that are too big to move easily. These sections end up falling apart under their own weight as you try to move them. Not only that, but there are other challenges – figuring out where that section fits, and how to tie it into the rest of the puzzle. Often, this is when you find a few pieces need to be added – these pieces are “glue,” if you will. The other issue that arises is due to the overhead of maintaining communications in a team. My mother, who worked in IT for around 30 years, once told me that 20% per team member is a good rule of thumb for maintaining communication. While this is a rule of thumb, it seems to imply that any team over about 6 people is going to become less agile simple because of the communications burden. Teams of ten or twelve seem like they fall into the philharmonic organizational model. Complicated pieces of music requiring dozens of players to all be on the same page requires a much different model than the jazz quintet. There’s much less room for improvisation, originality or freedom. (There are probably orchestral musicians who will take exception to this characterization; I’m calling it like I see it from the cheap seats.) And, there’s one guy up front who is running the show, whose job is to keep all of those dozens of players on the same page, to facilitate communications. Somehow, the orchestral model doesn’t feel much like a self-organizing team, either. The first violin may be the best violinist in the orchestra, but they don’t get to perform free-wheeling solos. I’ve never heard of an orchestra getting together for a jam session. But I have heard of teams that organize their work based on the developers available, rather than organizing the developers based on the work required. I have heard of teams where desired functionality is deferred – or worse yet, schedules are missed – because one critical person doesn’t have any bandwidth available. I’ve heard of teams where people simply don’t have the big picture, because there is too much communication overhead for everyone to be aware of everything that is happening on a project. I once heard Paul Rayner say something to the effect of “you have a process that is perfectly designed to give you exactly the results you have.” Given a choice, I want a process that’s much more like jazz than orchestral music. I want a process that doesn’t burden me with lots of forms and checkboxes and stuff. Give me the simplest, most lightweight process that will work – and a smaller team of the best developers I can find. This seems like the kind of process that will get the kind of result I want to be part of.

Read the article
External File Upload Optimizations for Windows Azure

- by rgillen

[Cross posted from here: http://rob.gillenfamily.net/post/External-File-Upload-Optimizations-for-Windows-Azure.aspx] I’m wrapping up a bit of the work we’ve been doing on data movement optimizations for cloud computing and the latest set of data yielded some interesting points I thought I’d share. The work done here is not really rocket science but may, in some ways, be slightly counter-intuitive and therefore seemed worthy of posting. Summary: for those who don’t like to read detailed posts or don’t have time, the synopsis is that if you are uploading data to Azure, block your data (even down to 1MB) and upload in parallel. Set your block size based on your source file size, but if you must choose a fixed value, use 1MB. Following the above will result in significant performance gains… upwards of 10x-24x and a reduction in overall file transfer time of upwards of 90% (eg, uploading a 1GB file averaged 46.37 minutes prior to optimizations and averaged 1.86 minutes afterwards). Detail: For those of you who want more detail, or think that the claims at the end of the preceding paragraph are over-reaching, what follows is information and code supporting these claims. As the title would indicate, these tests were run from our research facility pointing to the Azure cloud (specifically US North Central as it is physically closest to us) and do not represent intra-cloud results… we have performed intra-cloud tests and the overall results are similar in notion but the data rates are significantly different as well as the tipping points for the various block sizes… this will be detailed separately). We started by building a very simple console application that would loop through a directory and upload each file to Azure storage. This application used the shipping storage client library from the 1.1 version of the azure tools. The only real variation from the client library is that we added code to collect and record the duration (in ms) and size (in bytes) for each file transferred. The code is available here. We then created a directory that had a collection of files for the following sizes: 2KB, 32KB, 64KB, 128KB, 512KB, 1MB, 5MB, 10MB, 25MB, 50MB, 100MB, 250MB, 500MB, 750MB, and 1GB (50 files for each size listed). These files contained randomly-generated binary data and do not benefit from compression (a separate discussion topic). Our file generation tool is available here. The baseline was established by running the application described above against the directory containing all of the data files. This application uploads the files in a random order so as to avoid transferring all of the files of a given size sequentially and thereby spreading the affects of periodic Internet delays across the collection of results. We then ran some scripts to split the resulting data and generate some reports. The raw data collected for our non-optimized tests is available via the links in the Related Resources section at the bottom of this post. For each file size, we calculated the average upload time (and standard deviation) and the average transfer rate (and standard deviation). As you likely are aware, transferring data across the Internet is susceptible to many transient delays which can cause anomalies in the resulting data. It is for this reason that we randomized the order of source file processing as well as executed the tests 50x for each file size. We expect that these steps will yield a sufficiently balanced set of results. Once the baseline was collected and analyzed, we updated the test harness application with some methods to split the source file into user-defined block sizes and then to upload those blocks in parallel (using the PutBlock() method of Azure storage). The parallelization was handled by simply relying on the Parallel Extensions to .NET to provide a Parallel.For loop (see linked source for specific implementation details in Program.cs, line 173 and following… less than 100 lines total). Once all of the blocks were uploaded, we called PutBlockList() to assemble/commit the file in Azure storage. For each block transferred, the MD5 was calculated and sent ensuring that the bits that arrived matched was was intended. The timer for the blocked/parallelized transfer method wraps the entire process (source file splitting, block transfer, MD5 validation, file committal). A diagram of the process is as follows: We then tested the affects of blocking & parallelizing the transfers by running the updated application against the same source set and did a parameter sweep on the block size including 256KB, 512KB, 1MB, 2MB, and 4MB (our assumption was that anything lower than 256KB wasn’t worth the trouble and 4MB is the maximum size of a block supported by Azure). The raw data for the parallel tests is available via the links in the Related Resources section at the bottom of this post. This data was processed and then compared against the single-threaded / non-optimized transfer numbers and the results were encouraging. The Excel version of the results is available here. Two semi-obvious points need to be made prior to reviewing the data. The first is that if the block size is larger than the source file size you will end up with a “negative optimization” due to the overhead of attempting to block and parallelize. The second is that as the files get smaller, the clock-time cost of blocking and parallelizing (overhead) is more apparent and can tend towards negative optimizations. For this reason (and is supported in the raw data provided in the linked worksheet) the charts and dialog below ignore source file sizes less than 1MB. (click chart for full size image) The chart above illustrates some interesting points about the results: When the block size is smaller than the source file, performance increases but as the block size approaches and then passes the source file size, you see decreasing benefit to the point of negative gains (see the values for the 1MB file size) For some of the moderately-sized source files, small blocks (256KB) are best As the size of the source file gets larger (see values for 50MB and up), the smallest block size is not the most efficient (presumably due, at least in part, to the increased number of blocks, increased number of individual transfer requests, and reassembly/committal costs). Once you pass the 250MB source file size, the difference in rate for 1MB to 4MB blocks is more-or-less constant The 1MB block size gives the best average improvement (~16x) but the optimal approach would be to vary the block size based on the size of the source file. (click chart for full size image) The above is another view of the same data as the prior chart just with the axis changed (x-axis represents file size and plotted data shows improvement by block size). It again highlights the fact that the 1MB block size is probably the best overall size but highlights the benefits of some of the other block sizes at different source file sizes. This last chart shows the change in total duration of the file uploads based on different block sizes for the source file sizes. Nothing really new here other than this view of the data highlights the negative affects of poorly choosing a block size for smaller files. Summary What we have found so far is that blocking your file uploads and uploading them in parallel results in significant performance improvements. Further, utilizing extension methods and the Task Parallel Library (.NET 4.0) make short work of altering the shipping client library to provide this functionality while minimizing the amount of change to existing applications that might be using the client library for other interactions. Related Resources Source code for upload test application Source code for random file generator ODatas feed of raw data from non-optimized transfer tests Experiment Metadata Experiment Datasets 2KB Uploads 32KB Uploads 64KB Uploads 128KB Uploads 256KB Uploads 512KB Uploads 1MB Uploads 5MB Uploads 10MB Uploads 25MB Uploads 50MB Uploads 100MB Uploads 250MB Uploads 500MB Uploads 750MB Uploads 1GB Uploads Raw Data OData feeds of raw data from blocked/parallelized transfer tests Experiment Metadata Experiment Datasets Raw Data 256KB Blocks 512KB Blocks 1MB Blocks 2MB Blocks 4MB Blocks Excel worksheet showing summarizations and comparisons

Read the article
Master-slave vs. peer-to-peer archictecture: benefits and problems

- by Ashok_Ora

Normal 0 false false false EN-US X-NONE X-NONE Almost two decades ago, I was a member of a database development team that introduced adaptive locking. Locking, the most popular concurrency control technique in database systems, is pessimistic. Locking ensures that two or more conflicting operations on the same data item don’t “trample” on each other’s toes, resulting in data corruption. In a nutshell, here’s the issue we were trying to address. In everyday life, traffic lights serve the same purpose. They ensure that traffic flows smoothly and when everyone follows the rules, there are no accidents at intersections. As I mentioned earlier, the problem with typical locking protocols is that they are pessimistic. Regardless of whether there is another conflicting operation in the system or not, you have to hold a lock! Acquiring and releasing locks can be quite expensive, depending on how many objects the transaction touches. Every transaction has to pay this penalty. To use the earlier traffic light analogy, if you have ever waited at a red light in the middle of nowhere with no one on the road, wondering why you need to wait when there’s clearly no danger of a collision, you know what I mean. The adaptive locking scheme that we invented was able to minimize the number of locks that a transaction held, by detecting whether there were one or more transactions that needed conflicting eyou could get by without holding any lock at all. In many “well-behaved” workloads, there are few conflicts, so this optimization is a huge win. If, on the other hand, there are many concurrent, conflicting requests, the algorithm gracefully degrades to the “normal” behavior with minimal cost. We were able to reduce the number of lock requests per TPC-B transaction from 178 requests down to 2! Wow! This is a dramatic improvement in concurrency as well as transaction latency. The lesson from this exercise was that if you can identify the common scenario and optimize for that case so that only the uncommon scenarios are more expensive, you can make dramatic improvements in performance without sacrificing correctness. So how does this relate to the architecture and design of some of the modern NoSQL systems? NoSQL systems can be broadly classified as master-slave sharded, or peer-to-peer sharded systems. NoSQL systems with a peer-to-peer architecture have an interesting way of handling changes. Whenever an item is changed, the client (or an intermediary) propagates the changes synchronously or asynchronously to multiple copies (for availability) of the data. Since the change can be propagated asynchronously, during some interval in time, it will be the case that some copies have received the update, and others haven’t. What happens if someone tries to read the item during this interval? The client in a peer-to-peer system will fetch the same item from multiple copies and compare them to each other. If they’re all the same, then every copy that was queried has the same (and up-to-date) value of the data item, so all’s good. If not, then the system provides a mechanism to reconcile the discrepancy and to update stale copies. So what’s the problem with this? There are two major issues: First, IT’S HORRIBLY PESSIMISTIC because, in the common case, it is unlikely that the same data item will be updated and read from different locations at around the same time! For every read operation, you have to read from multiple copies. That’s a pretty expensive, especially if the data are stored in multiple geographically separate locations and network latencies are high. Second, if the copies are not all the same, the application has to reconcile the differences and propagate the correct value to the out-dated copies. This means that the application program has to handle discrepancies in the different versions of the data item and resolve the issue (which can further add to cost and operation latency). Resolving discrepancies is only one part of the problem. What if the same data item was updated independently on two different nodes (copies)? In that case, due to the asynchronous nature of change propagation, you might land up with different versions of the data item in different copies. In this case, the application program also has to resolve conflicts and then propagate the correct value to the copies that are out-dated or have incorrect versions. This can get really complicated. My hunch is that there are many peer-to-peer-based applications that don’t handle this correctly, and worse, don’t even know it. Imagine have 100s of millions of records in your database – how can you tell whether a particular data item is incorrect or out of date? And what price are you willing to pay for ensuring that the data can be trusted? Multiple network messages per read request? Discrepancy and conflict resolution logic in the application, and potentially, additional messages? All this overhead, when all you were trying to do was to read a data item. Wouldn’t it be simpler to avoid this problem in the first place? Master-slave architectures like the Oracle NoSQL Database handles this very elegantly. A change to a data item is always sent to the master copy. Consequently, the master copy always has the most current and authoritative version of the data item. The master is also responsible for propagating the change to the other copies (for availability and read scalability). Client drivers are aware of master copies and replicas, and client drivers are also aware of the “currency” of a replica. In other words, each NoSQL Database client knows how stale a replica is. This vastly simplifies the job of the application developer. If the application needs the most current version of the data item, the client driver will automatically route the request to the master copy. If the application is willing to tolerate some staleness of data (e.g. a version that is no more than 1 second out of date), the client can easily determine which replica (or set of replicas) can satisfy the request, and route the request to the most efficient copy. This results in a dramatic simplification in application logic and also minimizes network requests (the driver will only send the request to exactl the right replica, not many). So, back to my original point. A well designed and well architected system minimizes or eliminates unnecessary overhead and avoids pessimistic algorithms wherever possible in order to deliver a highly efficient and high performance system. If you’ve every programmed an Oracle NoSQL Database application, you’ll know the difference! /* Style Definitions */ table.MsoNormalTable {mso-style-name:"Table Normal"; mso-tstyle-rowband-size:0; mso-tstyle-colband-size:0; mso-style-noshow:yes; mso-style-priority:99; mso-style-qformat:yes; mso-style-parent:""; mso-padding-alt:0in 5.4pt 0in 5.4pt; mso-para-margin-top:0in; mso-para-margin-right:0in; mso-para-margin-bottom:10.0pt; mso-para-margin-left:0in; line-height:115%; mso-pagination:widow-orphan; font-size:11.0pt; font-family:"Calibri","sans-serif"; mso-ascii-font-family:Calibri; mso-ascii-theme-font:minor-latin; mso-fareast-font-family:"Times New Roman"; mso-fareast-theme-font:minor-fareast; mso-hansi-font-family:Calibri; mso-hansi-theme-font:minor-latin;}

Read the article
Data Source Connection Pool Sizing

- by Steve Felts

Normal 0 false false false EN-US X-NONE X-NONE MicrosoftInternetExplorer4 /* Style Definitions */ table.MsoNormalTable {mso-style-name:"Table Normal"; mso-tstyle-rowband-size:0; mso-tstyle-colband-size:0; mso-style-noshow:yes; mso-style-priority:99; mso-style-qformat:yes; mso-style-parent:""; mso-padding-alt:0in 5.4pt 0in 5.4pt; mso-para-margin:0in; mso-para-margin-bottom:.0001pt; mso-pagination:widow-orphan; font-size:10.0pt; font-family:"Times New Roman","serif";} One of the most time-consuming procedures of a database application is establishing a connection. The connection pooling of the data source can be used to minimize this overhead. That argues for using the data source instead of accessing the database driver directly. Configuring the size of the pool in the data source is somewhere between an art and science – this article will try to move it closer to science. From the beginning, WLS data source has had an initial capacity and a maximum capacity configuration values. When the system starts up and when it shrinks, initial capacity is used. The pool can grow to maximum capacity. Customers found that they might want to set the initial capacity to 0 (more on that later) but didn’t want the pool to shrink to 0. In WLS 10.3.6, we added minimum capacity to specify the lower limit to which a pool will shrink. If minimum capacity is not set, it defaults to the initial capacity for upward compatibility. We also did some work on the shrinking in release 10.3.4 to reduce thrashing; the algorithm that used to shrink to the maximum of the currently used connections or the initial capacity (basically the unused connections were all released) was changed to shrink by half of the unused connections. The simple approach to sizing the pool is to set the initial/minimum capacity to the maximum capacity. Doing this creates all connections at startup, avoiding creating connections on demand and the pool is stable. However, there are a number of reasons not to take this simple approach. When WLS is booted, the deployment of the data source includes synchronously creating the connections. The more connections that are configured in initial capacity, the longer the boot time for WLS (there have been several projects for parallel boot in WLS but none that are available). Related to creating a lot of connections at boot time is the problem of logon storms (the database gets too much work at one time). WLS has a solution for that by setting the login delay seconds on the pool but that also increases the boot time. There are a number of cases where it is desirable to set the initial capacity to 0. By doing that, the overhead of creating connections is deferred out of the boot and the database doesn’t need to be available. An application may not want WLS to automatically connect to the database until it is actually needed, such as for some code/warm failover configurations. There are a number of cases where minimum capacity should be less than maximum capacity. Connections are generally expensive to keep around. They cause state to be kept on both the client and the server, and the state on the backend may be heavy (for example, a process). Depending on the vendor, connection usage may cost money. If work load is not constant, then database connections can be freed up by shrinking the pool when connections are not in use. When using Active GridLink, connections can be created as needed according to runtime load balancing (RLB) percentages instead of by connection load balancing (CLB) during data source deployment. Shrinking is an effective technique for clearing the pool when connections are not in use. In addition to the obvious reason that there times where the workload is lighter, there are some configurations where the database and/or firewall conspire to make long-unused or too-old connections no longer viable. There are also some data source features where the connection has state and cannot be used again unless the state matches the request. Examples of this are identity based pooling where the connection has a particular owner and XA affinity where the connection is associated with a particular RAC node. At this point, WLS does not re-purpose (discard/replace) connections and shrinking is a way to get rid of the unused existing connection and get a new one with the correct state when needed. So far, the discussion has focused on the relationship of initial, minimum, and maximum capacity. Computing the maximum size requires some knowledge about the application and the current number of simultaneously active users, web sessions, batch programs, or whatever access patterns are common. The applications should be written to only reserve and close connections as needed but multiple statements, if needed, should be done in one reservation (don’t get/close more often than necessary). This means that the size of the pool is likely to be significantly smaller then the number of users. If possible, you can pick a size and see how it performs under simulated or real load. There is a high-water mark statistic (ActiveConnectionsHighCount) that tracks the maximum connections concurrently used. In general, you want the size to be big enough so that you never run out of connections but no bigger. It will need to deal with spikes in usage, which is where shrinking after the spike is important. Of course, the database capacity also has a big influence on the decision since it’s important not to overload the database machine. Planning also needs to happen if you are running in a Multi-Data Source or Active GridLink configuration and expect that the remaining nodes will take over the connections when one of the nodes in the cluster goes down. For XA affinity, additional headroom is also recommended. In summary, setting initial and maximum capacity to be the same may be simple but there are many other factors that may be important in making the decision about sizing.

Read the article
How to move linux executables to a ramdisk?

- by alfa64

i've made a ramdisk this way: mkdir -p /media/ramdisk mount -t tmpfs -o size=512M tmpfs /media/ramdisk/ The reason for this is because i run a lot of node.js scripts and their execution time is very small, but i suspect that the time overhead is because it reloads the node.js executable from disk and destroys it on each subsecuent run. So i think this might be the solution to gain a bit, if not much, performance. How can i move a program like node to the ramdisk and run it from there? The idea is to have a startup script that creates the ramdisk and puts the node files inside of it. Note that i'm currently using fedora 16 for what's it's worth. Thanks in advance.

Read the article
10/100 Network performing at 1.5 to 2.0 megabyte per second - is that below normal?

- by burnt1ce

This comes out to about 12 to 16 megabit/seond. I've read in forums that people are getting much higher speeds (ie: "40-60 Mb/s" http://forums.cnet.com/5208-7589_102-0.html?threadID=265967). I'm getting my benchmark by having a unmanged 5 port switch connected to a WRT54GS router connect. I'm sending a file from a computer connect to the WRT54GS to another computer that's connect to the unmanaged 5 port switch. Is the linkage of the switch causing this massive overhead? i doubt it. What could explain the slow down? electrical interference?

Read the article
LVM2 vs MDADM performance

- by archer

I've used MDADM + LVM2 on many boxes for quite a while. MDADM was serving for both RAID0 and RAID1 arrays, while LVM2 where used for logical volumes on top of MDADM. Recently I've found that LVM2 could be used w/o MDADM (thus minus one layer, as the result - less overhead) for both mirroring and stripping. However, some guys claims that READ PERFORMANCE on LVM2 for mirrored array is not that fast as for LVM2 (linear) on top of MDADM (RAID1) as LVM2 does not read from 2+ devices at a time, but use 2nd and higher devices in case of 1st device failure. MDADM reads from 2 devices at a time (even in mirrored mode). Who could confirm that?

Read the article
How can I get VirtualBox Guest Additions installed in an Ubuntu 9.10 server?

- by sutch

I have a freshly installed Ubuntu 9.10 server installed within a VirtualBox VM instance. From the VirtualBox menu bar, I selected Devices: Install Guest Additions... Then performed the following commands: > sudo apt-get install -y build-essential linux-headers-$(uname -r) > sudo mount /dev/cdrom /mnt/ > sudo /mnt/VBoxLinuxAdditions-amd64.run After some successful looking results, the following error is displayed: Installing the Window System drivers ...fail! (Could not find the X.Org or XFree86 Window System.) After restarting, I was looking forward to some UI integration with my host desktop (resize window, not needing to press right-Ctrl to escape the client window, and having copy and paste functionality. Is it possible to install the Guest Additions without the X Window overhead (I plan to only use for shell commands)? If additional packages are required, which ones?

Read the article
Why is (free_space + used_space) != total_size in df? [migrated]

- by Timothy Jones

I have a ~2TB ext4 USB external disk which is about half full: $ df Filesystem 1K-blocks Used Available Use% Mounted on /dev/sdc 1922860848 927384456 897800668 51% /media/big I'm wondering why the total size (1922860848) isn't the same as Used+Available (1825185124)? From this answer I see that 5% of the disk might be reserved for root, but that would still only take the total used to 1921328166, which is still off. Is it related to some other filesystem overhead? In case it's relevant, lsof -n | grep deleted shows no deleted files on this disk, and there are no other filesystems mounted inside this one.

Read the article
Mono on Linux: Apache or Nginx

- by Furism

Hi, I'm developing an ASP.NET application that will be run under Linux/Mono for various reasons (mostly to stay away from IIS, quite frankly). Of course the first web server I had in mind was Apache. But Apache, for all its advantages, adds a lot of overhead. Also, the application I'm building needs to be highly scalable and performance is one of the main concern. Apache has, obviously, a very good reputation and its record speaks for itself, but I don't need things like Reverse Proxy or Load Balancing because dedicated network devices would be used for that. So those modules from Apache will never be used. So basically my question is: since Nginx seems to fit exactly needs, is there any caveat I should be aware of? For instance, is Nginx renowned to be particularity safe? When security flaws are detected, how fast are they patched? Any insight on the pros and cons of using either of those servers in conjunction with Mono is welcome.

Read the article
afp/smb transfers caps at 2 megabytes/sec, wireless N

- by CQM

I wanted to transfer files between two mac computers. The network is wireless-N and both computers have wireless-N modules in them. The problem is that when I transfer files between them, via file sharing (afp) the network speed caps at 2 megabytes/sec. Just downloading files from the internet I can get faster speeds, so this isn't a constriction of my wifi bandwidth, it appears to be a constriction of the protocol being used. My wifi-n is set to 130mbits, so I should see real world transfer speeds around 12-16 megabytes/sec I did this command on both computers sudo sysctl -w net.inet.tcp.delayed_ack=0 which is supposed to lower tcp overhead, but this did not affect it. How can I get the speed I am expecting?

Read the article
NFS failover WITHOUT DRBD?

- by user439407

So I am trying to set up a redundant NFS share in a cloud environment(all links internal, half gig links), and I am looking into using heartbeat for failover, but all the guides seem to be about combining DRBD and heartbeat to create a robust environment. If need be I can do that, but since my content is almost completely static, I would like to avoid the extra overhead and complexity of DRBD if possible, but still be able to fail over if one of the NFS servers fails. Is it possible to use heartbeat with NFS to achieve high-availability without using DRBD to copy the blocks? I am not married to NFSv4, so if NFSv3 over UDP is necessary, that won't be a problem(only a very small number of clients will be connecting to the share) Any comments are appreciated.

Read the article
OpenX : mySql VS PostgreSQL

- by Andreas Grech

I have been using OpenX's AdServer with mySql as a backend, and since OpenX allows one to choose between mysql and postgres, I was wondering if anyone ever used Postgres and wished to talk about their experience with it here. Also, is there anyone who maybe tried using OpenX with both and can offer a comparison between the databases ? I am asking this question because I was having some trouble recently with mySql because some tables crashed occasionally and proved problematic because of the overhead they generated; and as a consequence of this, Statistics where not being generated.

Read the article
How to disable Apache http compression (mod_deflate) when SSL stream is compressed

- by Mohammad Ali

I found that Goggle Chrome supports ssl compression and Firefox should support it soon. I'm trying to configure Apache to to disable http compression if the ssl compression is used to prevent CPU overhead with the configuration option: SetEnvIf SSL_COMPRESS_METHOD DEFLATE no-gzip While the custom log (using %{SSL_COMPRESS_METHOD}x) shows that the ssl layer compression method is DEFLATE, the above option did not work and the http response content is still being compressed by Apache. I had to use the option: BrowserMatchNoCase ".Chrome." no-gzip' I prefer if there are more general method in case other browsers supports ssl compression or some has a version of chrome that does not have ssl compression.

Read the article

Search Results

Search found 1282 results on 52 pages for 'overhead'.

Page 16/52 | < Previous Page | 12 13 14 15 16 17 18 19 20 21 22 23 | Next Page >

- by Brian Stinar

- by Honglin Su

- by arungupta

- by danx

- by Chris Hoffman

- by danx

- by jorg

- by James Michael Hare

- by Simon Cooper

- by James Michael Hare

- by Karoly Vegh

- by Jeff Certain

- by rgillen

- by Ashok_Ora

- by Steve Felts

- by alfa64

- by burnt1ce

- by archer

- by sutch

- by Timothy Jones

- by Furism

- by CQM

- by user439407

- by Andreas Grech

- by Mohammad Ali

< Previous Page | 12 13 14 15 16 17 18 19 20 21 22 23 | Next Page >