processors - Page 14 - Developer IT

What is mangling tinyurls?

- by djn

Hello all. I've noticed that some form processors make a mess out of posted TinyURLs (converting the thing to a broken 'tinyurl": "http:\/?\/?tinyurl.com\/?whatever", "ok": tr') while leaving alone other plain URLs. I've seen it happen in WordPress, and I've seen it here on SO (eg.: http://stackoverflow.com/questions/2508690/whats-the-most-efficient-way-to-setup-a-multi-lingual-website - second comment to first answer). Has anybody looked into what component or function is doing this? Is there a way to prevent it?

Read the article

MPI datatype for 2 d array

- by dks

I need to pass an array of integer arrays (basically a 2 d array )to all the processors from root.I am using MPI in C programs. How to declare MPI datatype for 2 d array.and how to send the message (should i use broadcast or scatter)

Read the article

ARM Simulator on Windows

- by Betamoo

I am studying ARM Processors from a textbook... I thought it will be more useful if I could apply what I learn on an ARM simulator... writing code then watching results and different execution stages would be more fun... I have searched for it, but all I could find was either a freeware on linux or a demo on windows Is there a simulator that allow me to see execution steps and different changes for ARM processor (any version!) that runs on windows?? Thanks

Read the article

HTML / CSS autonumber headings?

- by Technical Bard

Is there a way (ideally easy) to make headings and sections autonumber in HTML/CSS? Perhaps a JS library? Or is this something that is hard to do in HTML? I'm looking at an application for a corporate wiki but we want to be able to use heading numbers like we always have in word processors.

Read the article

Which is better? Qt Creator or Visual Studio IDE

- by user249490

I am currently using Qt Creator 1.3 for my Qt applications. I know it uses jom for make step which is better when we have multi core processors. But besides that what are all the advantages of using both the IDEs? Dis advantages as well? I am using CL compiler though for compiling my applications. Is there any other specific advantages and disadvantages of these IDEs?

Read the article

SQL Server 2008 Optimization

- by hgulyan

I've learned today, if you append to your query OPTION (MAXDOP 0) your query will run on multiple processors and if it's huge query, query will perform faster. I know general guidelines on query optimizations (using indexes, selecting only needed fields etc.), my question is about SQL Server optimization. Maybe changing some options in configurations or anything else. What guidelines are there for SQL Server Optimization? Thank you.

Read the article

Can processor cores thrash each other's caches?

- by Jørgen Fogh

If more than one core on a processor is accessing the same memory address, will they thrash each other's caches or will some snooping protocol allow each to keep the data in L1-cache? I am interested in a general answer as well as answers for specific processors. How many layers of cache are invalidated? Will accessing another address within the same cache-line invalidate the entire line? What can you do to alleviate these problems?

Read the article

Windows Azure: Parallelization of the code

- by veda

I have some matrix multiplication operation. I want to parallelize the execution of those operations through multiple processors.. This can be done on high performance computing cluster using MPI (Message Passing Interface). Like wise, can I do some parallelization in the cloud using multiple worker roles. Is there any means for doing that.

Read the article

Looking for early paper about compiling object-oriented code

- by Robert Kosara

I remember reading a paper a long time ago that talked about object-oriented programming. I believe that this was from the early 1980s or perhaps even before then. This was at the time when object-oriented programming was still done through pre-processors, and one thing that stuck with me is this: it argued that you could write code in either procedural or object-oriented fashion, and after preprocessing/compiling, you would end up with the exact same machine code. Does anybody know which paper I'm talking about?

Read the article

How does retiming work in systolic arrays?

- by Andriyev

How does retiming work in systolic arrays (used in signal processors)? I read that there is some notion of negative delay which is used, but how can a delay be negative and if that is just an abstraction then how does it help?

Read the article

parallel sorting methods

- by davit-datuashvili

in book algorithm in c++ by robert sedgewick there is such kind of problem how many parallel steps would be required to sort n records that are distributed on some k disks(let say k=1000 or any value ) and using some m processors the same m can be 100 or arbitrary number i have questions what we should do in such case? what are methods to solve such kind of problems? and what is answer in this case?

Read the article

Concurrent processes do not utilize all available CPU

- by metdos

I run some processes on an EC2 cc2.8xlarge instance which has 32 virtual processors. For some type of processes, when I run 16 processes on parallel, all of them use 100% of CPU cycles. But for other type of processes, they are not using 100% CPU and they finish considerably slower than a single thread. There is no time spend on IO and all data is served from memory. Do you have any idea about the reason of this problem?

Read the article

Can Mac OS X Leopard & hence iPhone SDK be installed on the any of the new Intel i3/i5/i7 processor

- by black_sabbath

I know that it can be installed on Core 2 Duo & Dual core processors. I am using it right now.But,I am planning to buy new machine & wanna make sure that if I buy i3/i5,should be able to install Mac OS X & able to do iPhone development. Thanks.

Read the article

Can Mac OS X Leoperd/Snow Leoperd & hence iPhone SDK be installed on the any of the new Intel i3/i5/

- by black_sabbath

I know that it can be installed on Core 2 Duo & Dual core processors. I am using it right now.But,I am planning to buy new machine & wanna make sure that if I buy i3/i5,should be able to install Mac OS X & able to do iPhone development. Thanks.

Read the article

Introducing Oracle VM Server for SPARC

- by Honglin Su

As you are watching Oracle's Virtualization Strategy Webcast and exploring the great virtualization offerings of Oracle VM product line, I'd like to introduce Oracle VM Server for SPARC -- highly efficient, enterprise-class virtualization solution for Sun SPARC Enterprise Systems with Chip Multithreading (CMT) technology. Oracle VM Server for SPARC, previously called Sun Logical Domains, leverages the built-in SPARC hypervisor to subdivide supported platforms' resources (CPUs, memory, network, and storage) by creating partitions called logical (or virtual) domains. Each logical domain can run an independent operating system. Oracle VM Server for SPARC provides the flexibility to deploy multiple Oracle Solaris operating systems simultaneously on a single platform. Oracle VM Server also allows you to create up to 128 virtual servers on one system to take advantage of the massive thread scale offered by the CMT architecture. Oracle VM Server for SPARC integrates both the industry-leading CMT capability of the UltraSPARC T1, T2 and T2 Plus processors and the Oracle Solaris operating system. This combination helps to increase flexibility, isolate workload processing, and improve the potential for maximum server utilization. Oracle VM Server for SPARC delivers the following: Leading Price/Performance - The low-overhead architecture provides scalable performance under increasing workloads without additional license cost. This enables you to meet the most aggressive price/performance requirement Advanced RAS - Each logical domain is an entirely independent virtual machine with its own OS. It supports virtual disk mutipathing and failover as well as faster network failover with link-based IP multipathing (IPMP) support. Moreover, it's fully integrated with Solaris FMA (Fault Management Architecture), which enables predictive self healing. CPU Dynamic Resource Management (DRM) - Enable your resource management policy and domain workload to trigger the automatic addition and removal of CPUs. This ability helps you to better align with your IT and business priorities. Enhanced Domain Migrations - Perform domain migrations interactively and non-interactively to bring more flexibility to the management of your virtualized environment. Improve active domain migration performance by compressing memory transfers and taking advantage of cryptographic acceleration hardware. These methods provide faster migration for load balancing, power saving, and planned maintenance. Dynamic Crypto Control - Dynamically add and remove cryptographic units (aka MAU) to and from active domains. Also, migrate active domains that have cryptographic units. Physical-to-virtual (P2V) Conversion - Quickly convert an existing SPARC server running the Oracle Solaris 8, 9 or 10 OS into a virtualized Oracle Solaris 10 image. Use this image to facilitate OS migration into the virtualized environment. Virtual I/O Dynamic Reconfiguration (DR) - Add and remove virtual I/O services and devices without needing to reboot the system. CPU Power Management - Implement power saving by disabling each core on a Sun UltraSPARC T2 or T2 Plus processor that has all of its CPU threads idle. Advanced Network Configuration - Configure the following network features to obtain more flexible network configurations, higher performance, and scalability: Jumbo frames, VLANs, virtual switches for link aggregations, and network interface unit (NIU) hybrid I/O. Official Certification Based On Real-World Testing - Use Oracle VM Server for SPARC with the most sophisticated enterprise workloads under real-world conditions, including Oracle Real Application Clusters (RAC). Affordable, Full-Stack Enterprise Class Support - Obtain worldwide support from Oracle for the entire virtualization environment and workloads together. The support covers hardware, firmware, OS, virtualization, and the software stack. SPARC Server Virtualization Oracle offers a full portfolio of virtualization solutions to address your needs. SPARC is the leading platform to have the hard partitioning capability that provides the physical isolation needed to run independent operating systems. Many customers have already used Oracle Solaris Containers for application isolation. Oracle VM Server for SPARC provides another important feature with OS isolation. This gives you the flexibility to deploy multiple operating systems simultaneously on a single Sun SPARC T-Series server with finer granularity for computing resources. For SPARC CMT processors, the natural level of granularity is an execution thread, not a time-sliced microsecond of execution resources. Each CPU thread can be treated as an independent virtual processor. The scheduler is naturally built into the CPU for lower overhead and higher performance. Your organizations can couple Oracle Solaris Containers and Oracle VM Server for SPARC with the breakthrough space and energy savings afforded by Sun SPARC Enterprise systems with CMT technology to deliver a more agile, responsive, and low-cost environment. Management with Oracle Enterprise Manager Ops Center The Oracle Enterprise Manager Ops Center Virtualization Management Pack provides full lifecycle management of virtual guests, including Oracle VM Server for SPARC and Oracle Solaris Containers. It helps you streamline operations and reduce downtime. Together, the Virtualization Management Pack and the Ops Center Provisioning and Patch Automation Pack provide an end-to-end management solution for physical and virtual systems through a single web-based console. This solution automates the lifecycle management of physical and virtual systems and is the most effective systems management solution for Oracle's Sun infrastructure. Ease of Deployment with Configuration Assistant The Oracle VM Server for SPARC Configuration Assistant can help you easily create logical domains. After gathering the configuration data, the Configuration Assistant determines the best way to create a deployment to suit your requirements. The Configuration Assistant is available as both a graphical user interface (GUI) and terminal-based tool. Oracle Solaris Cluster HA Support The Oracle Solaris Cluster HA for Oracle VM Server for SPARC data service provides a mechanism for orderly startup and shutdown, fault monitoring and automatic failover of the Oracle VM Server guest domain service. In addition, applications that run on a logical domain, as well as its resources and dependencies can be controlled and managed independently. These are managed as if they were running in a classical Solaris Cluster hardware node. Supported Systems Oracle VM Server for SPARC is supported on all Sun SPARC Enterprise Systems with CMT technology. UltraSPARC T2 Plus Systems · Sun SPARC Enterprise T5140 Server · Sun SPARC Enterprise T5240 Server · Sun SPARC Enterprise T5440 Server · Sun Netra T5440 Server · Sun Blade T6340 Server Module · Sun Netra T6340 Server Module UltraSPARC T2 Systems · Sun SPARC Enterprise T5120 Server · Sun SPARC Enterprise T5220 Server · Sun Netra T5220 Server · Sun Blade T6320 Server Module · Sun Netra CP3260 ATCA Blade Server Note that UltraSPARC T1 systems are supported on earlier versions of the software.Sun SPARC Enterprise Systems with CMT technology come with the right to use (RTU) of Oracle VM Server, and the software is pre-installed. If you have the systems under warranty or with support, you can download the software and system firmware as well as their updates. Oracle Premier Support for Systems provides fully-integrated support for your server hardware, firmware, OS, and virtualization software. Visit oracle.com/support for information about Oracle's support offerings for Sun systems. For more information about Oracle's virtualization offerings, visit oracle.com/virtualization.

Read the article

Interesting articles and blogs on SPARC T4

- by mv

Interesting articles and blogs on SPARC T4 processor I have consolidated all the interesting information I could get on SPARC T4 processor and its hardware cryptographic capabilities. Hope its useful. 1. Advantages of SPARC T4 processor Most important points in this T4 announcement are : "The SPARC T4 processor was designed from the ground up for high speed security and has a cryptographic stream processing unit (SPU) integrated directly into each processor core. These accelerators support 16 industry standard security ciphers and enable high speed encryption at rates 3 to 5 times that of competing processors. By integrating encryption capabilities directly inside the instruction pipeline, the SPARC T4 processor eliminates the performance and cost barriers typically associated with secure computing and makes it possible to deliver high security levels without impacting the user experience." Data Sheet has more details on these : "New on-chip Encryption Instruction Accelerators with direct non-privileged support for 16 industry-standard cryptographic algorithms plus random number generation in each of the eight cores: AES, Camellia, CRC32c, DES, 3DES, DH, DSA, ECC, Kasumi, MD5, RSA, SHA-1, SHA-224, SHA-256, SHA-384, SHA-512" I ran "isainfo -v" command on Solaris 11 Sparc T4-1 system. It shows the new instructions as expected : $ isainfo -v 64-bit sparcv9 applications crc32c cbcond pause mont mpmul sha512 sha256 sha1 md5 camellia kasumi des aes ima hpc vis3 fmaf asi_blk_init vis2 vis popc 32-bit sparc applications crc32c cbcond pause mont mpmul sha512 sha256 sha1 md5 camellia kasumi des aes ima hpc vis3 fmaf asi_blk_init vis2 vis popc v8plus div32 mul32 2. Dan Anderson's Blog have some interesting points about how these can be used : "New T4 crypto instructions include: aes_kexpand0, aes_kexpand1, aes_kexpand2, aes_eround01, aes_eround23, aes_eround01_l, aes_eround_23_l, aes_dround01, aes_dround23, aes_dround01_l, aes_dround_23_l. Having SPARC T4 hardware crypto instructions is all well and good, but how do we access it ? The software is available with Solaris 11 and is used automatically if you are running Solaris a SPARC T4. It is used internally in the kernel through kernel crypto modules. It is available in user space through the PKCS#11 library." 3. Dans' Blog on Where's the Crypto Libraries? Although this was written in 2009 but still is very useful "Here's a brief tour of the major crypto libraries shown in the digraph: The libpkcs11 library contains the PKCS#11 API (C_\*() functions, such as C_Initialize()). That in turn calls library pkcs11_softtoken or pkcs11_kernel, for userland or kernel crypto providers. The latter is used mostly for hardware-assisted cryptography (such as n2cp for Niagara2 SPARC processors), as that is performed more efficiently in kernel space with the "kCF" module (Kernel Crypto Framework). Additionally, for Solaris 10, strong crypto algorithms were split off in separate libraries, pkcs11_softtoken_extra libcryptoutil contains low-level utility functions to help implement cryptography. libsoftcrypto (OpenSolaris and Solaris Nevada only) implements several symmetric-key crypto algorithms in software, such as AES, RC4, and DES3, and the bignum library (used for RSA). libmd implements MD5, SHA, and SHA2 message digest algorithms" 4. Difference in T3 and T4 Diagram in this blog is good and self explanatory. Jeff's blog also highlights the differences "The T4 servers have improved crypto acceleration, described at https://blogs.oracle.com/DanX/entry/sparc_t4_openssl_engine. It is "just built in" so administrators no longer have to assign crypto accelerator units to domains - it "just happens". Every physical or virtual CPU on a SPARC-T4 has full access to hardware based crypto acceleration at all times. .... For completeness sake, it's worth noting that the T4 adds more crypto algorithms, and accelerates Camelia, CRC32c, and more SHA-x." 5. About performance counters In this blog, performance counters are explained : "Note that unlike T3 and before, T4 crypto doesn't require kernel modules like ncp or n2cp, there is no visibility of crypto hardware with kstats or cryptoadm. T4 does provide hardware counters for crypto operations. You can see these using cpustat: cpustat -c pic0=Instr_FGU_crypto 5 You can check the general crypto support of the hardware and OS with the command "isainfo -v". Since T4 crypto's implementation now allows direct userland access, there are no "crypto units" visible to cryptoadm. " For more details refer Martin's blog as well. 6. How to turn off SPARC T4 or Intel AES-NI crypto acceleration I found this interesting blog from Darren about how to turn off SPARC T4 or Intel AES-NI crypto acceleration. "One of the new Solaris 11 features of the linker/loader is the ability to have a single ELF object that has multiple different implementations of the same functions that are selected at runtime based on the capabilities of the machine. The alternate to this is having the application coded to call getisax(2) system call and make the choice itself. We use this functionality of the linker/loader when we build the userland libraries for the Solaris Cryptographic Framework (specifically libmd.so and libsoftcrypto.so) The Solaris linker/loader allows control of a lot of its functionality via environment variables, we can use that to control the version of the cryptographic functions we run. To do this we simply export the LD_HWCAP environment variable with values that tell ld.so.1 to not select the HWCAP section matching certain features even if isainfo says they are present. This will work for consumers of the Solaris Cryptographic Framework that use the Solaris PKCS#11 libraries or use libmd.so interfaces directly. For SPARC T4 : export LD_HWCAP="-aes -des -md5 -sha256 -sha512 -mont -mpul" .. For Intel systems with AES-NI support: export LD_HWCAP="-aes"" Note that LD_HWCAP is explained in http://docs.oracle.com/cd/E23823_01/html/816-5165/ld.so.1-1.html "LD_HWCAP, LD_HWCAP_32, and LD_HWCAP_64 - Identifies an alternative hardware capabilities value... A “-” prefix results in the capabilities that follow being removed from the alternative capabilities." 7. Whitepaper on SPARC T4 Servers—Optimized for End-to-End Data Center Computing This Whitepaper on SPARC T4 Servers—Optimized for End-to-End Data Center Computing explains more details. It has DTrace scripts which may come in handy : "To ensure the hardware-assisted cryptographic acceleration is configured to use and working with the security scenarios, it is recommended to use the following Solaris DTrace script. #!/usr/sbin/dtrace -s pid$1:libsoftcrypto:yf*:entry, pid$target:libsoftcrypto:rsa*:entry, pid$1:libmd:yf*:entry { @[probefunc] = count(); } tick-1sec { printa(@ops); trunc(@ops); }" Note that I have slightly modified the D Script to have RSA "libsoftcrypto:rsa*:entry" as well as per recommendations from Chi-Chang Lin. 8. References http://www.oracle.com/us/corporate/features/sparc-t4-announcement-494846.html http://www.oracle.com/us/products/servers-storage/servers/sparc-enterprise/t-series/sparc-t4-1-ds-487858.pdf https://blogs.oracle.com/DanX/entry/sparc_t4_openssl_engine https://blogs.oracle.com/DanX/entry/where_s_the_crypto_libraries https://blogs.oracle.com/darren/entry/howto_turn_off_sparc_t4 http://docs.oracle.com/cd/E23823_01/html/816-5165/ld.so.1-1.html https://blogs.oracle.com/hardware/entry/unleash_the_power_of_cryptography https://blogs.oracle.com/cmt/entry/t4_crypto_cheat_sheet https://blogs.oracle.com/martinm/entry/t4_performance_counters_explained https://blogs.oracle.com/jsavit/entry/no_mau_required_on_a http://www.oracle.com/us/products/servers-storage/servers/sparc-enterprise/t-series/sparc-t4-business-wp-524472.pdf

Read the article

Thread placement policies on NUMA systems - update

- by Dave

In a prior blog entry I noted that Solaris used a "maximum dispersal" placement policy to assign nascent threads to their initial processors. The general idea is that threads should be placed as far away from each other as possible in the resource topology in order to reduce resource contention between concurrently running threads. This policy assumes that resource contention -- pipelines, memory channel contention, destructive interference in the shared caches, etc -- will likely outweigh (a) any potential communication benefits we might achieve by packing our threads more densely onto a subset of the NUMA nodes, and (b) benefits of NUMA affinity between memory allocated by one thread and accessed by other threads. We want our threads spread widely over the system and not packed together. Conceptually, when placing a new thread, the kernel picks the least loaded node NUMA node (the node with lowest aggregate load average), and then the least loaded core on that node, etc. Furthermore, the kernel places threads onto resources -- sockets, cores, pipelines, etc -- without regard to the thread's process membership. That is, initial placement is process-agnostic. Keep reading, though. This description is incorrect. On Solaris 10 on a SPARC T5440 with 4 x T2+ NUMA nodes, if the system is otherwise unloaded and we launch a process that creates 20 compute-bound concurrent threads, then typically we'll see a perfect balance with 5 threads on each node. We see similar behavior on an 8-node x86 x4800 system, where each node has 8 cores and each core is 2-way hyperthreaded. So far so good; this behavior seems in agreement with the policy I described in the 1st paragraph. I recently tried the same experiment on a 4-node T4-4 running Solaris 11. Both the T5440 and T4-4 are 4-node systems that expose 256 logical thread contexts. To my surprise, all 20 threads were placed onto just one NUMA node while the other 3 nodes remained completely idle. I checked the usual suspects such as processor sets inadvertently left around by colleagues, processors left offline, and power management policies, but the system was configured normally. I then launched multiple concurrent instances of the process, and, interestingly, all the threads from the 1st process landed on one node, all the threads from the 2nd process landed on another node, and so on. This happened even if I interleaved thread creating between the processes, so I was relatively sure the effect didn't related to thread creation time, but rather that placement was a function of process membership. I this point I consulted the Solaris sources and talked with folks in the Solaris group. The new Solaris 11 behavior is intentional. The kernel is no longer using a simple maximum dispersal policy, and thread placement is process membership-aware. Now, even if other nodes are completely unloaded, the kernel will still try to pack new threads onto the home lgroup (socket) of the primordial thread until the load average of that node reaches 50%, after which it will pick the next least loaded node as the process's new favorite node for placement. On the T4-4 we have 64 logical thread contexts (strands) per socket (lgroup), so if we launch 48 concurrent threads we will find 32 placed on one node and 16 on some other node. If we launch 64 threads we'll find 32 and 32. That means we can end up with our threads clustered on a small subset of the nodes in a way that's quite different that what we've seen on Solaris 10. So we have a policy that allows process-aware packing but reverts to spreading threads onto other nodes if a node becomes too saturated. It turns out this policy was enabled in Solaris 10, but certain bugs suppressed the mixed packing/spreading behavior. There are configuration variables in /etc/system that allow us to dial the affinity between nascent threads and their primordial thread up and down: see lgrp_expand_proc_thresh, specifically. In the OpenSolaris source code the key routine is mpo_update_tunables(). This method reads the /etc/system variables and sets up some global variables that will subsequently be used by the dispatcher, which calls lgrp_choose() in lgrp.c to place nascent threads. Lgrp_expand_proc_thresh controls how loaded an lgroup must be before we'll consider homing a process's threads to another lgroup. Tune this value lower to have it spread your process's threads out more. To recap, the 'new' policy is as follows. Threads from the same process are packed onto a subset of the strands of a socket (50% for T-series). Once that socket reaches the 50% threshold the kernel then picks another preferred socket for that process. Threads from unrelated processes are spread across sockets. More precisely, different processes may have different preferred sockets (lgroups). Beware that I've simplified and elided details for the purposes of explication. The truth is in the code. Remarks: It's worth noting that initial thread placement is just that. If there's a gross imbalance between the load on different nodes then the kernel will migrate threads to achieve a better and more even distribution over the set of available nodes. Once a thread runs and gains some affinity for a node, however, it becomes "stickier" under the assumption that the thread has residual cache residency on that node, and that memory allocated by that thread resides on that node given the default "first-touch" page-level NUMA allocation policy. Exactly how the various policies interact and which have precedence under what circumstances could the topic of a future blog entry. The scheduler is work-conserving. The x4800 mentioned above is an interesting system. Each of the 8 sockets houses an Intel 7500-series processor. Each processor has 3 coherent QPI links and the system is arranged as a glueless 8-socket twisted ladder "mobius" topology. Nodes are either 1 or 2 hops distant over the QPI links. As an aside the mapping of logical CPUIDs to physical resources is rather interesting on Solaris/x4800. On SPARC/Solaris the CPUID layout is strictly geographic, with the highest order bits identifying the socket, the next lower bits identifying the core within that socket, following by the pipeline (if present) and finally the logical thread context ("strand") on the core. But on Solaris on the x4800 the CPUID layout is as follows. [6:6] identifies the hyperthread on a core; bits [5:3] identify the socket, or package in Intel terminology; bits [2:0] identify the core within a socket. Such low-level details should be of interest only if you're binding threads -- a bad idea, the kernel typically handles placement best -- or if you're writing NUMA-aware code that's aware of the ambient placement and makes decisions accordingly. Solaris introduced the so-called critical-threads mechanism, which is expressed by putting a thread into the FX scheduling class at priority 60. The critical-threads mechanism applies to placement on cores, not on sockets, however. That is, it's an intra-socket policy, not an inter-socket policy. Solaris 11 introduces the Power Aware Dispatcher (PAD) which packs threads instead of spreading them out in an attempt to be able to keep sockets or cores at lower power levels. Maximum dispersal may be good for performance but is anathema to power management. PAD is off by default, but power management polices constitute yet another confounding factor with respect to scheduling and dispatching. If your threads communicate heavily -- one thread reads cache lines last written by some other thread -- then the new dense packing policy may improve performance by reducing traffic on the coherent interconnect. On the other hand if your threads in your process communicate rarely, then it's possible the new packing policy might result on contention on shared computing resources. Unfortunately there's no simple litmus test that says whether packing or spreading is optimal in a given situation. The answer varies by system load, application, number of threads, and platform hardware characteristics. Currently we don't have the necessary tools and sensoria to decide at runtime, so we're reduced to an empirical approach where we run trials and try to decide on a placement policy. The situation is quite frustrating. Relatedly, it's often hard to determine just the right level of concurrency to optimize throughput. (Understanding constructive vs destructive interference in the shared caches would be a good start. We could augment the lines with a small tag field indicating which strand last installed or accessed a line. Given that, we could augment the CPU with performance counters for misses where a thread evicts a line it installed vs misses where a thread displaces a line installed by some other thread.)

Read the article

NUMA-aware placement of communication variables

- by Dave

For classic NUMA-aware programming I'm typically most concerned about simple cold, capacity and compulsory misses and whether we can satisfy the miss by locally connected memory or whether we have to pull the line from its home node over the coherent interconnect -- we'd like to minimize channel contention and conserve interconnect bandwidth. That is, for this style of programming we're quite aware of where memory is homed relative to the threads that will be accessing it. Ideally, a page is collocated on the node with the thread that's expected to most frequently access the page, as simple misses on the page can be satisfied without resorting to transferring the line over the interconnect. The default "first touch" NUMA page placement policy tends to work reasonable well in this regard. When a virtual page is first accessed, the operating system will attempt to provision and map that virtual page to a physical page allocated from the node where the accessing thread is running. It's worth noting that the node-level memory interleaving granularity is usually a multiple of the page size, so we can say that a given page P resides on some node N. That is, the memory underlying a page resides on just one node. But when thinking about accesses to heavily-written communication variables we normally consider what caches the lines underlying such variables might be resident in, and in what states. We want to minimize coherence misses and cache probe activity and interconnect traffic in general. I don't usually give much thought to the location of the home NUMA node underlying such highly shared variables. On a SPARC T5440, for instance, which consists of 4 T2+ processors connected by a central coherence hub, the home node and placement of heavily accessed communication variables has very little impact on performance. The variables are frequently accessed so likely in M-state in some cache, and the location of the home node is of little consequence because a requester can use cache-to-cache transfers to get the line. Or at least that's what I thought. Recently, though, I was exploring a simple shared memory point-to-point communication model where a client writes a request into a request mailbox and then busy-waits on a response variable. It's a simple example of delegation based on message passing. The server polls the request mailbox, and having fetched a new request value, performs some operation and then writes a reply value into the response variable. As noted above, on a T5440 performance is insensitive to the placement of the communication variables -- the request and response mailbox words. But on a Sun/Oracle X4800 I noticed that was not the case and that NUMA placement of the communication variables was actually quite important. For background an X4800 system consists of 8 Intel X7560 Xeons . Each package (socket) has 8 cores with 2 contexts per core, so the system is 8x8x2. Each package is also a NUMA node and has locally attached memory. Every package has 3 point-to-point QPI links for cache coherence, and the system is configured with a twisted ladder "mobius" topology. The cache coherence fabric is glueless -- there's not central arbiter or coherence hub. The maximum distance between any two nodes is just 2 hops over the QPI links. For any given node, 3 other nodes are 1 hop distant and the remaining 4 nodes are 2 hops distant. Using a single request (client) thread and a single response (server) thread, a benchmark harness explored all permutations of NUMA placement for the two threads and the two communication variables, measuring the average round-trip-time and throughput rate between the client and server. In this benchmark the server simply acts as a simple transponder, writing the request value plus 1 back into the reply field, so there's no particular computation phase and we're only measuring communication overheads. In addition to varying the placement of communication variables over pairs of nodes, we also explored variations where both variables were placed on one page (and thus on one node) -- either on the same cache line or different cache lines -- while varying the node where the variables reside along with the placement of the threads. The key observation was that if the client and server threads were on different nodes, then the best placement of variables was to have the request variable (written by the client and read by the server) reside on the same node as the client thread, and to place the response variable (written by the server and read by the client) on the same node as the server. That is, if you have a variable that's to be written by one thread and read by another, it should be homed with the writer thread. For our simple client-server model that means using split request and response communication variables with unidirectional message flow on a given page. This can yield up to twice the throughput of less favorable placement strategies. Our X4800 uses the QPI 1.0 protocol with source-based snooping. Briefly, when node A needs to probe a cache line it fires off snoop requests to all the nodes in the system. Those recipients then forward their response not to the original requester, but to the home node H of the cache line. H waits for and collects the responses, adjudicates and resolves conflicts and ensures memory-model ordering, and then sends a definitive reply back to the original requester A. If some node B needed to transfer the line to A, it will do so by cache-to-cache transfer and let H know about the disposition of the cache line. A needs to wait for the authoritative response from H. So if a thread on node A wants to write a value to be read by a thread on node B, the latency is dependent on the distances between A, B, and H. We observe the best performance when the written-to variable is co-homed with the writer A. That is, we want H and A to be the same node, as the writer doesn't need the home to respond over the QPI link, as the writer and the home reside on the very same node. With architecturally informed placement of communication variables we eliminate at least one QPI hop from the critical path. Newer Intel processors use the QPI 1.1 coherence protocol with home-based snooping. As noted above, under source-snooping a requester broadcasts snoop requests to all nodes. Those nodes send their response to the home node of the location, which provides memory ordering, reconciles conflicts, etc., and then posts a definitive reply to the requester. In home-based snooping the snoop probe goes directly to the home node and are not broadcast. The home node can consult snoop filters -- if present -- and send out requests to retrieve the line if necessary. The 3rd party owner of the line, if any, can respond either to the home or the original requester (or even to both) according to the protocol policies. There are myriad variations that have been implemented, and unfortunately vendor terminology doesn't always agree between vendors or with the academic taxonomy papers. The key is that home-snooping enables the use of a snoop filter to reduce interconnect traffic. And while home-snooping might have a longer critical path (latency) than source-based snooping, it also may require fewer messages and less overall bandwidth. It'll be interesting to reprise these experiments on a platform with home-based snooping. While collecting data I also noticed that there are placement concerns even in the seemingly trivial case when both threads and both variables reside on a single node. Internally, the cores on each X7560 package are connected by an internal ring. (Actually there are multiple contra-rotating rings). And the last-level on-chip cache (LLC) is partitioned in banks or slices, which with each slice being associated with a core on the ring topology. A hardware hash function associates each physical address with a specific home bank. Thus we face distance and topology concerns even for intra-package communications, although the latencies are not nearly the magnitude we see inter-package. I've not seen such communication distance artifacts on the T2+, where the cache banks are connected to the cores via a high-speed crossbar instead of a ring -- communication latencies seem more regular.

Read the article

Lenovo V570 CPU fan running constantly, CPU core 1 running over 90%!

- by Rabbit2190

I have seen that a lot of people are having this same issue. I am running a Lenovo V570 i5 4 core, 6 gigs of ram, and am running 11.10 Onieric Ocelot. On my system monitor graph it shows CPU at 20%, when I open the monitor it shows core #1 at around 90%, the other cores fluctuate at or below 5-12% if even. Now this seems like a really terrible balance of power between the cores, especially with so much stress on one core only, when these things are designed to work with 4 cores and not at such high temps. My current readings say 64 degrees Celsius, this does not seem normal for any cpu, and I am seriously considering, working on my windows7 partition until I see a real solution to this issue or upgrading to 12.04 right away when it comes out... I have seen countless things saying it has something to do with the Kernel, the kernel on mine is the same as when I upgraded, I really do not like messing with it, as when I had 11.04, I did tinker with it due to the freeze issues I was having, and that just made worse issues. I like this version 11.10 and would like to keep it for a while, but without the fear that my core is going to fry! So any help would be much appreciated! I did try changing a couple things in ACPI, and restarting this did not help, and here I am. I tried one thing prior to that that was listed under a different computer brand, but it would not do a make on the file. I really need help with this, I rely on this computer for a lot of things, and love this OS! Please help so I do not need to resort to my Microsoft partition! PLEASE! Here is the fwts cpufrequ- output: rabbit@rabbit-Lenovo-V570:~$ sudo fwts cpufreq - 00001 fwts Results generated by fwts: Version V0.23.25 (Thu Oct 6 15 00002 fwts :12:31 BST 2011). 00003 fwts 00004 fwts Some of this work - Copyright (c) 1999 - 2010, Intel Corp. 00005 fwts All rights reserved. 00006 fwts Some of this work - Copyright (c) 2010 - 2011, Canonical. 00007 fwts 00008 fwts This test run on 02/04/12 at 17:23:22 on host Linux 00009 fwts rabbit-Lenovo-V570 3.0.0-17-generic-pae #30-Ubuntu SMP Thu 00010 fwts Mar 8 17:53:35 UTC 2012 i686. 00011 fwts 00012 fwts Running tests: cpufreq. 00014 cpufreq CPU frequency scaling tests (takes ~1-2 mins). 00015 cpufreq --------------------------------------------------------- 00016 cpufreq Test 1 of 1: CPU P-State Checks. 00017 cpufreq For each processor in the system, this test steps through 00018 cpufreq the various frequency states (P-states) that the BIOS 00019 cpufreq advertises for the processor. For each processor/frequency 00020 cpufreq combination, a quick performance value is measured. The 00021 cpufreq test then validates that: 00022 cpufreq 1) Each processor has the same number of frequency states 00023 cpufreq 2) Higher advertised frequencies have a higher performance 00024 cpufreq 3) No duplicate frequency values are reported by the BIOS 00025 cpufreq 4) Is BIOS wrongly doing Sw_All P-state coordination across cores 00026 cpufreq 5) Is BIOS wrongly doing Sw_Any P-state coordination across cores 00027 cpufreq Frequency | Speed 00028 cpufreq -----------+--------- 00029 cpufreq 2.45 Ghz | 100.0 % 00030 cpufreq 2.45 Ghz | 83.7 % 00031 cpufreq 2.05 Ghz | 69.2 % 00032 cpufreq 1.85 Ghz | 62.5 % 00033 cpufreq 1.65 Ghz | 55.2 % 00034 cpufreq 1400 Mhz | 48.6 % 00035 cpufreq 1200 Mhz | 41.8 % 00036 cpufreq 1000 Mhz | 34.5 % 00037 cpufreq 800 Mhz | 27.6 % 00038 cpufreq 9 CPU frequency steps supported 00039 cpufreq Frequency | Speed 00040 cpufreq -----------+--------- 00041 cpufreq 2.45 Ghz | 97.7 % 00042 cpufreq 2.45 Ghz | 83.7 % 00043 cpufreq 2.05 Ghz | 69.6 % 00044 cpufreq 1.85 Ghz | 63.3 % 00045 cpufreq 1.65 Ghz | 55.7 % 00046 cpufreq 1400 Mhz | 48.7 % 00047 cpufreq 1200 Mhz | 41.7 % 00048 cpufreq 1000 Mhz | 34.5 % 00049 cpufreq 800 Mhz | 27.5 % 00050 cpufreq Frequency | Speed 00051 cpufreq -----------+--------- 00052 cpufreq 2.45 Ghz | 97.7 % 00053 cpufreq 2.45 Ghz | 84.4 % 00054 cpufreq 2.05 Ghz | 69.6 % 00055 cpufreq 1.85 Ghz | 62.6 % 00056 cpufreq 1.65 Ghz | 55.9 % 00057 cpufreq 1400 Mhz | 48.7 % 00058 cpufreq 1200 Mhz | 41.7 % 00059 cpufreq 1000 Mhz | 34.7 % 00060 cpufreq 800 Mhz | 27.8 % 00061 cpufreq Frequency | Speed 00062 cpufreq -----------+--------- 00063 cpufreq 2.45 Ghz | 100.0 % 00064 cpufreq 2.45 Ghz | 82.6 % 00065 cpufreq 2.05 Ghz | 67.8 % 00066 cpufreq 1.85 Ghz | 61.4 % 00067 cpufreq 1.65 Ghz | 54.9 % 00068 cpufreq 1400 Mhz | 48.3 % 00069 cpufreq 1200 Mhz | 41.1 % 00070 cpufreq 1000 Mhz | 34.3 % 00071 cpufreq 800 Mhz | 27.4 % 00072 cpufreq Frequency | Speed 00073 cpufreq -----------+--------- 00074 cpufreq 2.45 Ghz | 96.2 % 00075 cpufreq 2.45 Ghz | 82.5 % 00076 cpufreq 2.05 Ghz | 69.3 % 00077 cpufreq 1.85 Ghz | 62.7 % 00078 cpufreq 1.65 Ghz | 55.0 % 00079 cpufreq 1400 Mhz | 47.4 % 00080 cpufreq 1200 Mhz | 41.1 % 00081 cpufreq 1000 Mhz | 34.0 % 00082 cpufreq 800 Mhz | 27.2 % 00083 cpufreq Frequency | Speed 00084 cpufreq -----------+--------- 00085 cpufreq 2.45 Ghz | 96.5 % 00086 cpufreq 2.45 Ghz | 83.6 % 00087 cpufreq 2.05 Ghz | 68.1 % 00088 cpufreq 1.85 Ghz | 61.7 % 00089 cpufreq 1.65 Ghz | 54.9 % 00090 cpufreq 1400 Mhz | 48.0 % 00091 cpufreq 1200 Mhz | 41.1 % 00092 cpufreq 1000 Mhz | 34.2 % 00093 cpufreq 800 Mhz | 27.8 % 00094 cpufreq Frequency | Speed 00095 cpufreq -----------+--------- 00096 cpufreq 2.45 Ghz | 96.4 % 00097 cpufreq 2.45 Ghz | 82.6 % 00098 cpufreq 2.05 Ghz | 68.8 % 00099 cpufreq 1.85 Ghz | 60.5 % 00100 cpufreq 1.65 Ghz | 52.4 % 00101 cpufreq 1400 Mhz | 48.8 % 00102 cpufreq 1200 Mhz | 41.1 % 00103 cpufreq 1000 Mhz | 34.2 % 00104 cpufreq 800 Mhz | 26.4 % 00105 cpufreq Frequency | Speed 00106 cpufreq -----------+--------- 00107 cpufreq 2.45 Ghz | 95.3 % 00108 cpufreq 2.45 Ghz | 82.5 % 00109 cpufreq 2.05 Ghz | 65.5 % 00110 cpufreq 1.85 Ghz | 62.8 % 00111 cpufreq 1.65 Ghz | 54.8 % 00112 cpufreq 1400 Mhz | 48.0 % 00113 cpufreq 1200 Mhz | 41.2 % 00114 cpufreq 1000 Mhz | 34.2 % 00115 cpufreq 800 Mhz | 27.3 % 00116 cpufreq Frequency | Speed 00117 cpufreq -----------+--------- 00118 cpufreq 2.45 Ghz | 96.3 % 00119 cpufreq 2.45 Ghz | 83.4 % 00120 cpufreq 2.05 Ghz | 68.3 % 00121 cpufreq 1.85 Ghz | 61.9 % 00122 cpufreq 1.65 Ghz | 54.9 % 00123 cpufreq 1400 Mhz | 48.0 % 00124 cpufreq 1200 Mhz | 41.1 % 00125 cpufreq 1000 Mhz | 34.2 % 00126 cpufreq 800 Mhz | 27.3 % 00127 cpufreq Frequency | Speed 00128 cpufreq -----------+--------- 00129 cpufreq 2.45 Ghz | 100.0 % 00130 cpufreq 2.45 Ghz | 77.9 % 00131 cpufreq 2.05 Ghz | 64.6 % 00132 cpufreq 1.85 Ghz | 54.0 % 00133 cpufreq 1.65 Ghz | 51.7 % 00134 cpufreq 1400 Mhz | 45.2 % 00135 cpufreq 1200 Mhz | 39.0 % 00136 cpufreq 1000 Mhz | 33.1 % 00137 cpufreq 800 Mhz | 25.5 % 00138 cpufreq Frequency | Speed 00139 cpufreq -----------+--------- 00140 cpufreq 2.45 Ghz | 93.4 % 00141 cpufreq 2.45 Ghz | 75.7 % 00142 cpufreq 2.05 Ghz | 64.5 % 00143 cpufreq 1.85 Ghz | 59.1 % 00144 cpufreq 1.65 Ghz | 51.4 % 00145 cpufreq 1400 Mhz | 45.9 % 00146 cpufreq 1200 Mhz | 39.3 % 00147 cpufreq 1000 Mhz | 32.7 % 00148 cpufreq 800 Mhz | 25.8 % 00149 cpufreq Frequency | Speed 00150 cpufreq -----------+--------- 00151 cpufreq 2.45 Ghz | 92.1 % 00152 cpufreq 2.45 Ghz | 78.1 % 00153 cpufreq 2.05 Ghz | 65.7 % 00154 cpufreq 1.85 Ghz | 58.6 % 00155 cpufreq 1.65 Ghz | 52.5 % 00156 cpufreq 1400 Mhz | 45.7 % 00157 cpufreq 1200 Mhz | 39.3 % 00158 cpufreq 1000 Mhz | 32.7 % 00159 cpufreq 800 Mhz | 24.3 % 00160 cpufreq Frequency | Speed 00161 cpufreq -----------+--------- 00162 cpufreq 2.45 Ghz | 88.9 % 00163 cpufreq 2.45 Ghz | 79.8 % 00164 cpufreq 2.05 Ghz | 58.4 % 00165 cpufreq 1.85 Ghz | 52.6 % 00166 cpufreq 1.65 Ghz | 46.9 % 00167 cpufreq 1400 Mhz | 41.0 % 00168 cpufreq 1200 Mhz | 35.1 % 00169 cpufreq 1000 Mhz | 29.1 % 00170 cpufreq 800 Mhz | 22.9 % 00171 cpufreq Frequency | Speed 00172 cpufreq -----------+--------- 00173 cpufreq 2.45 Ghz | 92.8 % 00174 cpufreq 2.45 Ghz | 80.1 % 00175 cpufreq 2.05 Ghz | 66.2 % 00176 cpufreq 1.85 Ghz | 59.5 % 00177 cpufreq 1.65 Ghz | 52.9 % 00178 cpufreq 1400 Mhz | 46.2 % 00179 cpufreq 1200 Mhz | 39.5 % 00180 cpufreq 1000 Mhz | 32.9 % 00181 cpufreq 800 Mhz | 26.3 % 00182 cpufreq Frequency | Speed 00183 cpufreq -----------+--------- 00184 cpufreq 2.45 Ghz | 92.9 % 00185 cpufreq 2.45 Ghz | 79.5 % 00186 cpufreq 2.05 Ghz | 66.2 % 00187 cpufreq 1.85 Ghz | 59.6 % 00188 cpufreq 1.65 Ghz | 52.9 % 00189 cpufreq 1400 Mhz | 46.7 % 00190 cpufreq 1200 Mhz | 39.6 % 00191 cpufreq 1000 Mhz | 32.9 % 00192 cpufreq 800 Mhz | 26.3 % 00193 cpufreq FAILED [MEDIUM] CPUFreqCPUsSetToSW_ANY: Test 1, Processors 00194 cpufreq are set to SW_ANY. 00195 cpufreq FAILED [MEDIUM] CPUFreqSW_ANY: Test 1, Firmware not 00196 cpufreq implementing hardware coordination cleanly. Firmware using 00197 cpufreq SW_ANY instead?. 00198 cpufreq 00199 cpufreq ========================================================= 00200 cpufreq 0 passed, 2 failed, 0 warnings, 0 aborted, 0 skipped, 0 00201 cpufreq info only. 00202 cpufreq ========================================================= 00204 summary 00205 summary 0 passed, 2 failed, 0 warnings, 0 aborted, 0 skipped, 0 00206 summary info only. 00207 summary 00208 summary Test Failure Summary 00209 summary ==================== 00210 summary 00211 summary Critical failures: NONE 00212 summary 00213 summary High failures: NONE 00214 summary 00215 summary Medium failures: 2 00216 summary cpufreq test, at 1 log line: 193 00217 summary "Processors are set to SW_ANY." 00218 summary cpufreq test, at 1 log line: 195 00219 summary "Firmware not implementing hardware coordination cleanly. Firmware using SW_ANY instead?." 00220 summary 00221 summary Low failures: NONE 00222 summary 00223 summary Other failures: NONE 00224 summary 00225 summary Test |Pass |Fail |Abort|Warn |Skip |Info | 00226 summary ---------------+-----+-----+-----+-----+-----+-----+ 00227 summary cpufreq | | 2| | | | | 00228 summary ---------------+-----+-----+-----+-----+-----+-----+ 00229 summary Total: | 0| 2| 0| 0| 0| 0| 00230 summary ---------------+-----+-----+-----+-----+-----+-----+ rabbit@rabbit-Lenovo-V570:~$

Read the article

Parallelism in .NET – Part 2, Simple Imperative Data Parallelism

- by Reed

In my discussion of Decomposition of the problem space, I mentioned that Data Decomposition is often the simplest abstraction to use when trying to parallelize a routine. If a problem can be decomposed based off the data, we will often want to use what MSDN refers to as Data Parallelism as our strategy for implementing our routine. The Task Parallel Library in .NET 4 makes implementing Data Parallelism, for most cases, very simple. Data Parallelism is the main technique we use to parallelize a routine which can be decomposed based off data. Data Parallelism refers to taking a single collection of data, and having a single operation be performed concurrently on elements in the collection. One side note here: Data Parallelism is also sometimes referred to as the Loop Parallelism Pattern or Loop-level Parallelism. In general, for this series, I will try to use the terminology used in the MSDN Documentation for the Task Parallel Library. This should make it easier to investigate these topics in more detail. Once we’ve determined we have a problem that, potentially, can be decomposed based on data, implementation using Data Parallelism in the TPL is quite simple. Let’s take our example from the Data Decomposition discussion – a simple contrast stretching filter. Here, we have a collection of data (pixels), and we need to run a simple operation on each element of the pixel. Once we know the minimum and maximum values, we most likely would have some simple code like the following: for (int row=0; row < pixelData.GetUpperBound(0); ++row) { for (int col=0; col < pixelData.GetUpperBound(1); ++col) { pixelData[row, col] = AdjustContrast(pixelData[row, col], minPixel, maxPixel); } } .csharpcode, .csharpcode pre { font-size: small; color: black; font-family: consolas, "Courier New", courier, monospace; background-color: #ffffff; /*white-space: pre;*/ } .csharpcode pre { margin: 0em; } .csharpcode .rem { color: #008000; } .csharpcode .kwrd { color: #0000ff; } .csharpcode .str { color: #006080; } .csharpcode .op { color: #0000c0; } .csharpcode .preproc { color: #cc6633; } .csharpcode .asp { background-color: #ffff00; } .csharpcode .html { color: #800000; } .csharpcode .attr { color: #ff0000; } .csharpcode .alt { background-color: #f4f4f4; width: 100%; margin: 0em; } .csharpcode .lnum { color: #606060; } This simple routine loops through a two dimensional array of pixelData, and calls the AdjustContrast routine on each pixel. As I mentioned, when you’re decomposing a problem space, most iteration statements are potentially candidates for data decomposition. Here, we’re using two for loops – one looping through rows in the image, and a second nested loop iterating through the columns. We then perform one, independent operation on each element based on those loop positions. This is a prime candidate – we have no shared data, no dependencies on anything but the pixel which we want to change. Since we’re using a for loop, we can easily parallelize this using the Parallel.For method in the TPL: Parallel.For(0, pixelData.GetUpperBound(0), row => { for (int col=0; col < pixelData.GetUpperBound(1); ++col) { pixelData[row, col] = AdjustContrast(pixelData[row, col], minPixel, maxPixel); } }); Here, by simply changing our first for loop to a call to Parallel.For, we can parallelize this portion of our routine. Parallel.For works, as do many methods in the TPL, by creating a delegate and using it as an argument to a method. In this case, our for loop iteration block becomes a delegate creating via a lambda expression. This lets you write code that, superficially, looks similar to the familiar for loop, but functions quite differently at runtime. We could easily do this to our second for loop as well, but that may not be a good idea. There is a balance to be struck when writing parallel code. We want to have enough work items to keep all of our processors busy, but the more we partition our data, the more overhead we introduce. In this case, we have an image of data – most likely hundreds of pixels in both dimensions. By just parallelizing our first loop, each row of pixels can be run as a single task. With hundreds of rows of data, we are providing fine enough granularity to keep all of our processors busy. If we parallelize both loops, we’re potentially creating millions of independent tasks. This introduces extra overhead with no extra gain, and will actually reduce our overall performance. This leads to my first guideline when writing parallel code: Partition your problem into enough tasks to keep each processor busy throughout the operation, but not more than necessary to keep each processor busy. Also note that I parallelized the outer loop. I could have just as easily partitioned the inner loop. However, partitioning the inner loop would have led to many more discrete work items, each with a smaller amount of work (operate on one pixel instead of one row of pixels). My second guideline when writing parallel code reflects this: Partition your problem in a way to place the most work possible into each task. This typically means, in practice, that you will want to parallelize the routine at the “highest” point possible in the routine, typically the outermost loop. If you’re looking at parallelizing methods which call other methods, you’ll want to try to partition your work high up in the stack – as you get into lower level methods, the performance impact of parallelizing your routines may not overcome the overhead introduced. Parallel.For works great for situations where we know the number of elements we’re going to process in advance. If we’re iterating through an IList<T> or an array, this is a typical approach. However, there are other iteration statements common in C#. In many situations, we’ll use foreach instead of a for loop. This can be more understandable and easier to read, but also has the advantage of working with collections which only implement IEnumerable<T>, where we do not know the number of elements involved in advance. As an example, lets take the following situation. Say we have a collection of Customers, and we want to iterate through each customer, check some information about the customer, and if a certain case is met, send an email to the customer and update our instance to reflect this change. Normally, this might look something like: foreach(var customer in customers) { // Run some process that takes some time... DateTime lastContact = theStore.GetLastContact(customer); TimeSpan timeSinceContact = DateTime.Now - lastContact; // If it's been more than two weeks, send an email, and update... if (timeSinceContact.Days > 14) { theStore.EmailCustomer(customer); customer.LastEmailContact = DateTime.Now; } } Here, we’re doing a fair amount of work for each customer in our collection, but we don’t know how many customers exist. If we assume that theStore.GetLastContact(customer) and theStore.EmailCustomer(customer) are both side-effect free, thread safe operations, we could parallelize this using Parallel.ForEach: Parallel.ForEach(customers, customer => { // Run some process that takes some time... DateTime lastContact = theStore.GetLastContact(customer); TimeSpan timeSinceContact = DateTime.Now - lastContact; // If it's been more than two weeks, send an email, and update... if (timeSinceContact.Days > 14) { theStore.EmailCustomer(customer); customer.LastEmailContact = DateTime.Now; } }); Just like Parallel.For, we rework our loop into a method call accepting a delegate created via a lambda expression. This keeps our new code very similar to our original iteration statement, however, this will now execute in parallel. The same guidelines apply with Parallel.ForEach as with Parallel.For. The other iteration statements, do and while, do not have direct equivalents in the Task Parallel Library. These, however, are very easy to implement using Parallel.ForEach and the yield keyword. Most applications can benefit from implementing some form of Data Parallelism. Iterating through collections and performing “work” is a very common pattern in nearly every application. When the problem can be decomposed by data, we often can parallelize the workload by merely changing foreach statements to Parallel.ForEach method calls, and for loops to Parallel.For method calls. Any time your program operates on a collection, and does a set of work on each item in the collection where that work is not dependent on other information, you very likely have an opportunity to parallelize your routine.

Read the article

Renci.SSHNet and HP ILO 4

- by Andrew J. Brehm

I am using Renci.SSHNet to connect to HP iLO processors. Generally this works fine and I can connect and run several commands and disconnect. However, I noticed that a few new servers that use iLO 4 simply don't react to any but the first command sent. When I login using Putty everything works fine, but when using an SSH connection with Renci only the first command sent is recognised whereas the second and further commands do not cause any reaction whatsoever by the iLO processor, not even an error message. Any ideas why that might be?

Read the article

Parallelism in .NET – Introduction

- by Reed

Parallel programming is something that every professional developer should understand, but is rarely discussed or taught in detail in a formal manner. Software users are no longer content with applications that lock up the user interface regularly, or take large amounts of time to process data unnecessarily. Modern development requires the use of parallelism. There is no longer any excuses for us as developers. Learning to write parallel software is challenging. It requires more than reading that one chapter on parallelism in our programming language book of choice… Today’s systems are no longer getting faster with each generation; in many cases, newer computers are actually slower than previous generation systems. Modern hardware is shifting towards conservation of power, with processing scalability coming from having multiple computer cores, not faster and faster CPUs. Our CPU frequencies no longer double on a regular basis, but Moore’s Law is still holding strong. Now, however, instead of scaling transistors in order to make processors faster, hardware manufacturers are scaling the transistors in order to add more discrete hardware processing threads to the system. This changes how we should think about software. In order to take advantage of modern systems, we need to redesign and rewrite our algorithms to work in parallel. As with any design domain, it helps tremendously to have a common language, as well as a common set of patterns and tools. For .NET developers, this is an exciting time for parallel programming. Version 4 of the .NET Framework is adding the Task Parallel Library. This has been back-ported to .NET 3.5sp1 as part of the Reactive Extensions for .NET, and is available for use today in both .NET 3.5 and .NET 4.0 beta. In order to fully utilize the Task Parallel Library and parallelism, both in .NET 4 and previous versions, we need to understand the proper terminology. For this series, I will provide an introduction to some of the basic concepts in parallelism, and relate them to the tools available in .NET.

Read the article

Oracle Announces Oracle Big Data Appliance X3-2 and Enhanced Oracle Big Data Connectors

- by jgelhaus

Enables Customers to Easily Harness the Business Value of Big Data at Lower Cost Engineered System Simplifies Big Data for the Enterprise Oracle Big Data Appliance X3-2 hardware features the latest 8-core Intel® Xeon E5-2600 series of processors, and compared with previous generation, the 18 compute and storage servers with 648 TB raw storage now offer: 33 percent more processing power with 288 CPU cores; 33 percent more memory per node with 1.1 TB of main memory; and up to a 30 percent reduction in power and cooling Oracle Big Data Appliance X3-2 further simplifies implementation and management of big data by integrating all the hardware and software required to acquire, organize and analyze big data. It includes: Support for CDH4.1 including software upgrades developed collaboratively with Cloudera to simplify NameNode High Availability in Hadoop, eliminating the single point of failure in a Hadoop cluster; Oracle NoSQL Database Community Edition 2.0, the latest version that brings better Hadoop integration, elastic scaling and new APIs, including JSON and C support; The Oracle Enterprise Manager plug-in for Big Data Appliance that complements Cloudera Manager to enable users to more easily manage a Hadoop cluster; Updated distributions of Oracle Linux and Oracle Java Development Kit; An updated distribution of open source R, optimized to work with high performance multi-threaded math libraries Read More Data sheet: Oracle Big Data Appliance X3-2 Oracle Big Data Appliance: Datacenter Network Integration Big Data and Natural Language: Extracting Insight From Text Thomson Reuters Discusses Oracle's Big Data Platform Connectors Integrate Hadoop with Oracle Big Data Ecosystem Oracle Big Data Connectors is a suite of software built by Oracle to integrate Apache Hadoop with Oracle Database, Oracle Data Integrator, and Oracle R Distribution. Enhancements to Oracle Big Data Connectors extend these data integration capabilities. With updates to every connector, this release includes: Oracle SQL Connector for Hadoop Distributed File System, for high performance SQL queries on Hadoop data from Oracle Database, enhanced with increased automation and querying of Hive tables and now supported within the Oracle Data Integrator Application Adapter for Hadoop; Transparent access to the Hive Query language from R and introduction of new analytic techniques executing natively in Hadoop, enabling R developers to be more productive by increasing access to Hadoop in the R environment. Read More Data sheet: Oracle Big Data Connectors High Performance Connectors for Load and Access of Data from Hadoop to Oracle Database

Read the article

Mobile Java, shiny and new: Nokia Asha and Nokia SDK 2.0

- by terrencebarr

Nokia has announced a series of new S40 phones called “Asha” – mass-market devices with smart-phone features: Good-sized touch screens, 1 GHz processors, WiFi connectivity, social networking integration, and more. Prices starting around €60 retail. In case you don’t know, the S40 series is built on Java ME and has a huge deployed base in many parts of the world where price/performance is critical. Along with the new phones, Nokia is also making available the new Nokia SDK 2.0 for Java (beta), which enables developers to build rich Java applications with multi-touch, sensor support, an improved Maps API, and the Lightweight UI Toolkit (LWUIT) (more API & tools details). Furthermore, there is a host of developer information, the remote device access service, and even a porting guide to help you port your Android app to the new Asha platform. Last, but not least: More and better options to monetize your applications. Nokia has enabled in-app advertising and in-app purchasing, and improved the way applications can be discovered by customers. Nokia has seen downloads from the Nokia app store rise by 63%, now totaling billions. From what I’m hearing, the revenue opportunities on S40 for developers are often way better than what is typical for other smart-phone platforms (where competition is huge and consumers are fickle). Cheers, – Terrence Filed under: Mobile & Embedded Tagged: Asha Series, Java ME, Java ME SDK, Mobile Java, monetization, Nokia, S40

Read the article

Ubuntu 11.04 VM shows a black screen in VMware Player

- by Roel Veldhuizen

I have a Ubuntu Server 11.04 64 bit VM running on VMware Player 3.1.4 that only shows a black screen. No matter what I try, the screen remains black. The VM has worked the first time. When I reset the machine, it shows the VMware loader and a flickering _ for about a second. Then the screen turns black again. VM settings: Memory: 512MB Processors: 1 HD: 20GB CD: auto detect Floppy: auto detect Network adapter: NAT USB controller: present soundcard: auto detect printer: present display: auto detect I just created a fresh VM and the same happens, so it seems that the problem is consistent.

Search Results

Search found 740 results on 30 pages for 'processors'.

Page 14/30 | < Previous Page | 10 11 12 13 14 15 16 17 18 19 20 21 | Next Page >

- by djn

- by dks

- by Betamoo

- by Technical Bard

- by user249490

- by hgulyan

- by Jørgen Fogh

- by veda

- by Robert Kosara

- by Andriyev

- by davit-datuashvili

- by metdos

- by black_sabbath

- by black_sabbath

- by Honglin Su

- by mv

- by Dave

- by Dave

- by Rabbit2190

- by Reed

- by Andrew J. Brehm

- by Reed

- by jgelhaus

- by terrencebarr

- by Roel Veldhuizen

< Previous Page | 10 11 12 13 14 15 16 17 18 19 20 21 | Next Page >