Search Results

Search found 24498 results on 980 pages for 'lock pages in memory'.

Page 217/980 | < Previous Page | 213 214 215 216 217 218 219 220 221 222 223 224  | Next Page >

  • Gnome 3 freezes on logon on samsung RV 509

    - by Noufal
    I have a Samsung NP-RV509 A0FIN and I tried to install GNU/Linux with gnome 3.2 on it. I tried Fedora 16, Ubuntu 11.10 and Linux Mint 12 RC, but with no success. All of these freezes upon login into gnome shell. I think it is the problem with graphics driver, so I tried xorg-edgers ppa on my last installation, ie., Linux Mint. I also tried various intel graphics packages listed on Synaptic package manager, but no success again. My device configuration is as follows(obtained from windows 7): More details about my computer Component Details Subscore Base score Processor Intel(R) Pentium(R) CPU P6200 @ 2.13GHz 5.6 4.6 Memory (RAM) 4.00 GB 7.2 Graphics Intel(R) HD Graphics 4.6 Gaming graphics 1562 MB Total available graphics memory 5.2 Primary hard disk 12GB Free (50GB Total) 5.9 Windows 7 Ultimate System -------------------------------------------------------------------------------- Manufacturer SAMSUNG ELECTRONICS CO., LTD. Model RV409/RV509/RV709 Total amount of system memory 4.00 GB RAM System type 32-bit operating system Number of processor cores 2 64-bit capable Yes Storage -------------------------------------------------------------------------------- Total size of hard disk(s) 418 GB Disk partition (C:) 12 GB Free (50 GB Total) Media drive (D:) CD/DVD Disk partition (E:) 526 MB Free (191 GB Total) Disk partition (F:) 101 GB Free (177 GB Total) Graphics -------------------------------------------------------------------------------- Display adapter type Intel(R) HD Graphics Total available graphics memory 1562 MB Dedicated graphics memory 64 MB Dedicated system memory 0 MB Shared system memory 1498 MB Display adapter driver version 8.15.10.2202 Primary monitor resolution 1366x768 DirectX version DirectX 10 Network -------------------------------------------------------------------------------- Network Adapter Realtek PCIe GBE Family Controller Network Adapter Broadcom 802.11n Network Adapter Network Adapter Microsoft Virtual WiFi Miniport Adapter Notes -------------------------------------------------------------------------------- The gaming graphics score is based on the primary graphics adapter. If this system has linked or multiple graphics adapters, some software applications may see additional performance benefits. Any help is appreciated, and thanks in advance.

    Read the article

  • The C++ Standard Template Library as a BDB Database (part 1)

    - by Gregory Burd
    If you've used C++ you undoubtedly have used the Standard Template Libraries. Designed for in-memory management of data and collections of data this is a core aspect of all C++ programs. Berkeley DB is a database library with a variety of APIs designed to ease development, one of those APIs extends and makes use of the STL for persistent, transactional data storage. dbstl is an STL standard compatible API for Berkeley DB. You can make use of Berkeley DB via this API as if you are using C++ STL classes, and still make full use of Berkeley DB features. Being an STL library backed by a database, there are some important and useful features that dbstl can provide, while the C++ STL library can't. The following are a few typical use cases to use the dbstl extensions to the C++ STL for data storage. When data exceeds available physical memory.Berkeley DB dbstl can vastly improve performance when managing a dataset which is larger than available memory. Performance suffers when the data can't reside in memory because the OS is forced to use virtual memory and swap pages of memory to disk. Switching to BDB's dbstl improves performance while allowing you to keep using STL containers. When you need concurrent access to C++ STL containers.Few existing C++ STL implementations support concurrent access (create/read/update/delete) within a container, at best you'll find support for accessing different containers of the same type concurrently. With the Berkeley DB dbstl implementation you can concurrently access your data from multiple threads or processes with confidence in the outcome. When your objects are your database.You want to have object persistence in your application, and store objects in a database, and use the objects across different runs of your application without having to translate them to/from SQL. The dbstl is capable of storing complicated objects, even those not located on a continous chunk of memory space, directly to disk without any unnecessary overhead. These are a few reasons why you should consider using Berkeley DB's C++ STL support for your embedded database application. In the next few blog posts I'll show you a few examples of this approach, it's easy to use and easy to learn.

    Read the article

  • July, the 31 Days of SQL Server DMO’s – Day 27 (sys.dm_db_file_space_usage)

    - by Tamarick Hill
    The sys.dm_db_file_space usage DMV returns information about database file space usage.  This DMV was enhanced for the 2012 version to include 3 additional columns. Let’s query this DMV against our AdventureWorks2012 database and view the results. SELECT * FROM sys.dm_db_file_space_usage The column returned from this DMV are really self-explanatory, but I will give you a description, paraphrased from books online, below. The first three columns returned from this DMV represent the Database, File, and Filegroup for the current database context that executed the DMV query. The next column is the total_page_count which represents the total number of pages in the file. The allocated_extent_page_count represents the total number of pages in all extents that have been allocated. The unallocated_extent_page_count represents the number of pages in the unallocated extents within the file. The version_store_reserved_page_count column represents the number of pages that are allocated to the version store. The user_object_reserved_page_count represents the number of pages allocated for user objects. The internal_object_reserved_page_count represents the number of pages allocated for internal objects.  Lastly is the mixed_extent_page_count which represents the total number of pages that are part of mixed extents. This is a great DMV for retrieving usage space information from your database files. For more information about this DMV, please see the below Books Online link: http://msdn.microsoft.com/en-us/library/ms174412.aspx Follow me on Twitter @PrimeTimeDBA

    Read the article

  • Organizations &amp; Architecture UNISA Studies &ndash; Chap 7

    - by MarkPearl
    Learning Outcomes Name different device categories Discuss the functions and structure of I/.O modules Describe the principles of Programmed I/O Describe the principles of Interrupt-driven I/O Describe the principles of DMA Discuss the evolution characteristic of I/O channels Describe different types of I/O interface Explain the principles of point-to-point and multipoint configurations Discuss the way in which a FireWire serial bus functions Discuss the principles of InfiniBand architecture External Devices An external device attaches to the computer by a link to an I/O module. The link is used to exchange control, status, and data between the I/O module and the external device. External devices can be classified into 3 categories… Human readable – e.g. video display Machine readable – e.g. magnetic disk Communications – e.g. wifi card I/O Modules An I/O module has two major functions… Interface to the processor and memory via the system bus or central switch Interface to one or more peripheral devices by tailored data links Module Functions The major functions or requirements for an I/O module fall into the following categories… Control and timing Processor communication Device communication Data buffering Error detection I/O function includes a control and timing requirement, to coordinate the flow of traffic between internal resources and external devices. Processor communication involves the following… Command decoding Data Status reporting Address recognition The I/O device must be able to perform device communication. This communication involves commands, status information, and data. An essential task of an I/O module is data buffering due to the relative slow speeds of most external devices. An I/O module is often responsible for error detection and for subsequently reporting errors to the processor. I/O Module Structure An I/O module functions to allow the processor to view a wide range of devices in a simple minded way. The I/O module may hide the details of timing, formats, and the electro mechanics of an external device so that the processor can function in terms of simple reads and write commands. An I/O channel/processor is an I/O module that takes on most of the detailed processing burden, presenting a high-level interface to the processor. There are 3 techniques are possible for I/O operations Programmed I/O Interrupt[t I/O DMA Access Programmed I/O When a processor is executing a program and encounters an instruction relating to I/O it executes that instruction by issuing a command to the appropriate I/O module. With programmed I/O, the I/O module will perform the requested action and then set the appropriate bits in the I/O status register. The I/O module takes no further actions to alert the processor. I/O Commands To execute an I/O related instruction, the processor issues an address, specifying the particular I/O module and external device, and an I/O command. There are four types of I/O commands that an I/O module may receive when it is addressed by a processor… Control – used to activate a peripheral and tell it what to do Test – Used to test various status conditions associated with an I/O module and its peripherals Read – Causes the I/O module to obtain an item of data from the peripheral and place it in an internal buffer Write – Causes the I/O module to take an item of data form the data bus and subsequently transmit that data item to the peripheral The main disadvantage of this technique is it is a time consuming process that keeps the processor busy needlessly I/O Instructions With programmed I/O there is a close correspondence between the I/O related instructions that the processor fetches from memory and the I/O commands that the processor issues to an I/O module to execute the instructions. Typically there will be many I/O devices connected through I/O modules to the system – each device is given a unique identifier or address – when the processor issues an I/O command, the command contains the address of the address of the desired device, thus each I/O module must interpret the address lines to determine if the command is for itself. When the processor, main memory and I/O share a common bus, two modes of addressing are possible… Memory mapped I/O Isolated I/O (for a detailed explanation read page 245 of book) The advantage of memory mapped I/O over isolated I/O is that it has a large repertoire of instructions that can be used, allowing more efficient programming. The disadvantage of memory mapped I/O over isolated I/O is that valuable memory address space is sued up. Interrupts driven I/O Interrupt driven I/O works as follows… The processor issues an I/O command to a module and then goes on to do some other useful work The I/O module will then interrupts the processor to request service when is is ready to exchange data with the processor The processor then executes the data transfer and then resumes its former processing Interrupt Processing The occurrence of an interrupt triggers a number of events, both in the processor hardware and in software. When an I/O device completes an I/O operations the following sequence of hardware events occurs… The device issues an interrupt signal to the processor The processor finishes execution of the current instruction before responding to the interrupt The processor tests for an interrupt – determines that there is one – and sends an acknowledgement signal to the device that issues the interrupt. The acknowledgement allows the device to remove its interrupt signal The processor now needs to prepare to transfer control to the interrupt routine. To begin, it needs to save information needed to resume the current program at the point of interrupt. The minimum information required is the status of the processor and the location of the next instruction to be executed. The processor now loads the program counter with the entry location of the interrupt-handling program that will respond to this interrupt. It also saves the values of the process registers because the Interrupt operation may modify these The interrupt handler processes the interrupt – this includes examination of status information relating to the I/O operation or other event that caused an interrupt When interrupt processing is complete, the saved register values are retrieved from the stack and restored to the registers Finally, the PSW and program counter values from the stack are restored. Design Issues Two design issues arise in implementing interrupt I/O Because there will be multiple I/O modules, how does the processor determine which device issued the interrupt? If multiple interrupts have occurred, how does the processor decide which one to process? Addressing device recognition, 4 general categories of techniques are in common use… Multiple interrupt lines Software poll Daisy chain Bus arbitration For a detailed explanation of these approaches read page 250 of the textbook. Interrupt driven I/O while more efficient than simple programmed I/O still requires the active intervention of the processor to transfer data between memory and an I/O module, and any data transfer must traverse a path through the processor. Thus is suffers from two inherent drawbacks… The I/O transfer rate is limited by the speed with which the processor can test and service a device The processor is tied up in managing an I/O transfer; a number of instructions must be executed for each I/O transfer Direct Memory Access When large volumes of data are to be moved, an efficient technique is direct memory access (DMA) DMA Function DMA involves an additional module on the system bus. The DMA module is capable of mimicking the processor and taking over control of the system from the processor. It needs to do this to transfer data to and from memory over the system bus. DMA must the bus only when the processor does not need it, or it must force the processor to suspend operation temporarily (most common – referred to as cycle stealing). When the processor wishes to read or write a block of data, it issues a command to the DMA module by sending to the DMA module the following information… Whether a read or write is requested using the read or write control line between the processor and the DMA module The address of the I/O device involved, communicated on the data lines The starting location in memory to read from or write to, communicated on the data lines and stored by the DMA module in its address register The number of words to be read or written, communicated via the data lines and stored in the data count register The processor then continues with other work, it delegates the I/O operation to the DMA module which transfers the entire block of data, one word at a time, directly to or from memory without going through the processor. When the transfer is complete, the DMA module sends an interrupt signal to the processor, this the processor is involved only at the beginning and end of the transfer. I/O Channels and Processors Characteristics of I/O Channels As one proceeds along the evolutionary path, more and more of the I/O function is performed without CPU involvement. The I/O channel represents an extension of the DMA concept. An I/O channel ahs the ability to execute I/O instructions, which gives it complete control over I/O operations. In a computer system with such devices, the CPU does not execute I/O instructions – such instructions are stored in main memory to be executed by a special purpose processor in the I/O channel itself. Two types of I/O channels are common A selector channel controls multiple high-speed devices. A multiplexor channel can handle I/O with multiple characters as fast as possible to multiple devices. The external interface: FireWire and InfiniBand Types of Interfaces One major characteristic of the interface is whether it is serial or parallel parallel interface – there are multiple lines connecting the I/O module and the peripheral, and multiple bits are transferred simultaneously serial interface – there is only one line used to transmit data, and bits must be transmitted one at a time With new generation serial interfaces, parallel interfaces are becoming less common. In either case, the I/O module must engage in a dialogue with the peripheral. In general terms the dialog may look as follows… The I/O module sends a control signal requesting permission to send data The peripheral acknowledges the request The I/O module transfers data The peripheral acknowledges receipt of data For a detailed explanation of FireWire and InfiniBand technology read page 264 – 270 of the textbook

    Read the article

  • Restrictive routing best practices for Google App Engine with python?

    - by Aleksandr Makov
    Say I have a simple structure: app = webapp2.WSGIApplication([ (r'/', 'pages.login'), (r'/profile', 'pages.profile'), (r'/dashboard', 'pages.dash'), ], debug=True) Basically all pages require authentication except for the login. If visitor tries to reach a restrictive page and he isn't authorized (or lacks privileges) then he gets redirected to the login view. The question is about the routing design. Should I check the auth and ACL privs in each of the modules (pages.profile and pages.dash from example above), or just pass all requests through the single routing mechanism: app = webapp2.WSGIApplication([ (r'/', 'pages.login'), (r'/.+', 'router') ], debug=True) I'm still quite new to the GAE, but my app requires authentication as well as ACL. I'm aware that there's login directive on the server config level, but I don't know how it works and how I can tight it with my ACL logic and what's worse I cannot estimate time needed to get it running. Besides, it looks only to provide only 2 user groups: admin and user. In any case, that's the configuration I use: handlers: - url: /favicon.ico static_files: static/favicon.ico upload: static/favicon.ico - url: /static/* static_dir: static - url: .* script: main.app secure: always Or I miss something here and ACL can be set in the config file? Thanks.

    Read the article

  • More Denali Execution Plan Warning Goodies

    - by Dave Ballantyne
    In my last blog, I showed how the execution plan in denali has been enhanced by 2 new warnings ,conversion affecting cardinality and conversion affecting seek, which are shown when a data type conversion has happened either implicitly or explicitly. That is not all though, there is more .  Also added are two warnings when performance has been affected due to memory issues. Memory spills to tempdb are a costly operation and happen when SqlServer is under memory pressure and needs to free some up. For a long time you have been able to see these as warnings in a profiler trace as a sort or hash warning event,  but now they are included right in the execution plan.  Not only that but also you can see which operator caused the spill , not just which statement.  Pretty damn handy. Another cause of performance problems relating to memory are memory grant waits.  Here is an informative write up on them,  but simply speaking , SQLServer has to allocate a certain amount of memory for each statement. If it is unable to you get a “memory grant wait”.  Once again there are other methods of analyzing these,  but the plan now shows these too. Don't worry that’s not real production code There is one other new warning that is of interest to me, “Unmatched Indexes”.  Once I find out the conditions under which that fires ill blog about it.

    Read the article

  • Series On Embedded Development (Part 1)

    - by user12612705
    This is the first in a series of entries on developing applications for the embedded environment. Most of this information is relevant to any type of embedded development (and even for desktop and server too), not just Java. This information is based on a talk Hinkmond Wong and I gave at JavaOne 2012 entitled Reducing Dynamic Memory in Java Embedded Applications. One thing to remember when developing embeddded applications is that memory matters. Yes, memory matters in desktop and server environments as well, but there's just plain less of it in embedded devices. So I'm going to be talking about saving this precious resource as well as another precious resource, CPU cycles...and a bit about power too. CPU matters too, and again, in embedded devices, there's just plain less of it. What you'll find, no surprise, is that there's a trade-off between performance and memory. To get better performance, you need to use more memory, and to save more memory, you need to need to use more CPU cycles. I'll be discussing three Memory Reduction Categories: - Optionality, both build-time and runtime. Optionality is about providing options so you can get rid of the stuff you don't need and include the stuff you do need. - Tunability, which is about providing options so you can tune your application by trading performance for size, and vice-versa. - Efficiency, which is about balancing size savings with performance.

    Read the article

  • ORA-4030 Troubleshooting

    - by [email protected]
    QUICKLINK: Note 399497.1 FAQ ORA-4030 Note 1088087.1 : ORA-4030 Diagnostic Tools [Video]   Have you observed an ORA-0430 error reported in your alert log? ORA-4030 errors are raised when memory or resources are requested from the Operating System and the Operating System is unable to provide the memory or resources.   The arguments included with the ORA-4030 are often important to narrowing down the problem. For more specifics on the ORA-4030 error and scenarios that lead to this problem, see Note 399497.1 FAQ ORA-4030.   Looking for the best way to diagnose? There are several available diagnostic tools (error tracing, 11g Diagnosibility, OCM, Process Memory Guides, RDA, OSW, diagnostic scripts) that collectively can prove powerful for identifying the cause of the ORA-4030.    Error Tracing   The ORA-4030 error usually occurs on the client workstation and for this reason, a trace file and alert log entry may not have been generated on the server side.  It may be necessary to add additional tracing events to get initial diagnostics on the problem. To setup tracing to trap the ORA-4030, on the server use the following in SQLPlus: alter system set events '4030 trace name heapdump level 536870917;name errorstack level 3';Once the error reoccurs with the event set, you can turn off  tracing using the following command in SQLPlus:alter system set events '4030 trace name context off; name context off';NOTE:   See more diagnostics information to collect in Note 399497.1  11g DiagnosibilityStarting with Oracle Database 11g Release 1, the Diagnosability infrastructure was introduced which places traces and core files into a location controlled by the DIAGNOSTIC_DEST initialization parameter when an incident, such as an ORA-4030 occurs.  For earlier versions, the trace file will be written to either USER_DUMP_DEST (if the error was caught in a user process) or BACKGROUND_DUMP_DEST (if the error was caught in a background process like PMON or SMON). The trace file may contain vital information about what led to the error condition.    Note 443529.1 11g Quick Steps to Package and Send Critical Error Diagnostic Informationto Support[Video]  Oracle Configuration Manager (OCM) Oracle Configuration Manager (OCM) works with My Oracle Support to enable proactive support capability that helps you organize, collect and manage your Oracle configurations. Oracle Configuration Manager Quick Start Guide Note 548815.1: My Oracle Support Configuration Management FAQ Note 250434.1: BULLETIN: Learn More About My Oracle Support Configuration Manager    General Process Memory Guides   An ORA-4030 indicates a limit has been reached with respect to the Oracle process private memory allocation.    Each Operating System will handle memory allocations with Oracle slightly differently. Solaris     Note 163763.1Linux       Note 341782.1IBM AIX   Notes 166491.1 and 123754.1HP           Note 166490.1Windows Note 225349.1, Note 373602.1, Note 231159.1, Note 269495.1, Note 762031.1Generic    Note 169706.1   RDAThe RDA report will show more detailed information about the database and Server Configuration. Note 414966.1 RDA Documentation Index Download RDA -- refer to Note 314422.1 Remote Diagnostic Agent (RDA) 4 - Getting Started OS Watcher (OSW)This tool is designed to gather Operating System side statistics to compare with the findings from the database.  This is a key tool in cases where memory usage is higher than expected on the server while not experiencing ORA-4030 errors currently. Reference more details on setup and usage in Note 301137.1 OS Watcher User Guide Diagnostic Scripts   Refer to Note 1088087.1 : ORA-4030 Diagnostic Tools [Video] Common Causes/Solutions The ORA-4030 can occur for a variety of reasons.  Some common causes are:   * OS Memory limit reached such as physical memory and/or swap/virtual paging.   For instance, IBM AIX can experience ORA-4030 issues related to swap scenarios.  See Note 740603.1 10.2.0.4 not using large pages on AIX for more on that problem. Also reference Note 188149.1 for pointers on 10g and stack size issues.* OS limits reached (kernel or user shell limits) that limit overall, user level or process level memory * OS limit on PGA memory size due to SGA attach address           Reference: Note 1028623.6 SOLARIS How to Relocate the SGA* Oracle internal limit on functionality like PL/SQL varrays or bulk collections. ORA-4030 errors will include arguments like "pl/sql vc2" "pmucalm coll" "pmuccst: adt/re".  See Coding Pointers for pointers on application design to get around these issues* Application design causing limits to be reached* Bug - space leaks, heap leaks   ***For reference to the content in this blog, refer to Note.1088267.1 Master Note for Diagnosing ORA-4030

    Read the article

  • SPARC M7 Chip - 32 cores - Mind Blowing performance

    - by Angelo-Oracle
    The M7 Chip Oracle just announced its Next Generation Processor at the HotChips HC26 conference. As the Tech Lead in our Systems Division's Partner group, I had a front row seat to the extraordinary price performance advantage of Oracle current T5 and M6 based systems. Partner after partner tested  these systems and were impressed with it performance. Just read some of the quotes to see what our partner has been saying about our hardware. We just announced our next generation processor, the M7. This has 32 cores (up from 16-cores in T5 and 12-cores in M6). With 20 nm technology  this is our most advanced processor. The processor has more cores than anything else in the industry today. After the Sun acquisition Oracle has released 5 processors in 4 years and this is the 6th.  The S4 core  The M7 is built using the foundation of the S4 core. This is the next generation core technology. Like its predecessor, the S4 has 8 dynamic threads. It increases the frequency while maintaining the Pipeline depth. Each core has its own fine grain power estimator that keeps the core within its power envelop in 250 nano-sec granularity. Each core also includes Software in Silicon features for Application Acceleration Support. Each core includes features to improve Application Data Integrity, with almost no performance loss. The core also allows using part of the Virtual Address to store meta-data.  User-Level Synchronization Instructions are also part of the S4 core. Each core has 16 KB Instruction and 16 KB Data L1 cache. The Core Clusters  The cores on the M7 chip are organized in sets of 4-core clusters. The core clusters share  L2 cache.  All four cores in the complex share 256 KB of 4 way set associative L2 Instruction Cache, with over 1/2 TB/s of throughput. Two cores share 256 KB of 8 way set associative L2 Data Cache, with over 1/2 TB/s of throughput. With this innovative Core Cluster architecture, the M7 doubles core execution bandwidth. to maximize per-thread performance.  The Chip  Each  M7 chip has 8 sets of these core-clusters. The chip has 64 MB on-chip L3 cache. This L3 caches is shared among all the cores and is partitioned into 8 x 8 MB chunks. Each chunk is  8-way set associative cache. The aggregate bandwidth for the L3 cache on the chip is over 1.6TB/s. Each chip has 4 DDR4 memory controllers and can support upto 16 DDR4 DIMMs, allowing for 2 TB of RAM/chip. The chip also includes 4 internal links of PCIe Gen3 I/O controllers.  Each chip has 7 coherence links, allowing for 8 of these chips to be connected together gluelessly. Also 32 of these chips can be connected in an SMP configuration. A potential system with 32 chips will have 1024 cores and 8192 threads and 64 TB of RAM.  Software in Silicon The M7 chip has many built in Application Accelerators in Silicon. These features will be exposed to our Software partners using the SPARC Accelerator Program.  The M7  has built-in logic to decompress data at the speed of memory access. This means that applications can directly work on compressed data in memory increasing the data access rates. The VA Masking feature allows the use of part of the virtual address to store meta-data.  Realtime Application Data Integrity The Realtime Application Data Integrity feature helps applications safeguard against invalid, stale memory reference and buffer overflows. The first 4-bits if the Pointer can be used to store a version number and this version number is also maintained in the memory & cache lines. When a pointer accesses memory the hardware checks to make sure the two versions match. A SEGV signal is raised when there is a mismatch. This feature can be used by the Database, applications and the OS.  M7 Database In-Memory Query Accelerator The M7 chip also includes a In-Silicon Query Engines.  These accelerate tasks that work on In-Memory Columnar Vectors. Oracle In-Memory options stores data in Column Format. The M7 Query Engine can speed up In-Memory Format Conversion, Value and Range Comparisons and Set Membership lookups. This engine can work on Compressed data - this means not only are we accelerating the query performance but also increasing the memory bandwidth for queries.  SPARC Accelerated Program  At the Hotchips conference we also introduced the SPARC Accelerated Program to provide our partners and third part developers access to all the goodness of the M7's SPARC Application Acceleration features. Please get in touch with us if you are interested in knowing more about this program. 

    Read the article

  • How can I make Internet Explorer 6 render Web pages like Internet Explorer 11?

    - by gparyani
    Now, I know that this may seem like a bad question in that I can just upgrade to Internet Explorer 8, but I am sticking with IE6 in that IE8 removes valuable features, like the ability to save favorites offline and the fact that a file path turns into a Windows Explorer window and typing a Web address into Windows Explorer changes it into an IE window. I know that Internet Explorer 6 does a really bad job at rendering some pages. I know of the Google Chrome Frame extension that brings Chrome-style rendering into IE, but that will soon be discontinued. So, I tried another thing: I know that C:\Windows\System32\mshtml.dll contains the Trident rendering engine that is used by IE, so I tried something: I first backed up the original file by renaming it on Windows XP to mshtml-old.dll, then I tried to copy in the DLL from a computer running Windows 7 with Internet Explorer 10. I noticed that, after copying, the system had replaced the new DLL with the old one, but left the one I backed up intact. Is there any way I can get the system to not replace the DLL like that so that I can transfer in IE11's mshtml.dll into Windows XP and make IE6 render like IE11? I'm looking for an answer that describes how to tweak my system to make IE6 render like IE11 (or IE10), not one that tells me to upgrade IE or install another browser. I don't care how tedious the method is, just as long as it works.

    Read the article

  • How John Got 15x Improvement Without Really Trying

    - by rchrd
    The following article was published on a Sun Microsystems website a number of years ago by John Feo. It is still useful and worth preserving. So I'm republishing it here.  How I Got 15x Improvement Without Really Trying John Feo, Sun Microsystems Taking ten "personal" program codes used in scientific and engineering research, the author was able to get from 2 to 15 times performance improvement easily by applying some simple general optimization techniques. Introduction Scientific research based on computer simulation depends on the simulation for advancement. The research can advance only as fast as the computational codes can execute. The codes' efficiency determines both the rate and quality of results. In the same amount of time, a faster program can generate more results and can carry out a more detailed simulation of physical phenomena than a slower program. Highly optimized programs help science advance quickly and insure that monies supporting scientific research are used as effectively as possible. Scientific computer codes divide into three broad categories: ISV, community, and personal. ISV codes are large, mature production codes developed and sold commercially. The codes improve slowly over time both in methods and capabilities, and they are well tuned for most vendor platforms. Since the codes are mature and complex, there are few opportunities to improve their performance solely through code optimization. Improvements of 10% to 15% are typical. Examples of ISV codes are DYNA3D, Gaussian, and Nastran. Community codes are non-commercial production codes used by a particular research field. Generally, they are developed and distributed by a single academic or research institution with assistance from the community. Most users just run the codes, but some develop new methods and extensions that feed back into the general release. The codes are available on most vendor platforms. Since these codes are younger than ISV codes, there are more opportunities to optimize the source code. Improvements of 50% are not unusual. Examples of community codes are AMBER, CHARM, BLAST, and FASTA. Personal codes are those written by single users or small research groups for their own use. These codes are not distributed, but may be passed from professor-to-student or student-to-student over several years. They form the primordial ocean of applications from which community and ISV codes emerge. Government research grants pay for the development of most personal codes. This paper reports on the nature and performance of this class of codes. Over the last year, I have looked at over two dozen personal codes from more than a dozen research institutions. The codes cover a variety of scientific fields, including astronomy, atmospheric sciences, bioinformatics, biology, chemistry, geology, and physics. The sources range from a few hundred lines to more than ten thousand lines, and are written in Fortran, Fortran 90, C, and C++. For the most part, the codes are modular, documented, and written in a clear, straightforward manner. They do not use complex language features, advanced data structures, programming tricks, or libraries. I had little trouble understanding what the codes did or how data structures were used. Most came with a makefile. Surprisingly, only one of the applications is parallel. All developers have access to parallel machines, so availability is not an issue. Several tried to parallelize their applications, but stopped after encountering difficulties. Lack of education and a perception that parallelism is difficult prevented most from trying. I parallelized several of the codes using OpenMP, and did not judge any of the codes as difficult to parallelize. Even more surprising than the lack of parallelism is the inefficiency of the codes. I was able to get large improvements in performance in a matter of a few days applying simple optimization techniques. Table 1 lists ten representative codes [names and affiliation are omitted to preserve anonymity]. Improvements on one processor range from 2x to 15.5x with a simple average of 4.75x. I did not use sophisticated performance tools or drill deep into the program's execution character as one would do when tuning ISV or community codes. Using only a profiler and source line timers, I identified inefficient sections of code and improved their performance by inspection. The changes were at a high level. I am sure there is another factor of 2 or 3 in each code, and more if the codes are parallelized. The study’s results show that personal scientific codes are running many times slower than they should and that the problem is pervasive. Computational scientists are not sloppy programmers; however, few are trained in the art of computer programming or code optimization. I found that most have a working knowledge of some programming language and standard software engineering practices; but they do not know, or think about, how to make their programs run faster. They simply do not know the standard techniques used to make codes run faster. In fact, they do not even perceive that such techniques exist. The case studies described in this paper show that applying simple, well known techniques can significantly increase the performance of personal codes. It is important that the scientific community and the Government agencies that support scientific research find ways to better educate academic scientific programmers. The inefficiency of their codes is so bad that it is retarding both the quality and progress of scientific research. # cacheperformance redundantoperations loopstructures performanceimprovement 1 x x 15.5 2 x 2.8 3 x x 2.5 4 x 2.1 5 x x 2.0 6 x 5.0 7 x 5.8 8 x 6.3 9 2.2 10 x x 3.3 Table 1 — Area of improvement and performance gains of 10 codes The remainder of the paper is organized as follows: sections 2, 3, and 4 discuss the three most common sources of inefficiencies in the codes studied. These are cache performance, redundant operations, and loop structures. Each section includes several examples. The last section summaries the work and suggests a possible solution to the issues raised. Optimizing cache performance Commodity microprocessor systems use caches to increase memory bandwidth and reduce memory latencies. Typical latencies from processor to L1, L2, local, and remote memory are 3, 10, 50, and 200 cycles, respectively. Moreover, bandwidth falls off dramatically as memory distances increase. Programs that do not use cache effectively run many times slower than programs that do. When optimizing for cache, the biggest performance gains are achieved by accessing data in cache order and reusing data to amortize the overhead of cache misses. Secondary considerations are prefetching, associativity, and replacement; however, the understanding and analysis required to optimize for the latter are probably beyond the capabilities of the non-expert. Much can be gained simply by accessing data in the correct order and maximizing data reuse. 6 out of the 10 codes studied here benefited from such high level optimizations. Array Accesses The most important cache optimization is the most basic: accessing Fortran array elements in column order and C array elements in row order. Four of the ten codes—1, 2, 4, and 10—got it wrong. Compilers will restructure nested loops to optimize cache performance, but may not do so if the loop structure is too complex, or the loop body includes conditionals, complex addressing, or function calls. In code 1, the compiler failed to invert a key loop because of complex addressing do I = 0, 1010, delta_x IM = I - delta_x IP = I + delta_x do J = 5, 995, delta_x JM = J - delta_x JP = J + delta_x T1 = CA1(IP, J) + CA1(I, JP) T2 = CA1(IM, J) + CA1(I, JM) S1 = T1 + T2 - 4 * CA1(I, J) CA(I, J) = CA1(I, J) + D * S1 end do end do In code 2, the culprit is conditionals do I = 1, N do J = 1, N If (IFLAG(I,J) .EQ. 0) then T1 = Value(I, J-1) T2 = Value(I-1, J) T3 = Value(I, J) T4 = Value(I+1, J) T5 = Value(I, J+1) Value(I,J) = 0.25 * (T1 + T2 + T5 + T4) Delta = ABS(T3 - Value(I,J)) If (Delta .GT. MaxDelta) MaxDelta = Delta endif enddo enddo I fixed both programs by inverting the loops by hand. Code 10 has three-dimensional arrays and triply nested loops. The structure of the most computationally intensive loops is too complex to invert automatically or by hand. The only practical solution is to transpose the arrays so that the dimension accessed by the innermost loop is in cache order. The arrays can be transposed at construction or prior to entering a computationally intensive section of code. The former requires all array references to be modified, while the latter is cost effective only if the cost of the transpose is amortized over many accesses. I used the second approach to optimize code 10. Code 5 has four-dimensional arrays and loops are nested four deep. For all of the reasons cited above the compiler is not able to restructure three key loops. Assume C arrays and let the four dimensions of the arrays be i, j, k, and l. In the original code, the index structure of the three loops is L1: for i L2: for i L3: for i for l for l for j for k for j for k for j for k for l So only L3 accesses array elements in cache order. L1 is a very complex loop—much too complex to invert. I brought the loop into cache alignment by transposing the second and fourth dimensions of the arrays. Since the code uses a macro to compute all array indexes, I effected the transpose at construction and changed the macro appropriately. The dimensions of the new arrays are now: i, l, k, and j. L3 is a simple loop and easily inverted. L2 has a loop-carried scalar dependence in k. By promoting the scalar name that carries the dependence to an array, I was able to invert the third and fourth subloops aligning the loop with cache. Code 5 is by far the most difficult of the four codes to optimize for array accesses; but the knowledge required to fix the problems is no more than that required for the other codes. I would judge this code at the limits of, but not beyond, the capabilities of appropriately trained computational scientists. Array Strides When a cache miss occurs, a line (64 bytes) rather than just one word is loaded into the cache. If data is accessed stride 1, than the cost of the miss is amortized over 8 words. Any stride other than one reduces the cost savings. Two of the ten codes studied suffered from non-unit strides. The codes represent two important classes of "strided" codes. Code 1 employs a multi-grid algorithm to reduce time to convergence. The grids are every tenth, fifth, second, and unit element. Since time to convergence is inversely proportional to the distance between elements, coarse grids converge quickly providing good starting values for finer grids. The better starting values further reduce the time to convergence. The downside is that grids of every nth element, n > 1, introduce non-unit strides into the computation. In the original code, much of the savings of the multi-grid algorithm were lost due to this problem. I eliminated the problem by compressing (copying) coarse grids into continuous memory, and rewriting the computation as a function of the compressed grid. On convergence, I copied the final values of the compressed grid back to the original grid. The savings gained from unit stride access of the compressed grid more than paid for the cost of copying. Using compressed grids, the loop from code 1 included in the previous section becomes do j = 1, GZ do i = 1, GZ T1 = CA(i+0, j-1) + CA(i-1, j+0) T4 = CA1(i+1, j+0) + CA1(i+0, j+1) S1 = T1 + T4 - 4 * CA1(i+0, j+0) CA(i+0, j+0) = CA1(i+0, j+0) + DD * S1 enddo enddo where CA and CA1 are compressed arrays of size GZ. Code 7 traverses a list of objects selecting objects for later processing. The labels of the selected objects are stored in an array. The selection step has unit stride, but the processing steps have irregular stride. A fix is to save the parameters of the selected objects in temporary arrays as they are selected, and pass the temporary arrays to the processing functions. The fix is practical if the same parameters are used in selection as in processing, or if processing comprises a series of distinct steps which use overlapping subsets of the parameters. Both conditions are true for code 7, so I achieved significant improvement by copying parameters to temporary arrays during selection. Data reuse In the previous sections, we optimized for spatial locality. It is also important to optimize for temporal locality. Once read, a datum should be used as much as possible before it is forced from cache. Loop fusion and loop unrolling are two techniques that increase temporal locality. Unfortunately, both techniques increase register pressure—as loop bodies become larger, the number of registers required to hold temporary values grows. Once register spilling occurs, any gains evaporate quickly. For multiprocessors with small register sets or small caches, the sweet spot can be very small. In the ten codes presented here, I found no opportunities for loop fusion and only two opportunities for loop unrolling (codes 1 and 3). In code 1, unrolling the outer and inner loop one iteration increases the number of result values computed by the loop body from 1 to 4, do J = 1, GZ-2, 2 do I = 1, GZ-2, 2 T1 = CA1(i+0, j-1) + CA1(i-1, j+0) T2 = CA1(i+1, j-1) + CA1(i+0, j+0) T3 = CA1(i+0, j+0) + CA1(i-1, j+1) T4 = CA1(i+1, j+0) + CA1(i+0, j+1) T5 = CA1(i+2, j+0) + CA1(i+1, j+1) T6 = CA1(i+1, j+1) + CA1(i+0, j+2) T7 = CA1(i+2, j+1) + CA1(i+1, j+2) S1 = T1 + T4 - 4 * CA1(i+0, j+0) S2 = T2 + T5 - 4 * CA1(i+1, j+0) S3 = T3 + T6 - 4 * CA1(i+0, j+1) S4 = T4 + T7 - 4 * CA1(i+1, j+1) CA(i+0, j+0) = CA1(i+0, j+0) + DD * S1 CA(i+1, j+0) = CA1(i+1, j+0) + DD * S2 CA(i+0, j+1) = CA1(i+0, j+1) + DD * S3 CA(i+1, j+1) = CA1(i+1, j+1) + DD * S4 enddo enddo The loop body executes 12 reads, whereas as the rolled loop shown in the previous section executes 20 reads to compute the same four values. In code 3, two loops are unrolled 8 times and one loop is unrolled 4 times. Here is the before for (k = 0; k < NK[u]; k++) { sum = 0.0; for (y = 0; y < NY; y++) { sum += W[y][u][k] * delta[y]; } backprop[i++]=sum; } and after code for (k = 0; k < KK - 8; k+=8) { sum0 = 0.0; sum1 = 0.0; sum2 = 0.0; sum3 = 0.0; sum4 = 0.0; sum5 = 0.0; sum6 = 0.0; sum7 = 0.0; for (y = 0; y < NY; y++) { sum0 += W[y][0][k+0] * delta[y]; sum1 += W[y][0][k+1] * delta[y]; sum2 += W[y][0][k+2] * delta[y]; sum3 += W[y][0][k+3] * delta[y]; sum4 += W[y][0][k+4] * delta[y]; sum5 += W[y][0][k+5] * delta[y]; sum6 += W[y][0][k+6] * delta[y]; sum7 += W[y][0][k+7] * delta[y]; } backprop[k+0] = sum0; backprop[k+1] = sum1; backprop[k+2] = sum2; backprop[k+3] = sum3; backprop[k+4] = sum4; backprop[k+5] = sum5; backprop[k+6] = sum6; backprop[k+7] = sum7; } for one of the loops unrolled 8 times. Optimizing for temporal locality is the most difficult optimization considered in this paper. The concepts are not difficult, but the sweet spot is small. Identifying where the program can benefit from loop unrolling or loop fusion is not trivial. Moreover, it takes some effort to get it right. Still, educating scientific programmers about temporal locality and teaching them how to optimize for it will pay dividends. Reducing instruction count Execution time is a function of instruction count. Reduce the count and you usually reduce the time. The best solution is to use a more efficient algorithm; that is, an algorithm whose order of complexity is smaller, that converges quicker, or is more accurate. Optimizing source code without changing the algorithm yields smaller, but still significant, gains. This paper considers only the latter because the intent is to study how much better codes can run if written by programmers schooled in basic code optimization techniques. The ten codes studied benefited from three types of "instruction reducing" optimizations. The two most prevalent were hoisting invariant memory and data operations out of inner loops. The third was eliminating unnecessary data copying. The nature of these inefficiencies is language dependent. Memory operations The semantics of C make it difficult for the compiler to determine all the invariant memory operations in a loop. The problem is particularly acute for loops in functions since the compiler may not know the values of the function's parameters at every call site when compiling the function. Most compilers support pragmas to help resolve ambiguities; however, these pragmas are not comprehensive and there is no standard syntax. To guarantee that invariant memory operations are not executed repetitively, the user has little choice but to hoist the operations by hand. The problem is not as severe in Fortran programs because in the absence of equivalence statements, it is a violation of the language's semantics for two names to share memory. Codes 3 and 5 are C programs. In both cases, the compiler did not hoist all invariant memory operations from inner loops. Consider the following loop from code 3 for (y = 0; y < NY; y++) { i = 0; for (u = 0; u < NU; u++) { for (k = 0; k < NK[u]; k++) { dW[y][u][k] += delta[y] * I1[i++]; } } } Since dW[y][u] can point to the same memory space as delta for one or more values of y and u, assignment to dW[y][u][k] may change the value of delta[y]. In reality, dW and delta do not overlap in memory, so I rewrote the loop as for (y = 0; y < NY; y++) { i = 0; Dy = delta[y]; for (u = 0; u < NU; u++) { for (k = 0; k < NK[u]; k++) { dW[y][u][k] += Dy * I1[i++]; } } } Failure to hoist invariant memory operations may be due to complex address calculations. If the compiler can not determine that the address calculation is invariant, then it can hoist neither the calculation nor the associated memory operations. As noted above, code 5 uses a macro to address four-dimensional arrays #define MAT4D(a,q,i,j,k) (double *)((a)->data + (q)*(a)->strides[0] + (i)*(a)->strides[3] + (j)*(a)->strides[2] + (k)*(a)->strides[1]) The macro is too complex for the compiler to understand and so, it does not identify any subexpressions as loop invariant. The simplest way to eliminate the address calculation from the innermost loop (over i) is to define a0 = MAT4D(a,q,0,j,k) before the loop and then replace all instances of *MAT4D(a,q,i,j,k) in the loop with a0[i] A similar problem appears in code 6, a Fortran program. The key loop in this program is do n1 = 1, nh nx1 = (n1 - 1) / nz + 1 nz1 = n1 - nz * (nx1 - 1) do n2 = 1, nh nx2 = (n2 - 1) / nz + 1 nz2 = n2 - nz * (nx2 - 1) ndx = nx2 - nx1 ndy = nz2 - nz1 gxx = grn(1,ndx,ndy) gyy = grn(2,ndx,ndy) gxy = grn(3,ndx,ndy) balance(n1,1) = balance(n1,1) + (force(n2,1) * gxx + force(n2,2) * gxy) * h1 balance(n1,2) = balance(n1,2) + (force(n2,1) * gxy + force(n2,2) * gyy)*h1 end do end do The programmer has written this loop well—there are no loop invariant operations with respect to n1 and n2. However, the loop resides within an iterative loop over time and the index calculations are independent with respect to time. Trading space for time, I precomputed the index values prior to the entering the time loop and stored the values in two arrays. I then replaced the index calculations with reads of the arrays. Data operations Ways to reduce data operations can appear in many forms. Implementing a more efficient algorithm produces the biggest gains. The closest I came to an algorithm change was in code 4. This code computes the inner product of K-vectors A(i) and B(j), 0 = i < N, 0 = j < M, for most values of i and j. Since the program computes most of the NM possible inner products, it is more efficient to compute all the inner products in one triply-nested loop rather than one at a time when needed. The savings accrue from reading A(i) once for all B(j) vectors and from loop unrolling. for (i = 0; i < N; i+=8) { for (j = 0; j < M; j++) { sum0 = 0.0; sum1 = 0.0; sum2 = 0.0; sum3 = 0.0; sum4 = 0.0; sum5 = 0.0; sum6 = 0.0; sum7 = 0.0; for (k = 0; k < K; k++) { sum0 += A[i+0][k] * B[j][k]; sum1 += A[i+1][k] * B[j][k]; sum2 += A[i+2][k] * B[j][k]; sum3 += A[i+3][k] * B[j][k]; sum4 += A[i+4][k] * B[j][k]; sum5 += A[i+5][k] * B[j][k]; sum6 += A[i+6][k] * B[j][k]; sum7 += A[i+7][k] * B[j][k]; } C[i+0][j] = sum0; C[i+1][j] = sum1; C[i+2][j] = sum2; C[i+3][j] = sum3; C[i+4][j] = sum4; C[i+5][j] = sum5; C[i+6][j] = sum6; C[i+7][j] = sum7; }} This change requires knowledge of a typical run; i.e., that most inner products are computed. The reasons for the change, however, derive from basic optimization concepts. It is the type of change easily made at development time by a knowledgeable programmer. In code 5, we have the data version of the index optimization in code 6. Here a very expensive computation is a function of the loop indices and so cannot be hoisted out of the loop; however, the computation is invariant with respect to an outer iterative loop over time. We can compute its value for each iteration of the computation loop prior to entering the time loop and save the values in an array. The increase in memory required to store the values is small in comparison to the large savings in time. The main loop in Code 8 is doubly nested. The inner loop includes a series of guarded computations; some are a function of the inner loop index but not the outer loop index while others are a function of the outer loop index but not the inner loop index for (j = 0; j < N; j++) { for (i = 0; i < M; i++) { r = i * hrmax; R = A[j]; temp = (PRM[3] == 0.0) ? 1.0 : pow(r, PRM[3]); high = temp * kcoeff * B[j] * PRM[2] * PRM[4]; low = high * PRM[6] * PRM[6] / (1.0 + pow(PRM[4] * PRM[6], 2.0)); kap = (R > PRM[6]) ? high * R * R / (1.0 + pow(PRM[4]*r, 2.0) : low * pow(R/PRM[6], PRM[5]); < rest of loop omitted > }} Note that the value of temp is invariant to j. Thus, we can hoist the computation for temp out of the loop and save its values in an array. for (i = 0; i < M; i++) { r = i * hrmax; TEMP[i] = pow(r, PRM[3]); } [N.B. – the case for PRM[3] = 0 is omitted and will be reintroduced later.] We now hoist out of the inner loop the computations invariant to i. Since the conditional guarding the value of kap is invariant to i, it behooves us to hoist the computation out of the inner loop, thereby executing the guard once rather than M times. The final version of the code is for (j = 0; j < N; j++) { R = rig[j] / 1000.; tmp1 = kcoeff * par[2] * beta[j] * par[4]; tmp2 = 1.0 + (par[4] * par[4] * par[6] * par[6]); tmp3 = 1.0 + (par[4] * par[4] * R * R); tmp4 = par[6] * par[6] / tmp2; tmp5 = R * R / tmp3; tmp6 = pow(R / par[6], par[5]); if ((par[3] == 0.0) && (R > par[6])) { for (i = 1; i <= imax1; i++) KAP[i] = tmp1 * tmp5; } else if ((par[3] == 0.0) && (R <= par[6])) { for (i = 1; i <= imax1; i++) KAP[i] = tmp1 * tmp4 * tmp6; } else if ((par[3] != 0.0) && (R > par[6])) { for (i = 1; i <= imax1; i++) KAP[i] = tmp1 * TEMP[i] * tmp5; } else if ((par[3] != 0.0) && (R <= par[6])) { for (i = 1; i <= imax1; i++) KAP[i] = tmp1 * TEMP[i] * tmp4 * tmp6; } for (i = 0; i < M; i++) { kap = KAP[i]; r = i * hrmax; < rest of loop omitted > } } Maybe not the prettiest piece of code, but certainly much more efficient than the original loop, Copy operations Several programs unnecessarily copy data from one data structure to another. This problem occurs in both Fortran and C programs, although it manifests itself differently in the two languages. Code 1 declares two arrays—one for old values and one for new values. At the end of each iteration, the array of new values is copied to the array of old values to reset the data structures for the next iteration. This problem occurs in Fortran programs not included in this study and in both Fortran 77 and Fortran 90 code. Introducing pointers to the arrays and swapping pointer values is an obvious way to eliminate the copying; but pointers is not a feature that many Fortran programmers know well or are comfortable using. An easy solution not involving pointers is to extend the dimension of the value array by 1 and use the last dimension to differentiate between arrays at different times. For example, if the data space is N x N, declare the array (N, N, 2). Then store the problem’s initial values in (_, _, 2) and define the scalar names new = 2 and old = 1. At the start of each iteration, swap old and new to reset the arrays. The old–new copy problem did not appear in any C program. In programs that had new and old values, the code swapped pointers to reset data structures. Where unnecessary coping did occur is in structure assignment and parameter passing. Structures in C are handled much like scalars. Assignment causes the data space of the right-hand name to be copied to the data space of the left-hand name. Similarly, when a structure is passed to a function, the data space of the actual parameter is copied to the data space of the formal parameter. If the structure is large and the assignment or function call is in an inner loop, then copying costs can grow quite large. While none of the ten programs considered here manifested this problem, it did occur in programs not included in the study. A simple fix is always to refer to structures via pointers. Optimizing loop structures Since scientific programs spend almost all their time in loops, efficient loops are the key to good performance. Conditionals, function calls, little instruction level parallelism, and large numbers of temporary values make it difficult for the compiler to generate tightly packed, highly efficient code. Conditionals and function calls introduce jumps that disrupt code flow. Users should eliminate or isolate conditionls to their own loops as much as possible. Often logical expressions can be substituted for if-then-else statements. For example, code 2 includes the following snippet MaxDelta = 0.0 do J = 1, N do I = 1, M < code omitted > Delta = abs(OldValue ? NewValue) if (Delta > MaxDelta) MaxDelta = Delta enddo enddo if (MaxDelta .gt. 0.001) goto 200 Since the only use of MaxDelta is to control the jump to 200 and all that matters is whether or not it is greater than 0.001, I made MaxDelta a boolean and rewrote the snippet as MaxDelta = .false. do J = 1, N do I = 1, M < code omitted > Delta = abs(OldValue ? NewValue) MaxDelta = MaxDelta .or. (Delta .gt. 0.001) enddo enddo if (MaxDelta) goto 200 thereby, eliminating the conditional expression from the inner loop. A microprocessor can execute many instructions per instruction cycle. Typically, it can execute one or more memory, floating point, integer, and jump operations. To be executed simultaneously, the operations must be independent. Thick loops tend to have more instruction level parallelism than thin loops. Moreover, they reduce memory traffice by maximizing data reuse. Loop unrolling and loop fusion are two techniques to increase the size of loop bodies. Several of the codes studied benefitted from loop unrolling, but none benefitted from loop fusion. This observation is not too surpising since it is the general tendency of programmers to write thick loops. As loops become thicker, the number of temporary values grows, increasing register pressure. If registers spill, then memory traffic increases and code flow is disrupted. A thick loop with many temporary values may execute slower than an equivalent series of thin loops. The biggest gain will be achieved if the thick loop can be split into a series of independent loops eliminating the need to write and read temporary arrays. I found such an occasion in code 10 where I split the loop do i = 1, n do j = 1, m A24(j,i)= S24(j,i) * T24(j,i) + S25(j,i) * U25(j,i) B24(j,i)= S24(j,i) * T25(j,i) + S25(j,i) * U24(j,i) A25(j,i)= S24(j,i) * C24(j,i) + S25(j,i) * V24(j,i) B25(j,i)= S24(j,i) * U25(j,i) + S25(j,i) * V25(j,i) C24(j,i)= S26(j,i) * T26(j,i) + S27(j,i) * U26(j,i) D24(j,i)= S26(j,i) * T27(j,i) + S27(j,i) * V26(j,i) C25(j,i)= S27(j,i) * S28(j,i) + S26(j,i) * U28(j,i) D25(j,i)= S27(j,i) * T28(j,i) + S26(j,i) * V28(j,i) end do end do into two disjoint loops do i = 1, n do j = 1, m A24(j,i)= S24(j,i) * T24(j,i) + S25(j,i) * U25(j,i) B24(j,i)= S24(j,i) * T25(j,i) + S25(j,i) * U24(j,i) A25(j,i)= S24(j,i) * C24(j,i) + S25(j,i) * V24(j,i) B25(j,i)= S24(j,i) * U25(j,i) + S25(j,i) * V25(j,i) end do end do do i = 1, n do j = 1, m C24(j,i)= S26(j,i) * T26(j,i) + S27(j,i) * U26(j,i) D24(j,i)= S26(j,i) * T27(j,i) + S27(j,i) * V26(j,i) C25(j,i)= S27(j,i) * S28(j,i) + S26(j,i) * U28(j,i) D25(j,i)= S27(j,i) * T28(j,i) + S26(j,i) * V28(j,i) end do end do Conclusions Over the course of the last year, I have had the opportunity to work with over two dozen academic scientific programmers at leading research universities. Their research interests span a broad range of scientific fields. Except for two programs that relied almost exclusively on library routines (matrix multiply and fast Fourier transform), I was able to improve significantly the single processor performance of all codes. Improvements range from 2x to 15.5x with a simple average of 4.75x. Changes to the source code were at a very high level. I did not use sophisticated techniques or programming tools to discover inefficiencies or effect the changes. Only one code was parallel despite the availability of parallel systems to all developers. Clearly, we have a problem—personal scientific research codes are highly inefficient and not running parallel. The developers are unaware of simple optimization techniques to make programs run faster. They lack education in the art of code optimization and parallel programming. I do not believe we can fix the problem by publishing additional books or training manuals. To date, the developers in questions have not studied the books or manual available, and are unlikely to do so in the future. Short courses are a possible solution, but I believe they are too concentrated to be much use. The general concepts can be taught in a three or four day course, but that is not enough time for students to practice what they learn and acquire the experience to apply and extend the concepts to their codes. Practice is the key to becoming proficient at optimization. I recommend that graduate students be required to take a semester length course in optimization and parallel programming. We would never give someone access to state-of-the-art scientific equipment costing hundreds of thousands of dollars without first requiring them to demonstrate that they know how to use the equipment. Yet the criterion for time on state-of-the-art supercomputers is at most an interesting project. Requestors are never asked to demonstrate that they know how to use the system, or can use the system effectively. A semester course would teach them the required skills. Government agencies that fund academic scientific research pay for most of the computer systems supporting scientific research as well as the development of most personal scientific codes. These agencies should require graduate schools to offer a course in optimization and parallel programming as a requirement for funding. About the Author John Feo received his Ph.D. in Computer Science from The University of Texas at Austin in 1986. After graduate school, Dr. Feo worked at Lawrence Livermore National Laboratory where he was the Group Leader of the Computer Research Group and principal investigator of the Sisal Language Project. In 1997, Dr. Feo joined Tera Computer Company where he was project manager for the MTA, and oversaw the programming and evaluation of the MTA at the San Diego Supercomputer Center. In 2000, Dr. Feo joined Sun Microsystems as an HPC application specialist. He works with university research groups to optimize and parallelize scientific codes. Dr. Feo has published over two dozen research articles in the areas of parallel parallel programming, parallel programming languages, and application performance.

    Read the article

  • Working around MySQL error "Deadlock found when trying to get lock; try restarting transaction"

    - by Anon Guy
    Hi all: I have a MySQL table with about 5,000,000 rows that are being constantly updated in small ways by parallel Perl processes connecting via DBI. The table has about 10 columns and several indexes. One fairly common operation gives rise to the following error sometimes: DBD::mysql::st execute failed: Deadlock found when trying to get lock; try restarting transaction at Db.pm line 276. The SQL statement that triggers the error is something like this: UPDATE file_table SET a_lock = 'process-1234' WHERE param1 = 'X' AND param2 = 'Y' AND param3 = 'Z' LIMIT 47 The error is triggered only sometimes. I'd estimate in 1% of calls or less. However, it never happened with a small table and has become more common as the database has grown. Note that I am using the a_lock field in file_table to ensure that the four near-identical processes I am running do not try and work on the same row. The limit is designed to break their work into small chunks. I haven't done much tuning on MySQL or DBD::mysql. MySQL is a standard Solaris deployment, and the database connection is set up as follows: my $dsn = "DBI:mysql:database=" . $DbConfig::database . ";host=${DbConfig::hostname};port=${DbConfig::port}"; my $dbh = DBI->connect($dsn, $DbConfig::username, $DbConfig::password, { RaiseError => 1, AutoCommit => 1 }) or die $DBI::errstr; I have seen online that several other people have reported similar errors and that this may be a genuine deadlock situation. I have two questions: What exactly about my situation is causing the error above? Is there a simple way to work around it or lessen its frequency? For example, how exactly do I go about "restarting transaction at Db.pm line 276"? Thanks in advance.

    Read the article

  • Binary search in a sorted (memory-mapped ?) file in Java

    - by sds
    I am struggling to port a Perl program to Java, and learning Java as I go. A central component of the original program is a Perl module that does string prefix lookups in a +500 GB sorted text file using binary search (essentially, "seek" to a byte offset in the middle of the file, backtrack to nearest newline, compare line prefix with the search string, "seek" to half/double that byte offset, repeat until found...) I have experimented with several database solutions but found that nothing beats this in sheer lookup speed with data sets of this size. Do you know of any existing Java library that implements such functionality? Failing that, could you point me to some idiomatic example code that does random access reads in text files? Alternatively, I am not familiar with the new (?) Java I/O libraries but would it be an option to memory-map the 500 GB text file (I'm on a 64-bit machine with memory to spare) and do binary search on the memory-mapped byte array? I would be very interested to hear any experiences you have to share about this and similar problems.

    Read the article

  • How can I render an in-memory UIViewController's view Landscape?

    - by Aaron
    I'm trying to render an in-memory (but not in hierarchy, yet) UIViewController's view into an in-memory image buffer so I can do some interesting transition animations. However, when I render the UIViewController's view into that buffer, it is always rendering as though the controller is in Portrait orientation, no matter the orientation of the rest of the app. How do I clue this controller in? My code in RootViewController looks like this: MyUIViewController* controller = [[MyUIViewController alloc] init]; int width = self.view.frame.size.width; int height = self.view.frame.size.height; int bitmapBytesPerRow = width * 4; unsigned char *offscreenData = calloc(bitmapBytesPerRow * height, sizeof(unsigned char)); CGColorSpaceRef colorSpace = CGColorSpaceCreateDeviceRGB(); CGContextRef offscreenContext = CGBitmapContextCreate(offscreenData, width, height, 8, bitmapBytesPerRow, colorSpace, kCGImageAlphaPremultipliedLast); CGContextTranslateCTM(offscreenContext, 0.0f, height); CGContextScaleCTM(offscreenContext, 1.0f, -1.0f); [(CALayer*)[controller.view layer] renderInContext:offscreenContext]; At that point, the offscreen memory buffers contents are portrait-oriented, even when the window is in landscape orientation. Ideas?

    Read the article

  • How can I store large amount of data from a database to XML (memory problem)?

    - by Andrija
    First, I had a problem with getting the data from the Database, it took too much memory and failed. I've set -Xmx1500M and I'm using scrolling ResultSet so that was taken care of. Now I need to make an XML from the data, but I can't put it in one file. At the moment, I'm doing it like this: while(rs.next()){ i++; xmlStringBuilder.append("\n\t<row>"); xmlStringBuilder.append("\n\t\t<ID>" + Util.transformToHTML(rs.getInt("id")) + "</ID>"); xmlStringBuilder.append("\n\t\t<JED_ID>" + Util.transformToHTML(rs.getInt("jed_id")) + "</JED_ID>"); xmlStringBuilder.append("\n\t\t<IME_PJ>" + Util.transformToHTML(rs.getString("ime_pj")) + "</IME_PJ>"); //etc. xmlStringBuilder.append("\n\t</row>"); if (i%100000 == 0){ //stores the data to a file with the name i.xml storeKBR(xmlStringBuilder.toString(),i); xmlStringBuilder= null; xmlStringBuilder= new StringBuilder(); } and it works; I get 12 100 MB files. Now, what I'd like to do is to do is have all that data in one file (which I then compress) but if just remove the if part, I go out of memory. I thought about trying to write to a file, closing it, then opening, but that wouldn't get me much since I'd have to load the file to memory when I open it. P.S. If there's a better way to release the Builder, do let me know :)

    Read the article

  • How to find out where a thread lock happend?

    - by SchlaWiener
    One of our company's Windows Forms application had a strange problem for several month. The app worked very reliable for most of our customers but on some PC's (mostly with a wireless lan connection) the app sometimes just didn't respond anymore. (You click on the UI and windows ask you to wait or kill the app). I wasn't able to track down the problem for a long time but now I figured out what happend. The app had this line of code // don't blame me for this. Wasn't my code :D Control.CheckForIllegalCrossThreadCalls = false and used some background threads to modify the controls. No I found a way to reproduce the application stopping responding bug on my dev machine and tracked it down to a line where I actually used Invoke() to run a task in the main thread. Me.Invoke(MyDelegate, arg1, arg2) Obviously there was a thread lock somewhere. After removing the Control.CheckForIllegalCrossThreadCalls = false statement and refactoring the whole programm to use Invoke() if modifying a control from a background thread, the problem is (hopefully) gone. However, I am wondering if there is a way to find such bugs without debugging every line of code (Even if I break into debugger after the app stops responding I can't tell what happend last, because the IDE didn't jump to the Invoke() statement) In other words: If my apps hangs how can I figure out which line of code has been executed last? Maybe even on the customers PC. I know VS2010 offers some backwards debugging feature, maybe that would be a solution, but currently I am using VS2008.

    Read the article

  • C#, Can I check on a lock without trying to acquire it?

    - by Biff MaGriff
    Hello, I have a lock in my c# web app that prevents users from running the update script once it has started. I was thinking I would put a notification in my master page to let the user know that the data isn't all there yet. Currently I do my locking like so. protected void butRefreshData_Click(object sender, EventArgs e) { Thread t = new Thread(new ParameterizedThreadStart(UpdateDatabase)); t.Start(this); //sleep for a bit to ensure that javascript has a chance to get rendered Thread.Sleep(100); } public static void UpdateDatabase(object con) { if (Monitor.TryEnter(myLock)) { Updater.RepopulateDatabase(); Monitor.Exit(myLock); } else { Common.RegisterStartupScript(con, AlreadyLockedJavaScript); } } And I do not want to do if(Monitor.TryEnter(myLock)) Monitor.Exit(myLock); else //show processing labal As I imagine there is a slight possibility that it might display the notification when it isn't actually running. Is there an alternative I can use? Edit: Hi Everyone, thanks a lot for your suggestions! Unfortunately I couldn't quite get them to work... However I combined the ideas on 2 answers and came up with my own solution.

    Read the article

  • Can I use Eclipse JDT to create new 'working copies' of source files in memory only?

    - by RYates
    I'm using Eclipse JDT to build a Java refactoring platform, for exploring different refactorings in memory before choosing one and saving it. I can create collections of working copies of the source files, edit them in memory, and commit the changes to disk using the JDT framework. However, I also want to generate new 'working copy' source files in memory as part of refactorings, and only create the corresponding real source file if I commit the working copy. I have seen various hints that this is possible, e.g. http://www.jarvana.com/jarvana/view/org/eclipse/jdt/doc/isv/3.3.0-v20070613/isv-3.3.0-v20070613.jar!/guide/jdt%5Fapi%5Fmanip.htm says "Note that the compilation unit does not need to exist in the Java model in order for a working copy to be created". So far I have only been able to create a new real file, i.e. ICompilationUnit newICompilationUnit = myPackage.createCompilationUnit(newName, "package piffle; public class Baz{private int i=0;}", false, null); This is not what I want. Does anyone know how to create a new 'working copy' source file, that does not appear in my file system until I commit it? Or any other mechanism to achieve the same thing?

    Read the article

  • How should I lock the table in this VB6 / Access application?

    - by Brian Hooper
    I'm working on a VB6 application using an Access database. The application writes messages to a log table from time to time. Several instances of the application may be running simultaneously and to distinguish them they each have their own run number. The run number is deduced from the log table thus... Set record_set = New ADODB.Recordset query_string = "SELECT MAX(RUN_NUMBER) + 1 AS NEW_RUN_NUMBER FROM ERROR_LOG" record_set.CursorLocation = adUseClient record_set.Open query_string, database_connection, adOpenStatic, , adCmdText record_set.MoveLast If IsNull(record_set.Fields("NEW_RUN_NUMBER")) Then run_number = 0 Else run_number = record_set.Fields("NEW_RUN_NUMBER") End If command_string = "INSERT INTO ERROR_LOG (RUN_NUMBER, SEVERITY, MESSAGE) " & _ " VALUES (" & Str$(run_number) & ", " & _ " " & Str$(SEVERITY_INFORMATION) & ", " & _ " 'Run Started'); " database_connection.Execute command_string Obviously there is a small gap between the calculation of the run number and the appearance of the new row in the database, and to prevent another instance getting access between the two operations I'd like to lock the table; something along the lines of SET TRANSACTION READ WRITE RESERVING ERROR_LOG FOR PROTECTED WRITE; How should I go about doing this? Would locking the recordset do any good (the row in the record set doesn't match any particular row in the database)?

    Read the article

  • Image.Save(..) throws a GDI+ exception because the memory stream is closed.

    - by Pure.Krome
    Hi folks, i've got some binary data which i want to save as an image. When i try to save the image, it throws an exception if the memory stream used to create the image, was closed before the save. The reason i do this is because i'm dynamically creating images and as such .. i need to use a memory stream. this is the code: [TestMethod] public void TestMethod1() { // Grab the binary data. byte[] data = File.ReadAllBytes("Chick.jpg"); // Read in the data but do not close, before using the stream. Stream originalBinaryDataStream = new MemoryStream(data); Bitmap image = new Bitmap(originalBinaryDataStream); image.Save(@"c:\test.jpg"); originalBinaryDataStream.Dispose(); // Now lets use a nice dispose, etc... Bitmap2 image2; using (Stream originalBinaryDataStream2 = new MemoryStream(data)) { image2 = new Bitmap(originalBinaryDataStream2); } image2.Save(@"C:\temp\pewpew.jpg"); // This throws the GDI+ exception. } Does anyone have any suggestions to how i could save an image with the stream closed? I cannot rely on the developers to remember to close the stream after the image is saved. In fact, the developer would have NO IDEA that the image was generated using a memory stream (because it happens in some other code, elsewhere). I'm really confused :(

    Read the article

  • How to lock non-browser clients from submitting a request?

    - by Thomas Kohl
    I want to block non-browser clients from accessing certain pages / successfully making a request. The website content is served to authenticated users. What happens is that our user gives his credentials to our website to 3rd party - it can be another website or a mobile application - that performs requests on his behalf. Say there is a form that the user fills out and sends a message. Can I protect this form so that the server processing the submission can tell whether the user has submitted it directly from the browser or not? I don't want to use CAPTCHA for usability reasons. Can I do it with some javascript?

    Read the article

  • Microbenchmark showing process-switching faster than thread-switching; what's wrong?

    - by Yang
    I have two simple microbenchmarks trying to measure thread- and process-switching overheads, but the process-switching overhead. The code is living here, and r1667 is pasted below: https://assorted.svn.sourceforge.net/svnroot/assorted/sandbox/trunk/src/c/process_switch_bench.c // on zs, ~2.1-2.4us/switch #include <stdlib.h> #include <fcntl.h> #include <stdint.h> #include <stdio.h> #include <semaphore.h> #include <unistd.h> #include <sys/wait.h> #include <sys/types.h> #include <sys/time.h> #include <pthread.h> uint32_t COUNTER; pthread_mutex_t LOCK; pthread_mutex_t START; sem_t *s0, *s1, *s2; void * threads ( void * unused ) { // Wait till we may fire away sem_wait(s2); for (;;) { pthread_mutex_lock(&LOCK); pthread_mutex_unlock(&LOCK); COUNTER++; sem_post(s0); sem_wait(s1); } return 0; } int64_t timeInMS () { struct timeval t; gettimeofday(&t, NULL); return ( (int64_t)t.tv_sec * 1000 + (int64_t)t.tv_usec / 1000 ); } int main ( int argc, char ** argv ) { int64_t start; pthread_t t1; pthread_mutex_init(&LOCK, NULL); COUNTER = 0; s0 = sem_open("/s0", O_CREAT, 0022, 0); if (s0 == 0) { perror("sem_open"); exit(1); } s1 = sem_open("/s1", O_CREAT, 0022, 0); if (s1 == 0) { perror("sem_open"); exit(1); } s2 = sem_open("/s2", O_CREAT, 0022, 0); if (s2 == 0) { perror("sem_open"); exit(1); } int x, y, z; sem_getvalue(s0, &x); sem_getvalue(s1, &y); sem_getvalue(s2, &z); printf("%d %d %d\n", x, y, z); pid_t pid = fork(); if (pid) { pthread_create(&t1, NULL, threads, NULL); pthread_detach(t1); // Get start time and fire away start = timeInMS(); sem_post(s2); sem_post(s2); // Wait for about a second sleep(1); // Stop thread pthread_mutex_lock(&LOCK); // Find out how much time has really passed. sleep won't guarantee me that // I sleep exactly one second, I might sleep longer since even after being // woken up, it can take some time before I gain back CPU time. Further // some more time might have passed before I obtained the lock! int64_t time = timeInMS() - start; // Correct the number of thread switches accordingly COUNTER = (uint32_t)(((uint64_t)COUNTER * 2 * 1000) / time); printf("Number of process switches in about one second was %u\n", COUNTER); printf("roughly %f microseconds per switch\n", 1000000.0 / COUNTER); // clean up kill(pid, 9); wait(0); sem_close(s0); sem_close(s1); sem_unlink("/s0"); sem_unlink("/s1"); sem_unlink("/s2"); } else { if (1) { sem_t *t = s0; s0 = s1; s1 = t; } threads(0); // never return } return 0; } https://assorted.svn.sourceforge.net/svnroot/assorted/sandbox/trunk/src/c/thread_switch_bench.c // From <http://stackoverflow.com/questions/304752/how-to-estimate-the-thread-context-switching-overhead> // on zs, ~4-5us/switch; tried making COUNTER updated only by one thread, but no difference #include <stdlib.h> #include <stdint.h> #include <stdio.h> #include <pthread.h> #include <unistd.h> #include <sys/time.h> uint32_t COUNTER; pthread_mutex_t LOCK; pthread_mutex_t START; pthread_cond_t CONDITION; void * threads ( void * unused ) { // Wait till we may fire away pthread_mutex_lock(&START); pthread_mutex_unlock(&START); int first=1; pthread_mutex_lock(&LOCK); // If I'm not the first thread, the other thread is already waiting on // the condition, thus Ihave to wake it up first, otherwise we'll deadlock if (COUNTER > 0) { pthread_cond_signal(&CONDITION); first=0; } for (;;) { if (first) COUNTER++; pthread_cond_wait(&CONDITION, &LOCK); // Always wake up the other thread before processing. The other // thread will not be able to do anything as long as I don't go // back to sleep first. pthread_cond_signal(&CONDITION); } pthread_mutex_unlock(&LOCK); return 0; } int64_t timeInMS () { struct timeval t; gettimeofday(&t, NULL); return ( (int64_t)t.tv_sec * 1000 + (int64_t)t.tv_usec / 1000 ); } int main ( int argc, char ** argv ) { int64_t start; pthread_t t1; pthread_t t2; pthread_mutex_init(&LOCK, NULL); pthread_mutex_init(&START, NULL); pthread_cond_init(&CONDITION, NULL); pthread_mutex_lock(&START); COUNTER = 0; pthread_create(&t1, NULL, threads, NULL); pthread_create(&t2, NULL, threads, NULL); pthread_detach(t1); pthread_detach(t2); // Get start time and fire away start = timeInMS(); pthread_mutex_unlock(&START); // Wait for about a second sleep(1); // Stop both threads pthread_mutex_lock(&LOCK); // Find out how much time has really passed. sleep won't guarantee me that // I sleep exactly one second, I might sleep longer since even after being // woken up, it can take some time before I gain back CPU time. Further // some more time might have passed before I obtained the lock! int64_t time = timeInMS() - start; // Correct the number of thread switches accordingly COUNTER = (uint32_t)(((uint64_t)COUNTER * 2 * 1000) / time); printf("Number of thread switches in about one second was %u\n", COUNTER); printf("roughly %f microseconds per switch\n", 1000000.0 / COUNTER); return 0; }

    Read the article

  • Parallelism in .NET – Part 4, Imperative Data Parallelism: Aggregation

    - by Reed
    In the article on simple data parallelism, I described how to perform an operation on an entire collection of elements in parallel.  Often, this is not adequate, as the parallel operation is going to be performing some form of aggregation. Simple examples of this might include taking the sum of the results of processing a function on each element in the collection, or finding the minimum of the collection given some criteria.  This can be done using the techniques described in simple data parallelism, however, special care needs to be taken into account to synchronize the shared data appropriately.  The Task Parallel Library has tools to assist in this synchronization. The main issue with aggregation when parallelizing a routine is that you need to handle synchronization of data.  Since multiple threads will need to write to a shared portion of data.  Suppose, for example, that we wanted to parallelize a simple loop that looked for the minimum value within a dataset: double min = double.MaxValue; foreach(var item in collection) { double value = item.PerformComputation(); min = System.Math.Min(min, value); } .csharpcode, .csharpcode pre { font-size: small; color: black; font-family: consolas, "Courier New", courier, monospace; background-color: #ffffff; /*white-space: pre;*/ } .csharpcode pre { margin: 0em; } .csharpcode .rem { color: #008000; } .csharpcode .kwrd { color: #0000ff; } .csharpcode .str { color: #006080; } .csharpcode .op { color: #0000c0; } .csharpcode .preproc { color: #cc6633; } .csharpcode .asp { background-color: #ffff00; } .csharpcode .html { color: #800000; } .csharpcode .attr { color: #ff0000; } .csharpcode .alt { background-color: #f4f4f4; width: 100%; margin: 0em; } .csharpcode .lnum { color: #606060; } This seems like a good candidate for parallelization, but there is a problem here.  If we just wrap this into a call to Parallel.ForEach, we’ll introduce a critical race condition, and get the wrong answer.  Let’s look at what happens here: // Buggy code! Do not use! double min = double.MaxValue; Parallel.ForEach(collection, item => { double value = item.PerformComputation(); min = System.Math.Min(min, value); }); This code has a fatal flaw: min will be checked, then set, by multiple threads simultaneously.  Two threads may perform the check at the same time, and set the wrong value for min.  Say we get a value of 1 in thread 1, and a value of 2 in thread 2, and these two elements are the first two to run.  If both hit the min check line at the same time, both will determine that min should change, to 1 and 2 respectively.  If element 1 happens to set the variable first, then element 2 sets the min variable, we’ll detect a min value of 2 instead of 1.  This can lead to wrong answers. Unfortunately, fixing this, with the Parallel.ForEach call we’re using, would require adding locking.  We would need to rewrite this like: // Safe, but slow double min = double.MaxValue; // Make a "lock" object object syncObject = new object(); Parallel.ForEach(collection, item => { double value = item.PerformComputation(); lock(syncObject) min = System.Math.Min(min, value); }); This will potentially add a huge amount of overhead to our calculation.  Since we can potentially block while waiting on the lock for every single iteration, we will most likely slow this down to where it is actually quite a bit slower than our serial implementation.  The problem is the lock statement – any time you use lock(object), you’re almost assuring reduced performance in a parallel situation.  This leads to two observations I’ll make: When parallelizing a routine, try to avoid locks. That being said: Always add any and all required synchronization to avoid race conditions. These two observations tend to be opposing forces – we often need to synchronize our algorithms, but we also want to avoid the synchronization when possible.  Looking at our routine, there is no way to directly avoid this lock, since each element is potentially being run on a separate thread, and this lock is necessary in order for our routine to function correctly every time. However, this isn’t the only way to design this routine to implement this algorithm.  Realize that, although our collection may have thousands or even millions of elements, we have a limited number of Processing Elements (PE).  Processing Element is the standard term for a hardware element which can process and execute instructions.  This typically is a core in your processor, but many modern systems have multiple hardware execution threads per core.  The Task Parallel Library will not execute the work for each item in the collection as a separate work item. Instead, when Parallel.ForEach executes, it will partition the collection into larger “chunks” which get processed on different threads via the ThreadPool.  This helps reduce the threading overhead, and help the overall speed.  In general, the Parallel class will only use one thread per PE in the system. Given the fact that there are typically fewer threads than work items, we can rethink our algorithm design.  We can parallelize our algorithm more effectively by approaching it differently.  Because the basic aggregation we are doing here (Min) is communitive, we do not need to perform this in a given order.  We knew this to be true already – otherwise, we wouldn’t have been able to parallelize this routine in the first place.  With this in mind, we can treat each thread’s work independently, allowing each thread to serially process many elements with no locking, then, after all the threads are complete, “merge” together the results. This can be accomplished via a different set of overloads in the Parallel class: Parallel.ForEach<TSource,TLocal>.  The idea behind these overloads is to allow each thread to begin by initializing some local state (TLocal).  The thread will then process an entire set of items in the source collection, providing that state to the delegate which processes an individual item.  Finally, at the end, a separate delegate is run which allows you to handle merging that local state into your final results. To rewriting our routine using Parallel.ForEach<TSource,TLocal>, we need to provide three delegates instead of one.  The most basic version of this function is declared as: public static ParallelLoopResult ForEach<TSource, TLocal>( IEnumerable<TSource> source, Func<TLocal> localInit, Func<TSource, ParallelLoopState, TLocal, TLocal> body, Action<TLocal> localFinally ) The first delegate (the localInit argument) is defined as Func<TLocal>.  This delegate initializes our local state.  It should return some object we can use to track the results of a single thread’s operations. The second delegate (the body argument) is where our main processing occurs, although now, instead of being an Action<T>, we actually provide a Func<TSource, ParallelLoopState, TLocal, TLocal> delegate.  This delegate will receive three arguments: our original element from the collection (TSource), a ParallelLoopState which we can use for early termination, and the instance of our local state we created (TLocal).  It should do whatever processing you wish to occur per element, then return the value of the local state after processing is completed. The third delegate (the localFinally argument) is defined as Action<TLocal>.  This delegate is passed our local state after it’s been processed by all of the elements this thread will handle.  This is where you can merge your final results together.  This may require synchronization, but now, instead of synchronizing once per element (potentially millions of times), you’ll only have to synchronize once per thread, which is an ideal situation. Now that I’ve explained how this works, lets look at the code: // Safe, and fast! double min = double.MaxValue; // Make a "lock" object object syncObject = new object(); Parallel.ForEach( collection, // First, we provide a local state initialization delegate. () => double.MaxValue, // Next, we supply the body, which takes the original item, loop state, // and local state, and returns a new local state (item, loopState, localState) => { double value = item.PerformComputation(); return System.Math.Min(localState, value); }, // Finally, we provide an Action<TLocal>, to "merge" results together localState => { // This requires locking, but it's only once per used thread lock(syncObj) min = System.Math.Min(min, localState); } ); Although this is a bit more complicated than the previous version, it is now both thread-safe, and has minimal locking.  This same approach can be used by Parallel.For, although now, it’s Parallel.For<TLocal>.  When working with Parallel.For<TLocal>, you use the same triplet of delegates, with the same purpose and results. Also, many times, you can completely avoid locking by using a method of the Interlocked class to perform the final aggregation in an atomic operation.  The MSDN example demonstrating this same technique using Parallel.For uses the Interlocked class instead of a lock, since they are doing a sum operation on a long variable, which is possible via Interlocked.Add. By taking advantage of local state, we can use the Parallel class methods to parallelize algorithms such as aggregation, which, at first, may seem like poor candidates for parallelization.  Doing so requires careful consideration, and often requires a slight redesign of the algorithm, but the performance gains can be significant if handled in a way to avoid excessive synchronization.

    Read the article

  • AWStats: Visits from IP address vs Crawlers

    - by user3651934
    I use AWStats in cPanel to see stats of my website. Under Hosts section I see one IP address that has visited 150 pages. I am not sure if one person would have visited 150 pages using a browser. But if these 150 pages have been visited using a software application, then should not it be listed under Robots/Spider section. So how do I determine if I should block a certain IP address that has visited several hundred pages of my website? Thanks

    Read the article

< Previous Page | 213 214 215 216 217 218 219 220 221 222 223 224  | Next Page >