Search Results

Search found 1238 results on 50 pages for 'dimensional modeling'.

Page 29/50 | < Previous Page | 25 26 27 28 29 30 31 32 33 34 35 36  | Next Page >

  • deteminant of matrix

    - by davit-datuashvili
    suppose there is given two dimensional array int a[][]=new int[4][4]; i am trying to find determinant of matrices please help i know how find it mathematical but i am trying to find it in programaticaly

    Read the article

  • How to calculate "holes" in timetable

    - by genesiss
    I've got a 2-dimensional array like this (it represents a timetable): http://www.shrani.si/f/28/L6/37YvFye/timetable.png Orange cells are lectures and whites are free time. How could I calculate number of free hours between lectures in the same day? (columns are days and rows are hours) For example, in this table the result should be: 2 for first column 0 for second colum -- The function returns 2 (because 2+0=2)

    Read the article

  • How do I create and read non-global variables that aren't destroyed at end of function?

    - by Paul Reilly
    I am attempting to code some plugins to use with MIDI sequencers but have hit a stumbling block. I can't use global-scope variables to store information because multiple instances of the .dll can exist which share memory. How do I create a class (for re-usability purposes in other plugins) containing 2 dimensional array and other variables the content of which is to be shared between functions? If that is possible, how would I read and write the data from the function in the framework where I do the processing?

    Read the article

  • How to build a tree from a list of items and their children recursively in php?

    - by k1lljoy
    I have a list of items stored in the DB, simplified schema is like this: id, parent, name I need to generate a tree structure (in a form of a multi-dimensional array) that can be infinite levels deep. Top level items would have parent = 0. Next level down would have parent equal to the the id of the parent item, fairly straight forward. What would be the best way to do this, while consuming as little resources as possible?

    Read the article

  • how to save related models with non-null foreign key

    - by Fortress
    I have a model called user which has_one email. I put the foreign key (NOT NULL) inside users table. Now I'm trying to save it in the following way: @email = Email.new(params[:email]) @email.user = User.new(params[:user]) @email.save This raises a db exception, because the foreign key constraint is not met (NULL is inserted into email_id). How can I elegantly solve this or is my data modeling wrong?

    Read the article

  • which is better in general, map or vector in c++?

    - by tsubasa
    As I know that accessing an element in vector takes constant time while in map takes logarithmic time. However, storing a map takes less memory than storing a vector. Therefore, I want to ask which one is better in general? I'm considering using one of those two in my program, which has about 1000 elements. I plan to use 3 dimensional vector, which would take 1000x1000x1000 elements.

    Read the article

  • Web UI for inputting a function from the reals to the reals, such as a probability distribution.

    - by dreeves
    I would like a web interface for a user to describe a one-dimensional real-valued function. I'm imagining the user being presented with a blank pair of axes and they can click anywhere to create points that are thick and draggable. Double-clicking a point, let's say, makes it disappear. The actual function should be shown in real time as an interpolation of the user-supplied points. Here's what this looks like implemented in Mathematica (though of course I'm looking for something in javascript):

    Read the article

  • What does this mean in AS3?

    - by uther-lightbringer
    Hello I've started to learning AS3 from one book and found something I don't understand. Ellipse(_board[row][column]).fill = getFill(row, column); _board is two dimensional array of Ellipse type, so I just dont understand why is Ellipse(Ellipse object) used when it apparently works without it, or I haven't seen any changes when I omitted it.

    Read the article

  • How value objects are saving and loading?

    - by yeraycaballero
    Since there isn't respositories for value objects. How can I load all value objects? Suppose we are modeling a blog application and we have this classes: Post (Entity) Comment (Value object) Tag (Value object) PostsRespository (Respository) I Know that when I save a new post, its tags are saving with it in the same table. But how could I load all tags of all posts. Has PostsRespository got a method to load all tags? I usually do it, but I want to know others opinions

    Read the article

  • How to extract data out of a specific PHP array

    - by user77413
    I have a multi-dimensional array that looks like this: The base array is indexed based on category ids from my catalog. $categories[category_id] Each base array has two underlying elements: ['parent_category_id'] ['sort_order'] I want to create a function that allows us to create a list of categories for a given parent_category_id in the correct sort order. Is this possible? Technically it is the same information, but the array is constructed in a weird way to extract that information.

    Read the article

  • Combining array values in multilevel array

    - by James Huckabone
    I have an array like so: array( 'a'=>array( 'a'=>3, 'f'=>5, 'sdf'=>0), 't'=>array( 'a'=>1, 'f'=>2, 'sdf'=>5), 'pps'=>array( 'a'=>1, 'f'=>2, 'sdf'=>3) ); Notice how the sub-arrays are the same for each top-level array. If I wanted to, what's the easiest way to combine the sub-arrays so that I'm left with a one-dimensional array like: array( 'a'=>5, 'f'=>9, 'sdf'=>8 );

    Read the article

  • Malthusian Growth Model in Ruby

    - by GregDean
    Hello. I am interested in modeling a Malthusian growth model in Ruby. Does anyone have any ideas, or are there any interesting libraries that cover this? Any help is appreciated. http://en.wikipedia.org/wiki/Malthusian_growth_model

    Read the article

  • fill array with binary numbers

    - by davit-datuashvili
    hi, first of all this is not homework!! my question is from book: Algorithms in C++ third edition by robert sedgewick question is: there is given array of size n by 2^n(two dimensional) and we should fill it by binary numbers with bits size exactly n or for example n=5 so result will be 00001 00010 00011 00100 00101 00110 00111 and so on we should put this sequence of bits into arrays please help me

    Read the article

  • DDD and avoiding CRUD

    - by g_b
    It seems that on most articles I read, CRUD is to be avoided in DDD as we are dealing with modeling business process and not data. However, I find it hard to see not to have CRUD operations on certain entities. For example, in a school grading system, before teachers can grade students, a SchoolYear has to be present or perhaps a GradingPeriod. I can't see how we can manage GradingPeriods without CRUD. Could someone enlighten me on this?

    Read the article

  • prefill a std::vector at initialization?

    - by user146780
    I want to create a vector of vector of a vector of double and want it to already have (32,32,16) elements, without manually pushing all of these back. Is there a way to do it during initialization? (I dont care what value gets pushed) Thanks I want a 3 dimensional array, first dimension has 32, second dimension has 32 and third dimension has 16 elements

    Read the article

  • Portal And Content - Content Integration - Best Practices

    - by Stefan Krantz
    Lately we have seen an increase in projects that have failed to either get user friendly content integration or non satisfactory performance. Our intention is to mitigate any knowledge gap that our previous post might have left you with, therefore this post will repeat some recommendation or reference back to old useful post. Moreover this post will help you understand ground up how to design, architect and implement business enabled, responsive and performing portals with complex requirements on business centric information publishing. Design the Information Model The key to successful portal deployments is Information modeling, it's a key task to understand the use case you designing for, therefore I have designed a set of question you need to ask yourself or your customer: Question: Who will own the content, IT or Business? Answer: BusinessQuestion: Who will publish the content, IT or Business? Answer: BusinessQuestion: Will there be multiple publishers? Answer: YesQuestion: Are the publishers computer scientist?Answer: NoQuestion: How often do the information changes, daily, weekly, monthly?Answer: Daily, weekly If your answers to the questions matches at least 2, we strongly recommend you design your content with following principles: Divide your pages in to logical sections, where each section is marked with its purpose Assign capabilities to each section, does it contain text, images, formatting and/or is it static and is populated through other contextual information Select editor/design element type WYSIWYG - Rich Text Plain Text - non-format text Image - Image object Static List - static list of formatted informationDynamic Data List - assembled information from multiple data files through CMIS query The result of such design map could look like following below examples: Based on the outcome of the required elements in the design column 3 from the left you will now simply design a data model in WebCenter Content - Site Studio by creating a Region Definition structure matching your design requirements.For more information on how to create a Region definition see following post: Region Definition Post - note see instruction 7 for details. Each region definition can now be used to instantiate data files, a data file will hold the actual data for each element in the region definition. Another way you can see this is to compare the region definition as an extension to the metadata model in WebCenter Content for each data file item. Design content templates With a solid dependable information model we can now proceed to template creation and page design, in this phase focuses on how to place the content sections from the region definition on the page via a Content Presenter template. Remember by creating content presenter templates you will leverage the latest and most integrated technology WebCenter has to offer. This phase is much easier since the you already have the information model and design wire-frames to base the logic on, however there is still few considerations to pay attention to: Base the template on ADF and make only necessary exceptions to markup when required Leverage ADF design components for Tabs, Accordions and other similar components, this way the design in the content published areas will comply with other design areas based on custom ADF taskflows There is no performance impact when using meta data or region definition based data All data access regardless of type, metadata or xml data it can be accessed via the Content Presenter - Node. See below for applied examples on how to access data Access metadata property from Document - #{node.propertyMap['myProp'].value}myProp in this example can be for instance (dDocName, dDocTitle, xComments or any other available metadata) Access element data from data file xml - #{node.propertyMap['[Region Definition Name]:[Element name]'].asTextHtml}Region Definition Name is the expect region definition that the current data file is instantiatingElement name is the element value you like to grab from the data file I recommend you read following  useful post on content template topic:CMIS queries and template creation - note see instruction 9 for detailsStatic List template rendering For more information on templates:Single Item Content TemplateMulti Item Content TemplateExpression Language Internationalization Considerations When integrating content assets via content presenter you by now probably understand that the content item/data file is wired to the page, what is also pretty common at this stage is that the content item/data file only support one language since its not practical or business friendly to mix that into a complex structure. Therefore you will be left with a very common dilemma that you will have to either build a complete new portal for each locale, which is not an good option! However with little bit of information modeling and clear naming convention this can be addressed. Basically you can simply make sure that all content item/data file are named with a predictable naming convention like "Content1_EN" for the English rendition and "Content1_ES" for the Spanish rendition. This way through simple none complex customizations you will be able to dynamically switch the actual content item/data file just before rendering. By following proposed approach above you not only enable a simple mechanism for internationalized content you also preserve the functionality in the content presenter to support business accessible run-time publishing of information on existing and new pages. I recommend you read following useful post on Internationalization topics:Internationalize with Content Presenter Integrate with Review & Approval processes Today the Review and approval functionality and configuration is based out of WebCenter Content - Criteria Workflows. Criteria Workflows uses the metadata of the checked in document to evaluate if the document is under any review/approval process. So for instance if a Criteria Workflow is configured to force any documents with Version = "2" or "higher" and Content Type is "Instructions", any matching content item version on check in will now enter the workflow before getting released for general access. Few things to consider when configuring Criteria Workflows: Make sure to not trigger on version one for Content Items that are Data Files - if you trigger on version 1 you will not only approve an empty document you will also have a content presenter pointing to a none existing document - since the document will only be available after successful completion of the workflow Approval workflows sometimes requires more complex criteria, the recommendation if that is the case is that the meta data triggering such criteria is automatically populated, this can be achieved through many approaches including Content Profiles Criteria workflows are configured and managed in WebCenter Content Administration Applets where you can configure one or more workflows. When you configured Criteria workflows the Content Presenter will support the editors with the approval process directly inline in the "Contribution mode" of the portal. In addition to approve/reject and details of the task, the content presenter natively support the user to view the current and future version of the change he/she is approving. See below for example: Architectural recommendation To support review&approval processes - minimize the amount of data files per page Each CMIS query can consume significant time depending on the complexity of the query - minimize the amount of CMIS queries per page Use Content Presenter Templates based on ADF - this way you minimize the design considerations and optimize the usage of caching Implement the page in as few Data files as possible - simplifies publishing process, increases performance and simplifies release process Named data file (node) or list of named nodes when integrating to pages increases performance vs. querying for data Named data file (node) or list of named nodes when integrating to pages enables business centric page creation and publishing and reduces the need for IT department interaction Summary Just because one architectural decision solves a business problem it doesn't mean its the right one, when designing portals all architecture has to be in harmony and not impacting each other. For instance the most technical complex solution is not always the best since it will most likely defeat the business accessibility, performance or both, therefore the best approach is to first design for simplicity that even a non-technical user can operate, after that consider the performance impact and final look at the technology challenges these brings and workaround them first with out-of-the-box features, after that design and develop functions to complement the short comings.

    Read the article

  • XBRL US Conference Highlights

    - by john.orourke(at)oracle.com
    Back in early November I had an opportunity to attend the XBRL US National Conference in Philadelphia.  At the event, XBRL US announced that Oracle had joined the initiative, so I had a chance to participate in a press conference and attend a number of sessions.  Oracle joined XBRL US so we can stay ahead of the standard and leverage it in our products, and to help drive awareness with customers and improve adoption of XBRL. There were roughly 250 attendees at the event, about half of which were vendors and consultants and the rest financial reporting staff from corporate filers.  Event sponsors included Ernst & Young, SWIFT and Fujitsu.  There were also a number of XBRL technology and service providers exhibiting at the conference.  On Monday Nov. 8th, the XBRL US Steering Committee meetings and Annual Members meeting and reception were held.  At the Annual Members meeting the big news was that current XBRL US President, Mark Bolgiano, is moving to a new position at Howard Hughes Medical Center.  Campbell Pryde, who had led the Taxonomy Development for XBRL US, is taking over as XBRL US President. Other items that were highlighted at the members meeting included: The US GAAP XBRL taxonomy is being used by over 1500 SEC filers and has now been handed over to the FASB to maintain and enhance 16 filer training events were held in 2010 XBRL Global Magazine was launched Corporate Actions proposal was submitted to the SEC with SWIFT in May XBRL Labs for iPhone, XBRL US Consistency Suite launched ISO 2022 Corporate Actions Alignment with XBRL achieved The XBRL Credit Rating taxonomy was accepted Tuesday Nov. 9th included Keynotes, General Sessions, Innovation Workshop for Governments and Securities Professionals, and an Opening Reception.  General sessions included: Lessons Learned from the SEC's rollout of XBRL.  More than 18,000 errors were identified in reviews of filings between June 2009 and September 2010.  Most of these related to negative values being used where they shouldn't have.  Also, the SEC feels there are too many taxonomy extensions being created - mostly in the Cash Flow Statements.  They emphasize using existing elements in the US GAAP taxonomy and advise filers not to  create extensions to improve the visual formatting of XBRL filings. Investors and XBRL - Setting the Standard for Data Quality.  In this panel discussion, the key learning was that CFA's, academics and the financial community are not using XBRL as expected.  The issues raised include the  accuracy and completeness of filings, number of taxonomy extensions, and limited number of tools available to help analyze XBRL data.  Another big issue that was raised is the lack of historic results in XBRL - most analysts need 10 quarters of historic data.  On the positive side, XBRL has the potential to eliminate re-keying of data and errors here and can improve analytic capabilities for financial analysts once more historic data is available and more companies are providing detailed tagging of their filings. A US Roadmap for XBRL Financial Reporting.  This was a panel discussion featuring Jeff Neumann(SEC), Campbell Pryde(XBRL US), and Louis Matherne(FASB).  Key points included the fact that XBRL is currently used by 1500 companies, with 8000 more companies coming in 2011.  XBRL for Mutual Fund Reporting will start in 2011 for 8000 funds, and a Credit Rating Taxonomy has now been submitted for review.  The XBRL tagging/filing process is improving each quarter - more education is helping here.  The FASB is looking at extensions to date, and potential additions to US GAAP taxonomy, while the SEC is evaluating filings for accuracy, consistency in tagging, and tools for analyzing data.  The big news is that the FASB 2011 US GAAP Taxonomy has been completed and reviewed by SEC.  The 2011 US GAAP Taxonomy supports new FASB accounting standards issued since 2009, has new taxonomy elements for certain industries (i.e airlines) and the elimination of 500 concepts.  (meaning they can't be used going forward but are still supported for historical comparison)  The 2011 US GAAP Taxonomy will be available for usage with Q2 2011 SEC filings.  More information about this can be found on the FASB web site.  http://www.fasb.org/home Accounting Firms and XBRL.  This session covered the Role of Audit Firms, which includes awareness and education, validation of XBRL filings, and in-house transition planning.  The main advice provided was that organizations should document XBRL mapping process, perform peer comparisons, and risk assessments on a regular basis. Wednesday Nov. 10th included more Keynotes, General Sessions on Corporate Actions, and XBRL Essentials Workshop Training for corporate filers.  The XBRL Essentials Training included: Getting Started Once you Have the Basics Detailed Footnote Tagging and Handling Tables Quality Control and Trust in the XBRL Process Bringing XBRL In-House:  What are the Options, What should you consider? The US GAAP Financial Reporting Taxonomy - Overview of the 2011 release The XBRL Essentials Training was well-attended with about 80 people.  This included a good overview of the SEC's XBRL mandate, limited liability issue, tagging levels, recommended planning process, internal vs. outsourced approach, and how to manage service providers.  I learned a lot from the session on detailed tagging.  This is the requirement that kicks in during a company's second year of XBRL filing with the SEC and applies to financial statements, footnotes and disclosures (it does not apply to MD&A, executive communications and other information).  The review of the Linkbase model, or dimensional table structure, was very interesting and can be complex to understand.  The key takeaway here is that using dimensional tables in XBRL filings can help limit the number of taxonomy extensions that are required.  The slides from this session are posted on the XBRL US web site. (http://xbrl.us/events/Pages/archive.aspx) For me, the main summary points and takeaways from the XBRL US conference are: XBRL for financial reporting has turned the corner and gone mainstream - with 1500 companies currently using it and 8000 more coming in 2011 The expected value is not being achieved by filers or consumers of XBRL data - this will improve when more companies are filing in XBRL, more history is available, and more software tools are available for analysis (hmm, sounds like an opportunity for Oracle) XBRL is becoming the global standard for all business communications beyond just the financials - i.e. adoption for mutual funds, corporate actions and others planned for the future If you would like to learn more about XBRL and the various training programs, services and software tools that are available check out the XBRL US web site and even better - become a member.  Here's a link:  http://xbrl.us/Pages/default.aspx

    Read the article

  • Oracle OpenWorld 2013 – Wrap up by Sven Bernhardt

    - by JuergenKress
    OOW 2013 is over and we’re heading home, so it is time to lean back and reflecting about the impressions we have from the conference. First of all: OOW was great! It was a pleasure to be a part of it. As already mentioned in our last blog article: It was the biggest OOW ever. Parallel to the conference the America’s Cup took place in San Francisco and the Oracle Team America won. Amazing job by the team and again congratulations from our side Back to the conference. The main topics for us are: Oracle SOA / BPM Suite 12c Adaptive Case management (ACM) Big Data Fast Data Cloud Mobile Below we will go a little more into detail, what are the key takeaways regarding the mentioned points: Oracle SOA / BPM Suite 12c During the five days at OOW, first details of the upcoming major release of Oracle SOA Suite 12c and Oracle BPM Suite 12c have been introduced. Some new key features are: Managed File Transfer (MFT) for transferring big files from a source to a target location Enhanced REST support by introducing a new REST binding Introduction of a generic cloud adapter, which can be used to connect to different cloud providers, like Salesforce Enhanced analytics with BAM, which has been totally reengineered (BAM Console now also runs in Firefox!) Introduction of templates (OSB pipelines, component templates, BPEL activities templates) EM as a single monitoring console OSB design-time integration into JDeveloper (Really great!) Enterprise modeling capabilities in BPM Composer These are only a few points from what is coming with 12c. We are really looking forward for the new realese to come out, because this seems to be really great stuff. The suite becomes more and more integrated. From 10g to 11g it was an evolution in terms of developing SOA-based applications. With 12c, Oracle continues it’s way – very impressive. Adaptive Case Management Another fantastic topic was Adaptive Case Management (ACM). The Oracle PMs did a great job especially at the demo grounds in showing the upcoming Case Management UI (will be available in 11g with the next BPM Suite MLR Patch), the roadmap and the differences between traditional business process modeling. They have been very busy during the conference because a lot of partners and customers have been interested Big Data Big Data is one of the current hype themes. Because of huge data amounts from different internal or external sources, the handling of these data becomes more and more challenging. Companies have a need for analyzing the data to optimize their business. The challenge is here: the amount of data is growing daily! To store and analyze the data efficiently, it is necessary to have a scalable and flexible infrastructure. Here it is important that hardware and software are engineered to work together. Therefore several new features of the Oracle Database 12c, like the new in-memory option, have been presented by Larry Ellison himself. From a hardware side new server machines like Fujitsu M10 or new processors, such as Oracle’s new M6-32 have been announced. The performance improvements, when using one of these hardware components in connection with the improved software solutions were really impressive. For more details about this, please take look at our previous blog post. Regarding Big Data, Oracle also introduced their Big Data architecture, which consists of: Oracle Big Data Appliance that is preconfigured with Hadoop Oracle Exdata which stores a huge amount of data efficently, to achieve optimal query performance Oracle Exalytics as a fast and scalable Business analytics system Analysis of the stored data can be performed using SQL, by streaming the data directly from Hadoop to an Oracle Database 12c. Alternatively the analysis can be directly implemented in Hadoop using “R”. In addition Oracle BI Tools can be used to analyze the data. Fast Data Fast Data is a complementary approach to Big Data. A huge amount of mostly unstructured data comes in via different channels with a high frequency. The analysis of these data streams is also important for companies, because the incoming data has to be analyzed regarding business-relevant patterns in real-time. Therefore these patterns must be identified efficiently and performant. To do so, in-memory grid solutions in combination with Oracle Coherence and Oracle Event Processing demonstrated very impressive how efficient real-time data processing can be. One example for Fast Data solutions that was shown during the OOW was the analysis of twitter streams regarding customer satisfaction. The feeds with negative words like “bad” or “worse” have been filtered and after a defined treshold has been reached in a certain timeframe, a business event was triggered. Cloud Another key trend in the IT market is of course Cloud Computing and what it means for companies and their businesses. Oracle announced their Cloud strategy and vision – companies can focus on their real business while all of the applications are available via Cloud. This also includes Oracle Database or Oracle Weblogic, so that companies can also build, deploy and run their own applications within the cloud. Three different approaches have been introduced: Infrastructure as a Service (IaaS) Platform as a Service (PaaS) Software as a Service (SaaS) Using the IaaS approach only the infrastructure components will be managed in the Cloud. Customers will be very flexible regarding memory, storage or number of CPUs because those parameters can be adjusted elastically. The PaaS approach means that besides the infrastructure also the platforms (such as databases or application servers) necessary for running applications will be provided within the Cloud. Here customers can also decide, if installation and management of these infrastructure components should be done by Oracle. The SaaS approach describes the most complete one, hence all applications a company uses are managed in the Cloud. Oracle is planning to provide all of their applications, like ERP systems or HR applications, as Cloud services. In conclusion this seems to be a very forward-thinking strategy, which opens up new possibilities for customers to manage their infrastructure and applications in a flexible, scalable and future-oriented manner. As you can see, our OOW days have been very very interresting. We collected many helpful informations for our projects. The new innovations presented at the confernce are great and being part of this was even greater! We are looking forward to next years’ conference! Links: http://www.oracle.com/openworld/index.html http://thecattlecrew.wordpress.com/2013/09/23/first-impressions-from-oracle-open-world-2013 SOA & BPM Partner Community For regular information on Oracle SOA Suite become a member in the SOA & BPM Partner Community for registration please visit www.oracle.com/goto/emea/soa (OPN account required) If you need support with your account please contact the Oracle Partner Business Center. Blog Twitter LinkedIn Facebook Wiki Mix Forum Technorati Tags: cattleCrew,Sven Bernhard,OOW2013,SOA Community,Oracle SOA,Oracle BPM,Community,OPN,Jürgen Kress

    Read the article

  • How John Got 15x Improvement Without Really Trying

    - by rchrd
    The following article was published on a Sun Microsystems website a number of years ago by John Feo. It is still useful and worth preserving. So I'm republishing it here.  How I Got 15x Improvement Without Really Trying John Feo, Sun Microsystems Taking ten "personal" program codes used in scientific and engineering research, the author was able to get from 2 to 15 times performance improvement easily by applying some simple general optimization techniques. Introduction Scientific research based on computer simulation depends on the simulation for advancement. The research can advance only as fast as the computational codes can execute. The codes' efficiency determines both the rate and quality of results. In the same amount of time, a faster program can generate more results and can carry out a more detailed simulation of physical phenomena than a slower program. Highly optimized programs help science advance quickly and insure that monies supporting scientific research are used as effectively as possible. Scientific computer codes divide into three broad categories: ISV, community, and personal. ISV codes are large, mature production codes developed and sold commercially. The codes improve slowly over time both in methods and capabilities, and they are well tuned for most vendor platforms. Since the codes are mature and complex, there are few opportunities to improve their performance solely through code optimization. Improvements of 10% to 15% are typical. Examples of ISV codes are DYNA3D, Gaussian, and Nastran. Community codes are non-commercial production codes used by a particular research field. Generally, they are developed and distributed by a single academic or research institution with assistance from the community. Most users just run the codes, but some develop new methods and extensions that feed back into the general release. The codes are available on most vendor platforms. Since these codes are younger than ISV codes, there are more opportunities to optimize the source code. Improvements of 50% are not unusual. Examples of community codes are AMBER, CHARM, BLAST, and FASTA. Personal codes are those written by single users or small research groups for their own use. These codes are not distributed, but may be passed from professor-to-student or student-to-student over several years. They form the primordial ocean of applications from which community and ISV codes emerge. Government research grants pay for the development of most personal codes. This paper reports on the nature and performance of this class of codes. Over the last year, I have looked at over two dozen personal codes from more than a dozen research institutions. The codes cover a variety of scientific fields, including astronomy, atmospheric sciences, bioinformatics, biology, chemistry, geology, and physics. The sources range from a few hundred lines to more than ten thousand lines, and are written in Fortran, Fortran 90, C, and C++. For the most part, the codes are modular, documented, and written in a clear, straightforward manner. They do not use complex language features, advanced data structures, programming tricks, or libraries. I had little trouble understanding what the codes did or how data structures were used. Most came with a makefile. Surprisingly, only one of the applications is parallel. All developers have access to parallel machines, so availability is not an issue. Several tried to parallelize their applications, but stopped after encountering difficulties. Lack of education and a perception that parallelism is difficult prevented most from trying. I parallelized several of the codes using OpenMP, and did not judge any of the codes as difficult to parallelize. Even more surprising than the lack of parallelism is the inefficiency of the codes. I was able to get large improvements in performance in a matter of a few days applying simple optimization techniques. Table 1 lists ten representative codes [names and affiliation are omitted to preserve anonymity]. Improvements on one processor range from 2x to 15.5x with a simple average of 4.75x. I did not use sophisticated performance tools or drill deep into the program's execution character as one would do when tuning ISV or community codes. Using only a profiler and source line timers, I identified inefficient sections of code and improved their performance by inspection. The changes were at a high level. I am sure there is another factor of 2 or 3 in each code, and more if the codes are parallelized. The study’s results show that personal scientific codes are running many times slower than they should and that the problem is pervasive. Computational scientists are not sloppy programmers; however, few are trained in the art of computer programming or code optimization. I found that most have a working knowledge of some programming language and standard software engineering practices; but they do not know, or think about, how to make their programs run faster. They simply do not know the standard techniques used to make codes run faster. In fact, they do not even perceive that such techniques exist. The case studies described in this paper show that applying simple, well known techniques can significantly increase the performance of personal codes. It is important that the scientific community and the Government agencies that support scientific research find ways to better educate academic scientific programmers. The inefficiency of their codes is so bad that it is retarding both the quality and progress of scientific research. # cacheperformance redundantoperations loopstructures performanceimprovement 1 x x 15.5 2 x 2.8 3 x x 2.5 4 x 2.1 5 x x 2.0 6 x 5.0 7 x 5.8 8 x 6.3 9 2.2 10 x x 3.3 Table 1 — Area of improvement and performance gains of 10 codes The remainder of the paper is organized as follows: sections 2, 3, and 4 discuss the three most common sources of inefficiencies in the codes studied. These are cache performance, redundant operations, and loop structures. Each section includes several examples. The last section summaries the work and suggests a possible solution to the issues raised. Optimizing cache performance Commodity microprocessor systems use caches to increase memory bandwidth and reduce memory latencies. Typical latencies from processor to L1, L2, local, and remote memory are 3, 10, 50, and 200 cycles, respectively. Moreover, bandwidth falls off dramatically as memory distances increase. Programs that do not use cache effectively run many times slower than programs that do. When optimizing for cache, the biggest performance gains are achieved by accessing data in cache order and reusing data to amortize the overhead of cache misses. Secondary considerations are prefetching, associativity, and replacement; however, the understanding and analysis required to optimize for the latter are probably beyond the capabilities of the non-expert. Much can be gained simply by accessing data in the correct order and maximizing data reuse. 6 out of the 10 codes studied here benefited from such high level optimizations. Array Accesses The most important cache optimization is the most basic: accessing Fortran array elements in column order and C array elements in row order. Four of the ten codes—1, 2, 4, and 10—got it wrong. Compilers will restructure nested loops to optimize cache performance, but may not do so if the loop structure is too complex, or the loop body includes conditionals, complex addressing, or function calls. In code 1, the compiler failed to invert a key loop because of complex addressing do I = 0, 1010, delta_x IM = I - delta_x IP = I + delta_x do J = 5, 995, delta_x JM = J - delta_x JP = J + delta_x T1 = CA1(IP, J) + CA1(I, JP) T2 = CA1(IM, J) + CA1(I, JM) S1 = T1 + T2 - 4 * CA1(I, J) CA(I, J) = CA1(I, J) + D * S1 end do end do In code 2, the culprit is conditionals do I = 1, N do J = 1, N If (IFLAG(I,J) .EQ. 0) then T1 = Value(I, J-1) T2 = Value(I-1, J) T3 = Value(I, J) T4 = Value(I+1, J) T5 = Value(I, J+1) Value(I,J) = 0.25 * (T1 + T2 + T5 + T4) Delta = ABS(T3 - Value(I,J)) If (Delta .GT. MaxDelta) MaxDelta = Delta endif enddo enddo I fixed both programs by inverting the loops by hand. Code 10 has three-dimensional arrays and triply nested loops. The structure of the most computationally intensive loops is too complex to invert automatically or by hand. The only practical solution is to transpose the arrays so that the dimension accessed by the innermost loop is in cache order. The arrays can be transposed at construction or prior to entering a computationally intensive section of code. The former requires all array references to be modified, while the latter is cost effective only if the cost of the transpose is amortized over many accesses. I used the second approach to optimize code 10. Code 5 has four-dimensional arrays and loops are nested four deep. For all of the reasons cited above the compiler is not able to restructure three key loops. Assume C arrays and let the four dimensions of the arrays be i, j, k, and l. In the original code, the index structure of the three loops is L1: for i L2: for i L3: for i for l for l for j for k for j for k for j for k for l So only L3 accesses array elements in cache order. L1 is a very complex loop—much too complex to invert. I brought the loop into cache alignment by transposing the second and fourth dimensions of the arrays. Since the code uses a macro to compute all array indexes, I effected the transpose at construction and changed the macro appropriately. The dimensions of the new arrays are now: i, l, k, and j. L3 is a simple loop and easily inverted. L2 has a loop-carried scalar dependence in k. By promoting the scalar name that carries the dependence to an array, I was able to invert the third and fourth subloops aligning the loop with cache. Code 5 is by far the most difficult of the four codes to optimize for array accesses; but the knowledge required to fix the problems is no more than that required for the other codes. I would judge this code at the limits of, but not beyond, the capabilities of appropriately trained computational scientists. Array Strides When a cache miss occurs, a line (64 bytes) rather than just one word is loaded into the cache. If data is accessed stride 1, than the cost of the miss is amortized over 8 words. Any stride other than one reduces the cost savings. Two of the ten codes studied suffered from non-unit strides. The codes represent two important classes of "strided" codes. Code 1 employs a multi-grid algorithm to reduce time to convergence. The grids are every tenth, fifth, second, and unit element. Since time to convergence is inversely proportional to the distance between elements, coarse grids converge quickly providing good starting values for finer grids. The better starting values further reduce the time to convergence. The downside is that grids of every nth element, n > 1, introduce non-unit strides into the computation. In the original code, much of the savings of the multi-grid algorithm were lost due to this problem. I eliminated the problem by compressing (copying) coarse grids into continuous memory, and rewriting the computation as a function of the compressed grid. On convergence, I copied the final values of the compressed grid back to the original grid. The savings gained from unit stride access of the compressed grid more than paid for the cost of copying. Using compressed grids, the loop from code 1 included in the previous section becomes do j = 1, GZ do i = 1, GZ T1 = CA(i+0, j-1) + CA(i-1, j+0) T4 = CA1(i+1, j+0) + CA1(i+0, j+1) S1 = T1 + T4 - 4 * CA1(i+0, j+0) CA(i+0, j+0) = CA1(i+0, j+0) + DD * S1 enddo enddo where CA and CA1 are compressed arrays of size GZ. Code 7 traverses a list of objects selecting objects for later processing. The labels of the selected objects are stored in an array. The selection step has unit stride, but the processing steps have irregular stride. A fix is to save the parameters of the selected objects in temporary arrays as they are selected, and pass the temporary arrays to the processing functions. The fix is practical if the same parameters are used in selection as in processing, or if processing comprises a series of distinct steps which use overlapping subsets of the parameters. Both conditions are true for code 7, so I achieved significant improvement by copying parameters to temporary arrays during selection. Data reuse In the previous sections, we optimized for spatial locality. It is also important to optimize for temporal locality. Once read, a datum should be used as much as possible before it is forced from cache. Loop fusion and loop unrolling are two techniques that increase temporal locality. Unfortunately, both techniques increase register pressure—as loop bodies become larger, the number of registers required to hold temporary values grows. Once register spilling occurs, any gains evaporate quickly. For multiprocessors with small register sets or small caches, the sweet spot can be very small. In the ten codes presented here, I found no opportunities for loop fusion and only two opportunities for loop unrolling (codes 1 and 3). In code 1, unrolling the outer and inner loop one iteration increases the number of result values computed by the loop body from 1 to 4, do J = 1, GZ-2, 2 do I = 1, GZ-2, 2 T1 = CA1(i+0, j-1) + CA1(i-1, j+0) T2 = CA1(i+1, j-1) + CA1(i+0, j+0) T3 = CA1(i+0, j+0) + CA1(i-1, j+1) T4 = CA1(i+1, j+0) + CA1(i+0, j+1) T5 = CA1(i+2, j+0) + CA1(i+1, j+1) T6 = CA1(i+1, j+1) + CA1(i+0, j+2) T7 = CA1(i+2, j+1) + CA1(i+1, j+2) S1 = T1 + T4 - 4 * CA1(i+0, j+0) S2 = T2 + T5 - 4 * CA1(i+1, j+0) S3 = T3 + T6 - 4 * CA1(i+0, j+1) S4 = T4 + T7 - 4 * CA1(i+1, j+1) CA(i+0, j+0) = CA1(i+0, j+0) + DD * S1 CA(i+1, j+0) = CA1(i+1, j+0) + DD * S2 CA(i+0, j+1) = CA1(i+0, j+1) + DD * S3 CA(i+1, j+1) = CA1(i+1, j+1) + DD * S4 enddo enddo The loop body executes 12 reads, whereas as the rolled loop shown in the previous section executes 20 reads to compute the same four values. In code 3, two loops are unrolled 8 times and one loop is unrolled 4 times. Here is the before for (k = 0; k < NK[u]; k++) { sum = 0.0; for (y = 0; y < NY; y++) { sum += W[y][u][k] * delta[y]; } backprop[i++]=sum; } and after code for (k = 0; k < KK - 8; k+=8) { sum0 = 0.0; sum1 = 0.0; sum2 = 0.0; sum3 = 0.0; sum4 = 0.0; sum5 = 0.0; sum6 = 0.0; sum7 = 0.0; for (y = 0; y < NY; y++) { sum0 += W[y][0][k+0] * delta[y]; sum1 += W[y][0][k+1] * delta[y]; sum2 += W[y][0][k+2] * delta[y]; sum3 += W[y][0][k+3] * delta[y]; sum4 += W[y][0][k+4] * delta[y]; sum5 += W[y][0][k+5] * delta[y]; sum6 += W[y][0][k+6] * delta[y]; sum7 += W[y][0][k+7] * delta[y]; } backprop[k+0] = sum0; backprop[k+1] = sum1; backprop[k+2] = sum2; backprop[k+3] = sum3; backprop[k+4] = sum4; backprop[k+5] = sum5; backprop[k+6] = sum6; backprop[k+7] = sum7; } for one of the loops unrolled 8 times. Optimizing for temporal locality is the most difficult optimization considered in this paper. The concepts are not difficult, but the sweet spot is small. Identifying where the program can benefit from loop unrolling or loop fusion is not trivial. Moreover, it takes some effort to get it right. Still, educating scientific programmers about temporal locality and teaching them how to optimize for it will pay dividends. Reducing instruction count Execution time is a function of instruction count. Reduce the count and you usually reduce the time. The best solution is to use a more efficient algorithm; that is, an algorithm whose order of complexity is smaller, that converges quicker, or is more accurate. Optimizing source code without changing the algorithm yields smaller, but still significant, gains. This paper considers only the latter because the intent is to study how much better codes can run if written by programmers schooled in basic code optimization techniques. The ten codes studied benefited from three types of "instruction reducing" optimizations. The two most prevalent were hoisting invariant memory and data operations out of inner loops. The third was eliminating unnecessary data copying. The nature of these inefficiencies is language dependent. Memory operations The semantics of C make it difficult for the compiler to determine all the invariant memory operations in a loop. The problem is particularly acute for loops in functions since the compiler may not know the values of the function's parameters at every call site when compiling the function. Most compilers support pragmas to help resolve ambiguities; however, these pragmas are not comprehensive and there is no standard syntax. To guarantee that invariant memory operations are not executed repetitively, the user has little choice but to hoist the operations by hand. The problem is not as severe in Fortran programs because in the absence of equivalence statements, it is a violation of the language's semantics for two names to share memory. Codes 3 and 5 are C programs. In both cases, the compiler did not hoist all invariant memory operations from inner loops. Consider the following loop from code 3 for (y = 0; y < NY; y++) { i = 0; for (u = 0; u < NU; u++) { for (k = 0; k < NK[u]; k++) { dW[y][u][k] += delta[y] * I1[i++]; } } } Since dW[y][u] can point to the same memory space as delta for one or more values of y and u, assignment to dW[y][u][k] may change the value of delta[y]. In reality, dW and delta do not overlap in memory, so I rewrote the loop as for (y = 0; y < NY; y++) { i = 0; Dy = delta[y]; for (u = 0; u < NU; u++) { for (k = 0; k < NK[u]; k++) { dW[y][u][k] += Dy * I1[i++]; } } } Failure to hoist invariant memory operations may be due to complex address calculations. If the compiler can not determine that the address calculation is invariant, then it can hoist neither the calculation nor the associated memory operations. As noted above, code 5 uses a macro to address four-dimensional arrays #define MAT4D(a,q,i,j,k) (double *)((a)->data + (q)*(a)->strides[0] + (i)*(a)->strides[3] + (j)*(a)->strides[2] + (k)*(a)->strides[1]) The macro is too complex for the compiler to understand and so, it does not identify any subexpressions as loop invariant. The simplest way to eliminate the address calculation from the innermost loop (over i) is to define a0 = MAT4D(a,q,0,j,k) before the loop and then replace all instances of *MAT4D(a,q,i,j,k) in the loop with a0[i] A similar problem appears in code 6, a Fortran program. The key loop in this program is do n1 = 1, nh nx1 = (n1 - 1) / nz + 1 nz1 = n1 - nz * (nx1 - 1) do n2 = 1, nh nx2 = (n2 - 1) / nz + 1 nz2 = n2 - nz * (nx2 - 1) ndx = nx2 - nx1 ndy = nz2 - nz1 gxx = grn(1,ndx,ndy) gyy = grn(2,ndx,ndy) gxy = grn(3,ndx,ndy) balance(n1,1) = balance(n1,1) + (force(n2,1) * gxx + force(n2,2) * gxy) * h1 balance(n1,2) = balance(n1,2) + (force(n2,1) * gxy + force(n2,2) * gyy)*h1 end do end do The programmer has written this loop well—there are no loop invariant operations with respect to n1 and n2. However, the loop resides within an iterative loop over time and the index calculations are independent with respect to time. Trading space for time, I precomputed the index values prior to the entering the time loop and stored the values in two arrays. I then replaced the index calculations with reads of the arrays. Data operations Ways to reduce data operations can appear in many forms. Implementing a more efficient algorithm produces the biggest gains. The closest I came to an algorithm change was in code 4. This code computes the inner product of K-vectors A(i) and B(j), 0 = i < N, 0 = j < M, for most values of i and j. Since the program computes most of the NM possible inner products, it is more efficient to compute all the inner products in one triply-nested loop rather than one at a time when needed. The savings accrue from reading A(i) once for all B(j) vectors and from loop unrolling. for (i = 0; i < N; i+=8) { for (j = 0; j < M; j++) { sum0 = 0.0; sum1 = 0.0; sum2 = 0.0; sum3 = 0.0; sum4 = 0.0; sum5 = 0.0; sum6 = 0.0; sum7 = 0.0; for (k = 0; k < K; k++) { sum0 += A[i+0][k] * B[j][k]; sum1 += A[i+1][k] * B[j][k]; sum2 += A[i+2][k] * B[j][k]; sum3 += A[i+3][k] * B[j][k]; sum4 += A[i+4][k] * B[j][k]; sum5 += A[i+5][k] * B[j][k]; sum6 += A[i+6][k] * B[j][k]; sum7 += A[i+7][k] * B[j][k]; } C[i+0][j] = sum0; C[i+1][j] = sum1; C[i+2][j] = sum2; C[i+3][j] = sum3; C[i+4][j] = sum4; C[i+5][j] = sum5; C[i+6][j] = sum6; C[i+7][j] = sum7; }} This change requires knowledge of a typical run; i.e., that most inner products are computed. The reasons for the change, however, derive from basic optimization concepts. It is the type of change easily made at development time by a knowledgeable programmer. In code 5, we have the data version of the index optimization in code 6. Here a very expensive computation is a function of the loop indices and so cannot be hoisted out of the loop; however, the computation is invariant with respect to an outer iterative loop over time. We can compute its value for each iteration of the computation loop prior to entering the time loop and save the values in an array. The increase in memory required to store the values is small in comparison to the large savings in time. The main loop in Code 8 is doubly nested. The inner loop includes a series of guarded computations; some are a function of the inner loop index but not the outer loop index while others are a function of the outer loop index but not the inner loop index for (j = 0; j < N; j++) { for (i = 0; i < M; i++) { r = i * hrmax; R = A[j]; temp = (PRM[3] == 0.0) ? 1.0 : pow(r, PRM[3]); high = temp * kcoeff * B[j] * PRM[2] * PRM[4]; low = high * PRM[6] * PRM[6] / (1.0 + pow(PRM[4] * PRM[6], 2.0)); kap = (R > PRM[6]) ? high * R * R / (1.0 + pow(PRM[4]*r, 2.0) : low * pow(R/PRM[6], PRM[5]); < rest of loop omitted > }} Note that the value of temp is invariant to j. Thus, we can hoist the computation for temp out of the loop and save its values in an array. for (i = 0; i < M; i++) { r = i * hrmax; TEMP[i] = pow(r, PRM[3]); } [N.B. – the case for PRM[3] = 0 is omitted and will be reintroduced later.] We now hoist out of the inner loop the computations invariant to i. Since the conditional guarding the value of kap is invariant to i, it behooves us to hoist the computation out of the inner loop, thereby executing the guard once rather than M times. The final version of the code is for (j = 0; j < N; j++) { R = rig[j] / 1000.; tmp1 = kcoeff * par[2] * beta[j] * par[4]; tmp2 = 1.0 + (par[4] * par[4] * par[6] * par[6]); tmp3 = 1.0 + (par[4] * par[4] * R * R); tmp4 = par[6] * par[6] / tmp2; tmp5 = R * R / tmp3; tmp6 = pow(R / par[6], par[5]); if ((par[3] == 0.0) && (R > par[6])) { for (i = 1; i <= imax1; i++) KAP[i] = tmp1 * tmp5; } else if ((par[3] == 0.0) && (R <= par[6])) { for (i = 1; i <= imax1; i++) KAP[i] = tmp1 * tmp4 * tmp6; } else if ((par[3] != 0.0) && (R > par[6])) { for (i = 1; i <= imax1; i++) KAP[i] = tmp1 * TEMP[i] * tmp5; } else if ((par[3] != 0.0) && (R <= par[6])) { for (i = 1; i <= imax1; i++) KAP[i] = tmp1 * TEMP[i] * tmp4 * tmp6; } for (i = 0; i < M; i++) { kap = KAP[i]; r = i * hrmax; < rest of loop omitted > } } Maybe not the prettiest piece of code, but certainly much more efficient than the original loop, Copy operations Several programs unnecessarily copy data from one data structure to another. This problem occurs in both Fortran and C programs, although it manifests itself differently in the two languages. Code 1 declares two arrays—one for old values and one for new values. At the end of each iteration, the array of new values is copied to the array of old values to reset the data structures for the next iteration. This problem occurs in Fortran programs not included in this study and in both Fortran 77 and Fortran 90 code. Introducing pointers to the arrays and swapping pointer values is an obvious way to eliminate the copying; but pointers is not a feature that many Fortran programmers know well or are comfortable using. An easy solution not involving pointers is to extend the dimension of the value array by 1 and use the last dimension to differentiate between arrays at different times. For example, if the data space is N x N, declare the array (N, N, 2). Then store the problem’s initial values in (_, _, 2) and define the scalar names new = 2 and old = 1. At the start of each iteration, swap old and new to reset the arrays. The old–new copy problem did not appear in any C program. In programs that had new and old values, the code swapped pointers to reset data structures. Where unnecessary coping did occur is in structure assignment and parameter passing. Structures in C are handled much like scalars. Assignment causes the data space of the right-hand name to be copied to the data space of the left-hand name. Similarly, when a structure is passed to a function, the data space of the actual parameter is copied to the data space of the formal parameter. If the structure is large and the assignment or function call is in an inner loop, then copying costs can grow quite large. While none of the ten programs considered here manifested this problem, it did occur in programs not included in the study. A simple fix is always to refer to structures via pointers. Optimizing loop structures Since scientific programs spend almost all their time in loops, efficient loops are the key to good performance. Conditionals, function calls, little instruction level parallelism, and large numbers of temporary values make it difficult for the compiler to generate tightly packed, highly efficient code. Conditionals and function calls introduce jumps that disrupt code flow. Users should eliminate or isolate conditionls to their own loops as much as possible. Often logical expressions can be substituted for if-then-else statements. For example, code 2 includes the following snippet MaxDelta = 0.0 do J = 1, N do I = 1, M < code omitted > Delta = abs(OldValue ? NewValue) if (Delta > MaxDelta) MaxDelta = Delta enddo enddo if (MaxDelta .gt. 0.001) goto 200 Since the only use of MaxDelta is to control the jump to 200 and all that matters is whether or not it is greater than 0.001, I made MaxDelta a boolean and rewrote the snippet as MaxDelta = .false. do J = 1, N do I = 1, M < code omitted > Delta = abs(OldValue ? NewValue) MaxDelta = MaxDelta .or. (Delta .gt. 0.001) enddo enddo if (MaxDelta) goto 200 thereby, eliminating the conditional expression from the inner loop. A microprocessor can execute many instructions per instruction cycle. Typically, it can execute one or more memory, floating point, integer, and jump operations. To be executed simultaneously, the operations must be independent. Thick loops tend to have more instruction level parallelism than thin loops. Moreover, they reduce memory traffice by maximizing data reuse. Loop unrolling and loop fusion are two techniques to increase the size of loop bodies. Several of the codes studied benefitted from loop unrolling, but none benefitted from loop fusion. This observation is not too surpising since it is the general tendency of programmers to write thick loops. As loops become thicker, the number of temporary values grows, increasing register pressure. If registers spill, then memory traffic increases and code flow is disrupted. A thick loop with many temporary values may execute slower than an equivalent series of thin loops. The biggest gain will be achieved if the thick loop can be split into a series of independent loops eliminating the need to write and read temporary arrays. I found such an occasion in code 10 where I split the loop do i = 1, n do j = 1, m A24(j,i)= S24(j,i) * T24(j,i) + S25(j,i) * U25(j,i) B24(j,i)= S24(j,i) * T25(j,i) + S25(j,i) * U24(j,i) A25(j,i)= S24(j,i) * C24(j,i) + S25(j,i) * V24(j,i) B25(j,i)= S24(j,i) * U25(j,i) + S25(j,i) * V25(j,i) C24(j,i)= S26(j,i) * T26(j,i) + S27(j,i) * U26(j,i) D24(j,i)= S26(j,i) * T27(j,i) + S27(j,i) * V26(j,i) C25(j,i)= S27(j,i) * S28(j,i) + S26(j,i) * U28(j,i) D25(j,i)= S27(j,i) * T28(j,i) + S26(j,i) * V28(j,i) end do end do into two disjoint loops do i = 1, n do j = 1, m A24(j,i)= S24(j,i) * T24(j,i) + S25(j,i) * U25(j,i) B24(j,i)= S24(j,i) * T25(j,i) + S25(j,i) * U24(j,i) A25(j,i)= S24(j,i) * C24(j,i) + S25(j,i) * V24(j,i) B25(j,i)= S24(j,i) * U25(j,i) + S25(j,i) * V25(j,i) end do end do do i = 1, n do j = 1, m C24(j,i)= S26(j,i) * T26(j,i) + S27(j,i) * U26(j,i) D24(j,i)= S26(j,i) * T27(j,i) + S27(j,i) * V26(j,i) C25(j,i)= S27(j,i) * S28(j,i) + S26(j,i) * U28(j,i) D25(j,i)= S27(j,i) * T28(j,i) + S26(j,i) * V28(j,i) end do end do Conclusions Over the course of the last year, I have had the opportunity to work with over two dozen academic scientific programmers at leading research universities. Their research interests span a broad range of scientific fields. Except for two programs that relied almost exclusively on library routines (matrix multiply and fast Fourier transform), I was able to improve significantly the single processor performance of all codes. Improvements range from 2x to 15.5x with a simple average of 4.75x. Changes to the source code were at a very high level. I did not use sophisticated techniques or programming tools to discover inefficiencies or effect the changes. Only one code was parallel despite the availability of parallel systems to all developers. Clearly, we have a problem—personal scientific research codes are highly inefficient and not running parallel. The developers are unaware of simple optimization techniques to make programs run faster. They lack education in the art of code optimization and parallel programming. I do not believe we can fix the problem by publishing additional books or training manuals. To date, the developers in questions have not studied the books or manual available, and are unlikely to do so in the future. Short courses are a possible solution, but I believe they are too concentrated to be much use. The general concepts can be taught in a three or four day course, but that is not enough time for students to practice what they learn and acquire the experience to apply and extend the concepts to their codes. Practice is the key to becoming proficient at optimization. I recommend that graduate students be required to take a semester length course in optimization and parallel programming. We would never give someone access to state-of-the-art scientific equipment costing hundreds of thousands of dollars without first requiring them to demonstrate that they know how to use the equipment. Yet the criterion for time on state-of-the-art supercomputers is at most an interesting project. Requestors are never asked to demonstrate that they know how to use the system, or can use the system effectively. A semester course would teach them the required skills. Government agencies that fund academic scientific research pay for most of the computer systems supporting scientific research as well as the development of most personal scientific codes. These agencies should require graduate schools to offer a course in optimization and parallel programming as a requirement for funding. About the Author John Feo received his Ph.D. in Computer Science from The University of Texas at Austin in 1986. After graduate school, Dr. Feo worked at Lawrence Livermore National Laboratory where he was the Group Leader of the Computer Research Group and principal investigator of the Sisal Language Project. In 1997, Dr. Feo joined Tera Computer Company where he was project manager for the MTA, and oversaw the programming and evaluation of the MTA at the San Diego Supercomputer Center. In 2000, Dr. Feo joined Sun Microsystems as an HPC application specialist. He works with university research groups to optimize and parallelize scientific codes. Dr. Feo has published over two dozen research articles in the areas of parallel parallel programming, parallel programming languages, and application performance.

    Read the article

< Previous Page | 25 26 27 28 29 30 31 32 33 34 35 36  | Next Page >