compilers - Page 19 - Developer IT

What is the current status of LOGO? (The programming language)

- by Workshop Alex

In another Q I saw someone mention LOGO and it reminded me of some programming language from the past, mostly used for educational purposes. Basically, you would have to program a turtle with a pen through it's back. By telling it where to move, the pen would draw lines. It could also lift the pen to move without drawing lines. I have fond memories of this language, since it was one of the first I've ever used, about 30 years ago. (Yeah, I'm old.) Well, I only programmed with LOGO for two days or so, but it got me hooked to programming. But I wonder if the LOGO information on it's Wikipedia page is still correct. And more importantly, are there versions that will create .NET binaries? Are there only LOGO Interpreters and no compilers? What is the current status of this educational language? And more interestingly, are there more experts here at SO who have experimented with LOGO in the past? Yeah, I know. Nowadays this language is a bit antique but I got some warm and comfortable memories when I remembered this interesting language from my history. For a teenager back then, it was fun!

Read the article

Communication between lexer and parser

- by FredOverflow

Every time I write a simple lexer and parser, I stumble upon the same question: how should the lexer and the parser communicate? I see four different approaches: The lexer eagerly converts the entire input string into a vector of tokens. Once this is done, the vector is fed to the parser which converts it into a tree. This is by far the simplest solution to implement, but since all tokens are stored in memory, it wastes a lot of space. Each time the lexer finds a token, it invokes a function on the parser, passing the current token. In my experience, this only works if the parser can naturally be implemented as a state machine like LALR parsers. By contrast, I don't think it would work at all for recursive descent parsers. Each time the parser needs a token, it asks the lexer for the next one. This is very easy to implement in C# due to the yield keyword, but quite hard in C++ which doesn't have it. The lexer and parser communicate through an asynchronous queue. This is commonly known under the title "producer/consumer", and it should simplify the communication between the lexer and the parser a lot. Does it also outperform the other solutions on multicores? Or is lexing too trivial? Is my analysis sound? Are there other approaches I haven't thought of? What is used in real-world compilers? It would be really cool if compiler writers like Eric Lippert could shed some light on this issue.

Read the article

Some questions about special operators i've never seen in C++ code.

- by toto

I have downloaded the Phoenix SDK June 2008 (Tools for compilers) and when I'm reading the code of the Hello sample, I really feel lost. public ref class Hello { //-------------------------------------------------------------------------- // // Description: // // Class Variables. // // Remarks: // // A normal compiler would have more flexible means for holding // on to all this information, but in our case it's simplest (if // somewhat inelegant) if we just keep references to all the // structures we'll need to access as classstatic variables. // //-------------------------------------------------------------------------- static Phx::ModuleUnit ^ module; static Phx::Targets::Runtimes::Runtime ^ runtime; static Phx::Targets::Architectures::Architecture ^ architecture; static Phx::Lifetime ^ lifetime; static Phx::Types::Table ^ typeTable; static Phx::Symbols::Table ^ symbolTable; static Phx::Phases::PhaseConfiguration ^ phaseConfiguration; 2 Questions : What's that ref keyword? What is that sign ^ ? What is it doing protected: virtual void Execute ( Phx::Unit ^ unit ) override; }; override is a C++ keyword too? It's colored as such in my Visual Studio. I really want to play with this framework, but this advanced C++ is really an obstacle right now. Thank you.

Read the article

java: libraries for immutable functional-style data structures

- by Jason S

This is very similar to another question (Functional Data Structures in Java) but the answers there are not particularly useful. I need to use immutable versions of the standard Java collections (e.g. HashMap / TreeMap / ArrayList / LinkedList / HashSet / TreeSet). By "immutable" I mean immutable in the functional sense (e.g. purely functional data structures), where updating operations on the data structure do not change the original data, but instead return a new instance of the same kind of data structure. Also typically new and old instances of the data structure will share immutable data to be efficient in time and space. From what I can tell my options include: Functional Java Scala Clojure but I'm not sure whether any of these are particularly appealing to me. I have a few requirements/desirements: the collections in question should be usable directly in Java (with the appropriate libraries in the classpath). FJ would work for me; I'm not sure if I can use Scala's or Clojure's data structures in Java w/o having to use the compilers/interpreters from those languages and w/o having to write Scala or Clojure code. Core operations on lists/maps/sets should be possible w/o having to create function objects with confusing syntaxes (FJ looks slightly iffy) They should be efficient in time and space. I'm looking for a library which ideally has done some performance testing. FJ's TreeMap is based on a red-black tree, not sure how that rates. Documentation / tutorials should be good enough so someone can get started quickly using the data structures. FJ fails on that front. Any suggestions?

Read the article

C++ Templates: Convincing self against code bloat

- by ArunSaha

I have heard about code bloats in context of C++ templates. I know that is not the case with modern C++ compilers. But, I want to construct an example and convince myself. Lets say we have a class template< typename T, size_t N > class Array { public: T * data(); private: T elems_[ N }; }; template< typename T, size_t N > T * Array<T>::data() { return elems_; } Further, let's say types.h contains typedef Array< int, 100 > MyArray; x.cpp contains MyArray ArrayX; and y.cpp contains MyArray ArrayY; Now, how can I verify that the code space for MyArray::data() is same for both ArrayX and ArrayY? What else I should know and verify from this (or other similar simple) examples? If there is any g++ specific tips, I am interested for that too. PS: Regarding bloat, I am concerned even for the slightest of bloats, since I come from embedded context.

Read the article

Tips for submitting a library to Boost?

- by AraK

Hi everyone, Summer is coming, and a group of friends and I are getting ready for it :) We decided to build a compile-time Arbitrary precision Unsigned Integers. We would like to provide a set of integers algorithms(functions) with the library. We have seen a number of requests for such a library(SoC2010, C++0x Standard Library wishlist). Also, a regular run-time bigint is requested usually with that, but we don't want to go into the hassle of memory management. The idea came to me from a library called TTMath, unfortunately this library works only on specific platforms because Assembly was used extensively in the library. We would like to write a standard library, depending on the C++ standard library and Boost. Also, we would like to use the available C++0x facilities in current compilers like user-defined literals and others. This would technically make the library non-standard for a while, but we think that it is a matter of time the new standards will be official. Your hints on the whole process including design, implementation, documentation, maintainable of the library are more than welcom. We are a group of students and fresh graduates who are looking for something interesting in the summer, but we see that Boost is full of gurus and we don't want to forget something too obvious. We are communicating on-line, so there is no shared white-boards :( Thanks,

Read the article

complete nub.. iostream file not found

- by user1742389

folks I am almost completely new to programming so please bear with me. I am using the first example from lydia.com c++ videos and failing. I am using Xcode 4.5.1 with a c++ command line project instead of eclipse and I am getting an error on compile of iostream file not found. the code is simple and I will include exactly what I have at the end of this message. I thought that iostream was a standard header that came with all even remotely recent versions of c++ compilers and am shocked to get this error and I cannot find any way to fix this. please tell me whats going on. #include <iostream> #include <stdio.h> #include <sstream> #include <vector> int main(int argc, char ** argv) { stringstream version; version << "GCC Version"; _GNUC_<<"."<<_GNUC_MINOR_<<"."<<_GNUC_PATCHLEVEL_<<_"\nVersion String: " <<_VERSION_; cout <<version.string() endl; vector<string> v={"one","two","three"}; for ( s : v ) { cout << s <<endl; } // insert code here... printf("Hello, World!\n"); return 0; } Thanks.

Read the article

Creating serializeable unique compile-time identifiers for arbitrary UDT's.

- by Endiannes

I would like a generic way to create unique compile-time identifiers for any C++ user defined types. for example: unique_id<my_type>::value == 0 // true unique_id<other_type>::value == 1 // true I've managed to implement something like this using preprocessor meta programming, the problem is, serialization is not consistent. For instance if the class template unique_id is instantiated with other_type first, then any serialization in previous revisions of my program will be invalidated. I've searched for solutions to this problem, and found several ways to implement this with non-consistent serialization if the unique values are compile-time constants. If RTTI or similar methods, like boost::sp_typeinfo are used, then the unique values are obviously not compile-time constants and extra overhead is present. An ad-hoc solution to this problem would be, instantiating all of the unique_id's in a separate header in the correct order, but this causes additional maintenance and boilerplate code, which is not different than using an enum unique_id{my_type, other_type};. A good solution to this problem would be using user-defined literals, unfortunately, as far as I know, no compiler supports them at this moment. The syntax would be 'my_type'_id; 'other_type'_id; with udl's. I'm hoping somebody knows a trick that allows implementing serialize-able unique identifiers in C++ with the current standard (C++03/C++0x), I would be happy if it works with the latest stable MSVC and GNU-G++ compilers, although I expect if there is a solution, it's not portable.

Read the article

how to emulate thread local storage at user space in C++ ?

- by vprajan

I am working on a mobile platform over Nucleus RTOS. It uses Nucleus Threading system but it doesn't have support for explicit thread local storage i.e, TlsAlloc, TlsSetValue, TlsGetValue, TlsFree APIs. The platform doesn't have user space pthreads as well. I found that __thread storage modifier is present in most of the C++ compilers. But i don't know how to make it work for my kind of usage. How does __thread keyword can be mapped with explicit thread local storage? I read many articles but nothing is so clear for giving me the following basic information will __thread variable different for each thread ? How to write to that and read from it ? does each thread has exactly one copy of the variable ? following is the pthread based implementation: pthread_key_t m_key; struct Data : Noncopyable { Data(T* value, void* owner) : value(value), owner(owner) {} int* value; }; inline ThreadSpecific() { int error = pthread_key_create(&m_key, destroy); if (error) CRASH(); } inline ~ThreadSpecific() { pthread_key_delete(m_key); // Does not invoke destructor functions. } inline T* get() { Data* data = static_cast<Data*>(pthread_getspecific(m_key)); return data ? data->value : 0; } inline void set(T* ptr) { ASSERT(!get()); pthread_setspecific(m_key, new Data(ptr, this)); } How to make the above code use __thread way to set & get specific value ? where/when does the create & delete happen? If this is not possible, how to write custom pthread_setspecific, pthread_getspecific kind of APIs. I tried using a C++ global map and index it uniquely for each thread and retrieved data from it. But it didn't work well.

Read the article

When is it possible to override top-level bindings in (R7RS) scheme?

- by Marc

I have read the current draft of the forthcoming R7RS scheme standard (small language), but I don't understand under which conditions it is not an error to redefine top-level bindings. I guess that it is possible to define or set! a binding that has been introduced at the top-level of a program a second time. But what about imported bindings from an external library? Is it possible to override these bindings by the standard? On page 26/27 of the report, it says: The top level of a program may also include import declarations. In a library declaration, it is an error to import the same identifier more than once with different bindings, or to redefine or mutate an imported binding with define, define-syntax or set!. However, a REPL should permit these actions. Does it mean that redefining is only an error when it does happen in libraries for imported bindings? I understand that it prohibits optimisations by compilers if the compiler does not know whether, say + still means the built-in addition or is any other user-specified error. But from this perspective, it does not make sense to restrict forbidding to rebind on the library level, when it would also make sense (at least) for imported bindings in programs. P.S.: As this is all about the environment of a scheme program: am I right in saying that environments are not first class citizens because one cannot get hold of the current environment? (Which, in turn, allows a compiled program to forget about the chosen names of the bindings.)

Read the article

Sequence Point and Evaluation Order( Preincrement)

- by Josh

There was a debate today among some of my colleagues and I wanted to clarify it. It is about the evaluation order and the sequence point in an expression. It is clearly stated in the standard that C/C++ does not have a left-to-right evaluation in an expression unlike languages like Java which is guaranteed to have a sequencial left-to-right order. So, in the below expression, the evaluation of the leftmost operand(B) in the binary operation is sequenced before the evaluation of the rightmost operand(C): A = B B_OP C The following expression according, to CPPReference under the subsection Sequenced-before rules(Undefined Behaviour) and Bjarne's TCPPL 3rd ed, is an UB x = x++ + 1; It could be interpreted as the compilers like BUT the expression below is said to be clearly a well defined behaviour in C++11 x = ++x + 1; So, if the above expression is well defined, what is the "fate" of this? array[x] = ++x; It seems the evaluation of a post-increment and post-decrement is not defined but the pre-increment and the pre-decrement is defined. NOTE: This is not used in a real-life code. Clang 3.4 and GCC 4.8 clearly warns about both the pre- and post-increment sequence point.

Read the article

What new Unicode functions are there in C++0x?

- by luiscubal

It has been mentioned in several sources that C++0x will include better language-level support for Unicode(including types and literals). If the language is going to add these new features, it's only natural to assume that the standard library will as well. However, I am currently unable to find any references to the new standard library. I expected to find out the answer for these answers: Does the new library provide standard methods to convert UTF-8 to UTF-16, etc.? Does the new library allowing writing UTF-8 to files, to the console (or from files, from the console). If so, can we use cout or will we need something else? Does the new library include "basic" functionality such as: discovering the byte count and length of a UTF-8 string, converting to upper-case/lower-case(does this consider the influence of locales?) Finally, are any of these functions are available in any popular compilers such as GCC or Visual Studio? I have tried to look for information, but I can't seem to find anything? I am actually starting to think that maybe these things aren't even decided yet(I am aware that C++0x is a work in progress).

Read the article

PC to Macbook Pro Transition - Getting (re)started?

- by Torus Linvald

I'm in my second computer science course right now. I've enjoyed programming so far, but really have just scraped my way by. I've not done much programming outside of required class work. For similar reasons, I never really invested in downloading/learning software to help me program (IDE's, editors, compilers, etc). I know it sounds tedious, but my current setup is: notepad++ for coding; Filezilla to transfer .cpp & .h files to school's aludra/unix and compiling; unix tells me where my bugs are and I go back to notepad++ to debug; repeat until done. This isn't fun - and I know it could be easier. But I put it off knowing that I was soon going to switch to a Mac. And, tomorrow, I'm switching. So... How should I set up my Macbook for the best programming experience? What IDEs and editors and debuggers and so on should I download? How will Mac programming differ from PC? I'm open to all ideas and comments, even the most basic. (Background - I'm learning/programming in C++ right now. Next semester, my classes switch to Java. I'm also going to take a class in web development, with HTML/CSS/Javascript/PHP. My new laptop will be a late 2009 Macbook Pro with Leopard, or maybe Snow Leopard. Free would be preferrable for all programs.) Thank you all.

Read the article

Overhead of calling tiny functions from a tight inner loop? [C++]

- by John

Say you see a loop like this one: for(int i=0; i<thing.getParent().getObjectModel().getElements(SOME_TYPE).count(); ++i) { thing.getData().insert( thing.GetData().Count(), thing.getParent().getObjectModel().getElements(SOME_TYPE)[i].getName() ); } if this was Java I'd probably not think twice. But in performance-critical sections of C++, it makes me want to tinker with it... however I don't know if the compiler is smart enough to make it futile. This is a made up example but all it's doing is inserting strings into a container. Please don't assume any of these are STL types, think in general terms about the following: Is having a messy condition in the for loop going to get evaluated each time, or only once? If those get methods are simply returning references to member variables on the objects, will they be inlined away? Would you expect custom [] operators to get optimized at all? In other words is it worth the time (in performance only, not readability) to convert it to something like: ElementContainer &source = thing.getParent().getObjectModel().getElements(SOME_TYPE); int num = source.count(); Store &destination = thing.getData(); for(int i=0;i<num;++i) { destination.insert(thing.GetData().Count(), source[i].getName(); } Remember, this is a tight loop, called millions of times a second. What I wonder is if all this will shave a couple of cycles per loop or something more substantial? Yes I know the quote about "premature optimisation". And I know that profiling is important. But this is a more general question about modern compilers, Visual Studio in particular.

Read the article

C++: What is the size of an object of an empty class?

- by Ashwin

I was wondering what could be the size of an object of an empty class. It surely could not be 0 bytes since it should be possible to reference and point to it like any other object. But, how big is such an object? I used this small program: #include <iostream> using namespace std; class Empty {}; int main() { Empty e; cerr << sizeof(e) << endl; return 0; } The output I got on both Visual C++ and Cygwin-g++ compilers was 1 byte! This was a little surprising to me since I was expecting it to be of the size of the machine word (32 bits or 4 bytes). Can anyone explain why the size of 1 byte? Why not 4 bytes? Is this dependent on compiler or the machine too? Also, can someone give a more cogent reason for why an empty class object will not be of size 0 bytes?

Read the article

C++ const-reference semantics?

- by Kristoffer

Consider the sample application below. It demonstrates what I would call a flawed class design. #include <iostream> using namespace std; struct B { B() : m_value(1) {} long m_value; }; struct A { const B& GetB() const { return m_B; } void Foo(const B &b) { // assert(this != &b); m_B.m_value += b.m_value; m_B.m_value += b.m_value; } protected: B m_B; }; int main(int argc, char* argv[]) { A a; cout << "Original value: " << a.GetB().m_value << endl; cout << "Expected value: 3" << endl; a.Foo(a.GetB()); cout << "Actual value: " << a.GetB().m_value << endl; return 0; } Output: Original value: 1 Expected value: 3 Actual value: 4 Obviously, the programmer is fooled by the constness of b. By mistake b points to this, which yields the undesired behavior. My question: What const-rules should you follow when designing getters/setters? My suggestion: Never return a reference to a member variable if it can be set by reference through a member function. Hence, either return by value or pass parameters by value. (Modern compilers will optimize away the extra copy anyway.)

Read the article

Do NOT Change "Copy Local” project references to false, unless understand subsequences.

- by Michael Freidgeim

To optimize performance of visual studio build I've found multiple recommendations to change CopyLocal property for dependent dlls to false,e.g. From http://stackoverflow.com/questions/690033/best-practices-for-large-solutions-in-visual-studio-2008 CopyLocal? For sure turn this offhttp://stackoverflow.com/questions/280751/what-is-the-best-practice-for-copy-local-and-with-project-referencesAlways set the Copy Local property to false and enforce this via a custom msbuild stephttp://codebetter.com/patricksmacchia/2007/06/20/benefit-from-the-c-and-vb-net-compilers-perf/BenefitBenefitMy advice is to always set ‘Copy Local’ to falseSome time ago we've tried to change the setting to false, and found that it causes problem for deployment of top-level projects.Recently I've followed the suggestion and changed the settings for middle-level projects. It didn't cause immediate issues, but I was warned by Readify Consultant Colin Savage about possible errors during deploymentsI haven't undone the changes immediately and we found a few issues during testing.There are many scenarios, when you need to have Copy Local’ left to True.The concerns are highlighted in some stack overflow answers, but they have small number of votes.Top-level projects: set copy local = true.First of all, it doesn't work correctly for top-level projects, i.e. executables or web sites.As pointed in the answer http://stackoverflow.com/a/6529461/52277for all the references in the one at the top set copy local = true.Alternatively you have to change output directory as it's described in http://www.simple-talk.com/dotnet/.net-framework/partitioning-your-code-base-through-.net-assemblies-and-visual-studio-projects/If you set ‘ Copy Local = false’, VS will, unless you tell it otherwise, place each assembly alone in its own .\bin\Debugdirectory. Because of this, you will need to configure VS to place assemblies together in the same directory. To do so, for each VS project, go to VS > Project Properties > Build tab > Output path, and set the Ouput path to ..\bin\Debugfor debug configuration, and ..\bin\Release for release configuration.Second-level dependencies: set copy local = true.Another example when copylocal =false fails on run-time, is when top level assembly doesn't directly referenced one of indirect dependencies.E..g. Top-level assembly A has reference to assembly B with copylocal =true, but assembly B has reference to assembly C with copylocal =false. Most likely assembly C will be missing on runtime and will cause errors E.g. http://stackoverflow.com/questions/602765/when-should-copy-local-be-set-to-true-and-when-should-it-not?lq=1Copy local is important for deployment scenarios and tools. As a general rule you should use CopyLocal=True and http://stackoverflow.com/questions/602765/when-should-copy-local-be-set-to-true-and-when-should-it-not?lq=1 Unfortunately there are some quirks and CopyLocal won't necessary work as expected for assembly references in secondary assemblies structured as shown below.MainApp.exe MyLibrary.dll ThirdPartyLibrary.dll (if in the GAC CopyLocal won't copy to MainApp bin folder)This makes xcopy deployments difficult . .Reflection called DLLs dependencies: set copy local = true.E.g user can see error "ISystem.Reflection.ReflectionTypeLoadException: Unable to load one or more of the requested types. Retrieve the LoaderExceptions property for more information."The fix for the issue is recommended in http://stackoverflow.com/a/6200173/52277"I solved this issue by setting the Copy Local attribute of my project's references to true."In general, the problems with investigation of deployment issues may overweight the benefits of reduced build time. Setting the Copy Local to false without considering deployment issues is not a good idea.

Read the article

How can I back up my ubuntu system?

- by Eloff

I'm sure there's a lot of questions on here similar to this, and I've been reading them, but I still feel this warrants a new question. I want nightly, incremental backups (full disk images would waste a lot of space - unless compressed somehow.) Preferably rotating or deleting old backups when running out of space or after a fixed number of backups. I want to be able to quickly and painlessly restore my system from these backups. This is my first time running ubuntu as my main development machine and I know from my experience with it as a server and in virtual machines that I regularly manage to make it unbootable or damage it to the point of being unable to rescue it. So how would you recommend I do this? There are so many options out there I really don't know where to start. There seems to be a vocal school of thought that it's sufficient to backup your home directory and the list of installed packages from the package manager. I've already installed lots of things from source, or outside of the package manager (development tools, ides, compilers, graphics drivers, etc.) So at the very least, if I do not back up the operating system itself I need to grab all config files, all program binaries, all created but required files, etc. I'd rather backup too much than too little - an ubuntu install is tiny anyway. Also this drastically reduces the restore time, which would cost me more in my time than the extra storage space. I tried using Deja Dup to backup the root partition, excluding some things like /mnt /media /dev /proc etc. Although many websites assured me you can backup a running linux system this way - that seems to be false as it complained that it could not backup the following files: /boot/System.map-3.0.0-17-generic /boot/System.map-3.2.0-22-generic /boot/vmcoreinfo-3.0.0-17-generic /boot/vmlinuz-3.0.0-17-generic /boot/vmlinuz-3.2.0-22-generic /etc/.pwd.lock /etc/NetworkManager/system-connections/LAN Connection /etc/apparmor.d/cache/lightdm-guest-session /etc/apparmor.d/cache/sbin.dhclient /etc/apparmor.d/cache/usr.bin.evince /etc/apparmor.d/cache/usr.lib.telepathy /etc/apparmor.d/cache/usr.sbin.cupsd /etc/apparmor.d/cache/usr.sbin.tcpdump /etc/apt/trustdb.gpg /etc/at.deny /etc/ati/inst_path_default /etc/ati/inst_path_override /etc/chatscripts /etc/cups/ssl /etc/cups/subscriptions.conf /etc/cups/subscriptions.conf.O /etc/default/cacerts /etc/fuse.conf /etc/group- /etc/gshadow /etc/gshadow- /etc/mtab.fuselock /etc/passwd- /etc/ppp/chap-secrets /etc/ppp/pap-secrets /etc/ppp/peers /etc/security/opasswd /etc/shadow /etc/shadow- /etc/ssl/private /etc/sudoers /etc/sudoers.d/README /etc/ufw/after.rules /etc/ufw/after6.rules /etc/ufw/before.rules /etc/ufw/before6.rules /lib/ufw/user.rules /lib/ufw/user6.rules /lost+found /root /run/crond.reboot /run/cups/certs /run/lightdm /run/lock/whoopsie/lock /run/udisks /var/backups/group.bak /var/backups/gshadow.bak /var/backups/passwd.bak /var/backups/shadow.bak /var/cache/apt/archives/lock /var/cache/cups/job.cache /var/cache/cups/job.cache.O /var/cache/cups/ppds.dat /var/cache/debconf/passwords.dat /var/cache/ldconfig /var/cache/lightdm/dmrc /var/crash/_usr_lib_x86_64-linux-gnu_colord_colord.102.crash /var/lib/apt/lists/lock /var/lib/dpkg/lock /var/lib/dpkg/triggers/Lock /var/lib/lightdm /var/lib/mlocate/mlocate.db /var/lib/polkit-1 /var/lib/sudo /var/lib/urandom/random-seed /var/lib/ureadahead/pack /var/lib/ureadahead/run.pack /var/log/btmp /var/log/installer/casper.log /var/log/installer/debug /var/log/installer/partman /var/log/installer/syslog /var/log/installer/version /var/log/lightdm/lightdm.log /var/log/lightdm/x-0-greeter.log /var/log/lightdm/x-0.log /var/log/speech-dispatcher /var/log/upstart/alsa-restore.log /var/log/upstart/alsa-restore.log.1.gz /var/log/upstart/console-setup.log /var/log/upstart/console-setup.log.1.gz /var/log/upstart/container-detect.log /var/log/upstart/container-detect.log.1.gz /var/log/upstart/hybrid-gfx.log /var/log/upstart/hybrid-gfx.log.1.gz /var/log/upstart/modemmanager.log /var/log/upstart/modemmanager.log.1.gz /var/log/upstart/module-init-tools.log /var/log/upstart/module-init-tools.log.1.gz /var/log/upstart/procps-static-network-up.log /var/log/upstart/procps-static-network-up.log.1.gz /var/log/upstart/procps-virtual-filesystems.log /var/log/upstart/procps-virtual-filesystems.log.1.gz /var/log/upstart/rsyslog.log /var/log/upstart/rsyslog.log.1.gz /var/log/upstart/ureadahead.log /var/log/upstart/ureadahead.log.1.gz /var/spool/anacron/cron.daily /var/spool/anacron/cron.monthly /var/spool/anacron/cron.weekly /var/spool/cron/atjobs /var/spool/cron/atspool /var/spool/cron/crontabs /var/spool/cups

Read the article

How John Got 15x Improvement Without Really Trying

- by rchrd

The following article was published on a Sun Microsystems website a number of years ago by John Feo. It is still useful and worth preserving. So I'm republishing it here. How I Got 15x Improvement Without Really Trying John Feo, Sun Microsystems Taking ten "personal" program codes used in scientific and engineering research, the author was able to get from 2 to 15 times performance improvement easily by applying some simple general optimization techniques. Introduction Scientific research based on computer simulation depends on the simulation for advancement. The research can advance only as fast as the computational codes can execute. The codes' efficiency determines both the rate and quality of results. In the same amount of time, a faster program can generate more results and can carry out a more detailed simulation of physical phenomena than a slower program. Highly optimized programs help science advance quickly and insure that monies supporting scientific research are used as effectively as possible. Scientific computer codes divide into three broad categories: ISV, community, and personal. ISV codes are large, mature production codes developed and sold commercially. The codes improve slowly over time both in methods and capabilities, and they are well tuned for most vendor platforms. Since the codes are mature and complex, there are few opportunities to improve their performance solely through code optimization. Improvements of 10% to 15% are typical. Examples of ISV codes are DYNA3D, Gaussian, and Nastran. Community codes are non-commercial production codes used by a particular research field. Generally, they are developed and distributed by a single academic or research institution with assistance from the community. Most users just run the codes, but some develop new methods and extensions that feed back into the general release. The codes are available on most vendor platforms. Since these codes are younger than ISV codes, there are more opportunities to optimize the source code. Improvements of 50% are not unusual. Examples of community codes are AMBER, CHARM, BLAST, and FASTA. Personal codes are those written by single users or small research groups for their own use. These codes are not distributed, but may be passed from professor-to-student or student-to-student over several years. They form the primordial ocean of applications from which community and ISV codes emerge. Government research grants pay for the development of most personal codes. This paper reports on the nature and performance of this class of codes. Over the last year, I have looked at over two dozen personal codes from more than a dozen research institutions. The codes cover a variety of scientific fields, including astronomy, atmospheric sciences, bioinformatics, biology, chemistry, geology, and physics. The sources range from a few hundred lines to more than ten thousand lines, and are written in Fortran, Fortran 90, C, and C++. For the most part, the codes are modular, documented, and written in a clear, straightforward manner. They do not use complex language features, advanced data structures, programming tricks, or libraries. I had little trouble understanding what the codes did or how data structures were used. Most came with a makefile. Surprisingly, only one of the applications is parallel. All developers have access to parallel machines, so availability is not an issue. Several tried to parallelize their applications, but stopped after encountering difficulties. Lack of education and a perception that parallelism is difficult prevented most from trying. I parallelized several of the codes using OpenMP, and did not judge any of the codes as difficult to parallelize. Even more surprising than the lack of parallelism is the inefficiency of the codes. I was able to get large improvements in performance in a matter of a few days applying simple optimization techniques. Table 1 lists ten representative codes [names and affiliation are omitted to preserve anonymity]. Improvements on one processor range from 2x to 15.5x with a simple average of 4.75x. I did not use sophisticated performance tools or drill deep into the program's execution character as one would do when tuning ISV or community codes. Using only a profiler and source line timers, I identified inefficient sections of code and improved their performance by inspection. The changes were at a high level. I am sure there is another factor of 2 or 3 in each code, and more if the codes are parallelized. The study’s results show that personal scientific codes are running many times slower than they should and that the problem is pervasive. Computational scientists are not sloppy programmers; however, few are trained in the art of computer programming or code optimization. I found that most have a working knowledge of some programming language and standard software engineering practices; but they do not know, or think about, how to make their programs run faster. They simply do not know the standard techniques used to make codes run faster. In fact, they do not even perceive that such techniques exist. The case studies described in this paper show that applying simple, well known techniques can significantly increase the performance of personal codes. It is important that the scientific community and the Government agencies that support scientific research find ways to better educate academic scientific programmers. The inefficiency of their codes is so bad that it is retarding both the quality and progress of scientific research. # cacheperformance redundantoperations loopstructures performanceimprovement 1 x x 15.5 2 x 2.8 3 x x 2.5 4 x 2.1 5 x x 2.0 6 x 5.0 7 x 5.8 8 x 6.3 9 2.2 10 x x 3.3 Table 1 — Area of improvement and performance gains of 10 codes The remainder of the paper is organized as follows: sections 2, 3, and 4 discuss the three most common sources of inefficiencies in the codes studied. These are cache performance, redundant operations, and loop structures. Each section includes several examples. The last section summaries the work and suggests a possible solution to the issues raised. Optimizing cache performance Commodity microprocessor systems use caches to increase memory bandwidth and reduce memory latencies. Typical latencies from processor to L1, L2, local, and remote memory are 3, 10, 50, and 200 cycles, respectively. Moreover, bandwidth falls off dramatically as memory distances increase. Programs that do not use cache effectively run many times slower than programs that do. When optimizing for cache, the biggest performance gains are achieved by accessing data in cache order and reusing data to amortize the overhead of cache misses. Secondary considerations are prefetching, associativity, and replacement; however, the understanding and analysis required to optimize for the latter are probably beyond the capabilities of the non-expert. Much can be gained simply by accessing data in the correct order and maximizing data reuse. 6 out of the 10 codes studied here benefited from such high level optimizations. Array Accesses The most important cache optimization is the most basic: accessing Fortran array elements in column order and C array elements in row order. Four of the ten codes—1, 2, 4, and 10—got it wrong. Compilers will restructure nested loops to optimize cache performance, but may not do so if the loop structure is too complex, or the loop body includes conditionals, complex addressing, or function calls. In code 1, the compiler failed to invert a key loop because of complex addressing do I = 0, 1010, delta_x IM = I - delta_x IP = I + delta_x do J = 5, 995, delta_x JM = J - delta_x JP = J + delta_x T1 = CA1(IP, J) + CA1(I, JP) T2 = CA1(IM, J) + CA1(I, JM) S1 = T1 + T2 - 4 * CA1(I, J) CA(I, J) = CA1(I, J) + D * S1 end do end do In code 2, the culprit is conditionals do I = 1, N do J = 1, N If (IFLAG(I,J) .EQ. 0) then T1 = Value(I, J-1) T2 = Value(I-1, J) T3 = Value(I, J) T4 = Value(I+1, J) T5 = Value(I, J+1) Value(I,J) = 0.25 * (T1 + T2 + T5 + T4) Delta = ABS(T3 - Value(I,J)) If (Delta .GT. MaxDelta) MaxDelta = Delta endif enddo enddo I fixed both programs by inverting the loops by hand. Code 10 has three-dimensional arrays and triply nested loops. The structure of the most computationally intensive loops is too complex to invert automatically or by hand. The only practical solution is to transpose the arrays so that the dimension accessed by the innermost loop is in cache order. The arrays can be transposed at construction or prior to entering a computationally intensive section of code. The former requires all array references to be modified, while the latter is cost effective only if the cost of the transpose is amortized over many accesses. I used the second approach to optimize code 10. Code 5 has four-dimensional arrays and loops are nested four deep. For all of the reasons cited above the compiler is not able to restructure three key loops. Assume C arrays and let the four dimensions of the arrays be i, j, k, and l. In the original code, the index structure of the three loops is L1: for i L2: for i L3: for i for l for l for j for k for j for k for j for k for l So only L3 accesses array elements in cache order. L1 is a very complex loop—much too complex to invert. I brought the loop into cache alignment by transposing the second and fourth dimensions of the arrays. Since the code uses a macro to compute all array indexes, I effected the transpose at construction and changed the macro appropriately. The dimensions of the new arrays are now: i, l, k, and j. L3 is a simple loop and easily inverted. L2 has a loop-carried scalar dependence in k. By promoting the scalar name that carries the dependence to an array, I was able to invert the third and fourth subloops aligning the loop with cache. Code 5 is by far the most difficult of the four codes to optimize for array accesses; but the knowledge required to fix the problems is no more than that required for the other codes. I would judge this code at the limits of, but not beyond, the capabilities of appropriately trained computational scientists. Array Strides When a cache miss occurs, a line (64 bytes) rather than just one word is loaded into the cache. If data is accessed stride 1, than the cost of the miss is amortized over 8 words. Any stride other than one reduces the cost savings. Two of the ten codes studied suffered from non-unit strides. The codes represent two important classes of "strided" codes. Code 1 employs a multi-grid algorithm to reduce time to convergence. The grids are every tenth, fifth, second, and unit element. Since time to convergence is inversely proportional to the distance between elements, coarse grids converge quickly providing good starting values for finer grids. The better starting values further reduce the time to convergence. The downside is that grids of every nth element, n > 1, introduce non-unit strides into the computation. In the original code, much of the savings of the multi-grid algorithm were lost due to this problem. I eliminated the problem by compressing (copying) coarse grids into continuous memory, and rewriting the computation as a function of the compressed grid. On convergence, I copied the final values of the compressed grid back to the original grid. The savings gained from unit stride access of the compressed grid more than paid for the cost of copying. Using compressed grids, the loop from code 1 included in the previous section becomes do j = 1, GZ do i = 1, GZ T1 = CA(i+0, j-1) + CA(i-1, j+0) T4 = CA1(i+1, j+0) + CA1(i+0, j+1) S1 = T1 + T4 - 4 * CA1(i+0, j+0) CA(i+0, j+0) = CA1(i+0, j+0) + DD * S1 enddo enddo where CA and CA1 are compressed arrays of size GZ. Code 7 traverses a list of objects selecting objects for later processing. The labels of the selected objects are stored in an array. The selection step has unit stride, but the processing steps have irregular stride. A fix is to save the parameters of the selected objects in temporary arrays as they are selected, and pass the temporary arrays to the processing functions. The fix is practical if the same parameters are used in selection as in processing, or if processing comprises a series of distinct steps which use overlapping subsets of the parameters. Both conditions are true for code 7, so I achieved significant improvement by copying parameters to temporary arrays during selection. Data reuse In the previous sections, we optimized for spatial locality. It is also important to optimize for temporal locality. Once read, a datum should be used as much as possible before it is forced from cache. Loop fusion and loop unrolling are two techniques that increase temporal locality. Unfortunately, both techniques increase register pressure—as loop bodies become larger, the number of registers required to hold temporary values grows. Once register spilling occurs, any gains evaporate quickly. For multiprocessors with small register sets or small caches, the sweet spot can be very small. In the ten codes presented here, I found no opportunities for loop fusion and only two opportunities for loop unrolling (codes 1 and 3). In code 1, unrolling the outer and inner loop one iteration increases the number of result values computed by the loop body from 1 to 4, do J = 1, GZ-2, 2 do I = 1, GZ-2, 2 T1 = CA1(i+0, j-1) + CA1(i-1, j+0) T2 = CA1(i+1, j-1) + CA1(i+0, j+0) T3 = CA1(i+0, j+0) + CA1(i-1, j+1) T4 = CA1(i+1, j+0) + CA1(i+0, j+1) T5 = CA1(i+2, j+0) + CA1(i+1, j+1) T6 = CA1(i+1, j+1) + CA1(i+0, j+2) T7 = CA1(i+2, j+1) + CA1(i+1, j+2) S1 = T1 + T4 - 4 * CA1(i+0, j+0) S2 = T2 + T5 - 4 * CA1(i+1, j+0) S3 = T3 + T6 - 4 * CA1(i+0, j+1) S4 = T4 + T7 - 4 * CA1(i+1, j+1) CA(i+0, j+0) = CA1(i+0, j+0) + DD * S1 CA(i+1, j+0) = CA1(i+1, j+0) + DD * S2 CA(i+0, j+1) = CA1(i+0, j+1) + DD * S3 CA(i+1, j+1) = CA1(i+1, j+1) + DD * S4 enddo enddo The loop body executes 12 reads, whereas as the rolled loop shown in the previous section executes 20 reads to compute the same four values. In code 3, two loops are unrolled 8 times and one loop is unrolled 4 times. Here is the before for (k = 0; k < NK[u]; k++) { sum = 0.0; for (y = 0; y < NY; y++) { sum += W[y][u][k] * delta[y]; } backprop[i++]=sum; } and after code for (k = 0; k < KK - 8; k+=8) { sum0 = 0.0; sum1 = 0.0; sum2 = 0.0; sum3 = 0.0; sum4 = 0.0; sum5 = 0.0; sum6 = 0.0; sum7 = 0.0; for (y = 0; y < NY; y++) { sum0 += W[y][0][k+0] * delta[y]; sum1 += W[y][0][k+1] * delta[y]; sum2 += W[y][0][k+2] * delta[y]; sum3 += W[y][0][k+3] * delta[y]; sum4 += W[y][0][k+4] * delta[y]; sum5 += W[y][0][k+5] * delta[y]; sum6 += W[y][0][k+6] * delta[y]; sum7 += W[y][0][k+7] * delta[y]; } backprop[k+0] = sum0; backprop[k+1] = sum1; backprop[k+2] = sum2; backprop[k+3] = sum3; backprop[k+4] = sum4; backprop[k+5] = sum5; backprop[k+6] = sum6; backprop[k+7] = sum7; } for one of the loops unrolled 8 times. Optimizing for temporal locality is the most difficult optimization considered in this paper. The concepts are not difficult, but the sweet spot is small. Identifying where the program can benefit from loop unrolling or loop fusion is not trivial. Moreover, it takes some effort to get it right. Still, educating scientific programmers about temporal locality and teaching them how to optimize for it will pay dividends. Reducing instruction count Execution time is a function of instruction count. Reduce the count and you usually reduce the time. The best solution is to use a more efficient algorithm; that is, an algorithm whose order of complexity is smaller, that converges quicker, or is more accurate. Optimizing source code without changing the algorithm yields smaller, but still significant, gains. This paper considers only the latter because the intent is to study how much better codes can run if written by programmers schooled in basic code optimization techniques. The ten codes studied benefited from three types of "instruction reducing" optimizations. The two most prevalent were hoisting invariant memory and data operations out of inner loops. The third was eliminating unnecessary data copying. The nature of these inefficiencies is language dependent. Memory operations The semantics of C make it difficult for the compiler to determine all the invariant memory operations in a loop. The problem is particularly acute for loops in functions since the compiler may not know the values of the function's parameters at every call site when compiling the function. Most compilers support pragmas to help resolve ambiguities; however, these pragmas are not comprehensive and there is no standard syntax. To guarantee that invariant memory operations are not executed repetitively, the user has little choice but to hoist the operations by hand. The problem is not as severe in Fortran programs because in the absence of equivalence statements, it is a violation of the language's semantics for two names to share memory. Codes 3 and 5 are C programs. In both cases, the compiler did not hoist all invariant memory operations from inner loops. Consider the following loop from code 3 for (y = 0; y < NY; y++) { i = 0; for (u = 0; u < NU; u++) { for (k = 0; k < NK[u]; k++) { dW[y][u][k] += delta[y] * I1[i++]; } } } Since dW[y][u] can point to the same memory space as delta for one or more values of y and u, assignment to dW[y][u][k] may change the value of delta[y]. In reality, dW and delta do not overlap in memory, so I rewrote the loop as for (y = 0; y < NY; y++) { i = 0; Dy = delta[y]; for (u = 0; u < NU; u++) { for (k = 0; k < NK[u]; k++) { dW[y][u][k] += Dy * I1[i++]; } } } Failure to hoist invariant memory operations may be due to complex address calculations. If the compiler can not determine that the address calculation is invariant, then it can hoist neither the calculation nor the associated memory operations. As noted above, code 5 uses a macro to address four-dimensional arrays #define MAT4D(a,q,i,j,k) (double *)((a)->data + (q)*(a)->strides[0] + (i)*(a)->strides[3] + (j)*(a)->strides[2] + (k)*(a)->strides[1]) The macro is too complex for the compiler to understand and so, it does not identify any subexpressions as loop invariant. The simplest way to eliminate the address calculation from the innermost loop (over i) is to define a0 = MAT4D(a,q,0,j,k) before the loop and then replace all instances of *MAT4D(a,q,i,j,k) in the loop with a0[i] A similar problem appears in code 6, a Fortran program. The key loop in this program is do n1 = 1, nh nx1 = (n1 - 1) / nz + 1 nz1 = n1 - nz * (nx1 - 1) do n2 = 1, nh nx2 = (n2 - 1) / nz + 1 nz2 = n2 - nz * (nx2 - 1) ndx = nx2 - nx1 ndy = nz2 - nz1 gxx = grn(1,ndx,ndy) gyy = grn(2,ndx,ndy) gxy = grn(3,ndx,ndy) balance(n1,1) = balance(n1,1) + (force(n2,1) * gxx + force(n2,2) * gxy) * h1 balance(n1,2) = balance(n1,2) + (force(n2,1) * gxy + force(n2,2) * gyy)*h1 end do end do The programmer has written this loop well—there are no loop invariant operations with respect to n1 and n2. However, the loop resides within an iterative loop over time and the index calculations are independent with respect to time. Trading space for time, I precomputed the index values prior to the entering the time loop and stored the values in two arrays. I then replaced the index calculations with reads of the arrays. Data operations Ways to reduce data operations can appear in many forms. Implementing a more efficient algorithm produces the biggest gains. The closest I came to an algorithm change was in code 4. This code computes the inner product of K-vectors A(i) and B(j), 0 = i < N, 0 = j < M, for most values of i and j. Since the program computes most of the NM possible inner products, it is more efficient to compute all the inner products in one triply-nested loop rather than one at a time when needed. The savings accrue from reading A(i) once for all B(j) vectors and from loop unrolling. for (i = 0; i < N; i+=8) { for (j = 0; j < M; j++) { sum0 = 0.0; sum1 = 0.0; sum2 = 0.0; sum3 = 0.0; sum4 = 0.0; sum5 = 0.0; sum6 = 0.0; sum7 = 0.0; for (k = 0; k < K; k++) { sum0 += A[i+0][k] * B[j][k]; sum1 += A[i+1][k] * B[j][k]; sum2 += A[i+2][k] * B[j][k]; sum3 += A[i+3][k] * B[j][k]; sum4 += A[i+4][k] * B[j][k]; sum5 += A[i+5][k] * B[j][k]; sum6 += A[i+6][k] * B[j][k]; sum7 += A[i+7][k] * B[j][k]; } C[i+0][j] = sum0; C[i+1][j] = sum1; C[i+2][j] = sum2; C[i+3][j] = sum3; C[i+4][j] = sum4; C[i+5][j] = sum5; C[i+6][j] = sum6; C[i+7][j] = sum7; }} This change requires knowledge of a typical run; i.e., that most inner products are computed. The reasons for the change, however, derive from basic optimization concepts. It is the type of change easily made at development time by a knowledgeable programmer. In code 5, we have the data version of the index optimization in code 6. Here a very expensive computation is a function of the loop indices and so cannot be hoisted out of the loop; however, the computation is invariant with respect to an outer iterative loop over time. We can compute its value for each iteration of the computation loop prior to entering the time loop and save the values in an array. The increase in memory required to store the values is small in comparison to the large savings in time. The main loop in Code 8 is doubly nested. The inner loop includes a series of guarded computations; some are a function of the inner loop index but not the outer loop index while others are a function of the outer loop index but not the inner loop index for (j = 0; j < N; j++) { for (i = 0; i < M; i++) { r = i * hrmax; R = A[j]; temp = (PRM[3] == 0.0) ? 1.0 : pow(r, PRM[3]); high = temp * kcoeff * B[j] * PRM[2] * PRM[4]; low = high * PRM[6] * PRM[6] / (1.0 + pow(PRM[4] * PRM[6], 2.0)); kap = (R > PRM[6]) ? high * R * R / (1.0 + pow(PRM[4]*r, 2.0) : low * pow(R/PRM[6], PRM[5]); < rest of loop omitted > }} Note that the value of temp is invariant to j. Thus, we can hoist the computation for temp out of the loop and save its values in an array. for (i = 0; i < M; i++) { r = i * hrmax; TEMP[i] = pow(r, PRM[3]); } [N.B. – the case for PRM[3] = 0 is omitted and will be reintroduced later.] We now hoist out of the inner loop the computations invariant to i. Since the conditional guarding the value of kap is invariant to i, it behooves us to hoist the computation out of the inner loop, thereby executing the guard once rather than M times. The final version of the code is for (j = 0; j < N; j++) { R = rig[j] / 1000.; tmp1 = kcoeff * par[2] * beta[j] * par[4]; tmp2 = 1.0 + (par[4] * par[4] * par[6] * par[6]); tmp3 = 1.0 + (par[4] * par[4] * R * R); tmp4 = par[6] * par[6] / tmp2; tmp5 = R * R / tmp3; tmp6 = pow(R / par[6], par[5]); if ((par[3] == 0.0) && (R > par[6])) { for (i = 1; i <= imax1; i++) KAP[i] = tmp1 * tmp5; } else if ((par[3] == 0.0) && (R <= par[6])) { for (i = 1; i <= imax1; i++) KAP[i] = tmp1 * tmp4 * tmp6; } else if ((par[3] != 0.0) && (R > par[6])) { for (i = 1; i <= imax1; i++) KAP[i] = tmp1 * TEMP[i] * tmp5; } else if ((par[3] != 0.0) && (R <= par[6])) { for (i = 1; i <= imax1; i++) KAP[i] = tmp1 * TEMP[i] * tmp4 * tmp6; } for (i = 0; i < M; i++) { kap = KAP[i]; r = i * hrmax; < rest of loop omitted > } } Maybe not the prettiest piece of code, but certainly much more efficient than the original loop, Copy operations Several programs unnecessarily copy data from one data structure to another. This problem occurs in both Fortran and C programs, although it manifests itself differently in the two languages. Code 1 declares two arrays—one for old values and one for new values. At the end of each iteration, the array of new values is copied to the array of old values to reset the data structures for the next iteration. This problem occurs in Fortran programs not included in this study and in both Fortran 77 and Fortran 90 code. Introducing pointers to the arrays and swapping pointer values is an obvious way to eliminate the copying; but pointers is not a feature that many Fortran programmers know well or are comfortable using. An easy solution not involving pointers is to extend the dimension of the value array by 1 and use the last dimension to differentiate between arrays at different times. For example, if the data space is N x N, declare the array (N, N, 2). Then store the problem’s initial values in (_, _, 2) and define the scalar names new = 2 and old = 1. At the start of each iteration, swap old and new to reset the arrays. The old–new copy problem did not appear in any C program. In programs that had new and old values, the code swapped pointers to reset data structures. Where unnecessary coping did occur is in structure assignment and parameter passing. Structures in C are handled much like scalars. Assignment causes the data space of the right-hand name to be copied to the data space of the left-hand name. Similarly, when a structure is passed to a function, the data space of the actual parameter is copied to the data space of the formal parameter. If the structure is large and the assignment or function call is in an inner loop, then copying costs can grow quite large. While none of the ten programs considered here manifested this problem, it did occur in programs not included in the study. A simple fix is always to refer to structures via pointers. Optimizing loop structures Since scientific programs spend almost all their time in loops, efficient loops are the key to good performance. Conditionals, function calls, little instruction level parallelism, and large numbers of temporary values make it difficult for the compiler to generate tightly packed, highly efficient code. Conditionals and function calls introduce jumps that disrupt code flow. Users should eliminate or isolate conditionls to their own loops as much as possible. Often logical expressions can be substituted for if-then-else statements. For example, code 2 includes the following snippet MaxDelta = 0.0 do J = 1, N do I = 1, M < code omitted > Delta = abs(OldValue ? NewValue) if (Delta > MaxDelta) MaxDelta = Delta enddo enddo if (MaxDelta .gt. 0.001) goto 200 Since the only use of MaxDelta is to control the jump to 200 and all that matters is whether or not it is greater than 0.001, I made MaxDelta a boolean and rewrote the snippet as MaxDelta = .false. do J = 1, N do I = 1, M < code omitted > Delta = abs(OldValue ? NewValue) MaxDelta = MaxDelta .or. (Delta .gt. 0.001) enddo enddo if (MaxDelta) goto 200 thereby, eliminating the conditional expression from the inner loop. A microprocessor can execute many instructions per instruction cycle. Typically, it can execute one or more memory, floating point, integer, and jump operations. To be executed simultaneously, the operations must be independent. Thick loops tend to have more instruction level parallelism than thin loops. Moreover, they reduce memory traffice by maximizing data reuse. Loop unrolling and loop fusion are two techniques to increase the size of loop bodies. Several of the codes studied benefitted from loop unrolling, but none benefitted from loop fusion. This observation is not too surpising since it is the general tendency of programmers to write thick loops. As loops become thicker, the number of temporary values grows, increasing register pressure. If registers spill, then memory traffic increases and code flow is disrupted. A thick loop with many temporary values may execute slower than an equivalent series of thin loops. The biggest gain will be achieved if the thick loop can be split into a series of independent loops eliminating the need to write and read temporary arrays. I found such an occasion in code 10 where I split the loop do i = 1, n do j = 1, m A24(j,i)= S24(j,i) * T24(j,i) + S25(j,i) * U25(j,i) B24(j,i)= S24(j,i) * T25(j,i) + S25(j,i) * U24(j,i) A25(j,i)= S24(j,i) * C24(j,i) + S25(j,i) * V24(j,i) B25(j,i)= S24(j,i) * U25(j,i) + S25(j,i) * V25(j,i) C24(j,i)= S26(j,i) * T26(j,i) + S27(j,i) * U26(j,i) D24(j,i)= S26(j,i) * T27(j,i) + S27(j,i) * V26(j,i) C25(j,i)= S27(j,i) * S28(j,i) + S26(j,i) * U28(j,i) D25(j,i)= S27(j,i) * T28(j,i) + S26(j,i) * V28(j,i) end do end do into two disjoint loops do i = 1, n do j = 1, m A24(j,i)= S24(j,i) * T24(j,i) + S25(j,i) * U25(j,i) B24(j,i)= S24(j,i) * T25(j,i) + S25(j,i) * U24(j,i) A25(j,i)= S24(j,i) * C24(j,i) + S25(j,i) * V24(j,i) B25(j,i)= S24(j,i) * U25(j,i) + S25(j,i) * V25(j,i) end do end do do i = 1, n do j = 1, m C24(j,i)= S26(j,i) * T26(j,i) + S27(j,i) * U26(j,i) D24(j,i)= S26(j,i) * T27(j,i) + S27(j,i) * V26(j,i) C25(j,i)= S27(j,i) * S28(j,i) + S26(j,i) * U28(j,i) D25(j,i)= S27(j,i) * T28(j,i) + S26(j,i) * V28(j,i) end do end do Conclusions Over the course of the last year, I have had the opportunity to work with over two dozen academic scientific programmers at leading research universities. Their research interests span a broad range of scientific fields. Except for two programs that relied almost exclusively on library routines (matrix multiply and fast Fourier transform), I was able to improve significantly the single processor performance of all codes. Improvements range from 2x to 15.5x with a simple average of 4.75x. Changes to the source code were at a very high level. I did not use sophisticated techniques or programming tools to discover inefficiencies or effect the changes. Only one code was parallel despite the availability of parallel systems to all developers. Clearly, we have a problem—personal scientific research codes are highly inefficient and not running parallel. The developers are unaware of simple optimization techniques to make programs run faster. They lack education in the art of code optimization and parallel programming. I do not believe we can fix the problem by publishing additional books or training manuals. To date, the developers in questions have not studied the books or manual available, and are unlikely to do so in the future. Short courses are a possible solution, but I believe they are too concentrated to be much use. The general concepts can be taught in a three or four day course, but that is not enough time for students to practice what they learn and acquire the experience to apply and extend the concepts to their codes. Practice is the key to becoming proficient at optimization. I recommend that graduate students be required to take a semester length course in optimization and parallel programming. We would never give someone access to state-of-the-art scientific equipment costing hundreds of thousands of dollars without first requiring them to demonstrate that they know how to use the equipment. Yet the criterion for time on state-of-the-art supercomputers is at most an interesting project. Requestors are never asked to demonstrate that they know how to use the system, or can use the system effectively. A semester course would teach them the required skills. Government agencies that fund academic scientific research pay for most of the computer systems supporting scientific research as well as the development of most personal scientific codes. These agencies should require graduate schools to offer a course in optimization and parallel programming as a requirement for funding. About the Author John Feo received his Ph.D. in Computer Science from The University of Texas at Austin in 1986. After graduate school, Dr. Feo worked at Lawrence Livermore National Laboratory where he was the Group Leader of the Computer Research Group and principal investigator of the Sisal Language Project. In 1997, Dr. Feo joined Tera Computer Company where he was project manager for the MTA, and oversaw the programming and evaluation of the MTA at the San Diego Supercomputer Center. In 2000, Dr. Feo joined Sun Microsystems as an HPC application specialist. He works with university research groups to optimize and parallelize scientific codes. Dr. Feo has published over two dozen research articles in the areas of parallel parallel programming, parallel programming languages, and application performance.

Read the article

Windows Azure Use Case: New Development

- by BuckWoody

This is one in a series of posts on when and where to use a distributed architecture design in your organization's computing needs. You can find the main post here: http://blogs.msdn.com/b/buckwoody/archive/2011/01/18/windows-azure-and-sql-azure-use-cases.aspx Description: Computing platforms evolve over time. Originally computers were directed by hardware wiring - that, the “code” was the path of the wiring that directed an electrical signal from one component to another, or in some cases a physical switch controlled the path. From there software was developed, first in a very low machine language, then when compilers were created, computer languages could more closely mimic written statements. These language statements can be compiled into the lower-level machine language still used by computers today. Microprocessors replaced logic circuits, sometimes with fewer instructions (Reduced Instruction Set Computing, RISC) and sometimes with more instructions (Complex Instruction Set Computing, CISC). The reason this history is important is that along each technology advancement, computer code has adapted. Writing software for a RISC architecture is significantly different than developing for a CISC architecture. And moving to a Distributed Architecture like Windows Azure also has specific implementation details that our code must follow. But why make a change? As I’ve described, we need to make the change to our code to follow advances in technology. There’s no point in change for its own sake, but as a new paradigm offers benefits to our users, it’s important for us to leverage those benefits where it makes sense. That’s most often done in new development projects. It’s a far simpler task to take a new project and adapt it to Windows Azure than to try and retrofit older code designed in a previous computing environment. We can still use the same coding languages (.NET, Java, C++) to write code for Windows Azure, but we need to think about the architecture of that code on a new project so that it runs in the most efficient, cost-effective way in a Distributed Architecture. As we receive new requests from the organization for new projects, a distributed architecture paradigm belongs in the decision matrix for the platform target. Implementation: When you are designing new applications for Windows Azure (or any distributed architecture) there are many important details to consider. But at the risk of over-simplification, there are three main concepts to learn and architect within the new code: Stateless Programming - Stateless program is a prime concept within distributed architectures. Rather than each server owning the complete processing cycle, the information from an operation that needs to be retained (the “state”) should be persisted to another location c(like storage) common to all machines involved in the process. An interesting learning process for Stateless Programming (although not unique to this language type) is to learn Functional Programming. Server-Side Processing - Along with developing using a Stateless Design, the closer you can locate the code processing to the data, the less expensive and faster the code will run. When you control the network layer, this is less important, since you can send vast amounts of data between the server and client, allowing the client to perform processing. In a distributed architecture, you don’t always own the network, so it’s performance is unpredictable. Also, you may not be able to control the platform the user is on (such as a smartphone, PC or tablet), so it’s imperative to deliver only results and graphical elements where possible. Token-Based Authentication - Also called “Claims-Based Authorization”, this code practice means instead of allowing a user to log on once and then running code in that context, a more granular level of security is used. A “token” or “claim”, often represented as a Certificate, is sent along for a series or even one request. In other words, every call to the code is authenticated against the token, rather than allowing a user free reign within the code call. While this is more work initially, it can bring a greater level of security, and it is far more resilient to disconnections. Resources: See the references of “Nondistributed Deployment” and “Distributed Deployment” at the top of this article for more information with graphics: http://msdn.microsoft.com/en-us/library/ee658120.aspx Stack Overflow has a good thread on functional programming: http://stackoverflow.com/questions/844536/advantages-of-stateless-programming Another good discussion on Stack Overflow on server-side processing is here: http://stackoverflow.com/questions/3064018/client-side-or-server-side-processing Claims Based Authorization is described here: http://msdn.microsoft.com/en-us/magazine/ee335707.aspx

Read the article

Windows Azure Use Case: New Development

- by BuckWoody

This is one in a series of posts on when and where to use a distributed architecture design in your organization's computing needs. You can find the main post here: http://blogs.msdn.com/b/buckwoody/archive/2011/01/18/windows-azure-and-sql-azure-use-cases.aspx Description: Computing platforms evolve over time. Originally computers were directed by hardware wiring - that, the “code” was the path of the wiring that directed an electrical signal from one component to another, or in some cases a physical switch controlled the path. From there software was developed, first in a very low machine language, then when compilers were created, computer languages could more closely mimic written statements. These language statements can be compiled into the lower-level machine language still used by computers today. Microprocessors replaced logic circuits, sometimes with fewer instructions (Reduced Instruction Set Computing, RISC) and sometimes with more instructions (Complex Instruction Set Computing, CISC). The reason this history is important is that along each technology advancement, computer code has adapted. Writing software for a RISC architecture is significantly different than developing for a CISC architecture. And moving to a Distributed Architecture like Windows Azure also has specific implementation details that our code must follow. But why make a change? As I’ve described, we need to make the change to our code to follow advances in technology. There’s no point in change for its own sake, but as a new paradigm offers benefits to our users, it’s important for us to leverage those benefits where it makes sense. That’s most often done in new development projects. It’s a far simpler task to take a new project and adapt it to Windows Azure than to try and retrofit older code designed in a previous computing environment. We can still use the same coding languages (.NET, Java, C++) to write code for Windows Azure, but we need to think about the architecture of that code on a new project so that it runs in the most efficient, cost-effective way in a Distributed Architecture. As we receive new requests from the organization for new projects, a distributed architecture paradigm belongs in the decision matrix for the platform target. Implementation: When you are designing new applications for Windows Azure (or any distributed architecture) there are many important details to consider. But at the risk of over-simplification, there are three main concepts to learn and architect within the new code: Stateless Programming - Stateless program is a prime concept within distributed architectures. Rather than each server owning the complete processing cycle, the information from an operation that needs to be retained (the “state”) should be persisted to another location c(like storage) common to all machines involved in the process. An interesting learning process for Stateless Programming (although not unique to this language type) is to learn Functional Programming. Server-Side Processing - Along with developing using a Stateless Design, the closer you can locate the code processing to the data, the less expensive and faster the code will run. When you control the network layer, this is less important, since you can send vast amounts of data between the server and client, allowing the client to perform processing. In a distributed architecture, you don’t always own the network, so it’s performance is unpredictable. Also, you may not be able to control the platform the user is on (such as a smartphone, PC or tablet), so it’s imperative to deliver only results and graphical elements where possible. Token-Based Authentication - Also called “Claims-Based Authorization”, this code practice means instead of allowing a user to log on once and then running code in that context, a more granular level of security is used. A “token” or “claim”, often represented as a Certificate, is sent along for a series or even one request. In other words, every call to the code is authenticated against the token, rather than allowing a user free reign within the code call. While this is more work initially, it can bring a greater level of security, and it is far more resilient to disconnections. Resources: See the references of “Nondistributed Deployment” and “Distributed Deployment” at the top of this article for more information with graphics: http://msdn.microsoft.com/en-us/library/ee658120.aspx Stack Overflow has a good thread on functional programming: http://stackoverflow.com/questions/844536/advantages-of-stateless-programming Another good discussion on Stack Overflow on server-side processing is here: http://stackoverflow.com/questions/3064018/client-side-or-server-side-processing Claims Based Authorization is described here: http://msdn.microsoft.com/en-us/magazine/ee335707.aspx

Read the article

Subterranean IL: Custom modifiers

- by Simon Cooper

In IL, volatile is an instruction prefix used to set a memory barrier at that instruction. However, in C#, volatile is applied to a field to indicate that all accesses on that field should be prefixed with volatile. As I mentioned in my previous post, this means that the field definition needs to store this information somehow, as such a field could be accessed from another assembly. However, IL does not have a concept of a 'volatile field'. How is this information stored? Attributes The standard way of solving this is to apply a VolatileAttribute or similar to the field; this extra metadata notifies the C# compiler that all loads and stores to that field should use the volatile prefix. However, there is a problem with this approach, namely, the .NET C++ compiler. C++ allows methods to be overloaded using properties, like volatile or const, on the parameters; this is perfectly legal C++: public ref class VolatileMethods { void Method(int *i) {} void Method(volatile int *i) {} } If volatile was specified using a custom attribute, then the VolatileMethods class wouldn't be compilable to IL, as there is nothing to differentiate the two methods from each other. This is where custom modifiers come in. Custom modifiers Custom modifiers are similar to custom attributes, but instead of being applied to an IL element separately to its declaration, they are embedded within the field or parameter's type signature itself. The VolatileMethods class would be compiled to the following IL: .class public VolatileMethods { .method public instance void Method(int32* i) {} .method public instance void Method( int32 modreq( [mscorlib]System.Runtime.CompilerServices.IsVolatile)* i) {} } The modreq([mscorlib]System.Runtime.CompilerServices.IsVolatile) is the custom modifier. This adds a TypeDef or TypeRef token to the signature of the field or parameter, and even though they are mostly ignored by the CLR when it's executing the program, this allows methods and fields to be overloaded in ways that wouldn't be allowed using attributes. Because the modifiers are part of the signature, they need to be fully specified when calling such a method in IL: call instance void Method( int32 modreq([mscorlib]System.Runtime.CompilerServices.IsVolatile)*) There are two ways of applying modifiers; modreq specifies required modifiers (like IsVolatile), and modopt specifies optional modifiers that can be ignored by compilers (like IsLong or IsConst). The type specified as the modifier argument are simple placeholders; if you have a look at the definitions of IsVolatile and IsLong they are completely empty. They exist solely to be referenced by a modifier. Custom modifiers are used extensively by the C++ compiler to specify concepts that aren't expressible in IL, but still need to be taken into account when calling method overloads. C++ and C# That's all very well and good, but how does this affect C#? Well, the C++ compiler uses modreq(IsVolatile) to specify volatility on both method parameters and fields, as it would be slightly odd to have the same concept represented using a modifier or attribute depending on what it was applied to. Once you've compiled your C++ project, it can then be referenced and used from C#, so the C# compiler has to recognise the modreq(IsVolatile) custom modifier applied to fields, and vice versa. So, even though you can't overload fields or parameters with volatile using C#, volatile needs to be expressed using a custom modifier rather than an attribute to guarentee correct interoperability and behaviour with any C++ dlls that happen to come along. Next up: a closer look at attributes, and how certain attributes compile in unexpected ways.

Read the article

Why is there no service-oriented language?

- by Wolfgang

Edit: To avoid further confusion: I am not talking about web services and such. I am talking about structuring applications internally, it's not about how computers communicate. It's about programming languages, compilers and how the imperative programming paradigm is extended. Original: In the imperative programming field, we saw two paradigms in the past 20 years (or more): object-oriented (OO), and service-oriented (SO) aka. component-based (CB). Both paradigms extend the imperative programming paradigm by introducing their own notion of modules. OO calls them objects (and classes) and lets them encapsulates both data (fields) and procedures (methods) together. SO, in contrast, separates data (records, beans, ...) from code (components, services). However, only OO has programming languages which natively support its paradigm: Smalltalk, C++, Java and all other JVM-compatibles, C# and all other .NET-compatibles, Python etc. SO has no such native language. It only comes into existence on top of procedural languages or OO languages: COM/DCOM (binary, C, C++), CORBA, EJB, Spring, Guice (all Java), ... These SO frameworks clearly suffer from the missing native language support of their concepts. They start using OO classes to represent services and records. This leads to designs where there is a clear distinction between classes that have methods only (services) and those that have fields only (records). Inheritance between services or records is then simulated by inheritance of classes. Technically, its not kept so strictly but in general programmers are adviced to make classes to play only one of the two roles. They use additional, external languages to represent the missing parts: IDL's, XML configurations, Annotations in Java code, or even embedded DSL like in Guice. This is especially needed, but not limited to, since the composition of services is not part of the service code itself. In OO, objects create other objects so there is no need for such facilities but for SO there is because services don't instantiate or configure other services. They establish an inner-platform effect on top of OO (early EJB, CORBA) where the programmer has to write all the code that is needed to "drive" SO. Classes represent only a part of the nature of a service and lots of classes have to be written to form a service together. All that boiler plate is necessary because there is no SO compiler which would do it for the programmer. This is just like some people did it in C for OO when there was no C++. You just pass the record which holds the data of the object as a first parameter to the procedure which is the method. In a OO language this parameter is implicit and the compiler produces all the code that we need for virtual functions etc. For SO, this is clearly missing. Especially the newer frameworks extensively use AOP or introspection to add the missing parts to a OO language. This doesn't bring the necessary language expressiveness but avoids the boiler platform code described in the previous point. Some frameworks use code generation to produce the boiler plate code. Configuration files in XML or annotations in OO code is the source of information for this. Not all of the phenomena that I mentioned above can be attributed to SO but I hope it clearly shows that there is a need for a SO language. Since this paradigm is so popular: why isn't there one? Or maybe there are some academic ones but at least the industry doesn't use one.

Read the article

Changes in Language Punctuation [closed]

- by Wes Miller

More social curiosity than actual programming question... (I got shot for posting this on Stack Overflow. They sent me here. At least i hope here is where they meant.) Based on the few responses I got before the content police ran me off Stack Overflow, I should note that I am legally blind and neatness and consistency in programming are my best friends. A thousand years ago when I took my first programming class (Fortran 66) and a mere 500 years ago when I tokk my first C and C++ classes, there were some pretty standard punctuation practices across languages. I saw them in Basic (shudder), PL/1, PL/AS, Rexx even Pascal. Ok, APL2 is not part of this discussion. Each language has its own peculiar punctuation. Pascal's periods, Fortran's comma separated do loops, almost everybody else's semicolons. As I learned it, each language also has KEYWORDS (if, for, do, while, until, etc.) which are set off by whitespace (or the left margin) if, etc. Each language has function, subroutines of whatever they're called. Some built-in some user coded. They were set off by function_name( parameters );. As in sqrt( x ) or rand( y ); Lately, there seems to be a new set of punctuation rules. Especially in c++ where initializers get glued onto the end of variable declarations int x(0); or auto_ptr p(new gizmo); This usually, briefly fools me into thinking someone is declaring a function prototype or using a function as a integer. Then "if" and 'for' seems to have grown parens; if(true) for(;;), etc. Since when did keywords become functions. I realize some people think they ARE functions with iterators as parameters. But if "for" is a function, where did the arg separating commas go? And finally, functions seem to have shed their parens; sqrt (2) select (...) I know, I koow, loosening whitespace rules is good. Keep reading. Question: when did the old ways disappear and this new way come into vogue? Does anyone besides me find it irritating to read and that the information that the placement of punctuation used to convey is gone? I know full well that K&R put the { at the end of the "if" or "for" to save a byte here and there. Can't use that excuse here. Space as an excuse for loss of readability died as HDD space soared past 100 MiB. Your thoughts are solicited. If there is a good reason to do this, I'll gladly learn it and maybe in another 50 years I'll get used to it. Of course it's good that compilers recognize these (IMHO) typos and keep right on going, but just because you CAN code it that way doesn't mean you HAVE to, right?

Read the article

What are the disadvantages of self-encapsulation?

- by Dave Jarvis

Background Tony Hoare's billion dollar mistake was the invention of null. Subsequently, a lot of code has become riddled with null pointer exceptions (segfaults) when software developers try to use (dereference) uninitialized variables. In 1989, Wirfs-Brock and Wikerson wrote: Direct references to variables severely limit the ability of programmers to re?ne existing classes. The programming conventions described here structure the use of variables to promote reusable designs. We encourage users of all object-oriented languages to follow these conventions. Additionally, we strongly urge designers of object-oriented languages to consider the effects of unrestricted variable references on reusability. Problem A lot of software, especially in Java, but likely in C# and C++, often uses the following pattern: public class SomeClass { private String someAttribute; public SomeClass() { this.someAttribute = "Some Value"; } public void someMethod() { if( this.someAttribute.equals( "Some Value" ) ) { // do something... } } public void setAttribute( String s ) { this.someAttribute = s; } public String getAttribute() { return this.someAttribute; } } Sometimes a band-aid solution is used by checking for null throughout the code base: public void someMethod() { assert this.someAttribute != null; if( this.someAttribute.equals( "Some Value" ) ) { // do something... } } public void anotherMethod() { assert this.someAttribute != null; if( this.someAttribute.equals( "Some Default Value" ) ) { // do something... } } The band-aid does not always avoid the null pointer problem: a race condition exists. The race condition is mitigated using: public void anotherMethod() { String someAttribute = this.someAttribute; assert someAttribute != null; if( someAttribute.equals( "Some Default Value" ) ) { // do something... } } Yet that requires two statements (assignment to local copy and check for null) every time a class-scoped variable is used to ensure it is valid. Self-Encapsulation Ken Auer's Reusability Through Self-Encapsulation (Pattern Languages of Program Design, Addison Wesley, New York, pp. 505-516, 1994) advocated self-encapsulation combined with lazy initialization. The result, in Java, would resemble: public class SomeClass { private String someAttribute; public SomeClass() { setAttribute( "Some Value" ); } public void someMethod() { if( getAttribute().equals( "Some Value" ) ) { // do something... } } public void setAttribute( String s ) { this.someAttribute = s; } public String getAttribute() { String someAttribute = this.someAttribute; if( someAttribute == null ) { setAttribute( createDefaultValue() ); } return someAttribute; } protected String createDefaultValue() { return "Some Default Value"; } } All duplicate checks for null are superfluous: getAttribute() ensures the value is never null at a single location within the containing class. Efficiency arguments should be fairly moot -- modern compilers and virtual machines can inline the code when possible. As long as variables are never referenced directly, this also allows for proper application of the Open-Closed Principle. Question What are the disadvantages of self-encapsulation, if any? (Ideally, I would like to see references to studies that contrast the robustness of similarly complex systems that use and don't use self-encapsulation, as this strikes me as a fairly straightforward testable hypothesis.)

Search Results

Search found 535 results on 22 pages for 'compilers'.

Page 19/22 | < Previous Page | 15 16 17 18 19 20 21 22 | Next Page >

- by Workshop Alex

- by FredOverflow

- by toto

- by Jason S

- by ArunSaha

- by AraK

- by user1742389

- by Endiannes

- by vprajan

- by Marc

- by Josh

- by luiscubal

- by Torus Linvald

- by John

- by Ashwin

- by Kristoffer

- by Michael Freidgeim

- by Eloff

- by rchrd

- by BuckWoody

- by BuckWoody

- by Simon Cooper

- by Wolfgang

- by Wes Miller

- by Dave Jarvis

< Previous Page | 15 16 17 18 19 20 21 22 | Next Page >