optimizations - Page 13

String Length Evaluating Incorrectly

- by Justin R.

My coworker and I are debugging an issue in a WCF service he's working on where a string's length isn't being evaluated correctly. He is running this method to unit test a method in his WCF service: // Unit test method public void RemoveAppGroupTest() { string addGroup = "TestGroup"; string status = string.Empty; string message = string.Empty; appActiveDirectoryServicesClient.RemoveAppGroup("AOD", addGroup, ref status, ref message); } // Inside the WCF service [OperationBehavior(Impersonation = ImpersonationOption.Required)] public void RemoveAppGroup(string AppName, string GroupName, ref string Status, ref string Message) { string accessOnDemandDomain = "MyDomain"; RemoveAppGroupFromDomain(AppName, accessOnDemandDomain, GroupName, ref Status, ref Message); } public AppActiveDirectoryDomain(string AppName, string DomainName) { if (string.IsNullOrEmpty(AppName)) { throw new ArgumentNullException("AppName", "You must specify an application name"); } } We tried to step into the .NET source code to see what value string.IsNullOrEmpty was receiving, but the IDE printed this message when we attempted to evaluate the variable: 'Cannot obtain value of local or argument 'value' as it is not available at this instruction pointer, possibly because it has been optimized away.' (None of the projects involved have optimizations enabled). So, we decided to try explicitly setting the value of the variable inside the method itself, immediately before the length check -- but that didn't help. // Lets try this again. public AppActiveDirectoryDomain(string AppName, string DomainName) { // Explicitly set the value for testing purposes. AppName = "AOD"; if (AppName == null) { throw new ArgumentNullException("AppName", "You must specify an application name"); } if (AppName.Length == 0) { // This exception gets thrown, even though it obviously isn't a zero length string. throw new ArgumentNullException("AppName", "You must specify an application name"); } } We're really pulling our hair out on this one. Has anyone else experienced behavior like this? Any tips on debugging it?

Read the article

ControlCollection extension method optimazation

- by Johan Leino

Hi, got question regarding an extension method that I have written that looks like this: public static IEnumerable<T> FindControlsOfType<T>(this ControlCollection instance) where T : class { T control; foreach (Control ctrl in instance) { if ((control = ctrl as T) != null) { yield return control; } foreach (T child in FindControlsOfType<T>(ctrl.Controls)) { yield return child; } } } public static IEnumerable<T> FindControlsOfType<T>(this ControlCollection instance, Func<T, bool> match) where T : class { return FindControlsOfType<T>(instance).Where(match); } The idea here is to find all controls that match a specifc criteria (hence the Func<..) in the controls collection. My question is: Does the second method (that has the Func) first call the first method to find all the controls of type T and then performs the where condition or does the "runtime" optimize the call to perform the where condition on the "whole" enumeration (if you get what I mean). secondly, are there any other optimizations that I can do to the code to perform better. An example can look like this: var checkbox = this.Controls.FindControlsOfType<MyCustomCheckBox>( ctrl => ctrl.CustomProperty == "Test" ) .FirstOrDefault();

Read the article

How to optimize Conway's game of life for CUDA?

- by nlight

I've written this CUDA kernel for Conway's game of life: global void gameOfLife(float* returnBuffer, int width, int height) { unsigned int x = blockIdx.x*blockDim.x + threadIdx.x; unsigned int y = blockIdx.y*blockDim.y + threadIdx.y; float p = tex2D(inputTex, x, y); float neighbors = 0; neighbors += tex2D(inputTex, x+1, y); neighbors += tex2D(inputTex, x-1, y); neighbors += tex2D(inputTex, x, y+1); neighbors += tex2D(inputTex, x, y-1); neighbors += tex2D(inputTex, x+1, y+1); neighbors += tex2D(inputTex, x-1, y-1); neighbors += tex2D(inputTex, x-1, y+1); neighbors += tex2D(inputTex, x+1, y-1); __syncthreads(); float final = 0; if(neighbors < 2) final = 0; else if(neighbors 3) final = 0; else if(p != 0) final = 1; else if(neighbors == 3) final = 1; __syncthreads(); returnBuffer[x + y*width] = final; } I am looking for errors/optimizations. Parallel programming is quite new to me and I am not sure if I get how to do it right. The rest of the app is: Memcpy input array to a 2d texture inputTex stored in a CUDA array. Output is memcpy-ed from global memory to host and then dealt with. As you can see a thread deals with a single pixel. I am unsure if that is the fastest way as some sources suggest doing a row or more per thread. If I understand correctly NVidia themselves say that the more threads, the better. I would love advice on this on someone with practical experience.

Read the article

Visual Studio crashes consistently on web-related projects

- by Traveling Tech Guy

Hi, I have a brand new VS2010 installed on a Win2008R2 machine. I started getting this error when debugging a WCF service project: "Attempted to read or write protected memory. This is often an indication that other memory is corrupt." When I started developing a web site a week later, this became consistent - I can't debug it. The stack dump reads: at Microsoft.VisualStudio.WebHost.Host.ProcessRequest(Connection conn) at Microsoft.VisualStudio.WebHost.Server.OnSocketAccept(Object acceptedSocket) at System.Threading.QueueUserWorkItemCallback.WaitCallback_Context(Object state) at System.Threading.ExecutionContext.Run(ExecutionContext executionContext, ContextCallback callback, Object state, Boolean ignoreSyncCtx) at System.Threading.QueueUserWorkItemCallback.System.Threading.IThreadPoolWorkItem.ExecuteWorkItem() at System.Threading.ThreadPoolWorkQueue.Dispatch() at System.Threading._ThreadPoolWaitCallback.PerformWaitCallback() I tried searching online, and some recommend turning off the "Suppress JIT Optimizations" in the Debugging options - this dos not seem to make a difference. Clearly the problem is with the built in web server. But am I doing something wrong? Is there something I can do? Or is this a known bug? Thanks for your time, Guy Update 12/31: Today I tried using CassiniDev as a replacement to the original VS2010 WebServer - exact same result. My suspicion is that there's some internal conflict between VS2010, Windows Server 2008R2 and maybe the fact that it's a 64 bit OS. I switched to using IIS as my debug server - and that seems to work, with some annoying side effects. My conclusion: do not use a 64 bit server system as your dev machine. Develop on 32bit - deploy to 64bit. Side conclusion: there are some scenarios Microsoft's QA doesn't test.

Read the article

Why are compilers so stupid?

- by martinus

I always wonder why compilers can't figure out simple things that are obvious to the human eye. They do lots of simple optimizations, but never something even a little bit complex. For example, this code takes about 6 seconds on my computer to print the value zero (using java 1.6): int x = 0; for (int i = 0; i < 100 * 1000 * 1000 * 1000; ++i) { x += x + x + x + x + x; } System.out.println(x); It is totally obvious that x is never changed so no matter how often you add 0 to itself it stays zero. So the compiler could in theory replace this with System.out.println(0). Or even better, this takes 23 seconds: public int slow() { String s = "x"; for (int i = 0; i < 100000; ++i) { s += "x"; } return 10; } First the compiler could notice that I am actually creating a string s of 100000 "x" so it could automatically use s StringBuilder instead, or even better directly replace it with the resulting string as it is always the same. Second, It does not recognize that I do not actually use the string at all, so the whole loop could be discarded! Why, after so much manpower is going into fast compilers, are they still so relatively dumb? EDIT: Of course these are stupid examples that should never be used anywhere. But whenever I have to rewrite a beautiful and very readable code into something unreadable so that the compiler is happy and produces fast code, I wonder why compilers or some other automated tool can't do this work for me.

Read the article

Determine target architecture of binary file in Linux (library or executable)

- by Fernando Miguélez

We have an issue related to a Java application running under a (rather old) FC3 on a Advantech POS board with a Via C3 processor. The java application has several compiled shared libs that are accessed via JNI. Via C3 processor is suppossed to be i686 compatible. Some time ago after installing Ubuntu 6.10 on a MiniItx board with the same processor I found out that the previous statement is not 100% true. The Ubuntu kernel hanged on startup due to the lack of some specific and optional instructions of the i686 set in the C3 processor. These instructions missing in C3 implementation of i686 set are used by default by GCC compiler when using i686 optimizations. The solution in this case was to go with a i386 compiled version of Ubuntu distribution. The base problem with the Java application is that the FC3 distribution was installed on the HD by cloning from an image of the HD of another PC, this time an Intel P4. Afterwards the distribution needed some hacking to have it running such as replacing some packages (such as the kernel one) with the i383 compiled version. The problem is that after working for a while the system completely hangs without a trace. I am afraid that some i686 code is left somewhere in the system and could be executed randomly at any time (for example after recovering from suspend mode or something like that). My question is: Is there any tool or way to find out at what specific architecture is an binary file (executable or library) aimed provided that "file" does not give so much information?

Read the article

How do I find all paths through a set of given nodes in a DAG?

- by Hanno Fietz

I have a list of items (blue nodes below) which are categorized by the users of my application. The categories themselves can be grouped and categorized themselves. The resulting structure can be represented as a Directed Acyclic Graph (DAG) where the items are sinks at the bottom of the graph's topology and the top categories are sources. Note that while some of the categories might be well defined, a lot is going to be user defined and might be very messy. Example: On that structure, I want to perform the following operations: find all items (sinks) below a particular node (all items in Europe) find all paths (if any) that pass through all of a set of n nodes (all items sent via SMTP from example.com) find all nodes that lie below all of a set of nodes (intersection: goyish brown foods) The first seems quite straightforward: start at the node, follow all possible paths to the bottom and collect the items there. However, is there a faster approach? Remembering the nodes I already passed through probably helps avoiding unnecessary repetition, but are there more optimizations? How do I go about the second one? It seems that the first step would be to determine the height of each node in the set, as to determine at which one(s) to start and then find all paths below that which include the rest of the set. But is this the best (or even a good) approach? The graph traversal algorithms listed at Wikipedia all seem to be concerned with either finding a particular node or the shortest or otherwise most effective route between two nodes. I think both is not what I want, or did I just fail to see how this applies to my problem? Where else should I read?

Read the article

Am I understanding premature optimization correctly?

- by Ed Mazur

I've been struggling with an application I'm writing and I think I'm beginning to see that my problem is premature optimization. The perfectionist side of me wants to make everything optimal and perfect the first time through, but I'm finding this is complicating the design quite a bit. Instead of writing small, testable functions that do one simple thing well, I'm leaning towards cramming in as much functionality as possible in order to be more efficient. For example, I'm avoiding multiple trips to the database for the same piece of information at the cost of my code becoming more complex. One part of me wants to just not worry about redundant database calls. It would make it easier to write correct code and the amount of data being fetched is small anyway. The other part of me feels very dirty and unclean doing this. :-) I'm leaning towards just going to the database multiple times, which I think is the right move here. It's more important that I finish the project and I feel like I'm getting hung up because of optimizations like this. My question is: is this the right strategy to be using when avoiding premature optimization?

Read the article

ControlCollection extension method optimization

- by Johan Leino

Hi, got question regarding an extension method that I have written that looks like this: public static IEnumerable<T> FindControlsOfType<T>(this ControlCollection instance) where T : class { T control; foreach (Control ctrl in instance) { if ((control = ctrl as T) != null) { yield return control; } foreach (T child in FindControlsOfType<T>(ctrl.Controls)) { yield return child; } } } public static IEnumerable<T> FindControlsOfType<T>(this ControlCollection instance, Func<T, bool> match) where T : class { return FindControlsOfType<T>(instance).Where(match); } The idea here is to find all controls that match a specifc criteria (hence the Func<..) in the controls collection. My question is: Does the second method (that has the Func) first call the first method to find all the controls of type T and then performs the where condition or does the "runtime" optimize the call to perform the where condition on the "whole" enumeration (if you get what I mean). secondly, are there any other optimizations that I can do to the code to perform better. An example can look like this: var checkbox = this.Controls.FindControlsOfType<MyCustomCheckBox>( ctrl => ctrl.CustomProperty == "Test" ) .FirstOrDefault();

Read the article

How would I instruct extconf.rb to use additional g++ optimization flags, and which are advisable?

- by mohawkjohn

I'm using Rice to write a C++ extension for a Ruby gem. The extension is in the form of a shared object (.so) file. This requires 'mkmf-rice' instead of 'mkmf', but the two (AFAIK) are pretty similar. By default, the compiler uses the flags -g -O2. Personally, I find this kind of silly, since it's hard to debug with any optimization enabled. I've resorted to editing the Makefile to take out the flags I don't like (e.g., removing -fPIC -shared when I need to debug using main() instead of Ruby's hooks). But I figure there's got to be a better way. I know I can just do $CPPFLAGS += " -DRICE" to add additional flags. But how do I remove things without editing the Makefile directly? A secondary question: what optimizations are safe for shared objects loaded by Ruby? Can I do things like -funroll-loops? What do you all recommend? It's a scientific computing project, so the faster the better. Memory is not much of an issue. Many thanks!

Read the article

What runs faster? Wordpress or Drupal 6.x?

- by electblake

So... I run a pretty large Wordpress blog. Currently it gets around 20k+ pageviews a day, and its always a struggle to keep the bad boy running quickly - I currently run a vps.net with CentOS 5.3 I am also Drupal developer by trade so I love the CMS Framework for its versatility and the portability (I can take work from one site and implement on another with great ease) MY QUESTION IS: What is faster then? Wordpress 3.x & Drupal 6.x I'd love to migrate my site to Drupal to be able to roll out new features etc (which I find awkward to do in Wordpress) but I am scared that Drupal may not be able to handle the traffic. Any opinions? I know that some major players use Drupal - as Dries documents well on his blog but I'm not under any illusions that Drupal can be a real hog. Thanks for any/all help! Please try to avoid server optimization talk unless it pertains to Wordpress or Drupal 6.x specifically, I love to learn more about optimizations but I do want to sort out which platform is quicker :) p.s - I realize the fastest option is to use a lower-level framework (with less overhead) like CakePHP etc but assume that isn't an option ;)

Read the article

What single software development tool do you think holds the most value?

- by Phobis

Every day I realize how much I love Visual Studio for .NET development.... but, I believe that Resharper, may hold a value that surpasses Visual Studio's (I am using VS 2005 for WPF/WCF development). I decided it would be great to compile a list of the most valuable tools for software development. These can be applications/plug-ins anything that you think holds GREAT value. Also, please explain the benefits of the tool that you are posting. Resharper: Intergrated Unit testing "Camel Hump" code auto completion Find "usings" (inverse of "Go to Deceleration") Code formating and member rearranging Assembly and namespace inclusion (based on your code) Check for common optimizations and possible bugs in code and suggests/rewrites the code for you (things like null checking, redundant delegate creation, inverting if statements, etc...); Tells you when code and be more generic (may suggest things like "use this interface instead" if your code never refers to something specific on an object) Helps you see code that is not being used and will clean any unused members. File structure view helps you jump around the regions of your file (this is really awesome and clean). Class searching (you can use things like camel humps) Asks you which partial file to open once you find a class. It also has it's own plugin support, so you can do things like FxCop, documentation and relfector (all free). This thing has so much I don't think I hit 10% of it yet :) [When I get time, I will try to add more... feel free to help me out]

Read the article

different thread accessing MemoryStream

- by Wayne

There's a bit of code which writes data to a MemoryStream object directly into it's data buffer by calling GetBuffer(). It also uses and updates the Position and SetLength() properties appropriately. This code works purposes 99.9999% of the time. Literally. Only every so many 100,000's of iterations it will barf. The specific problem is that the memory.Position property suddenly returns zero instead of the appropriate value. However, code was added that checks for the 0 and throws an exception which include log of the MemoryStream properties like Position and Length in a separate method. Those return the correct value. Further addition shows that when this rare condition occurs, the memory.Position only has zero inside this particular method. Okay. Obviously, this must be a threading issue. But this code is well locked. However, the nature of this software is that it's organized by "tasks" with a scheduler and so any one of several actual O/S thread may run this code at any give time--but never more than one at a time. So it's my guess that ordinarily it so happens that the same thread keeps getting used for this method and then on a rare occasion a different thread get used. Then due to compiler optimizations, the different thread never gets the correct value. It gets a "stale" value. Ordinarily in a situation like this I would apply a "volatile" keyword to the variable in question. But that (those) variables are inside the MemoryStream object. Does anyone have any other idea? Or does this mean we have to implement our own MemoryStream object? (Just like we end up having to do with practically every collection in .NET?) It's a shame to have such an awesome platform as .NET and have virtually the entire system useless as-is for seriously parallelized applications. If I'm wrong or you have other ideas, please advise. Sincerely, Wayne

Read the article

Eliminate full table scan due to BETWEEN (and GROUP BY)

- by Dave Jarvis

Description According to the explain command, there is a range that is causing a query to perform a full table scan (160k rows). How do I keep the range condition and reduce the scanning? I expect the culprit to be: Y.YEAR BETWEEN 1900 AND 2009 AND Code Here is the code that has the range condition (the STATION_DISTRICT is likely superfluous). SELECT COUNT(1) as MEASUREMENTS, AVG(D.AMOUNT) as AMOUNT, Y.YEAR as YEAR, MAKEDATE(Y.YEAR,1) as AMOUNT_DATE FROM CITY C, STATION S, STATION_DISTRICT SD, YEAR_REF Y FORCE INDEX(YEAR_IDX), MONTH_REF M, DAILY D WHERE -- For a specific city ... -- C.ID = 10663 AND -- Find all the stations within a specific unit radius ... -- 6371.009 * SQRT( POW(RADIANS(C.LATITUDE_DECIMAL - S.LATITUDE_DECIMAL), 2) + (COS(RADIANS(C.LATITUDE_DECIMAL + S.LATITUDE_DECIMAL) / 2) * POW(RADIANS(C.LONGITUDE_DECIMAL - S.LONGITUDE_DECIMAL), 2)) ) <= 50 AND -- Get the station district identification for the matching station. -- S.STATION_DISTRICT_ID = SD.ID AND -- Gather all known years for that station ... -- Y.STATION_DISTRICT_ID = SD.ID AND -- The data before 1900 is shaky; insufficient after 2009. -- Y.YEAR BETWEEN 1900 AND 2009 AND -- Filtered by all known months ... -- M.YEAR_REF_ID = Y.ID AND -- Whittled down by category ... -- M.CATEGORY_ID = '003' AND -- Into the valid daily climate data. -- M.ID = D.MONTH_REF_ID AND D.DAILY_FLAG_ID <> 'M' GROUP BY Y.YEAR Update The SQL is performing a full table scan, which results in MySQL performing a "copy to tmp table", as shown here: +----+-------------+-------+--------+-----------------------------------+--------------+---------+-------------------------------+--------+-------------+ | id | select_type | table | type | possible_keys | key | key_len | ref | rows | Extra | +----+-------------+-------+--------+-----------------------------------+--------------+---------+-------------------------------+--------+-------------+ | 1 | SIMPLE | C | const | PRIMARY | PRIMARY | 4 | const | 1 | | | 1 | SIMPLE | Y | range | YEAR_IDX | YEAR_IDX | 4 | NULL | 160422 | Using where | | 1 | SIMPLE | SD | eq_ref | PRIMARY | PRIMARY | 4 | climate.Y.STATION_DISTRICT_ID | 1 | Using index | | 1 | SIMPLE | S | eq_ref | PRIMARY | PRIMARY | 4 | climate.SD.ID | 1 | Using where | | 1 | SIMPLE | M | ref | PRIMARY,YEAR_REF_IDX,CATEGORY_IDX | YEAR_REF_IDX | 8 | climate.Y.ID | 54 | Using where | | 1 | SIMPLE | D | ref | INDEX | INDEX | 8 | climate.M.ID | 11 | Using where | +----+-------------+-------+--------+-----------------------------------+--------------+---------+-------------------------------+--------+-------------+ Related http://dev.mysql.com/doc/refman/5.0/en/how-to-avoid-table-scan.html http://dev.mysql.com/doc/refman/5.0/en/where-optimizations.html http://stackoverflow.com/questions/557425/optimize-sql-that-uses-between-clause Thank you!

Read the article

Best way to run multiple queries per second on database, performance wise?

- by Michael Joell

I am currently using Java to insert and update data multiple times per second. Never having used databases with Java, I am not sure what is required, and how to get the best performance. I currently have a method for each type of query I need to do (for example, update a row in a database). I also have a method to create the database connection. Below is my simplified code. public static void addOneForUserInChannel(String channel, String username) throws SQLException { Connection dbConnection = null; PreparedStatement ps = null; String updateSQL = "UPDATE " + channel + "_count SET messages = messages + 1 WHERE username = ?"; try { dbConnection = getDBConnection(); ps = dbConnection.prepareStatement(updateSQL); ps.setString(1, username); ps.executeUpdate(); } catch(SQLException e) { System.out.println(e.getMessage()); } finally { if(ps != null) { ps.close(); } if(dbConnection != null) { dbConnection.close(); } } } And my DB connection private static Connection getDBConnection() { Connection dbConnection = null; try { Class.forName(DB_DRIVER); } catch (ClassNotFoundException e) { System.out.println(e.getMessage()); } try { dbConnection = DriverManager.getConnection(DB_CONNECTION, DB_USER,DB_PASSWORD); return dbConnection; } catch (SQLException e) { System.out.println(e.getMessage()); } return dbConnection; } This seems to be working fine for now, with about 1-2 queries per second, but I am worried that once I expand and it is running many more, I might have some issues. My questions: Is there a way to have a persistent database connection throughout the entire run time of the process? If so, should I do this? Are there any other optimizations that I should do to help with performance? Thanks

Read the article

My program is spending most of its time in objc_msgSend. Does that mean that Objective-C has bad per

- by Paperflyer

Hello Stackoverflow. I have written an application that has a number of custom views and generally draws a lot of lines and bitmaps. Since performance is somewhat critical for the application, I spent a good amount of time optimizing draw performance. Now, activity monitor tells me that my application is usually using about 12% CPU and Instrument (the profiler) says that a whopping 10% CPU is spent in objc_msgSend (mostly in drawing related system calls). On the one hand, I am glad about this since it means that my drawing is about as fast as it gets and my optimizations where a huge success. On the other hand, it seems to imply that the only thing that is still using my CPU is the Objective-C overhead for messages (objc_msgSend). Hence, that if I had written the application in, say, Carbon, its performance would be drastically better. Now I am tempted to conclude that Objective-C is a language with bad performance, even though Cocoa seems to be awfully efficient since it can apparently draw faster than Objective-C can send messages. So, is Objective-C really a language with bad performance? What do you think about that?

Read the article

Obfuscator for .NET assembly (Maybe just a C++ obfuscator?)

- by Pirate for Profit

The software company I work for is using a ton of open source LGPL/BSD/MIT C++ code that we have written wrappers around to port "helper classes" into a .NET assembly, via C++/CLI. These libraries have wrapped old cryptic APIs into easy-to-use ones based on common sense, and will be very helpful for a lot of different tasks will be included in many future client's applications, and we might even license it to other software companies in the same field. So naturally we are tasked with looking into solutions for securing the code from prying eyes. What we're trying to do is stop the casual observer from seeing what's going on. Now I have hacked some crazy shit in EverQuest and other video games in my day so I know with enough tireless effort anything can be done. But we don't want to make it easy for whomever. To the point, besides the Visual Studio compiler's optimizations, is there's a C++ obfuscator or .NET assembly obfuscator (after it's been built o.O) or something that would scramble everything up, re-arrange data structures, string constants, etc. idk? And if such a thing exists, we'd be curious to know how that would impact performance, as some sections of code are time critical (funny saying that using a managed M$ framework).

Read the article

Phonegap: Will my mobile app 'feel' faster or slower once ported to phonegap?

- by user15872

So I'm designing everything in mobile Safari and I know that phonegap is essentially a stripped webview but... Question: Will my application will run better in phonegap? (revised below) a)I imagine my navigation and core app will load faster as the scripts and images are on the hard drive. Is this True? b)I assume since they've been working on it for 2 years now that they may have made some optimizations to make it quicker than just an average safari window. Is this true? (Assuming both html5/js/css code bases are pretty much the same and app is running on iOS.) Update: Sorry, I meant to compare apples to slightly different apples. Question 1 revised: Will my app see any performance benefits running with in a phonegap environment vs standard mobile safari? (compare mobile - to mobile) 1b) In what ways, other than loading time has phonegap optimized performance over standard mobile safari? Follow ups: 1) Are there any pitfalls, other than large libraries, that may cause phonegap to suffer a serious performance hit vs stand mobile safari? 2) Can I mix native and webview rendering? (i.e the top half of my app is rendered in with html/css/js and the bottom half native)

Read the article

DB optimization to use it as a queue

- by anony

We have a table called worktable which has some columns(key(primary key), ptime, aname, status, content) we have something called producer which puts in rows in this table and we have consumer which does an order-by on the key column and fetches the first row which has status as 'pending'. The consumer does some processing on this row: 1. updates status to "processing" 2. does some processing using content 3. deletes the row we are facing contention issues when we try to run multiple consumers(probably due to the order-by which does a full table scan)... using Advanced queues would be our next step but before we go there we want to check what is the max throughput we can achieve with multiple consumers and producer on the table. What are the optimizations we can do to get the best numbers possible? Can we do an in-memory processing where a consumer fetches 1000 rows at a time processes and deletes? will that improve? What are other possibilities? partitioning of table? parallelization? Index organized tables?...

Read the article

Find all A^x in a given range

- by Austin Henley

I need to find all monomials in the form AX that when evaluated falls within a range from m to n. It is safe to say that the base A is greater than 1, the power X is greater than 2, and only integers need to be used. For example, in the range 50 to 100, the solutions would be: 2^6 3^4 4^3 My first attempt to solve this was to brute force all combinations of A and X that make "sense." However this becomes too slow when used for very large numbers in a big range since these solutions are used in part of much more intensive processing. Here is the code: def monoSearch(min, max): base = 2 power = 3 while 1: while base**power < max: if base**power > min: print "Found " + repr(base) + "^" + repr(power) + " = " + repr(base**power) power = power + 1 base = base + 1 power = 3 if base**power > max: break I could remove one base**power by saving the value in a temporary variable but I don't think that would make a drastic effect. I also wondered if using logarithms would be better or if there was a closed form expression for this. I am open to any optimizations or alternatives to finding the solutions.

Read the article

NHibernate Performance Optimization | Suggestions invited!!!

- by user336749

Hi, I’m facing an issue with NHibernate performance and can you please suggest me some optimizations? Below mentioned is a small summary of my application architecture I have a windows service which is listening to a messaging bus. On receiving a message the service creates an object out of which a property is the received xml snippet and saves the message to the DB (uses NH). There is a WPF UI with a readonly connection to the DB, and on refresh of the UI it displays the objects on the screen. While the UI does a refresh, it retrieves the xml and deserializes it , from which the object’s properties are derived and binded to the screen. For example assume an xml XXX is received by the service, it deserializes the xml , creates the book object and save it to the DB and a property/column is SCHEMA which contains the xml snippet. The UI while refreshed searches all book objects by ID and creates the book objects out of the xml which is being saved (yes, the xml is the constructor param). Now my issue is that the refresh takes more than 2 minutes to display say 50 book objects. I analyzed it using the NHibernate profiler, and found that the time spend within the DB is negligible, however time spent to create the entities is proportionally huge(10ms:1990 ms).I guess it’s due to the fairly huge size of xml snippet and it’s deserialization. My question is, how can I improve the performance. I dispose sessions after every refresh and is not lazy loading (please note that the time spend in DB is negligible). On every refresh it’s possible that all objects are updated by some downstream systems or maybe one of them are updated.Can I implement some sort of caching mechanism in this case? Thanks in advance for any suggestions. Regards, -Mike

Read the article

Performance issue between builds

- by DeadMG

I've been developing a small indie game in my spare time and have run across an inexplicable issue. Some builds of the game will randomly run several hundred frames per second slower than other builds. For example, when rendering some text and no 3D scene, I can achieve 1800FPS on my own hardware. Add one 3D sphere (10k verts, pixel shaded), achieve 1700 FPS. Add two more spheres, achieve 800 FPS. Remove all spheres, achieve 1100FPS- even though the code now renders the same scene as I previously achieved at 1800FPS, which is just the FPS counter being rendered. I've tried rebuilding and cleaning the project and rebooting the compiler. This is in Release mode and I turned on all the optimizations I could find. Any suggestions as to the cause? I ran a quick profile, and Visual Studio seems to think that over 90% of my time was spent in D3D9_43.dll, suggesting that it's not a bug in my app, which doesn't explain why it manifests in only some builds. I rebooted my machine and it's back up to 1800FPS. I think it's a bug in the DirectX SDK tools (amongst many others). Going to delete this question.

Read the article

Data transformation question

- by tkm

I have data composed of a list of employers and a list of workers. Each has a many-to-many relationship with the other (so an employer can have many workers, and a worker can have many employers). The way the data is retrieved (and given to me) is as follows: each employer has an array of workers. In other words: employer n has: worker x, worker y etc. So I have a bunch of employer objects each containing an array of workers. I need to transform this data (and basically invert the relationship). I need to have a bunch of worker objects, each containing and array of employers. In other words: worker x has: employer n1, employer n2 etc. The context is hypothetical so please don't comment on why I need this or why I am doing it this way. I would really just like help on the algorithm to perform this transformation (there isn't that much data so I would prefer readability over complex optimizations which reduce complexity). (Oh and I am using Java, but pseudocode would be fine). Thanks!

Read the article

Windows CE Chat Transcript (March 30, 2010)

- by Bruce Eitman

For those of you who missed the chat today, here is the raw transcript. By raw, I mean that I copied and pasted the discussion without any edits. This is divided into two parts, the top part is the answers from the Microsoft Experts and the bottom part is the questions from the audience. Answers from Microsoft: Karel Danihelka [MS] (Expert)[2010-3-30 12:2]: Hi everyone, my name is Karel Danihelka and I am developer in partner response team. Sing Wee [MS] (Expert)[2010-3-30 12:2]: Hi, I'm Sing Wee, part of the CoreOS/BSP Test Team. GLanger_MS (Expert)[2010-3-30 12:2]: Hi, I'm Glen Langer, program manager on the Core Team. Karel Danihelka [MS] (Expert)[2010-3-30 12:3]: Q: I need to implement hardware timers on my windows CE 6.0 device to trigger events at microsecond intervals. Where should i start? A: Until you are using CPU with GHz frequency your only chance is use interrupt handler and implement all funcionality there. But it will be really tricky and may reduce system performance. If period will be near to millisecond timeframe you can use normal thread wait for event pattern. Karel Danihelka [MS] (Expert)[2010-3-30 12:5]: Q: I want to partition my NAND Flash device. One partition to use for hive ragistry and the other for the apps and data. The only way to do it is programmatically or setting some registry values ? A: It need to be set in registry - generally you need mark this partition as boot partition. Karel Danihelka [MS] (Expert)[2010-3-30 12:7]: Q: My CPU is Intel celeron M processor 1Ghz. A: In this case you can try use normal approach - in interrupt handler return SYSINTR and start thread in device driver which will spin thread waiting on event attached to this SYSINTR. Karel Danihelka [MS] (Expert)[2010-3-30 12:7]: Q: If i need to implement it using interrupt handlers, What are all the files that I should look at? A: Good quesiton - I would recommend documentation and there was BSP development book to download for free. mikehall_ms (Moderator)[2010-3-30 12:8]: Q: Hi guys, what's the formal way to report bugs back to the core team / product team? The mechanism of calling the support phone number every time is really onerous and time-consuming. Is there another mechanism? A: Using product support is the formal way to report bugs/issues - Product support can then create an issue that can be tracked. Karel Danihelka [MS] (Expert)[2010-3-30 12:9]: Q: But the operation for creating the partitions ? A: This is tricky - if you will make it autopartition & autoformat it will be created by filesystem. But generally it depends on your boot loader. mikehall_ms (Moderator)[2010-3-30 12:10]: Q: Is Windows Phone 7 related to Windows CE? If so, can you tell me what version of Windows CE is the basis? Is it in fact the new version of Windows Mobile? A: At MIX 2010 Charlie Kindel presented a session that described some of the core technologies that make up Windows Phone 7 Series, including the underlying operating system (Windows CE) and the new ISV programming model based on Silverlight and .NET - check out the Mix Online Videos to get more information. davbo-msft (Moderator)[2010-3-30 12:10]: Q: Is Windows Phone 7 related to Windows CE? If so, can you tell me what version of Windows CE is the basis? Is it in fact the new version of Windows Mobile? A: This forum is to discussed released products in the industry. Windows Mobile & Windows CE are based on the same Windows CE Kernel/system. Windows CE is focused on deliverying the OS for embedded customers in the market where Windows Mobile is focused on deliverying compelling Windows Phone platform. davbo-msft (Moderator)[2010-3-30 12:11]: Q: Is Windows Phone 7 related to Windows CE? If so, can you tell me what version of Windows CE is the basis? Is it in fact the new version of Windows Mobile? A: http://en.wikipedia.org/wiki/Windows_CE Wikipedia gives a good breakdown of the version history. Travis Hobrla [MS] (Expert)[2010-3-30 12:13]: Q: I created a OS design with KITL and kernel debugger enabled. But I am unable to connect to the target for debugging. I am getting the following error when i try to connect with the device. PB Debugger Cannot initialize the Kernel Debugger. PB Debugger Debugger could not initialize connection. PB Debugger The Kernel Debugger is waiting to connect with target. PB Debugger The Kernel Debugger has been disconnected successfully. A: One possibility is that a rogue cesvchost.exe has co-opted the debugger. I am assuming this is CE 6.0? Can you try exiting visual studio and manually killing the cesvchost.exe process from the Task Manager? davbo-msft (Moderator)[2010-3-30 12:14]: Q: Hi guys, what's the formal way to report bugs back to the core team / product team? The mechanism of calling the support phone number every time is really onerous and time-consuming. Is there another mechanism? A: For info on contacting Microsoft support refer to the support page on the Embedded website: http://msdn.microsoft.com/en-us/windowsembedded/dd897633.aspx Sing Wee [MS] (Expert)[2010-3-30 12:16]: Q: Do u mean ISR/IST implementation? How can i register an interrupt? What kind of interrupt should i register? A: A good introduction to interrupts in WinCE 6.0 can be found here (aside from the documentation on MSDN): http://download.microsoft.com/download/9/c/f/9cffaa58-4000-48d6-a4b2-5fed9e4e6410/Chapter%206%20-%20Developing%20Device%20Drivers.pdf mikehall_ms (Moderator)[2010-3-30 12:16]: Q: What will be different in Windows Compact 7 from CE 6.0? A: Unfortunately we cannot discuss unreleased products on this chat - keep an eye on the Windows Embedded web site and blogs to keep up to date with product announcements. Travis Hobrla [MS] (Expert)[2010-3-30 12:16]: Q: I am using CE 6.0. There is no cesvchost process running in my system. A: What operating system are you using? Karel Danihelka [MS] (Expert)[2010-3-30 12:16]: Q: So...I have to modify file system code to create 2 partition at system startup ?!! I haven't understood.... A: You don't need to modify code, there are registry settings to achive this (look to documentation). But you may need to create partition table in boot loader. Unfortunatelly there isn't simple way how to do it. davbo-msft (Moderator)[2010-3-30 12:18]: Q: I would like to get a handle to a Silverlight screen section, is that possiable? A: Windows Embedded CE 6.0 R3 includes Sliverlight for Windows Embedded. Refer to New Features overview on the embedded web site. http://www.microsoft.com/windowsembedded/en-us/products/windowsce/default.mspx. Silverlight - The power of Silverlight brought to Windows Embedded CE to create rich applications and user interfaces is new part of Windows CE Embedded. mikehall_ms (Moderator)[2010-3-30 12:20]: Q: The link for developing device drivers is not working. can u please check that? A: http://msdn.microsoft.com/en-us/library/ms923714.aspx davbo-msft (Moderator)[2010-3-30 12:20]: Q: I would like to get a handle to a Silverlight screen section, is that possiable? A: Sorry misunderstood the question I thought you were asking if embedded CE could handle Silverlight. Please repost so that the question goes back into the active queue because once answered no way to put the status back to open. Travis Hobrla [MS] (Expert)[2010-3-30 12:21]: Q: sorry! Windows XP SP3 A: Can you try exiting VS2005 and confirming cesvchost.exe is not running, then renaming C:\Documents and Settings\USERNAME\Local Settings\Application Data\Microsoft\CoreCon\1.0 to 1.0_backup, then restarting VS2005? Sing Wee [MS] (Expert)[2010-3-30 12:24]: Q: Can I have the book's name please? A: I believe the downloadable version is related to the last link I sent. If you go to the following website, I believe you can download the whole thing: http://msdn.microsoft.com/en-us/windowsembedded/ce/cc294468.aspx davbo-msft (Moderator)[2010-3-30 12:24]: Q: If one has an image on a Silverlight page, it seems to be cached. How would one refresh that cache after changing the underlying image? A: change the URI of the image or use a writeable bitmap if they want to manually toggle the pixels Sing Wee [MS] (Expert)[2010-3-30 12:25]: A: Whoops, hit [ENTER] too early. On the right side, you'll see there an "Exam Preparation Kit" link that can be downloaded in several different languages. Sing Wee [MS] (Expert)[2010-3-30 12:25]: Q: Can I have the book's name please? A: Whoops, hit [ENTER] too early. On the right side, you'll see there an "Exam Preparation Kit" link that can be downloaded in several different languages. Karel Danihelka [MS] (Expert)[2010-3-30 12:26]: Q: I have a NAND Flash on my target device. On this flash I have the hive registry and an application.I have observed that when the NAND flash is fully, the system startup time is longer....is there a degradation of NAND use that influences the startup time ? Why ? A: Yes - on boot flash abstraction library (old one) read metadata from all sectors to rebuild physical - logical mapping table. davbo-msft (Moderator)[2010-3-30 12:27]: Q: I would like to get a handle to a Silverlight screen section, is that possiable? A: Need addition info on this question. Can you provide more details on what you are trying to do in Silverlight? davbo-msft (Moderator)[2010-3-30 12:27]: Q: If one has an image on a Silverlight page, it seems to be cached. How would one refresh that cache after changing the underlying image? A: Additonal Info: if you want to manually touch the pixels use WriteableBitmap if you want to use the underlying HWND then use IXRVisualHost::GetHWND() davbo-msft (Moderator)[2010-3-30 12:28]: Q: Writable bitmap, is there an example of the syntax? A: if you want to manually touch the pixels use WriteableBitmap if you want to use the underlying HWND then use IXRVisualHost::GetHWND() davbo-msft (Moderator)[2010-3-30 12:29]: Q: I would like to get a handle to a Silverlight screen section, is that possiable? A: Can I get more information about this question about what you are trying to accomplish in Silverlight? davbo-msft (Moderator)[2010-3-30 12:31]: Q: IXRVisualHost::GetHWND() exactly what I needed Thanks, A: Your welcome Sing Wee [MS] (Expert)[2010-3-30 12:31]: Q: ok. thanks for the book's link A: No problem. Travis Hobrla [MS] (Expert)[2010-3-30 12:32]: Q: Typically for SoC devices you name your hardware specific libraries in the form "SOCDIRNAME_LIBNAME". In our platform "OMAP35XX_TPS659XX_TI_V1" if you do this we cause the catalog parser to die... For example if we have a library "Musbfn_OMAP35XX_TPS659XX_TI_V1.dll" entering this in the catalogs pbcxml file in a <module> section causes the XML parser to fail with : Error 3 The 'urn:Microsoft.PlatformBuilder/Catalog:Module' element is invalid - The value '012345678901234567890123456789.dll' is invalid according to its datatype 'urn:Microsoft.PlatformBuilder/Catalog:CatalogFileName' - The actual length is greater than the MaxLength value. A: There are a couple workarounds I can think of. I believe the Module element is only used when doing SYSGEN parsing to make sure dependent SYSGENs are present when the item is selected, so I believe it is optional to the catalog. The other obvious workaround is to shorten the soc name. I realize neither of these solutions is ideal. This is not something we anticipated when we tested CE6.0, sorry. Travis Hobrla [MS] (Expert)[2010-3-30 12:33]: Q: I am getting this error only when I select the KdStub as the debugger in Target device connectivity. A: Right, but KdStub is the debugger that you should use. Have you tried the steps I suggested? Travis Hobrla [MS] (Expert)[2010-3-30 12:36]: Q: If I select Active KTIL, My OS doesn't boots. It says "loading NK.EXE at 0x<xxxxx> location" after that nothing comes in the debug log. A: Can you look at the serial debug output and see what is happening there? Often it can give you a clue to the KITL driver malfunctioning. Travis Hobrla [MS] (Expert)[2010-3-30 12:38]: Q: I have tried that and I am getting the same error. A: I am assuming you have a device created in Target -> Connectivity Options in Platform Builder. What are the Kernel download / Kernel transport for your device? Travis Hobrla [MS] (Expert)[2010-3-30 12:40]: Q: KITL: *** Device Name CEPC56059 *** WARN: KITL will run in polling mode VBridge:: built on [Jul 10 2009] time [10:20:14] VBridgeInit()...TX = [16384] bytes -- Rx = [16384] bytes Tx buffer [0xA1B84860] to [0xA1B88860]. Rx buffer [0xA1B88880] to [0xA1B8C880]. VBridge:: NK add MAC: [0-60-65-2-DA-FB] Connecting to Desktop KITL: Connected host IP: 1 Port: 1086 .. this is the output of the serial debug A: This looks reasonable and does not give clues as to why boot would halt at that point. If you capture a network trace or turn on KITL debug zones via dpCurSettings in kitl.dll, do you see KITL active after this? Travis Hobrla [MS] (Expert)[2010-3-30 12:41]: Q: Both is happening via Ethernet. A: Only thing I have left to suggest is a Platform Builder installation Repair, then. Karel Danihelka [MS] (Expert)[2010-3-30 12:42]: Q: Hi, I saw that the ATADISK is quite generic and des not have any optimizations. Do you have any advice to consider while tryin to improve the performance of it? A: If I remember correctly sample code has support for some specific hardware controllers (little obsolete now). This should be good start point (if you will not decide take existing driver as sample and write you own). Travis Hobrla [MS] (Expert)[2010-3-30 12:44]: Q: I didn't do that. I have to try. A: I think that's the next valid step. You need to figure out whether KITL is hanging or the device - use instrumented serial debug messages and network trace to determine this. Sing Wee [MS] (Expert)[2010-3-30 12:46]: Q: KITL: *** Device Name CEPC56059 *** WARN: KITL will run in polling mode VBridge:: built on [Jul 10 2009] time [10:20:14] VBridgeInit()...TX = [16384] bytes -- Rx = [16384] bytes Tx buffer [0xA1B84860] to [0xA1B88860]. Rx buffer [0xA1B88880] to [0xA1B8C880]. VBridge:: NK add MAC: [0-60-65-2-DA-FB] Connecting to Desktop KITL: Connected host IP: 1 Port: 1086 .. this is the output of the serial debug A: Neo, have you by any chance tried looking into your firewall to see if it might be blocking traffic on any particular ports? Wireshark/netmon might be able to help you here if that's the issue. davbo-msft (Moderator)[2010-3-30 12:48]: Q: I lost spell check, how can i get it back A: Hello - can you give additional details about your question? Is this related to a Windows CE Embedded application? masatos_MSFT (Expert)[2010-3-30 12:51]: Q: When attempting to run the CETK cellcore tests the documentation states the pre-requisites include "stinger.ini", "ltk.ini" but windows CE doesn't provide these or document what they fully need to contain. Implicitly you also need "datatrans.xml" which isn't supplied. If you get around this error and steal these from Windows Mobile instead, when you try and run the CETK tests you get a data abort in radiometricsdll.dll. How should we invoke the cellcore parts of CETK? A: Hi Pev, what version of Windows CE and CETK are you using? I do not have the expertise to answer this question, but can find somebody who can. Travis Hobrla [MS] (Expert)[2010-3-30 12:52]: Q: I don't see a kitlcore.dll in my OS. is my debug image fails to load because of that? A: kitl.dll should be all that's needed, kitlcore.lib is linked into that. Travis Hobrla [MS] (Expert)[2010-3-30 12:55]: Q: I've got a platform (not developed by myself) where I2C bus support has to be provided through the OAL as the kernel needs to talk to devices such as the power management IC and gas gauge so a 'proper' I2C driver hanging off device manager isn't possible. This happens to be a polled driver, so obviously it hits the system hard when either under lots of traffic or an error condition occurs and the driver constantly polls. I originally thought that there was no straightforward way to make such code interrupt driven in the kernel (as it's a cludge) but I realised that that's exactly what ETHDBG drivers do. Is there any reason why I shouldn't have a go at implementing a similar mechanism for our kernel resident I2C driver? If not, are there any obvious pitfalls - I've not seen any other BSP's do this in the past... A: You can make a 'proper' driver that calls down into the OAL to do the actual I2C transactions. Alternatively you can build an interrupt-based version in the OAL where you handle everything in the ISR. There is nothing wrong with that so long as the rest of your drivers and app threads can handle longer times with interrupts off while you are servicing I2C interrupts. Sing Wee [MS] (Expert)[2010-3-30 12:55]: Q: I am having trouble with my mouse, I have the microsoft wireless mobile mouse 3000, when I push the scroll button I am suppose to have autoscroll instead it shows other web pages,Can you help me out tell me what to do!!! A: Sorry, this current chat is about Windows Embedded Compact. Hope you're able to find an answer to your question elsewhere. davbo-msft (Moderator)[2010-3-30 12:56]: Q: Is the Silverlight Animation "Spline" a BezierSpline? A: Spline - http://msdn.microsoft.com/en-us/library/ee501495.aspx<BR< a>> masatos_MSFT (Expert)[2010-3-30 12:57]: Thanks for the info Pev. I will follow up with the CETK experts here and get back to you. davbo-msft (Moderator)[2010-3-30 12:59]: Q: Spline- bad link A: http://msdn.microsoft.com/en-us/library/ee501495.aspx davbo-msft (Moderator)[2010-3-30 13:0]: Q: Sorry, got to tirm the ">" A: No Worries http://msdn.microsoft.com/en-us/library/ee501495.aspx davbo-msft (Moderator)[2010-3-30 13:0]: Hello everyone, we are just about out of time. Thank you for joining us for our Windows Embedded CE 6.0 chat today! <http://www.Microsoft.com/Embedded>; A special thank you to the product group members for coming out. The transcript of today’s chat will be posted online as soon as possible, to <http://msdn.microsoft.com/en-us/chats>;. We’ll see you again for another chat next month. Please check <http://msdn.microsoft.com/en-us/chats>; for the list of upcoming chats. If you still have unanswered questions, let me suggest that you post them on one of our newsgroups on <http://msdn.microsoft.com/en-us/windowsembedded/ce/default.aspx> -Windows Embedded CE 6.0 R3 Now Available! <http://msdn.microsoft.com/windowsembedded/ce/dd630616.aspx>; davbo-msft (Moderator)[2010-3-30 13:1]: Q: hi everybody. I would like to know if there is something know about a bug in RTC API (VOIP), especially when using SIP. According the to the analysis with application verifyier there is a heap link in rtcdllmedia.dll. All of the unreleased chunks seem to have a size of 6560 bytes. A: I will follow up with the Networking Team for a response. davbo-msft (Moderator)[2010-3-30 13:1]: Q: Hi, we've problems with debugging of applications (= breakpoints in Platform Builder will be ignored) over KITL on Windows CE 5.0, if the PDB files are large (over 60MB). Are there any limitations to size of the PDB files? A: I will follow up with the tools team for a response and post with the transcript. Sing Wee [MS] (Expert)[2010-3-30 13:1]: Q: I am unable to use the target control in my development environment. any ideas? A: Make sure you have SYSGEN_SHELL=1 set in your build environment. davbo-msft (Moderator)[2010-3-30 13:3]: Q: what are the main differences between Object Store and RAM disk ? They are both in RAM...are there performance differences ? access differences ? A: I will follow up with the Core Team and get a response posted with the transcript to MSDN The Questions [2010-3-30 12:57]: Thanks for the info Pev. I will follow up with the CETK experts here and get back to you. [2010-3-30 12:59]: [2010-3-30 13:0]: [2010-3-30 13:0]: Hello everyone, we are just about out of time. Thank you for joining us for our Windows Embedded CE 6.0 chat today! <http://www.Microsoft.com/Embedded>; A special thank you to the product group members for coming out. The transcript of today’s chat will be posted online as soon as possible, to <http://msdn.microsoft.com/en-us/chats>;. We’ll see you again for another chat next month. Please check <http://msdn.microsoft.com/en-us/chats>; for the list of upcoming chats. If you still have unanswered questions, let me suggest that you post them on one of our newsgroups on <http://msdn.microsoft.com/en-us/windowsembedded/ce/default.aspx> -Windows Embedded CE 6.0 R3 Now Available! <http://msdn.microsoft.com/windowsembedded/ce/dd630616.aspx>; [2010-3-30 13:1]: [2010-3-30 13:1]: [2010-3-30 13:1]: [2010-3-30 13:3]: neo (Guest)[2010-3-30 11:37]: Hi all KellyG (Guest)[2010-3-30 11:37]: Hi KellyG (Guest)[2010-3-30 11:37]: I have a question unrelated to windows Ce embedded, can you please help me?? neo (Guest)[2010-3-30 11:38]: I need to implement hardware timers on my windows CE 6.0 device to trigger events at microsecond intervals! c neo (Guest)[2010-3-30 11:38]: yes. post it. May be i cud give a try KellyG (Guest)[2010-3-30 11:38]: My Product key listed on my tower is not the product key I need for microsoft office, but that is the only product key listed. neo (Guest)[2010-3-30 11:39]: I hope this is a chat for windows embedded. please post ur queries in office forums KellyG (Guest)[2010-3-30 11:39]: it is but i could not find a forum for office neo (Guest)[2010-3-30 11:40]: I think moderators will help u out. @ davbo-msft: can u help this guy? neo (Guest)[2010-3-30 11:41]: Q: I need to implement hardware timers on my windows CE 6.0 device to trigger events at microsecond intervals. Where should i start? davbo-msft (Moderator)[2010-3-30 11:50]: Our chat today covers the topic of Windows Embedded CE! 1. This chat will last for one hour. During this hour, our Experts will respond to as many questions as they can. Please understand that there may be some questions we cannot respond to due to lack of information or because the information is not yet public. 2. We encourage you to submit questions for our Experts. To do so, type your questions in the send box, select the “ask the Experts” box and click SEND. Questions sent directly to the Guest Chat room will not be answered by the Experts, but we encourage other community members to assist. 3. We ask that you stay on topic for the duration of the chat. This helps the Guests and Experts follow the conversation more easily. We invite you to ask off topic questions after this chat is over, but not during. 4. Please abide by the Chat Code of Conduct. Chat code of conduct: <http://msdn.microsoft.com/chats/chatroom.aspx?ctl=hlp#Conduct>; Pev (Guest)[2010-3-30 11:54]: Evening! davbo-msft (Moderator)[2010-3-30 11:54]: Hello everyone this is Dave Boyce - I worked in the Multimedia area for Windows CE. neo (Guest)[2010-3-30 11:55]: hello dave neo (Guest)[2010-3-30 11:55]: The chat code of conduct link is not working! Pev (Guest)[2010-3-30 11:56]: Best be polite just in case then ;-) neo (Guest)[2010-3-30 11:56]: davbo-msft (Moderator)[2010-3-30 11:57]: I'll check out the issue w/ the link paolopat (Guest)[2010-3-30 12:0]: Hello davbo-msft (Moderator)[2010-3-30 12:0]: We are pleased to welcome our Experts for today’s chat. I will have them introduce themselves now. Chat will begin in a couple of minutes. <http://www.Microsoft.com/Embedded>; paolopat (Guest)[2010-3-30 12:3]: Hello Experts ! neo (Guest)[2010-3-30 12:3]: Welcome all! paolopat (Guest)[2010-3-30 12:3]: Q: I want to partition my NAND Flash device. One partition to use for hive ragistry and the other for the apps and data. The only way to do it is programmatically or setting some registry values ? neo (Guest)[2010-3-30 12:3]: Q: I need to implement hardware timers on my windows CE 6.0 device to trigger events at microsecond intervals. Where should i start? neo (Guest)[2010-3-30 12:5]: Q: My CPU is Intel celeron M processor 1Ghz. Pev (Guest)[2010-3-30 12:5]: neo: if your silicon has multiple general purpose timers, pick one that's not in use for the system timer / profiler and set it up to trigger irqs for your purpose. You can't guarantee hard realtime type responses though... GarySwalling (Guest)[2010-3-30 12:5]: Q: Is Windows Phone 7 related to Windows CE? If so, can you tell me what version of Windows CE is the basis? Is it in fact the new version of Windows Mobile? Pev (Guest)[2010-3-30 12:6]: Q: Hi guys, what's the formal way to report bugs back to the core team / product team? The mechanism of calling the support phone number every time is really onerous and time-consuming. Is there another mechanism? paolopat (Guest)[2010-3-30 12:6]: Q: But the operation for creating the partitions ? neo (Guest)[2010-3-30 12:6]: Q: If i need to implement it using interrupt handlers, What are all the files that I should look at? GPM (Guest)[2010-3-30 12:6]: Q: I would like to get a handle to a Silverlight screen section, is that possiable? Jhony (Guest)[2010-3-30 12:7]: Q: I created a OS design with KITL and kernel debugger enabled. But I am unable to connect to the target for debugging. I am getting the following error when i try to connect with the device. PB Debugger Cannot initialize the Kernel Debugger. PB Debugger Debugger could not initialize connection. PB Debugger The Kernel Debugger is waiting to connect with target. PB Debugger The Kernel Debugger has been disconnected successfully. Charles (Guest)[2010-3-30 12:7]: What will be different in Windows Compact 7 from CE 6.0? neo (Guest)[2010-3-30 12:8]: Can I have the book's name please? kiefs_dev (Guest)[2010-3-30 12:8]: Q: hi everybody. I would like to know if there is something know about a bug in RTC API (VOIP), especially when using SIP. According the to the analysis with application verifyier there is a heap link in rtcdllmedia.dll. All of the unreleased chunks seem to have a size of 6560 bytes. paolopat (Guest)[2010-3-30 12:10]: Q: So...I have to modify file system code to create 2 partition at system startup ?!! I haven't understood.... neo (Guest)[2010-3-30 12:10]: Q: Do u mean ISR/IST implementation? How can i register an interrupt? What kind of interrupt should i register? neo (Guest)[2010-3-30 12:11]: Q: Can I have the book's name please? Charles (Guest)[2010-3-30 12:11]: Q: What will be different in Windows Compact 7 from CE 6.0? PaulT (Guest)[2010-3-30 12:11]: neo: I'd say that you really need the docs for YOUR BSP, not generic documents for BSPs in general. Each BSP may be architected differently. If you're using the CEPC BSP, then the documentation that comes with Platform Builder is a reasonable place to look. GPM (Guest)[2010-3-30 12:11]: Q: If one has an image on a Silverlight page, it seems to be cached. How would one refresh that cache after changing the underlying image? Elektrobit (Guest)[2010-3-30 12:12]: Q: Hi, we've problems with debugging of applications (= breakpoints in Platform Builder will be ignored) over KITL on Windows CE 5.0, if the PDB files are large (over 60MB). Are there any limitations to size of the PDB files? Jhony (Guest)[2010-3-30 12:15]: Q: I am using CE 6.0. There is no cesvchost process running in my system. alexquisi (Guest)[2010-3-30 12:15]: Q: Hi, I saw that the ATADISK is quite generic and des not have any optimizations. Do you have any advice to consider while tryin to improve the performance of it? Jhony (Guest)[2010-3-30 12:17]: Windows XP service pack 1 Jhony (Guest)[2010-3-30 12:17]: Q: sorry! Windows XP SP3 neo (Guest)[2010-3-30 12:19]: Q: The link for developing device drivers is not working. can u please check that? paolopat (Guest)[2010-3-30 12:20]: Q: I have a NAND Flash on my target device. On this flash I have the hive registry and an application.I have observed that when the NAND flash is fully, the system startup time is longer....is there a degradation of NAND use that influences the startup time ? Why ? Pev (Guest)[2010-3-30 12:20]: Q: When attempting to run the CETK cellcore tests the documentation states the pre-requisites include "stinger.ini", "ltk.ini" but windows CE doesn't provide these or document what they fully need to contain. Implicitly you also need "datatrans.xml" which isn't supplied. If you get around this error and steal these from Windows Mobile instead, when you try and run the CETK tests you get a data abort in radiometricsdll.dll. How should we invoke the cellcore parts of CETK? Pev (Guest)[2010-3-30 12:21]: Hi all, Pev (Guest)[2010-3-30 12:21]: oops Pev (Guest)[2010-3-30 12:21]: :-D Pev (Guest)[2010-3-30 12:24]: Typically for SoC devices you name your hardware specific libraries in the form "SOCDIRNAME_LIBNAME". In our platform "OMAP35XX_TPS659XX_TI_V1" if you do this we cause the catalog parser to die... For example if we have a library "Musbfn_OMAP35XX_TPS659XX_TI_V1.dll" entering this in the catalogs pbcxml file in a <module> section causes the XML parser to fail with : Pev (Guest)[2010-3-30 12:25]: Q: Error 3 The 'urn:Microsoft.PlatformBuilder/Catalog:Module' element is invalid - The value '012345678901234567890123456789.dll' is invalid according to its datatype 'urn:Microsoft.PlatformBuilder/Catalog:CatalogFileName' - The actual length is greater than the MaxLength value. GPM (Guest)[2010-3-30 12:25]: Q: Writable bitmap, is there an example of the syntax? Pev (Guest)[2010-3-30 12:25]: Q: Typically for SoC devices you name your hardware specific libraries in the form "SOCDIRNAME_LIBNAME". In our platform "OMAP35XX_TPS659XX_TI_V1" if you do this we cause the catalog parser to die... For example if we have a library "Musbfn_OMAP35XX_TPS659XX_TI_V1.dll" entering this in the catalogs pbcxml file in a <module> section causes the XML parser to fail with : Error 3 The 'urn:Microsoft.PlatformBuilder/Catalog:Module' element is invalid - The value '012345678901234567890123456789.dll' is invalid according to its datatype 'urn:Microsoft.PlatformBuilder/Catalog:CatalogFileName' - The actual length is greater than the MaxLength value. Pev (Guest)[2010-3-30 12:25]: sorry, messed up submission there! GPM (Guest)[2010-3-30 12:26]: Q: I would like to get a handle to a Silverlight screen section, is that possiable? GPM (Guest)[2010-3-30 12:28]: Q: IXRVisualHost::GetHWND() exactly what I needed Thanks, PaulT (Guest)[2010-3-30 12:29]: GPM: You don't have to keep submitting the questions. The chat experts have an application that they're using to follow the chat and all Ask the Experts questions are logged. Jhony (Guest)[2010-3-30 12:29]: Q: I am getting this error only when I select the KdStub as the debugger in Target device connectivity. neo (Guest)[2010-3-30 12:30]: Q: ok. thanks for the book's link Pev (Guest)[2010-3-30 12:31]: Hm, did those two I submitted get picked up by anyone? neo (Guest)[2010-3-30 12:33]: Q: If I select Active KTIL, My OS doesn't boots. It says "loading NK.EXE at 0x<xxxxx> location" after that nothing comes in the debug log. PaulT (Guest)[2010-3-30 12:34]: Pev: PaulT (Guest)[2010-3-30 12:35]: Pev: I'm sure they did. The guys who are actually on the chat may not be experts in that part of things. That's usually the explanation when you don't get an answer in 10 minutes or so. Pev (Guest)[2010-3-30 12:36]: Ah, fair enough Susie (Guest)[2010-3-30 12:36]: My Outlook Express incoming mail is corrput. No ONE has been able to fix the problem, Dell or Norton. I have dial up I'm in a rural area Jhony (Guest)[2010-3-30 12:37]: Q: I have tried that and I am getting the same error. Pev (Guest)[2010-3-30 12:37]: Susie : Use Thunderbird instead :-D PaulT (Guest)[2010-3-30 12:37]: Susie: Sorry, but this chat is not about Windows, but Embedded (like what runs on a phone). Your best chance is to find a local expert or talk to your ISP. paolopat (Guest)[2010-3-30 12:37]: Q: what are the main differences between Object Store and RAM disk ? They are both in RAM...are there performance differences ? access differences ? Susie (Guest)[2010-3-30 12:38]: My computer knowlege is very limited, what is Thunderbird? Pev (Guest)[2010-3-30 12:38]: A different email client :-D neo (Guest)[2010-3-30 12:39]: Q: KITL: *** Device Name CEPC56059 *** WARN: KITL will run in polling mode VBridge:: built on [Jul 10 2009] time [10:20:14] VBridgeInit()...TX = [16384] bytes -- Rx = [16384] bytes Tx buffer [0xA1B84860] to [0xA1B88860]. Rx buffer [0xA1B88880] to [0xA1B8C880]. VBridge:: NK add MAC: [0-60-65-2-DA-FB] Connecting to Desktop KITL: Connected host IP: 1 Port: 1086 .. this is the output of the serial debug Susie (Guest)[2010-3-30 12:39]: Do I need to uninstall Outlook Express youngboyzie (Guest)[2010-3-30 12:39]: I need to start battery calibration for my new battery for my dell inspiron 1525 laptop and should be able to reach the BIOS screen by hitting f2 but this isnt working... help? Pev (Guest)[2010-3-30 12:39]: Nah, you can run it instead - you'll still need help from your ISP to configure it I expect Jhony (Guest)[2010-3-30 12:39]: Q: Both is happening via Ethernet. PaulT (Guest)[2010-3-30 12:40]: youngboyzie: You're off-topic. This is not a general chat for Windows and certainly not for Dell. You'll have to ask Dell how to get to setup; it's their machine. bill (Guest)[2010-3-30 12:41]: I lost spell check, how can i get et back neo (Guest)[2010-3-30 12:42]: Q: I didn't do that. I have to try. Jhony (Guest)[2010-3-30 12:43]: Q: Ok. I will do it then. bill (Guest)[2010-3-30 12:43]: Q: I lost spell check, how can i get it back PaulT (Guest)[2010-3-30 12:44]: bill: This isn't a general Windows chat. There are some Web forums that you might try. GarySwalling (Guest)[2010-3-30 12:45]: Q: Thanks, I found the Phone 7 presentation at http://live.visitmix.com/MIX10/Sessions/CL13 GPM (Guest)[2010-3-30 12:45]: Q: Is the Silverlight Animation "Spline" a BezierSpline? neo (Guest)[2010-3-30 12:46]: Q: ok. I'll do it. thanks Pev (Guest)[2010-3-30 12:46]: Whowever was asking about KITL connection : I've had this loads in the past. I think I started debugging last time by using wireshark to see what was happening on the network then setting up the OAL_ETHER and OAL_FUNC and OAL_VERBOSE as well as OAL_KITL flags to see what was actually happening in the driver.... Pev (Guest)[2010-3-30 12:47]: I'd generally make sure that you're testing though a 10baseT hub (instead of anything faster) and forcing Active KITL in polled mode too... neo (Guest)[2010-3-30 12:48]: Q: I disabled the firewall in my PC. Pev (Guest)[2010-3-30 12:50]: Q: I've got a platform (not developed by myself) where I2C bus support has to be provided through the OAL as the kernel needs to talk to devices such as the power management IC and gas gauge so a 'proper' I2C driver hanging off device manager isn't possible. This happens to be a polled driver, so obviously it hits the system hard when either under lots of traffic or an error condition occurs and the driver constantly polls. I originally thought that there was no straightforward way to make such code interrupt driven in the kernel (as it's a cludge) but I realised that that's exactly what ETHDBG drivers do. Is there any reason why I shouldn't have a go at implementing a similar mechanism for our kernel resident I2C driver? If not, are there any obvious pitfalls - I've not seen any other BSP's do this in the past... neo (Guest)[2010-3-30 12:51]: Q: I don't see a kitlcore.dll in my OS. is my debug image fails to load because of that? Pev (Guest)[2010-3-30 12:52]: Q: Hi masatos, I'm using Windows Embedded CE 6.0 with R3 and patched to feb 2010's QFE's (with it's associated CETK version) this is a machine with only CE 6.0 on (no conflicts with earlier CE or WM...) neo (Guest)[2010-3-30 12:54]: ok. Got it neo (Guest)[2010-3-30 12:54]: Q: ok. Got it Roundman (Guest)[2010-3-30 12:55]: Q: I am having trouble with my mouse, I have the microsoft wireless mobile mouse 3000, when I push the scroll button I am suppose to have autoscroll instead it shows other web pages,Can you help me out tell me what to do!!! Pev (Guest)[2010-3-30 12:55]: Hey neo, debugging kitl issues is really frustrating but dont lose heart :-) neo (Guest)[2010-3-30 12:55]: @ pev : u fixed the problem of KITL after that? neo (Guest)[2010-3-30 12:56]: I am getting the same error again and again. I even cleaned my environment and tried in a fresh PC. But didn't succeed yet Pev (Guest)[2010-3-30 12:56]: Well, eventually - my experiences probably won't help you as different platforms have different reasons for doing that neo (Guest)[2010-3-30 12:56]: I think so PaulT (Guest)[2010-3-30 12:58]: neo: Have you searched the old messages in microsoft.public.windowsce.platbuilder? It seems to me that there was a packet size situation where it was possible to have problems with KITL connections based on a setting on the PC. Google Groups, groups.google.com, Advanced Groups Search will allow you to search a single newsgroup or a set of newsgroups easily. GPM (Guest)[2010-3-30 12:58]: Q: Spline- bad link PaulT (Guest)[2010-3-30 12:58]: GPM: without the > at the end does it work? It seems to for me... neo (Guest)[2010-3-30 12:58]: But some times if i try to connect to the device again. The Image information is seen in the serial debug. what does that mean?Download BIN file information: ----------------------------------------------------- [0]: Base Address=0x220000 Length=0x18DAADC Received a broadcast message !CheckUDP: Not UDP (proto = 0x00000001) after this i am getting the old errors. PB debugger cannot initialize ... GPM (Guest)[2010-3-30 12:59]: Q: Sorry, got to tirm the ">" neo (Guest)[2010-3-30 12:59]: ok paul. I ll look into that. davbo-msft (Moderator)[2010-3-30 13:0]: Hello everyone, we are just about out of time. Thank you for joining us for our Windows Embedded CE 6.0 chat today! <http://www.Microsoft.com/Embedded>; A special thank you to the product group members for coming out. The transcript of today’s chat will be posted online as soon as possible, to <http://msdn.microsoft.com/en-us/chats>;. We’ll see you again for another chat next month. Please check <http://msdn.microsoft.com/en-us/chats>; for the list of upcoming chats. If you still have unanswered questions, let me suggest that you post them on one of our newsgroups on <http://msdn.microsoft.com/en-us/windowsembedded/ce/default.aspx> -Windows Embedded CE 6.0 R3 Now Available! <http://msdn.microsoft.com/windowsembedded/ce/dd630616.aspx>; neo (Guest)[2010-3-30 13:1]: Q: I am unable to use the target control in my development environment. any ideas? Pev (Guest)[2010-3-30 13:1]: Sure, if KITL isn't connected target control won't work as it runs over kitl... neo (Guest)[2010-3-30 13:1]: ok .thanks pev neo (Guest)[2010-3-30 13:2]: yes. sysgen_shell is set to 1 neo (Guest)[2010-3-30 13:2]: Q: yes. sysgen_shell is set to 1 Marcelovk (Guest)[2010-3-30 13:2]: Q: Is there any way to extract the default command lines of the tests in CETK? I want to have it running unconnected from the desktop. Copyright © 2010 – Bruce Eitman All Rights Reserved

Read the article

How John Got 15x Improvement Without Really Trying

- by rchrd

The following article was published on a Sun Microsystems website a number of years ago by John Feo. It is still useful and worth preserving. So I'm republishing it here. How I Got 15x Improvement Without Really Trying John Feo, Sun Microsystems Taking ten "personal" program codes used in scientific and engineering research, the author was able to get from 2 to 15 times performance improvement easily by applying some simple general optimization techniques. Introduction Scientific research based on computer simulation depends on the simulation for advancement. The research can advance only as fast as the computational codes can execute. The codes' efficiency determines both the rate and quality of results. In the same amount of time, a faster program can generate more results and can carry out a more detailed simulation of physical phenomena than a slower program. Highly optimized programs help science advance quickly and insure that monies supporting scientific research are used as effectively as possible. Scientific computer codes divide into three broad categories: ISV, community, and personal. ISV codes are large, mature production codes developed and sold commercially. The codes improve slowly over time both in methods and capabilities, and they are well tuned for most vendor platforms. Since the codes are mature and complex, there are few opportunities to improve their performance solely through code optimization. Improvements of 10% to 15% are typical. Examples of ISV codes are DYNA3D, Gaussian, and Nastran. Community codes are non-commercial production codes used by a particular research field. Generally, they are developed and distributed by a single academic or research institution with assistance from the community. Most users just run the codes, but some develop new methods and extensions that feed back into the general release. The codes are available on most vendor platforms. Since these codes are younger than ISV codes, there are more opportunities to optimize the source code. Improvements of 50% are not unusual. Examples of community codes are AMBER, CHARM, BLAST, and FASTA. Personal codes are those written by single users or small research groups for their own use. These codes are not distributed, but may be passed from professor-to-student or student-to-student over several years. They form the primordial ocean of applications from which community and ISV codes emerge. Government research grants pay for the development of most personal codes. This paper reports on the nature and performance of this class of codes. Over the last year, I have looked at over two dozen personal codes from more than a dozen research institutions. The codes cover a variety of scientific fields, including astronomy, atmospheric sciences, bioinformatics, biology, chemistry, geology, and physics. The sources range from a few hundred lines to more than ten thousand lines, and are written in Fortran, Fortran 90, C, and C++. For the most part, the codes are modular, documented, and written in a clear, straightforward manner. They do not use complex language features, advanced data structures, programming tricks, or libraries. I had little trouble understanding what the codes did or how data structures were used. Most came with a makefile. Surprisingly, only one of the applications is parallel. All developers have access to parallel machines, so availability is not an issue. Several tried to parallelize their applications, but stopped after encountering difficulties. Lack of education and a perception that parallelism is difficult prevented most from trying. I parallelized several of the codes using OpenMP, and did not judge any of the codes as difficult to parallelize. Even more surprising than the lack of parallelism is the inefficiency of the codes. I was able to get large improvements in performance in a matter of a few days applying simple optimization techniques. Table 1 lists ten representative codes [names and affiliation are omitted to preserve anonymity]. Improvements on one processor range from 2x to 15.5x with a simple average of 4.75x. I did not use sophisticated performance tools or drill deep into the program's execution character as one would do when tuning ISV or community codes. Using only a profiler and source line timers, I identified inefficient sections of code and improved their performance by inspection. The changes were at a high level. I am sure there is another factor of 2 or 3 in each code, and more if the codes are parallelized. The study’s results show that personal scientific codes are running many times slower than they should and that the problem is pervasive. Computational scientists are not sloppy programmers; however, few are trained in the art of computer programming or code optimization. I found that most have a working knowledge of some programming language and standard software engineering practices; but they do not know, or think about, how to make their programs run faster. They simply do not know the standard techniques used to make codes run faster. In fact, they do not even perceive that such techniques exist. The case studies described in this paper show that applying simple, well known techniques can significantly increase the performance of personal codes. It is important that the scientific community and the Government agencies that support scientific research find ways to better educate academic scientific programmers. The inefficiency of their codes is so bad that it is retarding both the quality and progress of scientific research. # cacheperformance redundantoperations loopstructures performanceimprovement 1 x x 15.5 2 x 2.8 3 x x 2.5 4 x 2.1 5 x x 2.0 6 x 5.0 7 x 5.8 8 x 6.3 9 2.2 10 x x 3.3 Table 1 — Area of improvement and performance gains of 10 codes The remainder of the paper is organized as follows: sections 2, 3, and 4 discuss the three most common sources of inefficiencies in the codes studied. These are cache performance, redundant operations, and loop structures. Each section includes several examples. The last section summaries the work and suggests a possible solution to the issues raised. Optimizing cache performance Commodity microprocessor systems use caches to increase memory bandwidth and reduce memory latencies. Typical latencies from processor to L1, L2, local, and remote memory are 3, 10, 50, and 200 cycles, respectively. Moreover, bandwidth falls off dramatically as memory distances increase. Programs that do not use cache effectively run many times slower than programs that do. When optimizing for cache, the biggest performance gains are achieved by accessing data in cache order and reusing data to amortize the overhead of cache misses. Secondary considerations are prefetching, associativity, and replacement; however, the understanding and analysis required to optimize for the latter are probably beyond the capabilities of the non-expert. Much can be gained simply by accessing data in the correct order and maximizing data reuse. 6 out of the 10 codes studied here benefited from such high level optimizations. Array Accesses The most important cache optimization is the most basic: accessing Fortran array elements in column order and C array elements in row order. Four of the ten codes—1, 2, 4, and 10—got it wrong. Compilers will restructure nested loops to optimize cache performance, but may not do so if the loop structure is too complex, or the loop body includes conditionals, complex addressing, or function calls. In code 1, the compiler failed to invert a key loop because of complex addressing do I = 0, 1010, delta_x IM = I - delta_x IP = I + delta_x do J = 5, 995, delta_x JM = J - delta_x JP = J + delta_x T1 = CA1(IP, J) + CA1(I, JP) T2 = CA1(IM, J) + CA1(I, JM) S1 = T1 + T2 - 4 * CA1(I, J) CA(I, J) = CA1(I, J) + D * S1 end do end do In code 2, the culprit is conditionals do I = 1, N do J = 1, N If (IFLAG(I,J) .EQ. 0) then T1 = Value(I, J-1) T2 = Value(I-1, J) T3 = Value(I, J) T4 = Value(I+1, J) T5 = Value(I, J+1) Value(I,J) = 0.25 * (T1 + T2 + T5 + T4) Delta = ABS(T3 - Value(I,J)) If (Delta .GT. MaxDelta) MaxDelta = Delta endif enddo enddo I fixed both programs by inverting the loops by hand. Code 10 has three-dimensional arrays and triply nested loops. The structure of the most computationally intensive loops is too complex to invert automatically or by hand. The only practical solution is to transpose the arrays so that the dimension accessed by the innermost loop is in cache order. The arrays can be transposed at construction or prior to entering a computationally intensive section of code. The former requires all array references to be modified, while the latter is cost effective only if the cost of the transpose is amortized over many accesses. I used the second approach to optimize code 10. Code 5 has four-dimensional arrays and loops are nested four deep. For all of the reasons cited above the compiler is not able to restructure three key loops. Assume C arrays and let the four dimensions of the arrays be i, j, k, and l. In the original code, the index structure of the three loops is L1: for i L2: for i L3: for i for l for l for j for k for j for k for j for k for l So only L3 accesses array elements in cache order. L1 is a very complex loop—much too complex to invert. I brought the loop into cache alignment by transposing the second and fourth dimensions of the arrays. Since the code uses a macro to compute all array indexes, I effected the transpose at construction and changed the macro appropriately. The dimensions of the new arrays are now: i, l, k, and j. L3 is a simple loop and easily inverted. L2 has a loop-carried scalar dependence in k. By promoting the scalar name that carries the dependence to an array, I was able to invert the third and fourth subloops aligning the loop with cache. Code 5 is by far the most difficult of the four codes to optimize for array accesses; but the knowledge required to fix the problems is no more than that required for the other codes. I would judge this code at the limits of, but not beyond, the capabilities of appropriately trained computational scientists. Array Strides When a cache miss occurs, a line (64 bytes) rather than just one word is loaded into the cache. If data is accessed stride 1, than the cost of the miss is amortized over 8 words. Any stride other than one reduces the cost savings. Two of the ten codes studied suffered from non-unit strides. The codes represent two important classes of "strided" codes. Code 1 employs a multi-grid algorithm to reduce time to convergence. The grids are every tenth, fifth, second, and unit element. Since time to convergence is inversely proportional to the distance between elements, coarse grids converge quickly providing good starting values for finer grids. The better starting values further reduce the time to convergence. The downside is that grids of every nth element, n > 1, introduce non-unit strides into the computation. In the original code, much of the savings of the multi-grid algorithm were lost due to this problem. I eliminated the problem by compressing (copying) coarse grids into continuous memory, and rewriting the computation as a function of the compressed grid. On convergence, I copied the final values of the compressed grid back to the original grid. The savings gained from unit stride access of the compressed grid more than paid for the cost of copying. Using compressed grids, the loop from code 1 included in the previous section becomes do j = 1, GZ do i = 1, GZ T1 = CA(i+0, j-1) + CA(i-1, j+0) T4 = CA1(i+1, j+0) + CA1(i+0, j+1) S1 = T1 + T4 - 4 * CA1(i+0, j+0) CA(i+0, j+0) = CA1(i+0, j+0) + DD * S1 enddo enddo where CA and CA1 are compressed arrays of size GZ. Code 7 traverses a list of objects selecting objects for later processing. The labels of the selected objects are stored in an array. The selection step has unit stride, but the processing steps have irregular stride. A fix is to save the parameters of the selected objects in temporary arrays as they are selected, and pass the temporary arrays to the processing functions. The fix is practical if the same parameters are used in selection as in processing, or if processing comprises a series of distinct steps which use overlapping subsets of the parameters. Both conditions are true for code 7, so I achieved significant improvement by copying parameters to temporary arrays during selection. Data reuse In the previous sections, we optimized for spatial locality. It is also important to optimize for temporal locality. Once read, a datum should be used as much as possible before it is forced from cache. Loop fusion and loop unrolling are two techniques that increase temporal locality. Unfortunately, both techniques increase register pressure—as loop bodies become larger, the number of registers required to hold temporary values grows. Once register spilling occurs, any gains evaporate quickly. For multiprocessors with small register sets or small caches, the sweet spot can be very small. In the ten codes presented here, I found no opportunities for loop fusion and only two opportunities for loop unrolling (codes 1 and 3). In code 1, unrolling the outer and inner loop one iteration increases the number of result values computed by the loop body from 1 to 4, do J = 1, GZ-2, 2 do I = 1, GZ-2, 2 T1 = CA1(i+0, j-1) + CA1(i-1, j+0) T2 = CA1(i+1, j-1) + CA1(i+0, j+0) T3 = CA1(i+0, j+0) + CA1(i-1, j+1) T4 = CA1(i+1, j+0) + CA1(i+0, j+1) T5 = CA1(i+2, j+0) + CA1(i+1, j+1) T6 = CA1(i+1, j+1) + CA1(i+0, j+2) T7 = CA1(i+2, j+1) + CA1(i+1, j+2) S1 = T1 + T4 - 4 * CA1(i+0, j+0) S2 = T2 + T5 - 4 * CA1(i+1, j+0) S3 = T3 + T6 - 4 * CA1(i+0, j+1) S4 = T4 + T7 - 4 * CA1(i+1, j+1) CA(i+0, j+0) = CA1(i+0, j+0) + DD * S1 CA(i+1, j+0) = CA1(i+1, j+0) + DD * S2 CA(i+0, j+1) = CA1(i+0, j+1) + DD * S3 CA(i+1, j+1) = CA1(i+1, j+1) + DD * S4 enddo enddo The loop body executes 12 reads, whereas as the rolled loop shown in the previous section executes 20 reads to compute the same four values. In code 3, two loops are unrolled 8 times and one loop is unrolled 4 times. Here is the before for (k = 0; k < NK[u]; k++) { sum = 0.0; for (y = 0; y < NY; y++) { sum += W[y][u][k] * delta[y]; } backprop[i++]=sum; } and after code for (k = 0; k < KK - 8; k+=8) { sum0 = 0.0; sum1 = 0.0; sum2 = 0.0; sum3 = 0.0; sum4 = 0.0; sum5 = 0.0; sum6 = 0.0; sum7 = 0.0; for (y = 0; y < NY; y++) { sum0 += W[y][0][k+0] * delta[y]; sum1 += W[y][0][k+1] * delta[y]; sum2 += W[y][0][k+2] * delta[y]; sum3 += W[y][0][k+3] * delta[y]; sum4 += W[y][0][k+4] * delta[y]; sum5 += W[y][0][k+5] * delta[y]; sum6 += W[y][0][k+6] * delta[y]; sum7 += W[y][0][k+7] * delta[y]; } backprop[k+0] = sum0; backprop[k+1] = sum1; backprop[k+2] = sum2; backprop[k+3] = sum3; backprop[k+4] = sum4; backprop[k+5] = sum5; backprop[k+6] = sum6; backprop[k+7] = sum7; } for one of the loops unrolled 8 times. Optimizing for temporal locality is the most difficult optimization considered in this paper. The concepts are not difficult, but the sweet spot is small. Identifying where the program can benefit from loop unrolling or loop fusion is not trivial. Moreover, it takes some effort to get it right. Still, educating scientific programmers about temporal locality and teaching them how to optimize for it will pay dividends. Reducing instruction count Execution time is a function of instruction count. Reduce the count and you usually reduce the time. The best solution is to use a more efficient algorithm; that is, an algorithm whose order of complexity is smaller, that converges quicker, or is more accurate. Optimizing source code without changing the algorithm yields smaller, but still significant, gains. This paper considers only the latter because the intent is to study how much better codes can run if written by programmers schooled in basic code optimization techniques. The ten codes studied benefited from three types of "instruction reducing" optimizations. The two most prevalent were hoisting invariant memory and data operations out of inner loops. The third was eliminating unnecessary data copying. The nature of these inefficiencies is language dependent. Memory operations The semantics of C make it difficult for the compiler to determine all the invariant memory operations in a loop. The problem is particularly acute for loops in functions since the compiler may not know the values of the function's parameters at every call site when compiling the function. Most compilers support pragmas to help resolve ambiguities; however, these pragmas are not comprehensive and there is no standard syntax. To guarantee that invariant memory operations are not executed repetitively, the user has little choice but to hoist the operations by hand. The problem is not as severe in Fortran programs because in the absence of equivalence statements, it is a violation of the language's semantics for two names to share memory. Codes 3 and 5 are C programs. In both cases, the compiler did not hoist all invariant memory operations from inner loops. Consider the following loop from code 3 for (y = 0; y < NY; y++) { i = 0; for (u = 0; u < NU; u++) { for (k = 0; k < NK[u]; k++) { dW[y][u][k] += delta[y] * I1[i++]; } } } Since dW[y][u] can point to the same memory space as delta for one or more values of y and u, assignment to dW[y][u][k] may change the value of delta[y]. In reality, dW and delta do not overlap in memory, so I rewrote the loop as for (y = 0; y < NY; y++) { i = 0; Dy = delta[y]; for (u = 0; u < NU; u++) { for (k = 0; k < NK[u]; k++) { dW[y][u][k] += Dy * I1[i++]; } } } Failure to hoist invariant memory operations may be due to complex address calculations. If the compiler can not determine that the address calculation is invariant, then it can hoist neither the calculation nor the associated memory operations. As noted above, code 5 uses a macro to address four-dimensional arrays #define MAT4D(a,q,i,j,k) (double *)((a)->data + (q)*(a)->strides[0] + (i)*(a)->strides[3] + (j)*(a)->strides[2] + (k)*(a)->strides[1]) The macro is too complex for the compiler to understand and so, it does not identify any subexpressions as loop invariant. The simplest way to eliminate the address calculation from the innermost loop (over i) is to define a0 = MAT4D(a,q,0,j,k) before the loop and then replace all instances of *MAT4D(a,q,i,j,k) in the loop with a0[i] A similar problem appears in code 6, a Fortran program. The key loop in this program is do n1 = 1, nh nx1 = (n1 - 1) / nz + 1 nz1 = n1 - nz * (nx1 - 1) do n2 = 1, nh nx2 = (n2 - 1) / nz + 1 nz2 = n2 - nz * (nx2 - 1) ndx = nx2 - nx1 ndy = nz2 - nz1 gxx = grn(1,ndx,ndy) gyy = grn(2,ndx,ndy) gxy = grn(3,ndx,ndy) balance(n1,1) = balance(n1,1) + (force(n2,1) * gxx + force(n2,2) * gxy) * h1 balance(n1,2) = balance(n1,2) + (force(n2,1) * gxy + force(n2,2) * gyy)*h1 end do end do The programmer has written this loop well—there are no loop invariant operations with respect to n1 and n2. However, the loop resides within an iterative loop over time and the index calculations are independent with respect to time. Trading space for time, I precomputed the index values prior to the entering the time loop and stored the values in two arrays. I then replaced the index calculations with reads of the arrays. Data operations Ways to reduce data operations can appear in many forms. Implementing a more efficient algorithm produces the biggest gains. The closest I came to an algorithm change was in code 4. This code computes the inner product of K-vectors A(i) and B(j), 0 = i < N, 0 = j < M, for most values of i and j. Since the program computes most of the NM possible inner products, it is more efficient to compute all the inner products in one triply-nested loop rather than one at a time when needed. The savings accrue from reading A(i) once for all B(j) vectors and from loop unrolling. for (i = 0; i < N; i+=8) { for (j = 0; j < M; j++) { sum0 = 0.0; sum1 = 0.0; sum2 = 0.0; sum3 = 0.0; sum4 = 0.0; sum5 = 0.0; sum6 = 0.0; sum7 = 0.0; for (k = 0; k < K; k++) { sum0 += A[i+0][k] * B[j][k]; sum1 += A[i+1][k] * B[j][k]; sum2 += A[i+2][k] * B[j][k]; sum3 += A[i+3][k] * B[j][k]; sum4 += A[i+4][k] * B[j][k]; sum5 += A[i+5][k] * B[j][k]; sum6 += A[i+6][k] * B[j][k]; sum7 += A[i+7][k] * B[j][k]; } C[i+0][j] = sum0; C[i+1][j] = sum1; C[i+2][j] = sum2; C[i+3][j] = sum3; C[i+4][j] = sum4; C[i+5][j] = sum5; C[i+6][j] = sum6; C[i+7][j] = sum7; }} This change requires knowledge of a typical run; i.e., that most inner products are computed. The reasons for the change, however, derive from basic optimization concepts. It is the type of change easily made at development time by a knowledgeable programmer. In code 5, we have the data version of the index optimization in code 6. Here a very expensive computation is a function of the loop indices and so cannot be hoisted out of the loop; however, the computation is invariant with respect to an outer iterative loop over time. We can compute its value for each iteration of the computation loop prior to entering the time loop and save the values in an array. The increase in memory required to store the values is small in comparison to the large savings in time. The main loop in Code 8 is doubly nested. The inner loop includes a series of guarded computations; some are a function of the inner loop index but not the outer loop index while others are a function of the outer loop index but not the inner loop index for (j = 0; j < N; j++) { for (i = 0; i < M; i++) { r = i * hrmax; R = A[j]; temp = (PRM[3] == 0.0) ? 1.0 : pow(r, PRM[3]); high = temp * kcoeff * B[j] * PRM[2] * PRM[4]; low = high * PRM[6] * PRM[6] / (1.0 + pow(PRM[4] * PRM[6], 2.0)); kap = (R > PRM[6]) ? high * R * R / (1.0 + pow(PRM[4]*r, 2.0) : low * pow(R/PRM[6], PRM[5]); < rest of loop omitted > }} Note that the value of temp is invariant to j. Thus, we can hoist the computation for temp out of the loop and save its values in an array. for (i = 0; i < M; i++) { r = i * hrmax; TEMP[i] = pow(r, PRM[3]); } [N.B. – the case for PRM[3] = 0 is omitted and will be reintroduced later.] We now hoist out of the inner loop the computations invariant to i. Since the conditional guarding the value of kap is invariant to i, it behooves us to hoist the computation out of the inner loop, thereby executing the guard once rather than M times. The final version of the code is for (j = 0; j < N; j++) { R = rig[j] / 1000.; tmp1 = kcoeff * par[2] * beta[j] * par[4]; tmp2 = 1.0 + (par[4] * par[4] * par[6] * par[6]); tmp3 = 1.0 + (par[4] * par[4] * R * R); tmp4 = par[6] * par[6] / tmp2; tmp5 = R * R / tmp3; tmp6 = pow(R / par[6], par[5]); if ((par[3] == 0.0) && (R > par[6])) { for (i = 1; i <= imax1; i++) KAP[i] = tmp1 * tmp5; } else if ((par[3] == 0.0) && (R <= par[6])) { for (i = 1; i <= imax1; i++) KAP[i] = tmp1 * tmp4 * tmp6; } else if ((par[3] != 0.0) && (R > par[6])) { for (i = 1; i <= imax1; i++) KAP[i] = tmp1 * TEMP[i] * tmp5; } else if ((par[3] != 0.0) && (R <= par[6])) { for (i = 1; i <= imax1; i++) KAP[i] = tmp1 * TEMP[i] * tmp4 * tmp6; } for (i = 0; i < M; i++) { kap = KAP[i]; r = i * hrmax; < rest of loop omitted > } } Maybe not the prettiest piece of code, but certainly much more efficient than the original loop, Copy operations Several programs unnecessarily copy data from one data structure to another. This problem occurs in both Fortran and C programs, although it manifests itself differently in the two languages. Code 1 declares two arrays—one for old values and one for new values. At the end of each iteration, the array of new values is copied to the array of old values to reset the data structures for the next iteration. This problem occurs in Fortran programs not included in this study and in both Fortran 77 and Fortran 90 code. Introducing pointers to the arrays and swapping pointer values is an obvious way to eliminate the copying; but pointers is not a feature that many Fortran programmers know well or are comfortable using. An easy solution not involving pointers is to extend the dimension of the value array by 1 and use the last dimension to differentiate between arrays at different times. For example, if the data space is N x N, declare the array (N, N, 2). Then store the problem’s initial values in (_, _, 2) and define the scalar names new = 2 and old = 1. At the start of each iteration, swap old and new to reset the arrays. The old–new copy problem did not appear in any C program. In programs that had new and old values, the code swapped pointers to reset data structures. Where unnecessary coping did occur is in structure assignment and parameter passing. Structures in C are handled much like scalars. Assignment causes the data space of the right-hand name to be copied to the data space of the left-hand name. Similarly, when a structure is passed to a function, the data space of the actual parameter is copied to the data space of the formal parameter. If the structure is large and the assignment or function call is in an inner loop, then copying costs can grow quite large. While none of the ten programs considered here manifested this problem, it did occur in programs not included in the study. A simple fix is always to refer to structures via pointers. Optimizing loop structures Since scientific programs spend almost all their time in loops, efficient loops are the key to good performance. Conditionals, function calls, little instruction level parallelism, and large numbers of temporary values make it difficult for the compiler to generate tightly packed, highly efficient code. Conditionals and function calls introduce jumps that disrupt code flow. Users should eliminate or isolate conditionls to their own loops as much as possible. Often logical expressions can be substituted for if-then-else statements. For example, code 2 includes the following snippet MaxDelta = 0.0 do J = 1, N do I = 1, M < code omitted > Delta = abs(OldValue ? NewValue) if (Delta > MaxDelta) MaxDelta = Delta enddo enddo if (MaxDelta .gt. 0.001) goto 200 Since the only use of MaxDelta is to control the jump to 200 and all that matters is whether or not it is greater than 0.001, I made MaxDelta a boolean and rewrote the snippet as MaxDelta = .false. do J = 1, N do I = 1, M < code omitted > Delta = abs(OldValue ? NewValue) MaxDelta = MaxDelta .or. (Delta .gt. 0.001) enddo enddo if (MaxDelta) goto 200 thereby, eliminating the conditional expression from the inner loop. A microprocessor can execute many instructions per instruction cycle. Typically, it can execute one or more memory, floating point, integer, and jump operations. To be executed simultaneously, the operations must be independent. Thick loops tend to have more instruction level parallelism than thin loops. Moreover, they reduce memory traffice by maximizing data reuse. Loop unrolling and loop fusion are two techniques to increase the size of loop bodies. Several of the codes studied benefitted from loop unrolling, but none benefitted from loop fusion. This observation is not too surpising since it is the general tendency of programmers to write thick loops. As loops become thicker, the number of temporary values grows, increasing register pressure. If registers spill, then memory traffic increases and code flow is disrupted. A thick loop with many temporary values may execute slower than an equivalent series of thin loops. The biggest gain will be achieved if the thick loop can be split into a series of independent loops eliminating the need to write and read temporary arrays. I found such an occasion in code 10 where I split the loop do i = 1, n do j = 1, m A24(j,i)= S24(j,i) * T24(j,i) + S25(j,i) * U25(j,i) B24(j,i)= S24(j,i) * T25(j,i) + S25(j,i) * U24(j,i) A25(j,i)= S24(j,i) * C24(j,i) + S25(j,i) * V24(j,i) B25(j,i)= S24(j,i) * U25(j,i) + S25(j,i) * V25(j,i) C24(j,i)= S26(j,i) * T26(j,i) + S27(j,i) * U26(j,i) D24(j,i)= S26(j,i) * T27(j,i) + S27(j,i) * V26(j,i) C25(j,i)= S27(j,i) * S28(j,i) + S26(j,i) * U28(j,i) D25(j,i)= S27(j,i) * T28(j,i) + S26(j,i) * V28(j,i) end do end do into two disjoint loops do i = 1, n do j = 1, m A24(j,i)= S24(j,i) * T24(j,i) + S25(j,i) * U25(j,i) B24(j,i)= S24(j,i) * T25(j,i) + S25(j,i) * U24(j,i) A25(j,i)= S24(j,i) * C24(j,i) + S25(j,i) * V24(j,i) B25(j,i)= S24(j,i) * U25(j,i) + S25(j,i) * V25(j,i) end do end do do i = 1, n do j = 1, m C24(j,i)= S26(j,i) * T26(j,i) + S27(j,i) * U26(j,i) D24(j,i)= S26(j,i) * T27(j,i) + S27(j,i) * V26(j,i) C25(j,i)= S27(j,i) * S28(j,i) + S26(j,i) * U28(j,i) D25(j,i)= S27(j,i) * T28(j,i) + S26(j,i) * V28(j,i) end do end do Conclusions Over the course of the last year, I have had the opportunity to work with over two dozen academic scientific programmers at leading research universities. Their research interests span a broad range of scientific fields. Except for two programs that relied almost exclusively on library routines (matrix multiply and fast Fourier transform), I was able to improve significantly the single processor performance of all codes. Improvements range from 2x to 15.5x with a simple average of 4.75x. Changes to the source code were at a very high level. I did not use sophisticated techniques or programming tools to discover inefficiencies or effect the changes. Only one code was parallel despite the availability of parallel systems to all developers. Clearly, we have a problem—personal scientific research codes are highly inefficient and not running parallel. The developers are unaware of simple optimization techniques to make programs run faster. They lack education in the art of code optimization and parallel programming. I do not believe we can fix the problem by publishing additional books or training manuals. To date, the developers in questions have not studied the books or manual available, and are unlikely to do so in the future. Short courses are a possible solution, but I believe they are too concentrated to be much use. The general concepts can be taught in a three or four day course, but that is not enough time for students to practice what they learn and acquire the experience to apply and extend the concepts to their codes. Practice is the key to becoming proficient at optimization. I recommend that graduate students be required to take a semester length course in optimization and parallel programming. We would never give someone access to state-of-the-art scientific equipment costing hundreds of thousands of dollars without first requiring them to demonstrate that they know how to use the equipment. Yet the criterion for time on state-of-the-art supercomputers is at most an interesting project. Requestors are never asked to demonstrate that they know how to use the system, or can use the system effectively. A semester course would teach them the required skills. Government agencies that fund academic scientific research pay for most of the computer systems supporting scientific research as well as the development of most personal scientific codes. These agencies should require graduate schools to offer a course in optimization and parallel programming as a requirement for funding. About the Author John Feo received his Ph.D. in Computer Science from The University of Texas at Austin in 1986. After graduate school, Dr. Feo worked at Lawrence Livermore National Laboratory where he was the Group Leader of the Computer Research Group and principal investigator of the Sisal Language Project. In 1997, Dr. Feo joined Tera Computer Company where he was project manager for the MTA, and oversaw the programming and evaluation of the MTA at the San Diego Supercomputer Center. In 2000, Dr. Feo joined Sun Microsystems as an HPC application specialist. He works with university research groups to optimize and parallelize scientific codes. Dr. Feo has published over two dozen research articles in the areas of parallel parallel programming, parallel programming languages, and application performance.

Search Results

Search found 454 results on 19 pages for 'optimizations'.

Page 13/19 | < Previous Page | 9 10 11 12 13 14 15 16 17 18 19 | Next Page >

- by Justin R.

- by Johan Leino

- by nlight

- by Traveling Tech Guy

- by martinus

- by Fernando Miguélez

- by Hanno Fietz

- by Ed Mazur

- by Johan Leino

- by mohawkjohn

- by electblake

- by Phobis

- by Wayne

- by Dave Jarvis

- by Michael Joell

- by Paperflyer

- by Pirate for Profit

- by user15872

- by anony

- by Austin Henley

- by user336749

- by DeadMG

- by tkm

- by Bruce Eitman

- by rchrd

< Previous Page | 9 10 11 12 13 14 15 16 17 18 19 | Next Page >