Search Results

Search found 1962 results on 79 pages for 'slightly offtopic'.

Page 71/79 | < Previous Page | 67 68 69 70 71 72 73 74 75 76 77 78 | Next Page >

Fun with Aggregates

- by Paul White

There are interesting things to be learned from even the simplest queries. For example, imagine you are given the task of writing a query to list AdventureWorks product names where the product has at least one entry in the transaction history table, but fewer than ten. One possible query to meet that specification is: SELECT p.Name FROM Production.Product AS p JOIN Production.TransactionHistory AS th ON p.ProductID = th.ProductID GROUP BY p.ProductID, p.Name HAVING COUNT_BIG(*) < 10; That query correctly returns 23 rows (execution plan and data sample shown below): The execution plan looks a bit different from the written form of the query: the base tables are accessed in reverse order, and the aggregation is performed before the join. The general idea is to read all rows from the history table, compute the count of rows grouped by ProductID, merge join the results to the Product table on ProductID, and finally filter to only return rows where the count is less than ten. This ‘fully-optimized’ plan has an estimated cost of around 0.33 units. The reason for the quote marks there is that this plan is not quite as optimal as it could be – surely it would make sense to push the Filter down past the join too? To answer that, let’s look at some other ways to formulate this query. This being SQL, there are any number of ways to write logically-equivalent query specifications, so we’ll just look at a couple of interesting ones. The first query is an attempt to reverse-engineer T-SQL from the optimized query plan shown above. It joins the result of pre-aggregating the history table to the Product table before filtering: SELECT p.Name FROM ( SELECT th.ProductID, cnt = COUNT_BIG(*) FROM Production.TransactionHistory AS th GROUP BY th.ProductID ) AS q1 JOIN Production.Product AS p ON p.ProductID = q1.ProductID WHERE q1.cnt < 10; Perhaps a little surprisingly, we get a slightly different execution plan: The results are the same (23 rows) but this time the Filter is pushed below the join! The optimizer chooses nested loops for the join, because the cardinality estimate for rows passing the Filter is a bit low (estimate 1 versus 23 actual), though you can force a merge join with a hint and the Filter still appears below the join. In yet another variation, the < 10 predicate can be ‘manually pushed’ by specifying it in a HAVING clause in the “q1” sub-query instead of in the WHERE clause as written above. The reason this predicate can be pushed past the join in this query form, but not in the original formulation is simply an optimizer limitation – it does make efforts (primarily during the simplification phase) to encourage logically-equivalent query specifications to produce the same execution plan, but the implementation is not completely comprehensive. Moving on to a second example, the following query specification results from phrasing the requirement as “list the products where there exists fewer than ten correlated rows in the history table”: SELECT p.Name FROM Production.Product AS p WHERE EXISTS ( SELECT * FROM Production.TransactionHistory AS th WHERE th.ProductID = p.ProductID HAVING COUNT_BIG(*) < 10 ); Unfortunately, this query produces an incorrect result (86 rows): The problem is that it lists products with no history rows, though the reasons are interesting. The COUNT_BIG(*) in the EXISTS clause is a scalar aggregate (meaning there is no GROUP BY clause) and scalar aggregates always produce a value, even when the input is an empty set. In the case of the COUNT aggregate, the result of aggregating the empty set is zero (the other standard aggregates produce a NULL). To make the point really clear, let’s look at product 709, which happens to be one for which no history rows exist: -- Scalar aggregate SELECT COUNT_BIG(*) FROM Production.TransactionHistory AS th WHERE th.ProductID = 709; -- Vector aggregate SELECT COUNT_BIG(*) FROM Production.TransactionHistory AS th WHERE th.ProductID = 709 GROUP BY th.ProductID; The estimated execution plans for these two statements are almost identical: You might expect the Stream Aggregate to have a Group By for the second statement, but this is not the case. The query includes an equality comparison to a constant value (709), so all qualified rows are guaranteed to have the same value for ProductID and the Group By is optimized away. In fact there are some minor differences between the two plans (the first is auto-parameterized and qualifies for trivial plan, whereas the second is not auto-parameterized and requires cost-based optimization), but there is nothing to indicate that one is a scalar aggregate and the other is a vector aggregate. This is something I would like to see exposed in show plan so I suggested it on Connect. Anyway, the results of running the two queries show the difference at runtime: The scalar aggregate (no GROUP BY) returns a result of zero, whereas the vector aggregate (with a GROUP BY clause) returns nothing at all. Returning to our EXISTS query, we could ‘fix’ it by changing the HAVING clause to reject rows where the scalar aggregate returns zero: SELECT p.Name FROM Production.Product AS p WHERE EXISTS ( SELECT * FROM Production.TransactionHistory AS th WHERE th.ProductID = p.ProductID HAVING COUNT_BIG(*) BETWEEN 1 AND 9 ); The query now returns the correct 23 rows: Unfortunately, the execution plan is less efficient now – it has an estimated cost of 0.78 compared to 0.33 for the earlier plans. Let’s try adding a redundant GROUP BY instead of changing the HAVING clause: SELECT p.Name FROM Production.Product AS p WHERE EXISTS ( SELECT * FROM Production.TransactionHistory AS th WHERE th.ProductID = p.ProductID GROUP BY th.ProductID HAVING COUNT_BIG(*) < 10 ); Not only do we now get correct results (23 rows), this is the execution plan: I like to compare that plan to quantum physics: if you don’t find it shocking, you haven’t understood it properly :) The simple addition of a redundant GROUP BY has resulted in the EXISTS form of the query being transformed into exactly the same optimal plan we found earlier. What’s more, in SQL Server 2008 and later, we can replace the odd-looking GROUP BY with an explicit GROUP BY on the empty set: SELECT p.Name FROM Production.Product AS p WHERE EXISTS ( SELECT * FROM Production.TransactionHistory AS th WHERE th.ProductID = p.ProductID GROUP BY () HAVING COUNT_BIG(*) < 10 ); I offer that as an alternative because some people find it more intuitive (and it perhaps has more geek value too). Whichever way you prefer, it’s rather satisfying to note that the result of the sub-query does not exist for a particular correlated value where a vector aggregate is used (the scalar COUNT aggregate always returns a value, even if zero, so it always ‘EXISTS’ regardless which ProductID is logically being evaluated). The following query forms also produce the optimal plan and correct results, so long as a vector aggregate is used (you can probably find more equivalent query forms): WHERE Clause SELECT p.Name FROM Production.Product AS p WHERE ( SELECT COUNT_BIG(*) FROM Production.TransactionHistory AS th WHERE th.ProductID = p.ProductID GROUP BY () ) < 10; APPLY SELECT p.Name FROM Production.Product AS p CROSS APPLY ( SELECT NULL FROM Production.TransactionHistory AS th WHERE th.ProductID = p.ProductID GROUP BY () HAVING COUNT_BIG(*) < 10 ) AS ca (dummy); FROM Clause SELECT q1.Name FROM ( SELECT p.Name, cnt = ( SELECT COUNT_BIG(*) FROM Production.TransactionHistory AS th WHERE th.ProductID = p.ProductID GROUP BY () ) FROM Production.Product AS p ) AS q1 WHERE q1.cnt < 10; This last example uses SUM(1) instead of COUNT and does not require a vector aggregate…you should be able to work out why :) SELECT q.Name FROM ( SELECT p.Name, cnt = ( SELECT SUM(1) FROM Production.TransactionHistory AS th WHERE th.ProductID = p.ProductID ) FROM Production.Product AS p ) AS q WHERE q.cnt < 10; The semantics of SQL aggregates are rather odd in places. It definitely pays to get to know the rules, and to be careful to check whether your queries are using scalar or vector aggregates. As we have seen, query plans do not show in which ‘mode’ an aggregate is running and getting it wrong can cause poor performance, wrong results, or both. © 2012 Paul White Twitter: @SQL_Kiwi email: [email protected]

Read the article
Accessing SharePoint 2010 Data with REST/OData on Windows Phone 7

- by Jan Tielens

Consuming SharePoint 2010 data in Windows Phone 7 applications using the CTP version of the developer tools is quite a challenge. The issue is that the SharePoint 2010 data is not anonymously available; users need to authenticate to be able to access the data. When I first tried to access SharePoint 2010 data from my first Hello-World-type Windows Phone 7 application I thought “Hey, this should be easy!” because Windows Phone 7 development based on Silverlight and SharePoint 2010 has a Client Object Model for Silverlight. Unfortunately you can’t use the Client Object Model of SharePoint 2010 on the Windows Phone platform; there’s a reference to an assembly that’s not available (System.Windows.Browser). My second thought was “OK, no problem!” because SharePoint 2010 also exposes a REST/OData API to access SharePoint data. Using the REST API in SharePoint 2010 is as easy as making a web request for a URL (in which you specify the data you’d like to retrieve), e.g. http://yoursiteurl/_vti_bin/listdata.svc/Announcements. This is very easy to accomplish in a Silverlight application that’s running in the context of a page in a SharePoint site, because the credentials of the currently logged on user are automatically picked up and passed to the WCF service. But a Windows Phone application is of course running outside of the SharePoint site’s page, so the application should build credentials that have to be passed to SharePoint’s WCF service. This turns out to be a small challenge in Silverlight 3, the WebClient doesn’t support authentication; there is a Credentials property but when you set it and make the request you get a NotImplementedException exception. Probably this issued will be solved in the very near future, since Silverlight 4 does support authentication, and there’s already a WCF Data Services download that uses this new platform feature of Silverlight 4. So when Windows Phone platform switches to Silverlight 4, you can just use the WebClient to get the data. Even more, if the OData Client Library for Windows Phone 7 gets updated after that, things should get even easier! By the way: the things I’m writing in this paragraph are just assumptions that I make which make a lot of sense IMHO, I don’t have any info all of this will happen, but I really hope so. So are SharePoint developers out of the Windows Phone development game until they get this fixed? Well luckily not, when the HttpWebRequest class is being used instead, you can pass credentials! Using the HttpWebRequest class is slightly more complex than using the WebClient class, but the end result is that you have access to your precious SharePoint 2010 data. The following code snippet is getting all the announcements of an Annoucements list in a SharePoint site: HttpWebRequest webReq = (HttpWebRequest)HttpWebRequest.Create("http://yoursite/_vti_bin/listdata.svc/Announcements");webReq.Credentials = new NetworkCredential("username", "password"); webReq.BeginGetResponse( (result) => { HttpWebRequest asyncReq = (HttpWebRequest)result.AsyncState; XDocument xdoc = XDocument.Load( ((HttpWebResponse)asyncReq.EndGetResponse(result)).GetResponseStream()); XNamespace ns = "http://www.w3.org/2005/Atom"; var items = from item in xdoc.Root.Elements(ns + "entry") select new { Title = item.Element(ns + "title").Value }; this.Dispatcher.BeginInvoke(() => { foreach (var item in items) MessageBox.Show(item.Title); }); }, webReq); When you try this in a Windows Phone 7 application, make sure you add a reference to the System.Xml.Linq assembly, because the code uses Linq to XML to parse the resulting Atom feed, so the Title of every announcement is being displayed in a MessageBox. Check out my previous post if you’d like to see a more polished sample Windows Phone 7 application that displays SharePoint 2010 data.When you plan to use this technique, it’s of course a good idea to encapsulate the code doing the request, so it becomes really easy to get the data that you need. In the following code snippet you can find the GetAtomFeed method that gets the contents of any Atom feed, even if you need to authenticate to get access to the feed. delegate void GetAtomFeedCallback(Stream responseStream); public MainPage(){ InitializeComponent(); SupportedOrientations = SupportedPageOrientation.Portrait | SupportedPageOrientation.Landscape; string url = "http://yoursite/_vti_bin/listdata.svc/Announcements"; string username = "username"; string password = "password"; string domain = ""; GetAtomFeed(url, username, password, domain, (s) => { XNamespace ns = "http://www.w3.org/2005/Atom"; XDocument xdoc = XDocument.Load(s); var items = from item in xdoc.Root.Elements(ns + "entry") select new { Title = item.Element(ns + "title").Value }; this.Dispatcher.BeginInvoke(() => { foreach (var item in items) { MessageBox.Show(item.Title); } }); });} private static void GetAtomFeed(string url, string username, string password, string domain, GetAtomFeedCallback cb){ HttpWebRequest webReq = (HttpWebRequest)HttpWebRequest.Create(url); webReq.Credentials = new NetworkCredential(username, password, domain); webReq.BeginGetResponse( (result) => { HttpWebRequest asyncReq = (HttpWebRequest)result.AsyncState; HttpWebResponse resp = (HttpWebResponse)asyncReq.EndGetResponse(result); cb(resp.GetResponseStream()); }, webReq);}

Read the article
techniques for an AI for a highly cramped turn-based tactics game

- by Adam M.

I'm trying to write an AI for a tactics game in the vein of Final Fantasy Tactics or Vandal Hearts. I can't change the game rules in any way, only upgrade the AI. I have experience programming AI for classic board games (basically minimax and its variants), but I think the branching factor is too great for the approach to be reasonable here. I'll describe the game and some current AI flaws that I'd like to fix. I'd like to hear ideas for applicable techniques. I'm a decent enough programmer, so I only need the ideas, not an implementation (though that's always appreciated). I'd rather not expend effort chasing (too many) dead ends, so although speculation and brainstorming are good and probably helpful, I'd prefer to hear from somebody with actual experience solving this kind of problem. For those who know it, the game is the land battle mini-game in Sid Meier's Pirates! (2004) and you can skim/skip the next two paragraphs. For those who don't, here's briefly how it works. The battle is turn-based and takes place on a 16x16 grid. There are three terrain types: clear (no hindrance), forest (hinders movement, ranged attacks, and sight), and rock (impassible, but does not hinder attacks or sight). The map is randomly generated with roughly equal amounts of each type of terrain. Because there are many rock and forest tiles, movement is typically very cramped. This is tactically important. The terrain is not flat; higher terrain gives minor bonuses. The terrain is known to both sides. The player is always the attacker and the AI is always the defender, so it's perfectly valid for the AI to set up a defensive position and just wait. The player wins by killing all defenders or by getting a unit to the city gates (a tile on the other side of the map). There are very few units on each side, usually 4-8. Because of this, it's crucial not to take damage without gaining some advantage from it. Units can take multiple actions per turn. All units on one side move before any units on the other side. Order of execution is important, and interleaving of actions between units is often useful. Units have melee and ranged attacks. Melee attacks vary widely in strength; ranged attacks have the same strength but vary in range. The main challenges I face are these: Lots of useful move combinations start with a "useless" move that gains no immediate advantage, or even loses advantage, in order to set up a powerful flank attack in the future. And, since the player units are stronger and have longer range, the AI pretty much always has to take some losses before they can start to gain kills. The AI must be able to look ahead to distinguish between sacrificial actions that provide a future benefit and those that don't. Because the terrain is so cramped, most of the tactics come down to achieving good positioning with multiple units that work together to defend an area. For instance, two defenders can often dominate a narrow pass by positioning themselves so an enemy unit attempting to pass must expose itself to a flank attack. But one defender in the same pass would be useless, and three units can defend a slightly larger pass. Etc. The AI should be able to figure out where the player must go to reach the city gates and how to best position its few units to cover the approaches, shifting, splitting, or combining them appropriately as the player moves. Because flank attacks are extremely deadly (and engineering flank attacks is key to the player strategy), the AI should be competent at moving its units so that they cover each other's flanks unless the sacrifice of a unit would give a substantial benefit. They should also be able to force flank attacks on players, for instance by threatening a unit from two different directions such that responding to one threat exposes the flank to the other. The AI should attack if possible, but sometimes there are no good ways to approach the player's position. In that case, the AI should be able to recognize this and set up a defensive position of its own. But the AI shouldn't be vulnerable to a trivial exploit where the player repeatedly opens and closes a hole in his defense and shoots at the AI as it approaches and retreats. That is, the AI should ideally be able to recognize that the player is capable of establishing a solid defense of an area, even if the defense is not currently in place. (I suppose if a good unit allocation algorithm existed, as needed for the second bullet point, the AI could run it on the player units to see where they could defend.) Because it's important to choose a good order of action and interleave actions between units, it's not as simple as just finding the best move for each unit in turn. All of these can be accomplished with a minimax search in theory, but the search space is too large, so specialized techniques are needed. I thought about techniques such as influence mapping, but I don't see how to use the technique to great effect. I thought about assigning goals to the units. This can help them work together in some limited way, and the problem of "how do I accomplish this goal?" is easier to solve than "how do I win this battle?", but assigning good goals is a hard problem in itself, because it requires knowing whether the goal is achievable and whether it's a good use of resources. So, does anyone have specific ideas for techniques that can help cleverize this AI? Update: I found a related question on Stackoverflow: http://stackoverflow.com/questions/3133273/ai-for-a-final-fantasy-tactics-like-game The selected answer gives a decent approach to choosing between alternative actions, but it doesn't seem to have much ability to look into the future and discern beneficial sacrifices from wasteful ones. It also focuses on a single unit at a time and it's not clear how it could be extended to support cooperation between units in defending or attacking.

Read the article
Charms and the App Bar

- by Dennis Vroegop

Ok. I admit. I made a mistake in the last post about our planespotter app. I have dedicated a full part of the hub to Social. I also had a section called Friends but that made sense since I said that “Friends” is a special group of people that connect to each other through our app and only our app. Social however is sharing our spots with Twitter, Facebook and so on. Now, we could write that functionality in our app in a different section but there is one small problem with that: users don’t expect that. Ok, I admit. The mistake was quite deliberate to give me an excuse to write this part. But still: the mistake is one I see a lot. People are trying to do stuff in their application that they shouldn’t be doing. This always strike me as slightly odd: why do some work when others have already done it for you and you can just use it? After all: good developers are lazy (lazy people will always try to find the easiest way to do something and in development land this usually means the cleanest and best to support way…) So. What is that part that Microsoft has done for us and we don’t have to do ourselves? The answer lies on the right hand of your Win8 screen: This is a screenshot of my tablet (as you can see I am writing this right now….) When I swipe my finger from out of the screen on the right inside the screen (or move the mouse to the upper right corner) this menu will appear. Next to settings and the start menu button we’ll find the Search and the Share charms. These are two ways that your app can share the information it contains with the rest of the world, or at least: the rest of your system. So don’t write a Search feature in your app. Don’t write a Share feature in your app. It’s here already. Users, once they are used to Windows 8, will use that feature and expect it to work. If it doesn’t, they won’t like your app and you can kiss you dreams of everlasting fame goodbye. So use these two. What are they? Well, simply they are parts of a contract. In your app you say somewhere in code that you are supporting Search and Share. So when the user selects Share the system will interrogate the current app in the foreground if it supports this feature. Your app will say “But why, yes, I do!” Then the system will ask the app “Ok then, wisecrack, then share!” and you will have to provide the system with some information about the format. Other applications have subscribed to be at the receiving end of the Share contract. They have told the system that they support Sharing (receiving) and which formats they understand. If one or more of them support the formats you specify, the user will see them. The user clicks / taps on the app of their choice and data is moved from your app to the new one. So if you say you support Facebook and Twitter users can post data from your app to these networks by selecting Share. The same applies to Search. Don’t make a “search” button in your app but use the contract to tell the system that you support search and use that instead. Users will be grateful (remember that bar with men/women/creatures that are waiting for you?) The more and more people get to know Windows 8, the more they will use this. And if you are one of the people who wrote an app that helped them learn the system, well, that’s even better. So. We don’t have a Share or a Search button. We do have other buttons. Most important: we probably need a “New Spot” button. And a “Filter” might be useful. Or someway to open the camera so you can add a picture to the spot. Where will be put those? The answer is the “Appbar” . This is a application / context aware menu that slides up from the bottom of the screen when you move your finger / mouse from below the screen into it. From above downwards works just as well. Here you see an example of the appbar from the People app. (click on it for a larger version). This appears whenever you slide your finger up from below of down from above. This is where you put your commands. Remember, this is context aware so this menu will change when you are in different parts of your app or when you have selected different items. There are a few conventions when you create this appbar. First, the items on the right are “General” items, meaning they have little to do with what is on the screen right now. I think this would be a great place to add our “New Spot” icon. On the far left are items associated with the current selected item or screen. So if you have a spot selected, the button for Add Photo should be visible here and on the left hand side. Not everything is as clear as this, but this is what you should strive for. Group items together. And please note: this is the only place in Metro design where we are allowed to use lines as separators. So when you want to separate a group of icons from another group, add a line. Also note the simplicity of the buttons. No colors, no lights or shadows, no 3D. After a couple of years of fancy almost realistic looking icons people have finally decided that hey, this is a virtual world: it’s ok to look virtual as well. So make things as readable and clear as possible and don’t try to duplicate nature. It’s all about the information, remember? (If you don’t remember I’d like to point you to a older blog post of mine about the what and why of Metro). So.. think about the buttons a bit and think about Share and Search. What will you put there? Remember: this is the way the users interact with your apps and while you shouldn’t judge a book by its covers when it comes to people, this isn’t entirely so when it comes to apps. People DO judge an app by its looks and the way it feels. Take advantage of that. History has learned that a crappy app with a GREAT user interface gets better reviews than a GREAT app with a lousy UI… I know: developers will find this extremely unfair but that’s the world we live in (No, I am not saying you should deliver rubbish apps). Next time: we’ll start by building the darn thing!

Read the article
SortedDictionary and SortedList

- by Simon Cooper

Apart from Dictionary<TKey, TValue>, there's two other dictionaries in the BCL - SortedDictionary<TKey, TValue> and SortedList<TKey, TValue>. On the face of it, these two classes do the same thing - provide an IDictionary<TKey, TValue> interface where the iterator returns the items sorted by the key. So what's the difference between them, and when should you use one rather than the other? (as in my previous post, I'll assume you have some basic algorithm & datastructure knowledge) SortedDictionary We'll first cover SortedDictionary. This is implemented as a special sort of binary tree called a red-black tree. Essentially, it's a binary tree that uses various constraints on how the nodes of the tree can be arranged to ensure the tree is always roughly balanced (for more gory algorithmical details, see the wikipedia link above). What I'm concerned about in this post is how the .NET SortedDictionary is actually implemented. In .NET 4, behind the scenes, the actual implementation of the tree is delegated to a SortedSet<KeyValuePair<TKey, TValue>>. One example tree might look like this: Each node in the above tree is stored as a separate SortedSet<T>.Node object (remember, in a SortedDictionary, T is instantiated to KeyValuePair<TKey, TValue>): class Node { public bool IsRed; public T Item; public SortedSet<T>.Node Left; public SortedSet<T>.Node Right; } The SortedSet only stores a reference to the root node; all the data in the tree is accessed by traversing the Left and Right node references until you reach the node you're looking for. Each individual node can be physically stored anywhere in memory; what's important is the relationship between the nodes. This is also why there is no constructor to SortedDictionary or SortedSet that takes an integer representing the capacity; there are no internal arrays that need to be created and resized. This may seen trivial, but it's an important distinction between SortedDictionary and SortedList that I'll cover later on. And that's pretty much it; it's a standard red-black tree. Plenty of webpages and datastructure books cover the algorithms behind the tree itself far better than I could. What's interesting is the comparions between SortedDictionary and SortedList, which I'll cover at the end. As a side point, SortedDictionary has existed in the BCL ever since .NET 2. That means that, all through .NET 2, 3, and 3.5, there has been a bona-fide sorted set class in the BCL (called TreeSet). However, it was internal, so it couldn't be used outside System.dll. Only in .NET 4 was this class exposed as SortedSet. SortedList Whereas SortedDictionary didn't use any backing arrays, SortedList does. It is implemented just as the name suggests; two arrays, one containing the keys, and one the values (I've just used random letters for the values): The items in the keys array are always guarenteed to be stored in sorted order, and the value corresponding to each key is stored in the same index as the key in the values array. In this example, the value for key item 5 is 'z', and for key item 8 is 'm'. Whenever an item is inserted or removed from the SortedList, a binary search is run on the keys array to find the correct index, then all the items in the arrays are shifted to accomodate the new or removed item. For example, if the key 3 was removed, a binary search would be run to find the array index the item was at, then everything above that index would be moved down by one: and then if the key/value pair {7, 'f'} was added, a binary search would be run on the keys to find the index to insert the new item, and everything above that index would be moved up to accomodate the new item: If another item was then added, both arrays would be resized (to a length of 10) before the new item was added to the arrays. As you can see, any insertions or removals in the middle of the list require a proportion of the array contents to be moved; an O(n) operation. However, if the insertion or removal is at the end of the array (ie the largest key), then it's only O(log n); the cost of the binary search to determine it does actually need to be added to the end (excluding the occasional O(n) cost of resizing the arrays to fit more items). As a side effect of using backing arrays, SortedList offers IList Keys and Values views that simply use the backing keys or values arrays, as well as various methods utilising the array index of stored items, which SortedDictionary does not (and cannot) offer. The Comparison So, when should you use one and not the other? Well, here's the important differences: Memory usage SortedDictionary and SortedList have got very different memory profiles. SortedDictionary... has a memory overhead of one object instance, a bool, and two references per item. On 64-bit systems, this adds up to ~40 bytes, not including the stored item and the reference to it from the Node object. stores the items in separate objects that can be spread all over the heap. This helps to keep memory fragmentation low, as the individual node objects can be allocated wherever there's a spare 60 bytes. In contrast, SortedList... has no additional overhead per item (only the reference to it in the array entries), however the backing arrays can be significantly larger than you need; every time the arrays are resized they double in size. That means that if you add 513 items to a SortedList, the backing arrays will each have a length of 1024. To conteract this, the TrimExcess method resizes the arrays back down to the actual size needed, or you can simply assign list.Capacity = list.Count. stores its items in a continuous block in memory. If the list stores thousands of items, this can cause significant problems with Large Object Heap memory fragmentation as the array resizes, which SortedDictionary doesn't have. Performance Operations on a SortedDictionary always have O(log n) performance, regardless of where in the collection you're adding or removing items. In contrast, SortedList has O(n) performance when you're altering the middle of the collection. If you're adding or removing from the end (ie the largest item), then performance is O(log n), same as SortedDictionary (in practice, it will likely be slightly faster, due to the array items all being in the same area in memory, also called locality of reference). So, when should you use one and not the other? As always with these sort of things, there are no hard-and-fast rules. But generally, if you: need to access items using their index within the collection are populating the dictionary all at once from sorted data aren't adding or removing keys once it's populated then use a SortedList. But if you: don't know how many items are going to be in the dictionary are populating the dictionary from random, unsorted data are adding & removing items randomly then use a SortedDictionary. The default (again, there's no definite rules on these sort of things!) should be to use SortedDictionary, unless there's a good reason to use SortedList, due to the bad performance of SortedList when altering the middle of the collection.

Read the article
SQL SERVER – Weekly Series – Memory Lane – #050

- by Pinal Dave

Here is the list of selected articles of SQLAuthority.com across all these years. Instead of just listing all the articles I have selected a few of my most favorite articles and have listed them here with additional notes below it. Let me know which one of the following is your favorite article from memory lane. 2007 Executing Remote Stored Procedure – Calling Stored Procedure on Linked Server In this example we see two different methods of how to call Stored Procedures remotely. Connection Property of SQL Server Management Studio SSMS A very simple example of the how to build connection properties for SQL Server with the help of SSMS. Sample Example of RANKING Functions – ROW_NUMBER, RANK, DENSE_RANK, NTILE SQL Server has a total of 4 ranking functions. Ranking functions return a ranking value for each row in a partition. All the ranking functions are non-deterministic. T-SQL Script to Add Clustered Primary Key Jr. DBA asked me three times in a day, how to create Clustered Primary Key. I gave him following sample example. That was the last time he asked “How to create Clustered Primary Key to table?” 2008 2008 – TRIM() Function – User Defined Function SQL Server does not have functions which can trim leading or trailing spaces of any string at the same time. SQL does have LTRIM() and RTRIM() which can trim leading and trailing spaces respectively. SQL Server 2008 also does not have TRIM() function. User can easily use LTRIM() and RTRIM() together and simulate TRIM() functionality. http://www.youtube.com/watch?v=1-hhApy6MHM 2009 Earlier I have written two different articles on the subject Remove Bookmark Lookup. This article is as part 3 of original article. Please read the first two articles here before continuing reading this article. Query Optimization – Remove Bookmark Lookup – Remove RID Lookup – Remove Key Lookup Query Optimization – Remove Bookmark Lookup – Remove RID Lookup – Remove Key Lookup – Part 2 Query Optimization – Remove Bookmark Lookup – Remove RID Lookup – Remove Key Lookup – Part 3 Interesting Observation – Query Hint – FORCE ORDER SQL Server never stops to amaze me. As regular readers of this blog already know that besides conducting corporate training, I work on large-scale projects on query optimizations and server tuning projects. In one of the recent projects, I have noticed that a Junior Database Developer used the query hint Force Order; when I asked for details, I found out that the basic concept was not properly understood by him. Queries Waiting for Memory Allocation to Execute In one of the recent projects, I was asked to create a report of queries that are waiting for memory allocation. The reason was that we were doubtful regarding whether the memory was sufficient for the application. The following query can be useful in similar cases. Queries that do not have to wait on a memory grant will not appear in the result set of following query. 2010 Quickest Way to Identify Blocking Query and Resolution – Dirty Solution As the title suggests, this is quite a dirty solution; it’s not as elegant as you expect. However, it works totally fine. Simple Explanation of Data Type Precedence While I was working on creating a question for SQL SERVER – SQL Quiz – The View, The Table and The Clustered Index Confusion, I had actually created yet another question along with this question. However, I felt that the one which is posted on the SQL Quiz is much better than this one because what makes that more challenging question is that it has a multiple answer. Encrypted Stored Procedure and Activity Monitor I recently had received questionable if any stored procedure is encrypted can we see its definition in Activity Monitor.Answer is - No. Let us do a quick test. Let us create following Stored Procedure and then launch the Activity Monitor and check the text. Indexed View always Use Index on Table A single table can have maximum 249 non clustered indexes and 1 clustered index. In SQL Server 2008, a single table can have maximum 999 non clustered indexes and 1 clustered index. It is widely believed that a table can have only 1 clustered index, and this belief is true. I have some questions for all of you. Let us assume that I am creating view from the table itself and then create a clustered index on it. In my view, I am selecting the complete table itself. 2011 Detecting Database Case Sensitive Property using fn_helpcollations() I received a question on how to determine the case sensitivity of the database. The quick answer to this is to identify the collation of the database and check the properties of the collation. I have previously written how one can identify database collation. Once you have figured out the collation of the database, you can put that in the WHERE condition of the following T-SQL and then check the case sensitivity from the description. Server Side Paging in SQL Server CE (Compact Edition) SQL Server Denali is coming up with new T-SQL of Paging. I have written about the same earlier.SQL SERVER – Server Side Paging in SQL Server Denali – A Better Alternative, SQL SERVER – Server Side Paging in SQL Server Denali Performance Comparison, SQL SERVER – Server Side Paging in SQL Server Denali – Part2 What is very interesting is that SQL Server CE 4.0 have the same feature introduced. Here is the quick example of the same. To run the script in the example, you will have to do installWebmatrix 4.0 and download sample database. Once done you can run following script. Why I am Going to Attend PASS Summit Unite 2011 The four-day event will be marked by a lot of learning, sharing, and networking, which will help me increase both my knowledge and contacts. Every year, PASS Summit provides me a golden opportunity to build my network as well as to identify and meet potential customers or employees. 2012 Manage Help Settings – CTRL + ALT + F1 This is very interesting read as my daughter once accidently came across a screen in SQL Server Management Studio. It took me 2-3 minutes to figure out how she has created the same screen. Recover the Accidentally Renamed Table “I accidentally renamed table in my SSMS. I was scrolling very fast and I made mistakes. It was either because I double clicked or clicked on F2 (shortcut key for renaming). However, I have made the mistake and now I have no idea how to fix this. If you have renamed the table, I think you pretty much is out of luck. Here are few things which you can do which can give you an idea about what your table name can be if you are lucky. Identify Numbers of Non Clustered Index on Tables for Entire Database Here is the script which will give you numbers of non clustered indexes on any table in entire database. Identify Most Resource Intensive Queries – SQL in Sixty Seconds #029 – Video Here is the complete complete script which I have used in the SQL in Sixty Seconds Video. Thanks Harsh for important Tip in the comment. http://www.youtube.com/watch?v=3kDHC_Tjrns Advanced Data Quality Services with Melissa Data – Azure Data Market For the purposes of the review, I used a database I had in an Excel spreadsheet with name and address information. Upon a cursory inspection, there are miscellaneous problems with these records; some addresses are missing ZIP codes, others missing a city, and some records are slightly misspelled or have unparsed suites. With DQS, I can easily add a knowledge base to help standardize my values, such as for state abbreviations. But how do I know that my address is correct? Reference: Pinal Dave (http://blog.sqlauthority.com) Filed under: Memory Lane, PostADay, SQL, SQL Authority, SQL Query, SQL Server, SQL Tips and Tricks, T SQL, Technology

Read the article
Microsoft Business Intelligence Seminar 2011

- by DavidWimbush

I was lucky enough to attend the maiden presentation of this at Microsoft Reading yesterday. It was pretty gripping stuff not only because of what was said but also because of what could only be hinted at. Here's what I took away from the day. (Disclaimer: I'm not a BI guru, just a reasonably experienced BI developer, so I may have misunderstood or misinterpreted a few things. Particularly when so much of the talk was about the vision and subtle hints of what is coming. Please comment if you think I've got anything wrong. I'm also not going to even try to cover Master Data Services as I struggled to imagine how you would actually use it.) I was a bit worried when I learned that the whole day was going to be presented by one guy but Rafal Lukawiecki is a very engaging speaker. He's going to be presenting this about 20 times around the world over the coming months. If you get a chance to hear him speak, I say go for it. No doubt some of the hints will become clearer as Denali gets closer to RTM. Firstly, things are definitely happening in the SQL Server Reporting and BI world. Traditionally IT would build a data warehouse, then cubes on top of that, and then publish them in a structured and controlled way. But, just as with many IT projects in general, by the time it's finished the business has moved on and the system no longer meets their requirements. This not sustainable and something more agile is needed but there has to be some control. Apparently we're going to be hearing the catchphrase 'Balancing agility with control' a lot. More users want more access to more data. Can they define what they want? Of course not, but they'll recognise it when they see it. It's estimated that only 28% of potential BI users have meaningful access to the data they need, so there is a real pent-up demand. The answer looks like: give them some self-service tools so they can experiment and see what works, and then IT can help to support the results. It's estimated that 32% of Excel users are comfortable with its analysis tools such as pivot tables. It's the power user's preferred tool. Why fight it? That's why PowerPivot is an Excel add-in and that's why they released a Data Mining add-in for it as well. It does appear that the strategy is going to be to use Reporting Services (in SharePoint mode), PowerPivot, and possibly something new (smiles and hints but no details) to create reports and explore data. Everything will be published and managed in SharePoint which gives users the ability to mash-up, share and socialise what they've found out. SharePoint also gives IT tools to understand what people are looking at and where to concentrate effort. If PowerPivot report X becomes widely used, it's time to check that it shows what they think it does and perhaps get it a bit more under central control. There was more SharePoint detail that went slightly over my head regarding where Excel Services and Excel Web Application fit in, the differences between them, and the suggestion that it is likely they will one day become one (but not in the immediate future). That basic pattern is set to be expanded upon by further exploiting Vertipaq (the columnar indexing engine that enables PowerPivot to store and process a lot of data fast and in a small memory footprint) to provide scalability 'from the desktop to the data centre', and some yet to be detailed advances in 'frictionless deployment' (part of which is about making the difference between local and the cloud pretty much irrelevant). Excel looks like becoming Microsoft's primary BI client. It already has: the ability to consume cubes strong visualisation tools slicers (which are part of Excel not PowerPivot) a data mining add-in PowerPivot A major hurdle for self-service BI is presenting the data in a consumable format. You can't just give users PowerPivot and a server with a copy of the OLTP database(s). Building cubes is labour intensive and doesn't always give the user what they need. This is where the BI Semantic Model (BISM) comes in. I gather it's a layer of metadata you define that can combine multiple data sources (and types of data source) into a clear 'interface' that users can work with. It comes with a new query language called DAX. SSAS cubes are unlikely to go away overnight because, with their pre-calculated results, they are still the most efficient way to work with really big data sets. A few other random titbits that came up: Reporting Services is going to get some good new stuff in Denali. Keep an eye on www.projectbotticelli.com for the slides. You can also view last year's seminar sessions which covered a lot of the same ground as far as the overall strategy is concerned. They plan to add more material as Denali's features are publicly exposed. Check out the PASS keynote address for a showing of Yahoo's SQL BI servers. Apparently they wheeled the rack out on stage still plugged in and running! Check out the Excel 2010 Data Mining Add-Ins. 32 bit only at present but 64 bit is on the way. There are lots of data sets, many of them free, at the Windows Azure Marketplace Data Market (where you can also get ESRI shape files). If you haven't already seen it, have a look at the Silverlight Pivot Viewer (http://weblogs.asp.net/scottgu/archive/2010/06/29/silverlight-pivotviewer-now-available.aspx). The Bing Maps Data Connector is worth a look if you're into spatial stuff (http://www.bing.com/community/site_blogs/b/maps/archive/2010/07/13/data-connector-sql-server-2008-spatial-amp-bing-maps.aspx).

Read the article
Improving Manageability of Virtual Environments

- by Jeff Victor

Boot Environments for Solaris 10 Branded Zones Until recently, Solaris 10 Branded Zones on Solaris 11 suffered one notable regression: Live Upgrade did not work. The individual packaging and patching tools work correctly, but the ability to upgrade Solaris while the production workload continued running did not exist. A recent Solaris 11 SRU (Solaris 11.1 SRU 6.4) restored most of that functionality, although with a slightly different concept, different commands, and without all of the feature details. This new method gives you the ability to create and manage multiple boot environments (BEs) for a Solaris 10 Branded Zone, and modify the active or any inactive BE, and to do so while the production workload continues to run. Background In case you are new to Solaris: Solaris includes a set of features that enables you to create a bootable Solaris image, called a Boot Environment (BE). This newly created image can be modified while the original BE is still running your workload(s). There are many benefits, including improved uptime and the ability to reboot into (or downgrade to) an older BE if a newer one has a problem. In Solaris 10 this set of features was named Live Upgrade. Solaris 11 applies the same basic concepts to the new packaging system (IPS) but there isn't a specific name for the feature set. The features are simply part of IPS. Solaris 11 Boot Environments are not discussed in this blog entry. Although a Solaris 10 system can have multiple BEs, until recently a Solaris 10 Branded Zone (BZ) in a Solaris 11 system did not have this ability. This limitation was addressed recently, and that enhancement is the subject of this blog entry. This new implementation uses two concepts. The first is the use of a ZFS clone for each BE. This makes it very easy to create a BE, or many BEs. This is a distinct advantage over the Live Upgrade feature set in Solaris 10, which had a practical limitation of two BEs on a system, when using UFS. The second new concept is a very simple mechanism to indicate the BE that should be booted: a ZFS property. The new ZFS property is named com.oracle.zones.solaris10:activebe (isn't that creative? ). It's important to note that the property is inherited from the original BE's file system to any BEs you create. In other words, all BEs in one zone have the same value for that property. When the (Solaris 11) global zone boots the Solaris 10 BZ, it boots the BE that has the name that is stored in the activebe property. Here is a quick summary of the actions you can use to manage these BEs: To create a BE: Create a ZFS clone of the zone's root dataset To activate a BE: Set the ZFS property of the root dataset to indicate the BE To add a package or patch to an inactive BE: Mount the inactive BE Add packages or patches to it Unmount the inactive BE To list the available BEs: Use the "zfs list" command. To destroy a BE: Use the "zfs destroy" command. Preparation Before you can use the new features, you will need a Solaris 10 BZ on a Solaris 11 system. You can use these three steps - on a real Solaris 11.1 server or in a VirtualBox guest running Solaris 11.1 - to create a Solaris 10 BZ. The Solaris 11.1 environment must be at SRU 6.4 or newer. Create a flash archive on the Solaris 10 system s10# flarcreate -n s10-system /net/zones/archives/s10-system.flar Configure the Solaris 10 BZ on the Solaris 11 system s11# zonecfg -z s10z Use 'create' to begin configuring a new zone. zonecfg:s10z create -t SYSsolaris10 zonecfg:s10z set zonepath=/zones/s10z zonecfg:s10z exit s11# zoneadm list -cv ID NAME STATUS PATH BRAND IP 0 global running / solaris shared - s10z configured /zones/s10z solaris10 excl Install the zone from the flash archive s11# zoneadm -z s10z install -a /net/zones/archives/s10-system.flar -p You can find more information about the migration of Solaris 10 environments to Solaris 10 Branded Zones in the documentation. The rest of this blog entry demonstrates the commands you can use to accomplish the aforementioned actions related to BEs. New features in action Note that the demonstration of the commands occurs in the Solaris 10 BZ, as indicated by the shell prompt "s10z# ". Many of these commands can be performed in the global zone instead, if you prefer. If you perform them in the global zone, you must change the ZFS file system names. Create The only complicated action is the creation of a BE. In the Solaris 10 BZ, create a new "boot environment" - a ZFS clone. You can assign any name to the final portion of the clone's name, as long as it meets the requirements for a ZFS file system name. s10z# zfs snapshot rpool/ROOT/zbe-0@snap s10z# zfs clone -o mountpoint=/ -o canmount=noauto rpool/ROOT/zbe-0@snap rpool/ROOT/newBE cannot mount 'rpool/ROOT/newBE' on '/': directory is not empty filesystem successfully created, but not mounted You can safely ignore that message: we already know that / is not empty! We have merely told ZFS that the default mountpoint for the clone is the root directory. List the available BEs and active BE Because each BE is represented by a clone of the rpool/ROOT dataset, listing the BEs is as simple as listing the clones. s10z# zfs list -r rpool/ROOT NAME USED AVAIL REFER MOUNTPOINT rpool/ROOT 3.55G 42.9G 31K legacy rpool/ROOT/zbe-0 1K 42.9G 3.55G / rpool/ROOT/newBE 3.55G 42.9G 3.55G / The output shows that two BEs exist. Their names are "zbe-0" and "newBE". You can tell Solaris that one particular BE should be used when the zone next boots by using a ZFS property. Its name is com.oracle.zones.solaris10:activebe. The value of that property is the name of the clone that contains the BE that should be booted. s10z# zfs get com.oracle.zones.solaris10:activebe rpool/ROOT NAME PROPERTY VALUE SOURCE rpool/ROOT com.oracle.zones.solaris10:activebe zbe-0 local Change the active BE When you want to change the BE that will be booted next time, you can just change the activebe property on the rpool/ROOT dataset. s10z# zfs get com.oracle.zones.solaris10:activebe rpool/ROOT NAME PROPERTY VALUE SOURCE rpool/ROOT com.oracle.zones.solaris10:activebe zbe-0 local s10z# zfs set com.oracle.zones.solaris10:activebe=newBE rpool/ROOT s10z# zfs get com.oracle.zones.solaris10:activebe rpool/ROOT NAME PROPERTY VALUE SOURCE rpool/ROOT com.oracle.zones.solaris10:activebe newBE local s10z# shutdown -y -g0 -i6 After the zone has rebooted: s10z# zfs get com.oracle.zones.solaris10:activebe rpool/ROOT rpool/ROOT com.oracle.zones.solaris10:activebe newBE local s10z# zfs mount rpool/ROOT/newBE / rpool/export /export rpool/export/home /export/home rpool /rpool Mount the original BE to see that it's still there. s10z# zfs mount -o mountpoint=/mnt rpool/ROOT/zbe-0 s10z# ls /mnt Desktop export platform Documents export.backup.20130607T214951Z proc S10Flar home rpool TT_DB kernel sbin bin lib system boot lost+found tmp cdrom mnt usr dev net var etc opt Patch an inactive BE At this point, you can modify the original BE. If you would prefer to modify the new BE, you can restore the original value to the activebe property and reboot, and then mount the new BE to /mnt (or another empty directory) and modify it. Let's mount the original BE so we can modify it. (The first command is only needed if you haven't already mounted that BE.) s10z# zfs mount -o mountpoint=/mnt rpool/ROOT/zbe-0 s10z# patchadd -R /mnt -M /var/sadm/spool 104945-02 Note that the typical usage will be: Create a BE Mount the new (inactive) BE Use the package and patch tools to update the new BE Unmount the new BE Reboot Delete an inactive BE ZFS clones are children of their parent file systems. In order to destroy the parent, you must first "promote" the child. This reverses the parent-child relationship. (For more information on this, see the documentation.) The original rpool/ROOT file system is the parent of the clones that you create as BEs. In order to destroy an earlier BE that is that parent of other BEs, you must first promote one of the child BEs to be the ZFS parent. Only then can you destroy the original BE. Fortunately, this is easier to do than to explain: s10z# zfs promote rpool/ROOT/newBE s10z# zfs destroy rpool/ROOT/zbe-0 s10z# zfs list -r rpool/ROOT NAME USED AVAIL REFER MOUNTPOINT rpool/ROOT 3.56G 269G 31K legacy rpool/ROOT/newBE 3.56G 269G 3.55G / Documentation This feature is so new, it is not yet described in the Solaris 11 documentation. However, MOS note 1558773.1 offers some details. Conclusion With this new feature, you can add and patch packages to boot environments of a Solaris 10 Branded Zone. This ability improves the manageability of these zones, and makes their use more practical. It also means that you can use the existing P2V tools with earlier Solaris 10 updates, and modify the environments after they become Solaris 10 Branded Zones.

Read the article
Anatomy of a .NET Assembly - CLR metadata 1

- by Simon Cooper

Before we look at the bytes comprising the CLR-specific data inside an assembly, we first need to understand the logical format of the metadata (For this post I only be looking at simple pure-IL assemblies; mixed-mode assemblies & other things complicates things quite a bit). Metadata streams Most of the CLR-specific data inside an assembly is inside one of 5 streams, which are analogous to the sections in a PE file. The name of each section in a PE file starts with a ., and the name of each stream in the CLR metadata starts with a #. All but one of the streams are heaps, which store unstructured binary data. The predefined streams are: #~ Also called the metadata stream, this stream stores all the information on the types, methods, fields, properties and events in the assembly. Unlike the other streams, the metadata stream has predefined contents & structure. #Strings This heap is where all the namespace, type & member names are stored. It is referenced extensively from the #~ stream, as we'll be looking at later. #US Also known as the user string heap, this stream stores all the strings used in code directly. All the strings you embed in your source code end up in here. This stream is only referenced from method bodies. #GUID This heap exclusively stores GUIDs used throughout the assembly. #Blob This heap is for storing pure binary data - method signatures, generic instantiations, that sort of thing. Items inside the heaps (#Strings, #US, #GUID and #Blob) are indexed using a simple binary offset from the start of the heap. At that offset is a coded integer giving the length of that item, then the item's bytes immediately follow. The #GUID stream is slightly different, in that GUIDs are all 16 bytes long, so a length isn't required. Metadata tables The #~ stream contains all the assembly metadata. The metadata is organised into 45 tables, which are binary arrays of predefined structures containing information on various aspects of the metadata. Each entry in a table is called a row, and the rows are simply concatentated together in the file on disk. For example, each row in the TypeRef table contains: A reference to where the type is defined (most of the time, a row in the AssemblyRef table). An offset into the #Strings heap with the name of the type An offset into the #Strings heap with the namespace of the type. in that order. The important tables are (with their table number in hex): 0x2: TypeDef 0x4: FieldDef 0x6: MethodDef 0x14: EventDef 0x17: PropertyDef Contains basic information on all the types, fields, methods, events and properties defined in the assembly. 0x1: TypeRef The details of all the referenced types defined in other assemblies. 0xa: MemberRef The details of all the referenced members of types defined in other assemblies. 0x9: InterfaceImpl Links the types defined in the assembly with the interfaces that type implements. 0xc: CustomAttribute Contains information on all the attributes applied to elements in this assembly, from method parameters to the assembly itself. 0x18: MethodSemantics Links properties and events with the methods that comprise the get/set or add/remove methods of the property or method. 0x1b: TypeSpec 0x2b: MethodSpec These tables provide instantiations of generic types and methods for each usage within the assembly. There are several ways to reference a single row within a table. The simplest is to simply specify the 1-based row index (RID). The indexes are 1-based so a value of 0 can represent 'null'. In this case, which table the row index refers to is inferred from the context. If the table can't be determined from the context, then a particular row is specified using a token. This is a 4-byte value with the most significant byte specifying the table, and the other 3 specifying the 1-based RID within that table. This is generally how a metadata table row is referenced from the instruction stream in method bodies. The third way is to use a coded token, which we will look at in the next post. So, back to the bytes Now we've got a rough idea of how the metadata is logically arranged, we can now look at the bytes comprising the start of the CLR data within an assembly: The first 8 bytes of the .text section are used by the CLR loader stub. After that, the CLR-specific data starts with the CLI header. I've highlighted the important bytes in the diagram. In order, they are: The size of the header. As the header is a fixed size, this is always 0x48. The CLR major version. This is always 2, even for .NET 4 assemblies. The CLR minor version. This is always 5, even for .NET 4 assemblies, and seems to be ignored by the runtime. The RVA and size of the metadata header. In the diagram, the RVA 0x20e4 corresponds to the file offset 0x2e4 Various flags specifying if this assembly is pure-IL, whether it is strong name signed, and whether it should be run as 32-bit (this is how the CLR differentiates between x86 and AnyCPU assemblies). A token pointing to the entrypoint of the assembly. In this case, 06 (the last byte) refers to the MethodDef table, and 01 00 00 refers to to the first row in that table. (after a gap) RVA of the strong name signature hash, which comes straight after the CLI header. The RVA 0x2050 corresponds to file offset 0x250. The rest of the CLI header is mainly used in mixed-mode assemblies, and so is zeroed in this pure-IL assembly. After the CLI header comes the strong name hash, which is a SHA-1 hash of the assembly using the strong name key. After that comes the bodies of all the methods in the assembly concatentated together. Each method body starts off with a header, which I'll be looking at later. As you can see, this is a very small assembly with only 2 methods (an instance constructor and a Main method). After that, near the end of the .text section, comes the metadata, containing a metadata header and the 5 streams discussed above. We'll be looking at this in the next post. Conclusion The CLI header data doesn't have much to it, but we've covered some concepts that will be important in later posts - the logical structure of the CLR metadata and the overall layout of CLR data within the .text section. Next, I'll have a look at the contents of the #~ stream, and how the table data is arranged on disk.

Read the article
Developing a Cost Model for Cloud Applications

- by BuckWoody

Note - please pay attention to the date of this post. As much as I attempt to make the information below accurate, the nature of distributed computing means that components, units and pricing will change over time. The definitive costs for Microsoft Windows Azure and SQL Azure are located here, and are more accurate than anything you will see in this post: http://www.microsoft.com/windowsazure/offers/ When writing software that is run on a Platform-as-a-Service (PaaS) offering like Windows Azure / SQL Azure, one of the questions you must answer is how much the system will cost. I will not discuss the comparisons between on-premise costs (which are nigh impossible to calculate accurately) versus cloud costs, but instead focus on creating a general model for estimating costs for a given application. You should be aware that there are (at this writing) two billing mechanisms for Windows and SQL Azure: “Pay-as-you-go” or consumption, and “Subscription” or commitment. Conceptually, you can consider the former a pay-as-you-go cell phone plan, where you pay by the unit used (at a slightly higher rate) and the latter as a standard cell phone plan where you commit to a contract and thus pay lower rates. In this post I’ll stick with the pay-as-you-go mechanism for simplicity, which should be the maximum cost you would pay. From there you may be able to get a lower cost if you use the other mechanism. In any case, the model you create should hold. Developing a good cost model is essential. As a developer or architect, you’ll most certainly be asked how much something will cost, and you need to have a reliable way to estimate that. Businesses and Organizations have been used to paying for servers, software licenses, and other infrastructure as an up-front cost, and power, people to the systems and so on as an ongoing (and sometimes not factored) cost. When presented with a new paradigm like distributed computing, they may not understand the true cost/value proposition, and that’s where the architect and developer can guide the conversation to make a choice based on features of the application versus the true costs. The two big buckets of use-types for these applications are customer-based and steady-state. In the customer-based use type, each successful use of the program results in a sale or income for your organization. Perhaps you’ve written an application that provides the spot-price of foo, and your customer pays for the use of that application. In that case, once you’ve estimated your cost for a successful traversal of the application, you can build that into the price you charge the user. It’s a standard restaurant model, where the price of the meal is determined by the cost of making it, plus any profit you can make. In the second use-type, the application will be used by a more-or-less constant number of processes or users and no direct revenue is attached to the system. A typical example is a customer-tracking system used by the employees within your company. In this case, the cost model is often created “in reverse” - meaning that you pilot the application, monitor the use (and costs) and that cost is held steady. This is where the comparison with an on-premise system becomes necessary, even though it is more difficult to estimate those on-premise true costs. For instance, do you know exactly how much cost the air conditioning is because you have a team of system administrators? This may sound trivial, but that, along with the insurance for the building, the wiring, and every other part of the system is in fact a cost to the business. There are three primary methods that I’ve been successful with in estimating the cost. None are perfect, all are demand-driven. The general process is to lay out a matrix of: components units cost per unit and then multiply that times the usage of the system, based on which components you use in the program. That sounds a bit simplistic, but using those metrics in a calculation becomes more detailed. In all of the methods that follow, you need to know your application. The components for a PaaS include computing instances, storage, transactions, bandwidth and in the case of SQL Azure, database size. In most cases, architects start with the first model and progress through the other methods to gain accuracy. Simple Estimation The simplest way to calculate costs is to architect the application (even UML or on-paper, no coding involved) and then estimate which of the components you’ll use, and how much of each will be used. Microsoft provides two tools to do this - one is a simple slider-application located here: http://www.microsoft.com/windowsazure/pricing-calculator/ The other is a tool you download to create an “Return on Investment” (ROI) spreadsheet, which has the advantage of leading you through various questions to estimate what you plan to use, located here: https://roianalyst.alinean.com/msft/AutoLogin.do?d=176318219048082115 You can also just create a spreadsheet yourself with a structure like this: Program Element Azure Component Unit of Measure Cost Per Unit Estimated Use of Component Total Cost Per Component Cumulative Cost Of course, the consideration with this model is that it is difficult to predict a system that is not running or hasn’t even been developed. Which brings us to the next model type. Measure and Project A more accurate model is to actually write the code for the application, using the Software Development Kit (SDK) which can run entirely disconnected from Azure. The code should be instrumented to estimate the use of the application components, logging to a local file on the development system. A series of unit and integration tests should be run, which will create load on the test system. You can use standard development concepts to track this usage, and even use Windows Performance Monitor counters. The best place to start with this method is to use the Windows Azure Diagnostics subsystem in your code, which you can read more about here: http://blogs.msdn.com/b/sumitm/archive/2009/11/18/introducing-windows-azure-diagnostics.aspx This set of API’s greatly simplifies tracking the application, and in fact you can use this information for more than just a cost model. After you have the tracking logs, you can plug the numbers into ay of the tools above, which should give a representative cost or in some cases a unit cost. The consideration with this model is that the SDK fabric is not a one-to-one comparison with performance on the actual Windows Azure fabric. Those differences are usually smaller, but they do need to be considered. Also, you may not be able to accurately predict the load on the system, which might lead to an architectural change, which changes the model. This leads us to the next, most accurate method for a cost model. Sample and Estimate Using standard statistical and other predictive math, once the application is deployed you will get a bill each month from Microsoft for your Azure usage. The bill is quite detailed, and you can export the data from it to do analysis, and using methods like regression and so on project out into the future what the costs will be. I normally advise that the architect also extrapolate a unit cost from those metrics as well. This is the information that should be reported back to the executives that pay the bills: the past cost, future projected costs, and unit cost “per click” or “per transaction”, as your case warrants. The challenge here is in the model itself - statistical methods are not foolproof, and the larger the sample (in this case I recommend the entire population, not a smaller sample) is key. References and Tools Articles: http://blogs.msdn.com/b/patrick_butler_monterde/archive/2010/02/10/windows-azure-billing-overview.aspx http://technet.microsoft.com/en-us/magazine/gg213848.aspx http://blog.codingoutloud.com/2011/06/05/azure-faq-how-much-will-it-cost-me-to-run-my-application-on-windows-azure/ http://blogs.msdn.com/b/johnalioto/archive/2010/08/25/10054193.aspx http://geekswithblogs.net/iupdateable/archive/2010/02/08/qampa-how-can-i-calculate-the-tco-and-roi-when.aspx Other Tools: http://cloud-assessment.com/ http://communities.quest.com/community/cloud_tools

Read the article
Source-control 'wet-work'?

- by Phil Factor

When a design or creative work is flawed beyond remedy, it is often best to destroy it and start again. The other day, I lost the code to a long and intricate SQL batch I was working on. I’d thought it was impossible, but it happened. With all the technology around that is designed to prevent this occurring, this sort of accident has become a rare event. If it weren’t for a deranged laptop, and my distraction, the code wouldn’t have been lost this time. As always, I sighed, had a soothing cup of tea, and typed it all in again. The new code I hastily tapped in was much better: I’d held in my head the essence of how the code should work rather than the details: I now knew for certain the start point, the end, and how it should be achieved. Instantly the detritus of half-baked thoughts fell away and I was able to write logical code that performed better. Because I could work so quickly, I was able to hold the details of all the columns and variables in my head, and the dynamics of the flow of data. It was, in fact, easier and quicker to start from scratch rather than tidy up and refactor the existing code with its inevitable fumbling and half-baked ideas. What a shame that technology is now so good that developers rarely experience the cleansing shock of losing one’s code and having to rewrite it from scratch. If you’ve never accidentally lost your code, then it is worth doing it deliberately once for the experience. Creative people have, until Technology mistakenly prevented it, torn up their drafts or sketches, threw them in the bin, and started again from scratch. Leonardo’s obsessive reworking of the Mona Lisa was renowned because it was so unusual: Most artists have been utterly ruthless in destroying work that didn’t quite make it. Authors are particularly keen on writing afresh, and the results are generally positive. Lawrence of Arabia actually lost the entire 250,000 word manuscript of ‘The Seven Pillars of Wisdom’ by accidentally leaving it on a train at Reading station, before rewriting a much better version. Now, any writer or artist is seduced by technology into altering or refining their work rather than casting it dramatically in the bin or setting a light to it on a bonfire, and rewriting it from the blank page. It is easy to pick away at a flawed work, but the real creative process is far more brutal. Once, many years ago whilst running a software house that supplied commercial software to local businesses, I’d been supervising an accounting system for a farming cooperative. No packaged system met their needs, and it was all hand-cut code. For us, it represented a breakthrough as it was for a government organisation, and success would guarantee more contracts. As you’ve probably guessed, the code got mangled in a disk crash just a week before the deadline for delivery, and the many backups all proved to be entirely corrupted by a faulty tape drive. There were some fragments left on individual machines, but they were all of different versions. The developers were in despair. Strangely, I managed to re-write the bulk of a three-month project in a manic and caffeine-soaked weekend. Sure, that elegant universally-applicable input-form routine was‘nt quite so elegant, but it didn’t really need to be as we knew what forms it needed to support. Yes, the code lacked architectural elegance and reusability. By dawn on Monday, the application passed its integration tests. The developers rose to the occasion after I’d collapsed, and tidied up what I’d done, though they were reproachful that some of the style and elegance had gone out of the application. By the delivery date, we were able to install it. It was a smaller, faster application than the beta they’d seen and the user-interface had a new, rather Spartan, appearance that we swore was done to conform to the latest in user-interface guidelines. (we switched to Helvetica font to look more ‘Bauhaus’ ). The client was so delighted that he forgave the new bugs that had crept in. I still have the disk that crashed, up in the attic. In IT, we have had mixed experiences from complete re-writes. Lotus 123 never really recovered from a complete rewrite from assembler into C, Borland made the mistake with Arago and Quattro Pro and Netscape’s complete rewrite of their Navigator 4 browser was a white-knuckle ride. In all cases, the decision to rewrite was a result of extreme circumstances where no other course of action seemed possible. The rewrite didn’t come out of the blue. I prefer to remember the rewrite of Minix by young Linus Torvalds, or the rewrite of Bitkeeper by a slightly older Linus. The rewrite of CP/M didn’t do too badly either, did it? Come to think of it, the guy who decided to rewrite the windowing system of the Xerox Star never regretted the decision. I’ll agree that one should often resist calls for a rewrite. One of the worst habits of the more inexperienced programmer is to denigrate whatever code he or she inherits, and then call loudly for a complete rewrite. They are buoyed up by the mistaken belief that they can do better. This, however, is a different psychological phenomenon, more related to the idea of some motorcyclists that they are operating on infinite lives, or the occasional squaddies that if they charge the machine-guns determinedly enough all will be well. Grim experience brings out the humility in any experienced programmer. I’m referring to quite different circumstances here. Where a team knows the requirements perfectly, are of one mind on methodology and coding standards, and they already have a solution, then what is wrong with considering a complete rewrite? Rewrites are so painful in the early stages, until that point where one realises the payoff, that even I quail at the thought. One needs a natural disaster to push one over the edge. The trouble is that source-control systems, and disaster recovery systems, are just too good nowadays. If I were to lose this draft of this very blog post, I know I’d rewrite it much better. However, if you read this, you’ll know I didn’t have the nerve to delete it and start again. There was a time that one prayed that unreliable hardware would deliver you from an unmaintainable mess of a codebase, but now technology has made us almost entirely immune to such a merciful act of God. An old friend of mine with long experience in the software industry has long had the idea of the ‘source-control wet-work’, where one hires a malicious hacker in some wild eastern country to hack into one’s own source control system to destroy all trace of the source to an application. Alas, backup systems are just too good to make this any more than a pipedream. Somehow, it would be difficult to promote the idea. As an alternative, could one construct a source control system that, on doing all the code-quality metrics, would systematically destroy all trace of source code that failed the quality test? Alas, I can’t see many managers buying into the idea. In reading the full story of the near-loss of Toy Story 2, it set me thinking. It turned out that the lucky restoration of the code wasn’t the happy ending one first imagined it to be, because they eventually came to the conclusion that the plot was fundamentally flawed and it all had to be rewritten anyway. Was this an early case of the ‘source-control wet-job’?’ It is very hard nowadays to do a rapid U-turn in a development project because we are far too prone to cling to our existing source-code.

Read the article
Finally, upgrade from Nokia X3 to Samsung Galaxy S III

This time, something slightly different but nonetheless not less interesting, hopefully. Living on a remote island like Mauritius, ill-praised 'Cyber Island' in the Indian Ocean, has its advantages in life style and relaxed environment to life in but in terms of technological aspects it can be quite a nightmare. Well, I guess this might be different story to report about... one day. Cyber Island Mauritius Despite it's shiny advertisement as Cyber Island and business in ICT hub to Africa, Mauritius is not on the latest track of available models in computer hardware or, in the context of this article, cellulars or smart-phone, or communication technology in general. Okay, I have to admit that this statement is only partly true. Money can buy, even here in Mauritius. Luckily, there are ways and ways to deal with this outcry of modern, read: technological, civilisation issues. Online shopping you might think? Yes, for sure, until you discover in your checkout procedure that a small island in the Indian Ocean isn't a preferred destination for delivery and the precious time you spent on putting your items into your cart and feeding your personal level of anticipation gets ruined on the last stint. Ordering from abroad saves you money Anyway, I got in touch with my personal courier and luckily there were some extra-kilos left in the luggage. First obstacle sorted, we have a Transporter! Okay, on the next occasion off to Amazon online and using their Prime service for fast delivery. Actually, the order was placed on Saturday evening and everything got delivered on Tuesday morning - nice job in less than 72 hours. Okay, among the items of that shopping rush I ordered a shiny Samsung Galaxy S III 16GB in oceanic blue - did I mention, that you hardly get a blue model in Mauritius? - for my BWE. Interesting side-notes: First, Amazon Germany dropped the prices for roughly 30% on the S3, and we got the 16GB model for less than 500 Euro (or approx. Rs. 19.500,-) compared to the usual Rs. 27.000,- on the local market. It even varies whether the local price is inclusive or exclusive VAT (15%). Second, since a while she was bothering me to get an iPhone and an iPad for her, fair enough I thought, decent hardware, posh design and reliable services. Until we watched the 'magical' introduction of Samsung's new models at the IFA exhibition, she read the bashing comments on Google+ on the iPhone 5 and I gave her a brief summary on the law suit between Apple and Samsung in the USA. So, yes, Samsung USA is right, the next big thing is already here - literally. My BWE loves the look and touch of the Galaxy S3. And for me it was more cost-effective in terms of purchases done at the App Store, ups, Play Store. Transfer of contacts, text messages and media files Okay, now that the hardware is in place, how to transfer all those contacts, text messages, media files, etc. between those two devices? In the past, I used to use the Nokia Communication Suite between various models but now for Android? Well, as usual Google and Bing are reliable friends and among the first hits I came across an article about How to Transfer Contacts from Nokia to Android. Couldn't be easier, right? Well, sort of... my main Windows systems are already running on Windows 8, and this actually caused problems with the mobile/smart-phone device drivers. The article provides the download for an older version 1.10 which upgrades to 2.11 (as time of writing this entry) but both couldn't get the Galaxy S3 and the Nokia connected. Shame on me... the product page clearly doesn't mention Windows 8 (for now) and Windows 8 isn't available for the general audience at all... After I took a spare machine running on Windows Vista everything went smooth. Software installed, upgrade done, device drivers for Android automatically downloaded and installed, and the same painless routine for the Nokia part. I think, I rebooted the system twice during the whole setup procedure but hey, it was more or less a distraction while coding some stuff in ASP.NET MVC and Telerik Kendo UI. The transfer of contacts and text messages was done via Wondershare MobileGo for Android, and all media files by moving the additional microSD card from one device to the other. But even without an external SD card, it would have been very easy to copy the files via Windows Explorer directly. Little catch and excellent service Fine, we are almost done and the only step left is to shift the SIM card... Ouch, gotcha! The X3 uses a standard size SIM card while the S III only accepts microSIM form factor. What an irony, bigger smartphone needs smaller SIM card. Luckily, the next showroom of Emtel is just 5 mins away up the road, and the service staff over there know their job. Finally, after roughly 10 mins of paper work, activation and small chit-chat, the S3 came to life on the mobile network. Owning a smart-phone now and knowing that my BWE would like to interact more on social networks away from home, especially to upload pictures and provide local 'check-ins', I activated a data package for her in advance, too. Even that it is Saturday, everything was already done and ready to be used. Nice bonus: The Emtel clerk directly offered me to set up the configuration for the Emtel data services, yes sure, go ahead, this saves me to search for that in the settings. Okay, spoiler-alert here, setting a static APN to access the Emtel network and the internet wouldn't be a challenge. But hey, she already had the phone in her hands and I could keep my eyes on the children. Well done, Emtel! Resume Thanks to the useful software package by Wondershare is was a hands-free experience to transfer all the data from a Nokia mobile on Symbian S60 to a Samsung Galaxy S III on Android Ice Cream Sandwich (ICS). In the future, this wont be a serious issue at all anymore thanks to synchronisation services and cloud storage. And for now, I'm only waiting for the official upgrades for Jelly Bean.

Read the article
DRY and SRP

- by Timothy Klenke

Originally posted on: http://geekswithblogs.net/TimothyK/archive/2014/06/11/dry-and-srp.aspxKent Beck’s XP Simplicity Rules (aka Four Rules of Simple Design) are a prioritized list of rules that when applied to your code generally yield a great design. As you’ll see from the above link the list has slightly evolved over time. I find today they are usually listed as: All Tests Pass Don’t Repeat Yourself (DRY) Express Intent Minimalistic These are prioritized. If your code doesn’t work (rule 1) then everything else is forfeit. Go back to rule one and get the code working before worrying about anything else. Over the years the community have debated whether the priority of rules 2 and 3 should be reversed. Some say a little duplication in the code is OK as long as it helps express intent. I’ve debated it myself. This recent post got me thinking about this again, hence this post. I don’t think it is fair to compare “Expressing Intent” against “DRY”. This is a comparison of apples to oranges. “Expressing Intent” is a principal of code quality. “Repeating Yourself” is a code smell. A code smell is merely an indicator that there might be something wrong with the code. It takes further investigation to determine if a violation of an underlying principal of code quality has actually occurred. For example “using nouns for method names”, “using verbs for property names”, or “using Booleans for parameters” are all code smells that indicate that code probably isn’t doing a good job at expressing intent. They are usually very good indicators. But what principle is the code smell of Duplication pointing to and how good of an indicator is it? Duplication in the code base is bad for a couple reasons. If you need to make a change and that needs to be made in a number of locations it is difficult to know if you have caught all of them. This can lead to bugs if/when one of those locations is overlooked. By refactoring the code to remove all duplication there will be left with only one place to change, thereby eliminating this problem. With most projects the code becomes the single source of truth for a project. If a production code base is inconsistent with a five year old requirements or design document the production code that people are currently living with is usually declared as the current reality (or truth). Requirement or design documents at this age in a project life cycle are usually of little value. Although comparing production code to external documentation is usually straight forward, duplication within the code base muddles this declaration of truth. When code is duplicated small discrepancies will creep in between the two copies over time. The question then becomes which copy is correct? As different factions debate how the software should work, trust in the software and the team behind it erodes. The code smell of Duplication points to a violation of the “Single Source of Truth” principle. Let me define that as: A stakeholder’s requirement for a software change should never cause more than one class to change. Violation of the Single Source of Truth principle will always result in duplication in the code. However, the inverse is not always true. Duplication in the code does not necessarily indicate that there is a violation of the Single Source of Truth principle. To illustrate this, let’s look at a retail system where the system will (1) send a transaction to a bank and (2) print a receipt for the customer. Although these are two separate features of the system, they are closely related. The reason for printing the receipt is usually to provide an audit trail back to the bank transaction. Both features use the same data: amount charged, account number, transaction date, customer name, retail store name, and etcetera. Because both features use much of the same data, there is likely to be a lot of duplication between them. This duplication can be removed by making both features use the same data access layer. Then start coming the divergent requirements. The receipt stakeholder wants a change so that the account number has the last few digits masked out to protect the customer’s privacy. That can be solve with a small IF statement whilst still eliminating all duplication in the system. Then the bank wants to take a picture of the customer as well as capture their signature and/or PIN number for enhanced security. Then the receipt owner wants to pull data from a completely different system to report the customer’s loyalty program point total. After a while you realize that the two stakeholders have somewhat similar, but ultimately different responsibilities. They have their own reasons for pulling the data access layer in different directions. Then it dawns on you, the Single Responsibility Principle: There should never be more than one reason for a class to change. In this example we have two stakeholders giving two separate reasons for the data access class to change. It is clear violation of the Single Responsibility Principle. That’s a problem because it can often lead the project owner pitting the two stakeholders against each other in a vein attempt to get them to work out a mutual single source of truth. But that doesn’t exist. There are two completely valid truths that the developers need to support. How is this to be supported and honour the Single Responsibility Principle? The solution is to duplicate the data access layer and let each stakeholder control their own copy. The Single Source of Truth and Single Responsibility Principles are very closely related. SST tells you when to remove duplication; SRP tells you when to introduce it. They may seem to be fighting each other, but really they are not. The key is to clearly identify the different responsibilities (or sources of truth) over a system. Sometimes there is a single person with that responsibility, other times there are many. This can be especially difficult if the same person has dual responsibilities. They might not even realize they are wearing multiple hats. In my opinion Single Source of Truth should be listed as the second rule of simple design with Express Intent at number three. Investigation of the DRY code smell should yield to the proper application SST, without violating SRP. When necessary leave duplication in the system and let the class names express the different people that are responsible for controlling them. Knowing all the people with responsibilities over a system is the higher priority because you’ll need to know this before you can express it. Although it may be a code smell when there is duplication in the code, it does not necessarily mean that the coder has chosen to be expressive over DRY or that the code is bad.

Read the article
External File Upload Optimizations for Windows Azure

- by rgillen

[Cross posted from here: http://rob.gillenfamily.net/post/External-File-Upload-Optimizations-for-Windows-Azure.aspx] I’m wrapping up a bit of the work we’ve been doing on data movement optimizations for cloud computing and the latest set of data yielded some interesting points I thought I’d share. The work done here is not really rocket science but may, in some ways, be slightly counter-intuitive and therefore seemed worthy of posting. Summary: for those who don’t like to read detailed posts or don’t have time, the synopsis is that if you are uploading data to Azure, block your data (even down to 1MB) and upload in parallel. Set your block size based on your source file size, but if you must choose a fixed value, use 1MB. Following the above will result in significant performance gains… upwards of 10x-24x and a reduction in overall file transfer time of upwards of 90% (eg, uploading a 1GB file averaged 46.37 minutes prior to optimizations and averaged 1.86 minutes afterwards). Detail: For those of you who want more detail, or think that the claims at the end of the preceding paragraph are over-reaching, what follows is information and code supporting these claims. As the title would indicate, these tests were run from our research facility pointing to the Azure cloud (specifically US North Central as it is physically closest to us) and do not represent intra-cloud results… we have performed intra-cloud tests and the overall results are similar in notion but the data rates are significantly different as well as the tipping points for the various block sizes… this will be detailed separately). We started by building a very simple console application that would loop through a directory and upload each file to Azure storage. This application used the shipping storage client library from the 1.1 version of the azure tools. The only real variation from the client library is that we added code to collect and record the duration (in ms) and size (in bytes) for each file transferred. The code is available here. We then created a directory that had a collection of files for the following sizes: 2KB, 32KB, 64KB, 128KB, 512KB, 1MB, 5MB, 10MB, 25MB, 50MB, 100MB, 250MB, 500MB, 750MB, and 1GB (50 files for each size listed). These files contained randomly-generated binary data and do not benefit from compression (a separate discussion topic). Our file generation tool is available here. The baseline was established by running the application described above against the directory containing all of the data files. This application uploads the files in a random order so as to avoid transferring all of the files of a given size sequentially and thereby spreading the affects of periodic Internet delays across the collection of results. We then ran some scripts to split the resulting data and generate some reports. The raw data collected for our non-optimized tests is available via the links in the Related Resources section at the bottom of this post. For each file size, we calculated the average upload time (and standard deviation) and the average transfer rate (and standard deviation). As you likely are aware, transferring data across the Internet is susceptible to many transient delays which can cause anomalies in the resulting data. It is for this reason that we randomized the order of source file processing as well as executed the tests 50x for each file size. We expect that these steps will yield a sufficiently balanced set of results. Once the baseline was collected and analyzed, we updated the test harness application with some methods to split the source file into user-defined block sizes and then to upload those blocks in parallel (using the PutBlock() method of Azure storage). The parallelization was handled by simply relying on the Parallel Extensions to .NET to provide a Parallel.For loop (see linked source for specific implementation details in Program.cs, line 173 and following… less than 100 lines total). Once all of the blocks were uploaded, we called PutBlockList() to assemble/commit the file in Azure storage. For each block transferred, the MD5 was calculated and sent ensuring that the bits that arrived matched was was intended. The timer for the blocked/parallelized transfer method wraps the entire process (source file splitting, block transfer, MD5 validation, file committal). A diagram of the process is as follows: We then tested the affects of blocking & parallelizing the transfers by running the updated application against the same source set and did a parameter sweep on the block size including 256KB, 512KB, 1MB, 2MB, and 4MB (our assumption was that anything lower than 256KB wasn’t worth the trouble and 4MB is the maximum size of a block supported by Azure). The raw data for the parallel tests is available via the links in the Related Resources section at the bottom of this post. This data was processed and then compared against the single-threaded / non-optimized transfer numbers and the results were encouraging. The Excel version of the results is available here. Two semi-obvious points need to be made prior to reviewing the data. The first is that if the block size is larger than the source file size you will end up with a “negative optimization” due to the overhead of attempting to block and parallelize. The second is that as the files get smaller, the clock-time cost of blocking and parallelizing (overhead) is more apparent and can tend towards negative optimizations. For this reason (and is supported in the raw data provided in the linked worksheet) the charts and dialog below ignore source file sizes less than 1MB. (click chart for full size image) The chart above illustrates some interesting points about the results: When the block size is smaller than the source file, performance increases but as the block size approaches and then passes the source file size, you see decreasing benefit to the point of negative gains (see the values for the 1MB file size) For some of the moderately-sized source files, small blocks (256KB) are best As the size of the source file gets larger (see values for 50MB and up), the smallest block size is not the most efficient (presumably due, at least in part, to the increased number of blocks, increased number of individual transfer requests, and reassembly/committal costs). Once you pass the 250MB source file size, the difference in rate for 1MB to 4MB blocks is more-or-less constant The 1MB block size gives the best average improvement (~16x) but the optimal approach would be to vary the block size based on the size of the source file. (click chart for full size image) The above is another view of the same data as the prior chart just with the axis changed (x-axis represents file size and plotted data shows improvement by block size). It again highlights the fact that the 1MB block size is probably the best overall size but highlights the benefits of some of the other block sizes at different source file sizes. This last chart shows the change in total duration of the file uploads based on different block sizes for the source file sizes. Nothing really new here other than this view of the data highlights the negative affects of poorly choosing a block size for smaller files. Summary What we have found so far is that blocking your file uploads and uploading them in parallel results in significant performance improvements. Further, utilizing extension methods and the Task Parallel Library (.NET 4.0) make short work of altering the shipping client library to provide this functionality while minimizing the amount of change to existing applications that might be using the client library for other interactions. Related Resources Source code for upload test application Source code for random file generator ODatas feed of raw data from non-optimized transfer tests Experiment Metadata Experiment Datasets 2KB Uploads 32KB Uploads 64KB Uploads 128KB Uploads 256KB Uploads 512KB Uploads 1MB Uploads 5MB Uploads 10MB Uploads 25MB Uploads 50MB Uploads 100MB Uploads 250MB Uploads 500MB Uploads 750MB Uploads 1GB Uploads Raw Data OData feeds of raw data from blocked/parallelized transfer tests Experiment Metadata Experiment Datasets Raw Data 256KB Blocks 512KB Blocks 1MB Blocks 2MB Blocks 4MB Blocks Excel worksheet showing summarizations and comparisons

Read the article
Finally, upgrade from Nokia X3 to Samsung Galaxy S III

This time, something slightly different but nonetheless not less interesting, hopefully. Living on a remote island like Mauritius, ill-praised 'Cyber Island' in the Indian Ocean, has its advantages in life style and relaxed environment to life in but in terms of technological aspects it can be quite a nightmare. Well, I guess this might be different story to report about... one day. Cyber Island Mauritius Despite it's shiny advertisement as Cyber Island and business in ICT hub to Africa, Mauritius is not on the latest track of available models in computer hardware or, in the context of this article, cellulars or smart-phone, or communication technology in general. Okay, I have to admit that this statement is only partly true. Money can buy, even here in Mauritius. Luckily, there are ways and ways to deal with this outcry of modern, read: technological, civilisation issues. Online shopping you might think? Yes, for sure, until you discover in your checkout procedure that a small island in the Indian Ocean isn't a preferred destination for delivery and the precious time you spent on putting your items into your cart and feeding your personal level of anticipation gets ruined on the last stint. Ordering from abroad saves you money Anyway, I got in touch with my personal courier and luckily there were some extra-kilos left in the luggage. First obstacle sorted, we have a Transporter! Okay, on the next occasion off to Amazon online and using their Prime service for fast delivery. Actually, the order was placed on Saturday evening and everything got delivered on Tuesday morning - nice job in less than 72 hours. Okay, among the items of that shopping rush I ordered a shiny Samsung Galaxy S III 16GB in oceanic blue - did I mention, that you hardly get a blue model in Mauritius? - for my BWE. Interesting side-notes: First, Amazon Germany dropped the prices for roughly 30% on the S3, and we got the 16GB model for less than 500 Euro (or approx. Rs. 19.500,-) compared to the usual Rs. 27.000,- on the local market. It even varies whether the local price is inclusive or exclusive VAT (15%). Second, since a while she was bothering me to get an iPhone and an iPad for her, fair enough I thought, decent hardware, posh design and reliable services. Until we watched the 'magical' introduction of Samsung's new models at the IFA exhibition, she read the bashing comments on Google+ on the iPhone 5 and I gave her a brief summary on the law suit between Apple and Samsung in the USA. So, yes, Samsung USA is right, the next big thing is already here - literally. My BWE loves the look and touch of the Galaxy S3. And for me it was more cost-effective in terms of purchases done at the App Store, ups, Play Store. Transfer of contacts, text messages and media files Okay, now that the hardware is in place, how to transfer all those contacts, text messages, media files, etc. between those two devices? In the past, I used to use the Nokia Communication Suite between various models but now for Android? Well, as usual Google and Bing are reliable friends and among the first hits I came across an article about How to Transfer Contacts from Nokia to Android. Couldn't be easier, right? Well, sort of... my main Windows systems are already running on Windows 8, and this actually caused problems with the mobile/smart-phone device drivers. The article provides the download for an older version 1.10 which upgrades to 2.11 (as time of writing this entry) but both couldn't get the Galaxy S3 and the Nokia connected. Shame on me... the product page clearly doesn't mention Windows 8 (for now) and Windows 8 isn't available for the general audience at all... After I took a spare machine running on Windows Vista everything went smooth. Software installed, upgrade done, device drivers for Android automatically downloaded and installed, and the same painless routine for the Nokia part. I think, I rebooted the system twice during the whole setup procedure but hey, it was more or less a distraction while coding some stuff in ASP.NET MVC and Telerik Kendo UI. The transfer of contacts and text messages was done via Wondershare MobileGo for Android, and all media files by moving the additional microSD card from one device to the other. But even without an external SD card, it would have been very easy to copy the files via Windows Explorer directly. Little catch and excellent service Fine, we are almost done and the only step left is to shift the SIM card... Ouch, gotcha! The X3 uses a standard size SIM card while the S III only accepts microSIM form factor. What an irony, bigger smartphone needs smaller SIM card. Luckily, the next showroom of Emtel is just 5 mins away up the road, and the service staff over there know their job. Finally, after roughly 10 mins of paper work, activation and small chit-chat, the S3 came to life on the mobile network. Owning a smart-phone now and knowing that my BWE would like to interact more on social networks away from home, especially to upload pictures and provide local 'check-ins', I activated a data package for her in advance, too. Even that it is Saturday, everything was already done and ready to be used. Nice bonus: The Emtel clerk directly offered me to set up the configuration for the Emtel data services, yes sure, go ahead, this saves me to search for that in the settings. Okay, spoiler-alert here, setting a static APN to access the Emtel network and the internet wouldn't be a challenge. But hey, she already had the phone in her hands and I could keep my eyes on the children. Well done, Emtel! Resume Thanks to the useful software package by Wondershare is was a hands-free experience to transfer all the data from a Nokia mobile on Symbian S60 to a Samsung Galaxy S III on Android Ice Cream Sandwich (ICS). In the future, this wont be a serious issue at all anymore thanks to synchronisation services and cloud storage. And for now, I'm only waiting for the official upgrades for Jelly Bean.

Read the article
I see no LOBs!

- by Paul White

Is it possible to see LOB (large object) logical reads from STATISTICS IO output on a table with no LOB columns? I was asked this question today by someone who had spent a good fraction of their afternoon trying to work out why this was occurring – even going so far as to re-run DBCC CHECKDB to see if any corruption had taken place. The table in question wasn’t particularly pretty – it had grown somewhat organically over time, with new columns being added every so often as the need arose. Nevertheless, it remained a simple structure with no LOB columns – no TEXT or IMAGE, no XML, no MAX types – nothing aside from ordinary INT, MONEY, VARCHAR, and DATETIME types. To add to the air of mystery, not every query that ran against the table would report LOB logical reads – just sometimes – but when it did, the query often took much longer to execute. Ok, enough of the pre-amble. I can’t reproduce the exact structure here, but the following script creates a table that will serve to demonstrate the effect: IF OBJECT_ID(N'dbo.Test', N'U') IS NOT NULL DROP TABLE dbo.Test GO CREATE TABLE dbo.Test ( row_id NUMERIC IDENTITY NOT NULL, col01 NVARCHAR(450) NOT NULL, col02 NVARCHAR(450) NOT NULL, col03 NVARCHAR(450) NOT NULL, col04 NVARCHAR(450) NOT NULL, col05 NVARCHAR(450) NOT NULL, col06 NVARCHAR(450) NOT NULL, col07 NVARCHAR(450) NOT NULL, col08 NVARCHAR(450) NOT NULL, col09 NVARCHAR(450) NOT NULL, col10 NVARCHAR(450) NOT NULL, CONSTRAINT [PK dbo.Test row_id] PRIMARY KEY CLUSTERED (row_id) ) ; The next script loads the ten variable-length character columns with one-character strings in the first row, two-character strings in the second row, and so on down to the 450th row: WITH Numbers AS ( -- Generates numbers 1 - 450 inclusive SELECT TOP (450) n = ROW_NUMBER() OVER (ORDER BY (SELECT 0)) FROM master.sys.columns C1, master.sys.columns C2, master.sys.columns C3 ORDER BY n ASC ) INSERT dbo.Test WITH (TABLOCKX) SELECT REPLICATE(N'A', N.n), REPLICATE(N'B', N.n), REPLICATE(N'C', N.n), REPLICATE(N'D', N.n), REPLICATE(N'E', N.n), REPLICATE(N'F', N.n), REPLICATE(N'G', N.n), REPLICATE(N'H', N.n), REPLICATE(N'I', N.n), REPLICATE(N'J', N.n) FROM Numbers AS N ORDER BY N.n ASC ; Once those two scripts have run, the table contains 450 rows and 10 columns of data like this: Most of the time, when we query data from this table, we don’t see any LOB logical reads, for example: -- Find the maximum length of the data in -- column 5 for a range of rows SELECT result = MAX(DATALENGTH(T.col05)) FROM dbo.Test AS T WHERE row_id BETWEEN 50 AND 100 ; But with a different query… -- Read all the data in column 1 SELECT result = MAX(DATALENGTH(T.col01)) FROM dbo.Test AS T ; …suddenly we have 49 LOB logical reads, as well as the ‘normal’ logical reads we would expect. The Explanation If we had tried to create this table in SQL Server 2000, we would have received a warning message to say that future INSERT or UPDATE operations on the table might fail if the resulting row exceeded the in-row storage limit of 8060 bytes. If we needed to store more data than would fit in an 8060 byte row (including internal overhead) we had to use a LOB column – TEXT, NTEXT, or IMAGE. These special data types store the large data values in a separate structure, with just a small pointer left in the original row. Row Overflow SQL Server 2005 introduced a feature called row overflow, which allows one or more variable-length columns in a row to move to off-row storage if the data in a particular row would otherwise exceed 8060 bytes. You no longer receive a warning when creating (or altering) a table that might need more than 8060 bytes of in-row storage; if SQL Server finds that it can no longer fit a variable-length column in a particular row, it will silently move one or more of these columns off the row into a separate allocation unit. Only variable-length columns can be moved in this way (for example the (N)VARCHAR, VARBINARY, and SQL_VARIANT types). Fixed-length columns (like INTEGER and DATETIME for example) never move into ‘row overflow’ storage. The decision to move a column off-row is done on a row-by-row basis – so data in a particular column might be stored in-row for some table records, and off-row for others. In general, if SQL Server finds that it needs to move a column into row-overflow storage, it moves the largest variable-length column record for that row. Note that in the case of an UPDATE statement that results in the 8060 byte limit being exceeded, it might not be the column that grew that is moved! Sneaky LOBs Anyway, that’s all very interesting but I don’t want to get too carried away with the intricacies of row-overflow storage internals. The point is that it is now possible to define a table with non-LOB columns that will silently exceed the old row-size limit and result in ordinary variable-length columns being moved to off-row storage. Adding new columns to a table, expanding an existing column definition, or simply storing more data in a column than you used to – all these things can result in one or more variable-length columns being moved off the row. Note that row-overflow storage is logically quite different from old-style LOB and new-style MAX data type storage – individual variable-length columns are still limited to 8000 bytes each – you can just have more of them now. Having said that, the physical mechanisms involved are very similar to full LOB storage – a column moved to row-overflow leaves a 24-byte pointer record in the row, and the ‘separate storage’ I have been talking about is structured very similarly to both old-style LOBs and new-style MAX types. The disadvantages are also the same: when SQL Server needs a row-overflow column value it needs to follow the in-row pointer a navigate another chain of pages, just like retrieving a traditional LOB. And Finally… In the example script presented above, the rows with row_id values from 402 to 450 inclusive all exceed the total in-row storage limit of 8060 bytes. A SELECT that references a column in one of those rows that has moved to off-row storage will incur one or more lob logical reads as the storage engine locates the data. The results on your system might vary slightly depending on your settings, of course; but in my tests only column 1 in rows 402-450 moved off-row. You might like to play around with the script – updating columns, changing data type lengths, and so on – to see the effect on lob logical reads and which columns get moved when. You might even see row-overflow columns moving back in-row if they are updated to be smaller (hint: reduce the size of a column entry by at least 1000 bytes if you hope to see this). Be aware that SQL Server will not warn you when it moves ‘ordinary’ variable-length columns into overflow storage, and it can have dramatic effects on performance. It makes more sense than ever to choose column data types sensibly. If you make every column a VARCHAR(8000) or NVARCHAR(4000), and someone stores data that results in a row needing more than 8060 bytes, SQL Server might turn some of your column data into pseudo-LOBs – all without saying a word. Finally, some people make a distinction between ordinary LOBs (those that can hold up to 2GB of data) and the LOB-like structures created by row-overflow (where columns are still limited to 8000 bytes) by referring to row-overflow LOBs as SLOBs. I find that quite appealing, but the ‘S’ stands for ‘small’, which makes expanding the whole acronym a little daft-sounding…small large objects anyone? © Paul White 2011 email: [email protected] twitter: @SQL_Kiwi

Read the article
The Presentation Isn't Over Until It's Over

- by Phil Factor

The senior corporate dignitaries settled into their seats looking important in a blue-suited sort of way. The lights dimmed as I strode out in front to give my presentation. I had ten vital minutes to make my pitch. I was about to dazzle the top management of a large software company who were considering the purchase of my software product. I would present them with a dazzling synthesis of diagrams, graphs, followed by a live demonstration of my software projected from my laptop. My preparation had been meticulous: It had to be: A year’s hard work was at stake, so I’d prepared it to perfection. I stood up and took them all in, with a gaze of sublime confidence. Then the laptop expired. There are several possible alternative plans of action when this happens A. Stare at the smoking laptop vacuously, flapping ones mouth slowly up and down B. Stand frozen like a statue, locked in indecision between fright and flight. C. Run out of the room, weeping D. Pretend that this was all planned E. Abandon the presentation in favour of a stilted and tedious dissertation about the software F. Shake your fist at the sky, and curse the sense of humour of your preferred deity I started for a few seconds on plan B, normally referred to as the ‘Rabbit in the headlamps of the car’ technique. Suddenly, a little voice inside my head spoke. It spoke the famous inane words of Yogi Berra; ‘The game isn't over until it's over.’ ‘Too right’, I thought. What to do? I ran through the alternatives A-F inclusive in my mind but none appealed to me. I was completely unprepared for this. Nowadays, longevity has since taught me more than I wanted to know about the wacky sense of humour of fate, and I would have taken two laptops. I hadn’t, but decided to do the presentation anyway as planned. I started out ignoring the dead laptop, but pretending, instead that it was still working. The audience looked startled. They were expecting plan B to be succeeded by plan C, I suspect. They weren’t used to denial on this scale. After my introductory talk, which didn’t require any visuals, I came to the diagram that described the application I’d written. I’d taken ages over it and it was hot stuff. Well, it would have been had it been projected onto the screen. It wasn’t. Before I describe what happened then, I must explain that I have thespian tendencies. My triumph as Professor Higgins in My Fair Lady at the local operatic society is now long forgotten, but I remember at the time of my finest performance, the moment that, glancing up over the vast audience of moist-eyed faces at the during the poignant scene between Eliza and Higgins at the end, I realised that I had a talent that one day could possibly be harnessed for commercial use I just talked about the diagram as if it was there, but throwing in some extra description. The audience nodded helpfully when I’d done enough. Emboldened, I began a sort of mime, well, more of a ballet, to represent each slide as I came to it. Heaven knows I’d done my preparation and, in my mind’s eye, I could see every detail, but I had to somehow project the reality of that vision to the audience, much the same way any actor playing Macbeth should do the ghost of Banquo. My desperation gave me a manic energy. If you’ve ever demonstrated a windows application entirely by mime, gesture and florid description, you’ll understand the scale of the challenge, but then I had nothing to lose. With a brief sentence of description here and there, and arms flailing whilst outlining the size and shape of graphs and diagrams, I used the many tricks of mime, gesture and body-language learned from playing Captain Hook, or the Sheriff of Nottingham in pantomime. I set out determinedly on my desperate venture. There wasn’t time to do anything but focus on the challenge of the task: the world around me narrowed down to ten faces and my presentation: ten souls who had to be hypnotized into seeing a Windows application: one that was slick, well organized and functional I don’t remember the details. Eight minutes of my life are gone completely. I was a thespian berserker. I know however that I followed the basic plan of building the presentation in a carefully controlled crescendo until the dazzling finale where the results were displayed on-screen. ‘And here you see the results, neatly formatted and grouped carefully to enhance the significance of the figures, together with running trend-graphs!’ I waved a mime to signify an animated window-opening, and looked up, in my first pause, to gaze defiantly at the audience. It was a sight I’ll never forget. Ten pairs of eyes were gazing in rapt attention at the imaginary window, and several pairs of eyes were glancing at the imaginary graphs and figures. I hadn’t had an audience like that since my starring role in Beauty and the Beast. At that moment, I realized that my desperate ploy might work. I sat down, slightly winded, when my ten minutes were up. For the first and last time in my life, the audience of a ‘PowerPoint’ presentation burst into spontaneous applause. ‘Any questions?’ ‘Yes, Have you got an agent?’ Yes, in case you’re wondering, I got the deal. They bought the software product from me there and then. However, it was a life-changing experience for me and I have never ever again trusted technology as part of a presentation. Even if things can’t go wrong, they’ll go wrong and they’ll kill the flow of what you’re presenting. if you can’t do something without the techno-props, then you shouldn’t do it. The greatest lesson of all is that great presentations require preparation and ‘stage-presence’ rather than fancy graphics. They’re a great supporting aid, but they should never dominate to the point that you’re lost without them.

Read the article
When is a Seek not a Seek?

- by Paul White

The following script creates a single-column clustered table containing the integers from 1 to 1,000 inclusive. IF OBJECT_ID(N'tempdb..#Test', N'U') IS NOT NULL DROP TABLE #Test ; GO CREATE TABLE #Test ( id INTEGER PRIMARY KEY CLUSTERED ); ; INSERT #Test (id) SELECT V.number FROM master.dbo.spt_values AS V WHERE V.[type] = N'P' AND V.number BETWEEN 1 AND 1000 ; Let’s say we need to find the rows with values from 100 to 170, excluding any values that divide exactly by 10. One way to write that query would be: SELECT T.id FROM #Test AS T WHERE T.id IN ( 101,102,103,104,105,106,107,108,109, 111,112,113,114,115,116,117,118,119, 121,122,123,124,125,126,127,128,129, 131,132,133,134,135,136,137,138,139, 141,142,143,144,145,146,147,148,149, 151,152,153,154,155,156,157,158,159, 161,162,163,164,165,166,167,168,169 ) ; That query produces a pretty efficient-looking query plan: Knowing that the source column is defined as an INTEGER, we could also express the query this way: SELECT T.id FROM #Test AS T WHERE T.id >= 101 AND T.id <= 169 AND T.id % 10 > 0 ; We get a similar-looking plan: If you look closely, you might notice that the line connecting the two icons is a little thinner than before. The first query is estimated to produce 61.9167 rows – very close to the 63 rows we know the query will return. The second query presents a tougher challenge for SQL Server because it doesn’t know how to predict the selectivity of the modulo expression (T.id % 10 > 0). Without that last line, the second query is estimated to produce 68.1667 rows – a slight overestimate. Adding the opaque modulo expression results in SQL Server guessing at the selectivity. As you may know, the selectivity guess for a greater-than operation is 30%, so the final estimate is 30% of 68.1667, which comes to 20.45 rows. The second difference is that the Clustered Index Seek is costed at 99% of the estimated total for the statement. For some reason, the final SELECT operator is assigned a small cost of 0.0000484 units; I have absolutely no idea why this is so, or what it models. Nevertheless, we can compare the total cost for both queries: the first one comes in at 0.0033501 units, and the second at 0.0034054. The important point is that the second query is costed very slightly higher than the first, even though it is expected to produce many fewer rows (20.45 versus 61.9167). If you run the two queries, they produce exactly the same results, and both complete so quickly that it is impossible to measure CPU usage for a single execution. We can, however, compare the I/O statistics for a single run by running the queries with STATISTICS IO ON: Table '#Test'. Scan count 63, logical reads 126, physical reads 0. Table '#Test'. Scan count 01, logical reads 002, physical reads 0. The query with the IN list uses 126 logical reads (and has a ‘scan count’ of 63), while the second query form completes with just 2 logical reads (and a ‘scan count’ of 1). It is no coincidence that 126 = 63 * 2, by the way. It is almost as if the first query is doing 63 seeks, compared to one for the second query. In fact, that is exactly what it is doing. There is no indication of this in the graphical plan, or the tool-tip that appears when you hover your mouse over the Clustered Index Seek icon. To see the 63 seek operations, you have click on the Seek icon and look in the Properties window (press F4, or right-click and choose from the menu): The Seek Predicates list shows a total of 63 seek operations – one for each of the values from the IN list contained in the first query. I have expanded the first seek node to show the details; it is seeking down the clustered index to find the entry with the value 101. Each of the other 62 nodes expands similarly, and the same information is contained (even more verbosely) in the XML form of the plan. Each of the 63 seek operations starts at the root of the clustered index B-tree and navigates down to the leaf page that contains the sought key value. Our table is just large enough to need a separate root page, so each seek incurs 2 logical reads (one for the root, and one for the leaf). We can see the index depth using the INDEXPROPERTY function, or by using the a DMV: SELECT S.index_type_desc, S.index_depth FROM sys.dm_db_index_physical_stats ( DB_ID(N'tempdb'), OBJECT_ID(N'tempdb..#Test', N'U'), 1, 1, DEFAULT ) AS S ; Let’s look now at the Properties window when the Clustered Index Seek from the second query is selected: There is just one seek operation, which starts at the root of the index and navigates the B-tree looking for the first key that matches the Start range condition (id >= 101). It then continues to read records at the leaf level of the index (following links between leaf-level pages if necessary) until it finds a row that does not meet the End range condition (id <= 169). Every row that meets the seek range condition is also tested against the Residual Predicate highlighted above (id % 10 > 0), and is only returned if it matches that as well. You will not be surprised that the single seek (with a range scan and residual predicate) is much more efficient than 63 singleton seeks. It is not 63 times more efficient (as the logical reads comparison would suggest), but it is around three times faster. Let’s run both query forms 10,000 times and measure the elapsed time: DECLARE @i INTEGER, @n INTEGER = 10000, @s DATETIME = GETDATE() ; SET NOCOUNT ON; SET STATISTICS XML OFF; ; WHILE @n > 0 BEGIN SELECT @i = T.id FROM #Test AS T WHERE T.id IN ( 101,102,103,104,105,106,107,108,109, 111,112,113,114,115,116,117,118,119, 121,122,123,124,125,126,127,128,129, 131,132,133,134,135,136,137,138,139, 141,142,143,144,145,146,147,148,149, 151,152,153,154,155,156,157,158,159, 161,162,163,164,165,166,167,168,169 ) ; SET @n -= 1; END ; PRINT DATEDIFF(MILLISECOND, @s, GETDATE()) ; GO DECLARE @i INTEGER, @n INTEGER = 10000, @s DATETIME = GETDATE() ; SET NOCOUNT ON ; WHILE @n > 0 BEGIN SELECT @i = T.id FROM #Test AS T WHERE T.id >= 101 AND T.id <= 169 AND T.id % 10 > 0 ; SET @n -= 1; END ; PRINT DATEDIFF(MILLISECOND, @s, GETDATE()) ; On my laptop, running SQL Server 2008 build 4272 (SP2 CU2), the IN form of the query takes around 830ms and the range query about 300ms. The main point of this post is not performance, however – it is meant as an introduction to the next few parts in this mini-series that will continue to explore scans and seeks in detail. When is a seek not a seek? When it is 63 seeks © Paul White 2011 email: [email protected] twitter: @SQL_kiwi

Read the article
CodePlex Daily Summary for Monday, July 15, 2013

CodePlex Daily Summary for Monday, July 15, 2013Popular ReleasesMVC Forum: MVC Forum v1.0: Finally version 1.0 is here! We have been fixing a few bugs, and making sure the release is as stable as possible. We have also changed the way configuration of the application works, mostly how to add your own code or replace some of the standard code with your own. If you download and use our software, please give us some sort of feedback, good or bad!SharePoint 2013 TypeScript Definitions: Release 1.1: TypeScript 0.9 support SharePoint TypeScript Definitions are now compliant with the new version of TypeScript TypeScript generics are now used for defining collection classes (derivatives of SP.ClientCollection object) Improved coverage Added mQuery definitions (m$) Added SPClientAutoFill definitions SP.Utilities namespace is now fully covered SP.UI modal dialog definitions improved CSR definitions improved, added some missing methods and context properties, mostly related to list ...GoAgent GUI: GoAgent GUI ??? 1.0.0: ????GoAgent GUI????，???????????.Net Framework 4.0 ???????: Windows 8 x64 Windows 8 x86 Windows 7 x64 Windows 7 x86 ???????： Windows 8.1 Preview (x86/x64) ???????： Windows XP ????： ????????GoAgent????，????????，?????????????，???????????????????，??????????，????。PiGraph: PiGraph 2.0.8.13: C?p nh?t:Các l?i dã s?a: S?a l?i không nh?p du?c s? âm. L?i tabindex trong giao di?n Thêm hàm Các l?i chua kh?c ph?c: L?i ghi chú nh?p nháy màu. L?i khung ghi chú vu?t ra kh?i biên khi luu file. Luu ý:N?u không kh?i d?ng duoc chuong trình, b?n nên c?p nh?t driver card d? h?a phiên b?n m?i nh?t: AMD Graphics Drivers NVIDIA Driver Xem yêu c?u h? th?ngD3D9Client: D3D9Client R12 for Orbiter Beta: D3D9Client release for orbiter BetaVidCoder: 1.4.23: New in 1.4.23 Added French translation. Fixed non-x264 video encoders not sticking in video tab. New in 1.4 Updated HandBrake core to 0.9.9 Blu-ray subtitle (PGS) support Additional framerates: 30, 50, 59.94, 60 Additional sample rates: 8, 11.025, 12 and 16 kHz Additional higher bitrates for audio Same as Source Constant Framerate 24-bit FLAC encoding Added Windows Phone 8 and Apple TV 3 presets Introduced process isolation for encodes. Now if HandBrake crashes, VidCoder will ...Project Server 2013 Event Handler Admin Tool: PSI Event Admin Tool: Download & exract the File. Use LoggerAdmin to upload the event handlers in project server 2013. PSIEventLogger\LoggerAdmin\bin\DebugGherkin editor: Gherkin Editor Beta 2: Fix issue #7 and add some refactoring and code cleanupNew-NuGetPackage PowerShell Script: New-NuGetPackage.ps1 PowerShell Script v1.2: Show nuget gallery to push to when prompting user if they want to push their package.Site Templates By Steve: SharePoint 2010 CORE Site Theme By Steve WSP: Great Site Theme to start with from Steve. See project home page for install instructions. This is a nice centered, mega-menu, fixed width masterpage with styles. Remember to update the mega menu lists.SharePoint Solution Installer: SharePoint Solution Installer V1.2.8: setup2013.exe now supports CompatibilityLevel to target specific hive Use setup.exe for SP2007 & SP2010. Use setup2013.exe for SP2013.TBox - tool to make developer's life easier.: TBox 1.021: 1)Add console unit tests runner, to run unit tests in parallel from console. Also this new sub tool can save valid xml nunit report. It can help you in continuous integration. 2)Fix build scripts.LifeInSharepoint Modern UI Update: Version 2: Some minor improvements, including Audience Targeting support for quick launch links. Also removing all NextDocs references.Virtual Photonics: VTS MATLAB Package 1.0.13 Beta: This is the beta release of the MATLAB package with examples of using the VTS libraries within MATLAB. Changes for this release: Added two new examples to vts_solver_demo.m that: 1) generates and plots R(lambda) at a given rho, and chromophore concentrations assuming a power law for scattering, and 2) solves inverse problem for R(lambda) at given rho. This example solves for concentrations of HbO2, Hb and H20 given simulated measurements created using Nurbs scaled Monte Carlo and inverted u...Advanced Resource Tab for Blend: Advanced Resource Tab: This is the first alpha release of the advanced resource tab for Blend for Visual Studio 2012.Microsoft Ajax Minifier: Microsoft Ajax Minifier 4.96: Fix for issue #19957: EXE should output the name of the file(s) being minified. Discussion #449181: throw a Sev-2 warning when trailing commas are detected on an Array literal. Perfectly legal to do so, but the behavior ends up working differently on different browsers, so throw a cross-browser warning. Add a few more known global names for improved ES6 compatibility update Nuget package to version 2.5 and automatically add the AjaxMin.targets to your project when you update the package...Outlook 2013 Add-In: Categories and Colors: This new version has a major change in the drawing of the list items: - Using owner drawn code to format the appointments using GDI (some flickering may occur, but it looks a little bit better IMHO, with separate sections). - Added category color support (if more than one category, only one color will be shown). Here, the colors Outlook uses are slightly different than the ones available in System.Drawing, so I did a first approach matching. ;-) - Added appointment status support (to show fr...Columbus Remote Desktop: 2.0 Sapphire: Added configuration settings Added update notifications Added ability to disable GPU acceleration Fixed connection bugsLINQ to Twitter: LINQ to Twitter v2.1.07: Supports .NET 3.5, .NET 4.0, .NET 4.5, Silverlight 4.0, Windows Phone 7.1, Windows Phone 8, Client Profile, Windows 8, and Windows Azure. 100% Twitter API coverage. Also supports Twitter API v1.1! Also on NuGet.DotNetNuke® Community Edition CMS: 06.02.08: Major Highlights Fixed issue where the application throws an Unhandled Error and an HTTP Response Code of 200 when the connection to the database is lost. Security FixesNone Updated Modules/Providers ModulesNone ProvidersNoneNew Projects[.Net Intl] harroc_c;mallar_a;olouso_f: The goal of this project is to create a web crawler and a web front who allows you to search in your index. You will create a mini (or large!) search engine basButterfly Storage: Butterfly Storage is a data access technology based on object-oriented database model for Windows Store applications.KaveCompany: KaveCompleave that girl alone: a team project!MyClrProfiler: This project helps you learn about and develop your own CLR profiler.NETDeob: Deobfuscate obfuscated .NET files easilyProgram stomatologie: SummarySimple Graph Library: Simple portable class library for graphs data structures. .NET, Silverlight 4/5, Windows Phone, Windows RT, Xbox 360T6502 Emulator: T6502 is a 6502 emulator written in TypeScript using AngularJS. The goal is well-organized, readable code over performance.WP8 File Access Webserver: C# HTTP server and web application on Windows Phone 8. Implements file access, browsing and downloading.wpadk: wpadk????wp7?????? ?????????，?????、SDK、wpadk?????????????。??????????????????。??????????????????，????wpadk?????????????????????????????????????。xlmUnit: xlmUnit, Unit Testing

Read the article
Developing Schema Compare for Oracle (Part 1)

- by Simon Cooper

SQL Compare is one of Red Gate's most successful SQL Server tools; it allows developers and DBAs to compare and synchronize the contents of their databases. Although similar tools exist for Oracle, they are quite noticeably lacking in the usability and stability that SQL Compare is known for in the SQL Server world. We could see a real need for a usable schema comparison tools for Oracle, and so the Schema Compare for Oracle project was born. Over the next few weeks, as we come up to release of v1, I'll be doing a series of posts on the development of Schema Compare for Oracle. For the first post, I thought I would start with the main pitfalls that we stumbled across when developing the product, especially from a SQL Server background. 1. Schemas and Databases The most obvious difference is that the concept of a 'database' is quite different between Oracle and SQL Server. On SQL Server, one server instance has multiple databases, each with separate schemas. There is typically little communication between separate databases, and most databases are no more than about 1000-2000 objects. This means SQL Compare can register an entire database in a reasonable amount of time, and cross-database dependencies probably won't be an issue. It is a quite different scene under Oracle, however. The terms 'database' and 'instance' are used interchangeably, (although technically 'database' refers to the datafiles on disk, and 'instance' the running Oracle process that reads & writes to the database), and a database is a single conceptual entity. This immediately presents problems, as it is infeasible to register an entire database as we do in SQL Compare; in my Oracle install, using the standard recommended options, there are 63975 system objects. If we tried to register all those, not only would it take hours, but the client would probably run out of memory before we finished. As a result, we had to allow people to specify what schemas they wanted to register. This decision had quite a few knock-on effects for the design, which I will cover in a future post. 2. Connecting to Oracle The next obvious difference is in actually connecting to Oracle – in SQL Server, you can specify a server and database, and off you go. On Oracle things are slightly more complicated. SIDs, Service Names, and TNS A database (the files on disk) must have a unique identifier for the databases on the system, called the SID. It also has a global database name, which consists of a name (which doesn't have to match the SID) and a domain. Alternatively, you can identify a database using a service name, which normally has a 1-to-1 relationship with instances, but may not if, for example, using RAC (Real Application Clusters) for redundancy and failover. You specify the computer and instance you want to connect to using TNS (Transparent Network Substrate). The user-visible parts are a config file (tnsnames.ora) on the client machine that specifies how to connect to an instance. For example, the entry for one of my test instances is: SC_11GDB1 = (DESCRIPTION = (ADDRESS_LIST = (ADDRESS = (PROTOCOL = TCP)(HOST = simonctest)(PORT = 1521)) ) (CONNECT_DATA = (SID = 11gR1db1) ) ) This gives the hostname, port, and SID of the instance I want to connect to, and associates it with a name (SC_11GDB1). The tnsnames syntax also allows you to specify failover, multiple descriptions and address lists, and client load balancing. You can then specify this TNS identifier as the data source in a connection string. Although using ODP.NET (the .NET dlls provided by Oracle) was fine for internal prototype builds, once we released the EAP we discovered that this simply wasn't an acceptable solution for installs on other people's machines. Due to .NET assembly strong naming, users had to have installed on their machines the exact same version of the ODP.NET dlls as we had on our build server. We couldn't ship the ODP.NET dlls with our installer as the Oracle license agreement prohibited this, and we didn't want to force users to install another Oracle client just so they can run our program. To be able to list the TNS entries in the connection dialog, we also had to locate and parse the tnsnames.ora file, which was complicated by users with several Oracle client installs and intricate TNS entries. After much swearing at our computers, we eventually decided to use a third party Oracle connection library from Devart that we could ship with our program; this could use whatever client version was installed, parse the TNS entries for us, and also had the nice feature of being able to connect to an Oracle server without having any client installed at all. Unfortunately, their current license agreement prevents us from shipping an Oracle SDK, but that's a bridge we'll cross when we get to it. 3. Running synchronization scripts The most important difference is that in Oracle, DDL is non-transactional; you cannot rollback DDL statements like you can on SQL Server. Although we considered various solutions to this, including using the flashback archive or recycle bin, or generating an undo script, no reliable method of completely undoing a half-executed sync script has yet been found; so in this case we simply have to trust that the DBA or developer will check and verify the script before running it. However, before we got to that stage, we had to get the scripts to run in the first place... To run a synchronization script from SQL Compare we essentially pass the script over to the SqlCommand.ExecuteNonQuery method. However, when we tried to do the same for an OracleConnection we got a very strange error – 'ORA-00911: invalid character', even when running the most basic CREATE TABLE command. After much hair-pulling and Googling, we discovered that Oracle has got some very strange behaviour with semicolons at the end of statements. To understand what's going on, we need to take a quick foray into SQL and PL/SQL. PL/SQL is not T-SQL In SQL Server, T-SQL is the language used to interface with the database. It has DDL, DML, control flow, and many other nice features (like Turing-completeness) that you can mix and match in the same script. In Oracle, DDL SQL and PL/SQL are two completely separate languages, with different syntax, different datatypes and different execution engines within the instance. Oracle SQL is much more like 'pure' ANSI SQL, with no state, no control flow, and only the basic DML commands. PL/SQL is the Turing-complete language, but can only do DML and DCL (i.e. BEGIN TRANSATION commands). Any DDL or SQL commands that aren't recognised by the PL/SQL engine have to be passed back to the SQL engine via an EXECUTE IMMEDIATE command. In PL/SQL, a semicolons is a valid token used to delimit the end of a statement. In SQL, a semicolon is not a valid token (even though the Oracle documentation gives them at the end of the syntax diagrams) . When you execute the command CREATE TABLE table1 (COL1 NUMBER); in SQL*Plus the semicolon on the end is a command to SQL*Plus to execute the preceding statement on the server; it strips off the semicolon before passing it on. SQL Developer does a similar thing. When executing a PL/SQL block, however, the syntax is like so: BEGIN INSERT INTO table1 VALUES (1); INSERT INTO table1 VALUES (2); END; / In this case, the semicolon is accepted by the PL/SQL engine as a statement delimiter, and instead the / is the command to SQL*Plus to execute the current block. This explains the ORA-00911 error we got when trying to run the CREATE TABLE command – the server is complaining about the semicolon on the end. This also means that there is no SQL syntax to execute more than one DDL command in the same OracleCommand. Therefore, we would have to do a round-trip to the server for every command we want to execute. Obviously, this would cause lots of network traffic and be very slow on slow or congested networks. Our first attempt at a solution was to wrap every SQL statement (without semicolon) inside an EXECUTE IMMEDIATE command in a PL/SQL block and pass that to the server to execute. One downside of this solution is that we get no feedback as to how the script execution is going; we're currently evaluating better solutions to this thorny issue. Next up: Dependencies; how we solved the problem of being unable to register the entire database, and the knock-on effects to the whole product.

Read the article
DTracing TCP congestion control

- by user12820842

In a previous post, I showed how we can use DTrace to probe TCP receive and send window events. TCP receive and send windows are in effect both about flow-controlling how much data can be received - the receive window reflects how much data the local TCP is prepared to receive, while the send window simply reflects the size of the receive window of the peer TCP. Both then represent flow control as imposed by the receiver. However, consider that without the sender imposing flow control, and a slow link to a peer, TCP will simply fill up it's window with sent segments. Dealing with multiple TCP implementations filling their peer TCP's receive windows in this manner, busy intermediate routers may drop some of these segments, leading to timeout and retransmission, which may again lead to drops. This is termed congestion, and TCP has multiple congestion control strategies. We can see that in this example, we need to have some way of adjusting how much data we send depending on how quickly we receive acknowledgement - if we get ACKs quickly, we can safely send more segments, but if acknowledgements come slowly, we should proceed with more caution. More generally, we need to implement flow control on the send side also. Slow Start and Congestion Avoidance From RFC2581, let's examine the relevant variables: "The congestion window (cwnd) is a sender-side limit on the amount of data the sender can transmit into the network before receiving an acknowledgment (ACK). Another state variable, the slow start threshold (ssthresh), is used to determine whether the slow start or congestion avoidance algorithm is used to control data transmission" Slow start is used to probe the network's ability to handle transmission bursts both when a connection is first created and when retransmission timers fire. The latter case is important, as the fact that we have effectively lost TCP data acts as a motivator for re-probing how much data the network can handle from the sending TCP. The congestion window (cwnd) is initialized to a relatively small value, generally a low multiple of the sending maximum segment size. When slow start kicks in, we will only send that number of bytes before waiting for acknowledgement. When acknowledgements are received, the congestion window is increased in size until cwnd reaches the slow start threshold ssthresh value. For most congestion control algorithms the window increases exponentially under slow start, assuming we receive acknowledgements. We send 1 segment, receive an ACK, increase the cwnd by 1 MSS to 2*MSS, send 2 segments, receive 2 ACKs, increase the cwnd by 2*MSS to 4*MSS, send 4 segments etc. When the congestion window exceeds the slow start threshold, congestion avoidance is used instead of slow start. During congestion avoidance, the congestion window is generally updated by one MSS for each round-trip-time as opposed to each ACK, and so cwnd growth is linear instead of exponential (we may receive multiple ACKs within a single RTT). This continues until congestion is detected. If a retransmit timer fires, congestion is assumed and the ssthresh value is reset. It is reset to a fraction of the number of bytes outstanding (unacknowledged) in the network. At the same time the congestion window is reset to a single max segment size. Thus, we initiate slow start until we start receiving acknowledgements again, at which point we can eventually flip over to congestion avoidance when cwnd ssthresh. Congestion control algorithms differ most in how they handle the other indication of congestion - duplicate ACKs. A duplicate ACK is a strong indication that data has been lost, since they often come from a receiver explicitly asking for a retransmission. In some cases, a duplicate ACK may be generated at the receiver as a result of packets arriving out-of-order, so it is sensible to wait for multiple duplicate ACKs before assuming packet loss rather than out-of-order delivery. This is termed fast retransmit (i.e. retransmit without waiting for the retransmission timer to expire). Note that on Oracle Solaris 11, the congestion control method used can be customized. See here for more details. In general, 3 or more duplicate ACKs indicate packet loss and should trigger fast retransmit . It's best not to revert to slow start in this case, as the fact that the receiver knew it was missing data suggests it has received data with a higher sequence number, so we know traffic is still flowing. Falling back to slow start would be excessive therefore, so fast recovery is used instead. Observing slow start and congestion avoidance The following script counts TCP segments sent when under slow start (cwnd ssthresh). #!/usr/sbin/dtrace -s #pragma D option quiet tcp:::connect-request / start[args[1]-cs_cid] == 0/ { start[args[1]-cs_cid] = 1; } tcp:::send / start[args[1]-cs_cid] == 1 && args[3]-tcps_cwnd tcps_cwnd_ssthresh / { @c["Slow start", args[2]-ip_daddr, args[4]-tcp_dport] = count(); } tcp:::send / start[args[1]-cs_cid] == 1 && args[3]-tcps_cwnd args[3]-tcps_cwnd_ssthresh / { @c["Congestion avoidance", args[2]-ip_daddr, args[4]-tcp_dport] = count(); } As we can see the script only works on connections initiated since it is started (using the start[] associative array with the connection ID as index to set whether it's a new connection (start[cid] = 1). From there we simply differentiate send events where cwnd ssthresh (congestion avoidance). Here's the output taken when I accessed a YouTube video (where rport is 80) and from an FTP session where I put a large file onto a remote system. # dtrace -s tcp_slow_start.d ^C ALGORITHM RADDR RPORT #SEG Slow start 10.153.125.222 20 6 Slow start 138.3.237.7 80 14 Slow start 10.153.125.222 21 18 Congestion avoidance 10.153.125.222 20 1164 We see that in the case of the YouTube video, slow start was exclusively used. Most of the segments we sent in that case were likely ACKs. Compare this case - where 14 segments were sent using slow start - to the FTP case, where only 6 segments were sent before we switched to congestion avoidance for 1164 segments. In the case of the FTP session, the FTP data on port 20 was predominantly sent with congestion avoidance in operation, while the FTP session relied exclusively on slow start. For the default congestion control algorithm - "newreno" - on Solaris 11, slow start will increase the cwnd by 1 MSS for every acknowledgement received, and by 1 MSS for each RTT in congestion avoidance mode. Different pluggable congestion control algorithms operate slightly differently. For example "highspeed" will update the slow start cwnd by the number of bytes ACKed rather than the MSS. And to finish, here's a neat oneliner to visually display the distribution of congestion window values for all TCP connections to a given remote port using a quantization. In this example, only port 80 is in use and we see the majority of cwnd values for that port are in the 4096-8191 range. # dtrace -n 'tcp:::send { @q[args[4]-tcp_dport] = quantize(args[3]-tcps_cwnd); }' dtrace: description 'tcp:::send ' matched 10 probes ^C 80 value ------------- Distribution ------------- count -1 | 0 0 |@@@@@@ 5 1 | 0 2 | 0 4 | 0 8 | 0 16 | 0 32 | 0 64 | 0 128 | 0 256 | 0 512 | 0 1024 | 0 2048 |@@@@@@@@@ 8 4096 |@@@@@@@@@@@@@@@@@@@@@@@@@@ 23 8192 | 0

Read the article
Source-control 'wet-work'?

- by Phil Factor

When a design or creative work is flawed beyond remedy, it is often best to destroy it and start again. The other day, I lost the code to a long and intricate SQL batch I was working on. I’d thought it was impossible, but it happened. With all the technology around that is designed to prevent this occurring, this sort of accident has become a rare event. If it weren’t for a deranged laptop, and my distraction, the code wouldn’t have been lost this time. As always, I sighed, had a soothing cup of tea, and typed it all in again. The new code I hastily tapped in was much better: I’d held in my head the essence of how the code should work rather than the details: I now knew for certain the start point, the end, and how it should be achieved. Instantly the detritus of half-baked thoughts fell away and I was able to write logical code that performed better. Because I could work so quickly, I was able to hold the details of all the columns and variables in my head, and the dynamics of the flow of data. It was, in fact, easier and quicker to start from scratch rather than tidy up and refactor the existing code with its inevitable fumbling and half-baked ideas. What a shame that technology is now so good that developers rarely experience the cleansing shock of losing one’s code and having to rewrite it from scratch. If you’ve never accidentally lost your code, then it is worth doing it deliberately once for the experience. Creative people have, until Technology mistakenly prevented it, torn up their drafts or sketches, threw them in the bin, and started again from scratch. Leonardo’s obsessive reworking of the Mona Lisa was renowned because it was so unusual: Most artists have been utterly ruthless in destroying work that didn’t quite make it. Authors are particularly keen on writing afresh, and the results are generally positive. Lawrence of Arabia actually lost the entire 250,000 word manuscript of ‘The Seven Pillars of Wisdom’ by accidentally leaving it on a train at Reading station, before rewriting a much better version. Now, any writer or artist is seduced by technology into altering or refining their work rather than casting it dramatically in the bin or setting a light to it on a bonfire, and rewriting it from the blank page. It is easy to pick away at a flawed work, but the real creative process is far more brutal. Once, many years ago whilst running a software house that supplied commercial software to local businesses, I’d been supervising an accounting system for a farming cooperative. No packaged system met their needs, and it was all hand-cut code. For us, it represented a breakthrough as it was for a government organisation, and success would guarantee more contracts. As you’ve probably guessed, the code got mangled in a disk crash just a week before the deadline for delivery, and the many backups all proved to be entirely corrupted by a faulty tape drive. There were some fragments left on individual machines, but they were all of different versions. The developers were in despair. Strangely, I managed to re-write the bulk of a three-month project in a manic and caffeine-soaked weekend. Sure, that elegant universally-applicable input-form routine was‘nt quite so elegant, but it didn’t really need to be as we knew what forms it needed to support. Yes, the code lacked architectural elegance and reusability. By dawn on Monday, the application passed its integration tests. The developers rose to the occasion after I’d collapsed, and tidied up what I’d done, though they were reproachful that some of the style and elegance had gone out of the application. By the delivery date, we were able to install it. It was a smaller, faster application than the beta they’d seen and the user-interface had a new, rather Spartan, appearance that we swore was done to conform to the latest in user-interface guidelines. (we switched to Helvetica font to look more ‘Bauhaus’ ). The client was so delighted that he forgave the new bugs that had crept in. I still have the disk that crashed, up in the attic. In IT, we have had mixed experiences from complete re-writes. Lotus 123 never really recovered from a complete rewrite from assembler into C, Borland made the mistake with Arago and Quattro Pro and Netscape’s complete rewrite of their Navigator 4 browser was a white-knuckle ride. In all cases, the decision to rewrite was a result of extreme circumstances where no other course of action seemed possible. The rewrite didn’t come out of the blue. I prefer to remember the rewrite of Minix by young Linus Torvalds, or the rewrite of Bitkeeper by a slightly older Linus. The rewrite of CP/M didn’t do too badly either, did it? Come to think of it, the guy who decided to rewrite the windowing system of the Xerox Star never regretted the decision. I’ll agree that one should often resist calls for a rewrite. One of the worst habits of the more inexperienced programmer is to denigrate whatever code he or she inherits, and then call loudly for a complete rewrite. They are buoyed up by the mistaken belief that they can do better. This, however, is a different psychological phenomenon, more related to the idea of some motorcyclists that they are operating on infinite lives, or the occasional squaddies that if they charge the machine-guns determinedly enough all will be well. Grim experience brings out the humility in any experienced programmer. I’m referring to quite different circumstances here. Where a team knows the requirements perfectly, are of one mind on methodology and coding standards, and they already have a solution, then what is wrong with considering a complete rewrite? Rewrites are so painful in the early stages, until that point where one realises the payoff, that even I quail at the thought. One needs a natural disaster to push one over the edge. The trouble is that source-control systems, and disaster recovery systems, are just too good nowadays. If I were to lose this draft of this very blog post, I know I’d rewrite it much better. However, if you read this, you’ll know I didn’t have the nerve to delete it and start again. There was a time that one prayed that unreliable hardware would deliver you from an unmaintainable mess of a codebase, but now technology has made us almost entirely immune to such a merciful act of God. An old friend of mine with long experience in the software industry has long had the idea of the ‘source-control wet-work’, where one hires a malicious hacker in some wild eastern country to hack into one’s own source control system to destroy all trace of the source to an application. Alas, backup systems are just too good to make this any more than a pipedream. Somehow, it would be difficult to promote the idea. As an alternative, could one construct a source control system that, on doing all the code-quality metrics, would systematically destroy all trace of source code that failed the quality test? Alas, I can’t see many managers buying into the idea. In reading the full story of the near-loss of Toy Story 2, it set me thinking. It turned out that the lucky restoration of the code wasn’t the happy ending one first imagined it to be, because they eventually came to the conclusion that the plot was fundamentally flawed and it all had to be rewritten anyway. Was this an early case of the ‘source-control wet-job’?’ It is very hard nowadays to do a rapid U-turn in a development project because we are far too prone to cling to our existing source-code.

Read the article
Fun with Aggregates

- by Paul White

There are interesting things to be learned from even the simplest queries. For example, imagine you are given the task of writing a query to list AdventureWorks product names where the product has at least one entry in the transaction history table, but fewer than ten. One possible query to meet that specification is: SELECT p.Name FROM Production.Product AS p JOIN Production.TransactionHistory AS th ON p.ProductID = th.ProductID GROUP BY p.ProductID, p.Name HAVING COUNT_BIG(*) < 10; That query correctly returns 23 rows (execution plan and data sample shown below): The execution plan looks a bit different from the written form of the query: the base tables are accessed in reverse order, and the aggregation is performed before the join. The general idea is to read all rows from the history table, compute the count of rows grouped by ProductID, merge join the results to the Product table on ProductID, and finally filter to only return rows where the count is less than ten. This ‘fully-optimized’ plan has an estimated cost of around 0.33 units. The reason for the quote marks there is that this plan is not quite as optimal as it could be – surely it would make sense to push the Filter down past the join too? To answer that, let’s look at some other ways to formulate this query. This being SQL, there are any number of ways to write logically-equivalent query specifications, so we’ll just look at a couple of interesting ones. The first query is an attempt to reverse-engineer T-SQL from the optimized query plan shown above. It joins the result of pre-aggregating the history table to the Product table before filtering: SELECT p.Name FROM ( SELECT th.ProductID, cnt = COUNT_BIG(*) FROM Production.TransactionHistory AS th GROUP BY th.ProductID ) AS q1 JOIN Production.Product AS p ON p.ProductID = q1.ProductID WHERE q1.cnt < 10; Perhaps a little surprisingly, we get a slightly different execution plan: The results are the same (23 rows) but this time the Filter is pushed below the join! The optimizer chooses nested loops for the join, because the cardinality estimate for rows passing the Filter is a bit low (estimate 1 versus 23 actual), though you can force a merge join with a hint and the Filter still appears below the join. In yet another variation, the < 10 predicate can be ‘manually pushed’ by specifying it in a HAVING clause in the “q1” sub-query instead of in the WHERE clause as written above. The reason this predicate can be pushed past the join in this query form, but not in the original formulation is simply an optimizer limitation – it does make efforts (primarily during the simplification phase) to encourage logically-equivalent query specifications to produce the same execution plan, but the implementation is not completely comprehensive. Moving on to a second example, the following query specification results from phrasing the requirement as “list the products where there exists fewer than ten correlated rows in the history table”: SELECT p.Name FROM Production.Product AS p WHERE EXISTS ( SELECT * FROM Production.TransactionHistory AS th WHERE th.ProductID = p.ProductID HAVING COUNT_BIG(*) < 10 ); Unfortunately, this query produces an incorrect result (86 rows): The problem is that it lists products with no history rows, though the reasons are interesting. The COUNT_BIG(*) in the EXISTS clause is a scalar aggregate (meaning there is no GROUP BY clause) and scalar aggregates always produce a value, even when the input is an empty set. In the case of the COUNT aggregate, the result of aggregating the empty set is zero (the other standard aggregates produce a NULL). To make the point really clear, let’s look at product 709, which happens to be one for which no history rows exist: -- Scalar aggregate SELECT COUNT_BIG(*) FROM Production.TransactionHistory AS th WHERE th.ProductID = 709; -- Vector aggregate SELECT COUNT_BIG(*) FROM Production.TransactionHistory AS th WHERE th.ProductID = 709 GROUP BY th.ProductID; The estimated execution plans for these two statements are almost identical: You might expect the Stream Aggregate to have a Group By for the second statement, but this is not the case. The query includes an equality comparison to a constant value (709), so all qualified rows are guaranteed to have the same value for ProductID and the Group By is optimized away. In fact there are some minor differences between the two plans (the first is auto-parameterized and qualifies for trivial plan, whereas the second is not auto-parameterized and requires cost-based optimization), but there is nothing to indicate that one is a scalar aggregate and the other is a vector aggregate. This is something I would like to see exposed in show plan so I suggested it on Connect. Anyway, the results of running the two queries show the difference at runtime: The scalar aggregate (no GROUP BY) returns a result of zero, whereas the vector aggregate (with a GROUP BY clause) returns nothing at all. Returning to our EXISTS query, we could ‘fix’ it by changing the HAVING clause to reject rows where the scalar aggregate returns zero: SELECT p.Name FROM Production.Product AS p WHERE EXISTS ( SELECT * FROM Production.TransactionHistory AS th WHERE th.ProductID = p.ProductID HAVING COUNT_BIG(*) BETWEEN 1 AND 9 ); The query now returns the correct 23 rows: Unfortunately, the execution plan is less efficient now – it has an estimated cost of 0.78 compared to 0.33 for the earlier plans. Let’s try adding a redundant GROUP BY instead of changing the HAVING clause: SELECT p.Name FROM Production.Product AS p WHERE EXISTS ( SELECT * FROM Production.TransactionHistory AS th WHERE th.ProductID = p.ProductID GROUP BY th.ProductID HAVING COUNT_BIG(*) < 10 ); Not only do we now get correct results (23 rows), this is the execution plan: I like to compare that plan to quantum physics: if you don’t find it shocking, you haven’t understood it properly :) The simple addition of a redundant GROUP BY has resulted in the EXISTS form of the query being transformed into exactly the same optimal plan we found earlier. What’s more, in SQL Server 2008 and later, we can replace the odd-looking GROUP BY with an explicit GROUP BY on the empty set: SELECT p.Name FROM Production.Product AS p WHERE EXISTS ( SELECT * FROM Production.TransactionHistory AS th WHERE th.ProductID = p.ProductID GROUP BY () HAVING COUNT_BIG(*) < 10 ); I offer that as an alternative because some people find it more intuitive (and it perhaps has more geek value too). Whichever way you prefer, it’s rather satisfying to note that the result of the sub-query does not exist for a particular correlated value where a vector aggregate is used (the scalar COUNT aggregate always returns a value, even if zero, so it always ‘EXISTS’ regardless which ProductID is logically being evaluated). The following query forms also produce the optimal plan and correct results, so long as a vector aggregate is used (you can probably find more equivalent query forms): WHERE Clause SELECT p.Name FROM Production.Product AS p WHERE ( SELECT COUNT_BIG(*) FROM Production.TransactionHistory AS th WHERE th.ProductID = p.ProductID GROUP BY () ) < 10; APPLY SELECT p.Name FROM Production.Product AS p CROSS APPLY ( SELECT NULL FROM Production.TransactionHistory AS th WHERE th.ProductID = p.ProductID GROUP BY () HAVING COUNT_BIG(*) < 10 ) AS ca (dummy); FROM Clause SELECT q1.Name FROM ( SELECT p.Name, cnt = ( SELECT COUNT_BIG(*) FROM Production.TransactionHistory AS th WHERE th.ProductID = p.ProductID GROUP BY () ) FROM Production.Product AS p ) AS q1 WHERE q1.cnt < 10; This last example uses SUM(1) instead of COUNT and does not require a vector aggregate…you should be able to work out why :) SELECT q.Name FROM ( SELECT p.Name, cnt = ( SELECT SUM(1) FROM Production.TransactionHistory AS th WHERE th.ProductID = p.ProductID ) FROM Production.Product AS p ) AS q WHERE q.cnt < 10; The semantics of SQL aggregates are rather odd in places. It definitely pays to get to know the rules, and to be careful to check whether your queries are using scalar or vector aggregates. As we have seen, query plans do not show in which ‘mode’ an aggregate is running and getting it wrong can cause poor performance, wrong results, or both. © 2012 Paul White Twitter: @SQL_Kiwi email: [email protected]

Read the article
Free Document/Content Management System Using SharePoint 2010

- by KunaalKapoor

That’s right, it’s true. You can use the free version of SharePoint 2010 to meet your document and content management needs and even run your public facing website or an internal knowledge bank. SharePoint Foundation 2010 is free. It may not have all the features that you get in the enterprise license but it still has enough to cater to your needs to build a document management system and replace age old file shares or folders. I’ve built a dozen content management sites for internal and public use exploiting SharePoint. There are hundreds of web content management systems out there (see CMS Matrix). On one hand we have commercial platforms like SharePoint, SiteCore, and Ektron etc. which are the most frequently used and on the other hand there are free options like WordPress, Drupal, Joomla, and Plone etc. which are pretty common popular as well. But I would be very surprised if anyone was able to find a single CMS platform that is all things to all people. Infact not a lot of people consider SharePoint’s free version under the free CMS side but its high time organizations benefit from this. Through this blog post I wanted to present SharePoint Foundation as an option for running a FREE CMS platform. Even if you knew that there is a free version of SharePoint, what most people don’t realize is that SharePoint Foundation is a great option for running web sites of all kinds – not just team sites. It is a great option for many reasons, but in reality it is supported by Microsoft, and above all it is FREE (yay!), and it is extremely easy to get started. From a functionality perspective – it’s hard to beat SharePoint. Even the free version, SharePoint Foundation, offers simple data connectivity (through BCS), cross browser support, accessibility, support for Office Web Apps, blogs, wikis, templates, document support, health analyzer, support for presence, and MUCH more.I often get asked: “Can I use SharePoint 2010 as a document management system?” The answer really depends on · What are your specific requirements? · What systems you currently have in place for managing documents. · And of course how much money you have J Benefits? Not many large organizations have benefited from SharePoint yet. For some it has been an IT project to see what they can achieve with it, for others it has been used as a collaborative platform or in many cases an extended intranet. SharePoint 2010 has changed the game slightly as the improvements that Microsoft have made have been noted by organizations, and we are seeing a lot of companies starting to build specific business applications using SharePoint as the basis, and nearly every business process will require documents at some stage. If you require a document management system and have SharePoint in place then it can be a relatively straight forward decision to use SharePoint, as long as you have reviewed the considerations just discussed. The collaborative nature of SharePoint 2010 is also a massive advantage, as specific departmental or project sites can be created quickly and easily that allow workers to interact in a variety of different ways using one source of information. This also benefits an organization with regards to how they manage the knowledge that they have, as if all of their information is in one source then it is naturally easier to search and manage. Is SharePoint right for your organization? As just discussed, this can only be determined after defining your requirements and also planning a longer term strategy for how you will manage your documents and information. A key factor to look at is how the users would interact with the system and how much value would it get for your organization. The amount of data and documents that organizations are creating is increasing rapidly each year. Therefore the ability to archive this information, whilst keeping the ability to know what you have and where it is, is vital to any organizations management of their information life cycle. SharePoint is best used for the initial life of business documents where they need to be referenced and accessed after time. It is often beneficial to archive these to overcome for storage and performance issues. FREE CMS – SharePoint, Really? In order to show some of the completely of what comes with this free version of SharePoint 2010, I thought it would make sense to use Wikipedia (since every one trusts it as a credible source). Wikipedia shows that a web content management system typically has the following components: Document Management: - CMS software may provide a means of managing the life cycle of a document from initial creation time, through revisions, publication, archive, and document destruction. SharePoint is king when it comes to document management. Version history, exclusive check-out, security, publication, workflow, and so much more. Content Virtualization: - CMS software may provide a means of allowing each user to work within a virtual copy of the entire Web site, document set, and/or code base. This enables changes to multiple interdependent resources to be viewed and/or executed in-context prior to submission. Through the use of versioning, each content manager can preview, publish, and roll-back content of pages, wiki entries, blog posts, documents, or any other type of content stored in SharePoint. The idea of each user having an entire copy of the website virtualized is a bit odd to me – not sure why anyone would need that for anything but the simplest of websites. Automated Templates: - Create standard output templates that can be automatically applied to new and existing content, allowing the appearance of all content to be changed from one central place. Through the use of Master Pages and Themes, SharePoint provides the ability to change the entire look and feel of site. Of course, the older brother version of SharePoint – SharePoint Server 2010 – also introduces the concept of Page Layouts which allows page template level customization and even switching the layout of an individual page using different page templates. I think many organizations really think they want this but rarely end up using this bit of functionality. Easy Edits: - Once content is separated from the visual presentation of a site, it usually becomes much easier and quicker to edit and manipulate. Most WCMS software includes WYSIWYG editing tools allowing non-technical individuals to create and edit content. This is probably easier described with a screen cap of a vanilla SharePoint Foundation page in edit mode. Notice the page editing toolbar, the multiple layout options… It’s actually easier to use than Microsoft Word. Workflow management: - Workflow is the process of creating cycles of sequential and parallel tasks that must be accomplished in the CMS. For example, a content creator can submit a story, but it is not published until the copy editor cleans it up and the editor-in-chief approves it. Workflow, it’s in there. In fact, the same workflow engine is running under SharePoint Foundation that is running under the other versions of SharePoint. The primary difference is that with SharePoint Foundation – you need to configure the workflows yourself. Web Standards: - Active WCMS software usually receives regular updates that include new feature sets and keep the system up to current web standards. SharePoint is in the fourth major iteration under Microsoft with the 2010 release. In addition to the innovation that Microsoft continuously adds, you have the entire global ecosystem available. Scalable Expansion: - Available in most modern WCMSs is the ability to expand a single implementation (one installation on one server) across multiple domains. SharePoint Foundation can run multiple sites using multiple URLs on a single server install. Even more powerful, SharePoint Foundation is scalable and can be part of a multi-server farm to ensure that it will handle any amount of traffic that can be thrown at it. Delegation & Security: - Some CMS software allows for various user groups to have limited privileges over specific content on the website, spreading out the responsibility of content management. SharePoint Foundation provides very granular security capabilities. Read @ http://msdn.microsoft.com/en-us/library/ee537811.aspx Content Syndication: - CMS software often assists in content distribution by generating RSS and Atom data feeds to other systems. They may also e-mail users when updates are available as part of the workflow process. SharePoint Foundation nails it. With RSS syndication and email alerts available out of the box, content syndication is already in the platform. Multilingual Support: - Ability to display content in multiple languages. SharePoint Foundation 2010 supports more than 40 languages. Read More Read more @ http://msdn.microsoft.com/en-us/library/dd776256(v=office.12).aspxYou can download the free version from http://www.microsoft.com/en-us/download/details.aspx?id=5970

Read the article
Why Is Vertical Resolution Monitor Resolution so Often a Multiple of 360?

- by Jason Fitzpatrick

Stare at a list of monitor resolutions long enough and you might notice a pattern: many of the vertical resolutions, especially those of gaming or multimedia displays, are multiples of 360 (720, 1080, 1440, etc.) But why exactly is this the case? Is it arbitrary or is there something more at work? Today’s Question & Answer session comes to us courtesy of SuperUser—a subdivision of Stack Exchange, a community-driven grouping of Q&A web sites. The Question SuperUser reader Trojandestroy recently noticed something about his display interface and needs answers: YouTube recently added 1440p functionality, and for the first time I realized that all (most?) vertical resolutions are multiples of 360. Is this just because the smallest common resolution is 480×360, and it’s convenient to use multiples? (Not doubting that multiples are convenient.) And/or was that the first viewable/conveniently sized resolution, so hardware (TVs, monitors, etc) grew with 360 in mind? Taking it further, why not have a square resolution? Or something else unusual? (Assuming it’s usual enough that it’s viewable). Is it merely a pleasing-the-eye situation? So why have the display be a multiple of 360? The Answer SuperUser contributor User26129 offers us not just an answer as to why the numerical pattern exists but a history of screen design in the process: Alright, there are a couple of questions and a lot of factors here. Resolutions are a really interesting field of psychooptics meeting marketing. First of all, why are the vertical resolutions on youtube multiples of 360. This is of course just arbitrary, there is no real reason this is the case. The reason is that resolution here is not the limiting factor for Youtube videos – bandwidth is. Youtube has to re-encode every video that is uploaded a couple of times, and tries to use as little re-encoding formats/bitrates/resolutions as possible to cover all the different use cases. For low-res mobile devices they have 360×240, for higher res mobile there’s 480p, and for the computer crowd there is 360p for 2xISDN/multiuser landlines, 720p for DSL and 1080p for higher speed internet. For a while there were some other codecs than h.264, but these are slowly being phased out with h.264 having essentially ‘won’ the format war and all computers being outfitted with hardware codecs for this. Now, there is some interesting psychooptics going on as well. As I said: resolution isn’t everything. 720p with really strong compression can and will look worse than 240p at a very high bitrate. But on the other side of the spectrum: throwing more bits at a certain resolution doesn’t magically make it better beyond some point. There is an optimum here, which of course depends on both resolution and codec. In general: the optimal bitrate is actually proportional to the resolution. So the next question is: what kind of resolution steps make sense? Apparently, people need about a 2x increase in resolution to really see (and prefer) a marked difference. Anything less than that and many people will simply not bother with the higher bitrates, they’d rather use their bandwidth for other stuff. This has been researched quite a long time ago and is the big reason why we went from 720×576 (415kpix) to 1280×720 (922kpix), and then again from 1280×720 to 1920×1080 (2MP). Stuff in between is not a viable optimization target. And again, 1440P is about 3.7MP, another ~2x increase over HD. You will see a difference there. 4K is the next step after that. Next up is that magical number of 360 vertical pixels. Actually, the magic number is 120 or 128. All resolutions are some kind of multiple of 120 pixels nowadays, back in the day they used to be multiples of 128. This is something that just grew out of LCD panel industry. LCD panels use what are called line drivers, little chips that sit on the sides of your LCD screen that control how bright each subpixel is. Because historically, for reasons I don’t really know for sure, probably memory constraints, these multiple-of-128 or multiple-of-120 resolutions already existed, the industry standard line drivers became drivers with 360 line outputs (1 per subpixel). If you would tear down your 1920×1080 screen, I would be putting money on there being 16 line drivers on the top/bottom and 9 on one of the sides. Oh hey, that’s 16:9. Guess how obvious that resolution choice was back when 16:9 was ‘invented’. Then there’s the issue of aspect ratio. This is really a completely different field of psychology, but it boils down to: historically, people have believed and measured that we have a sort of wide-screen view of the world. Naturally, people believed that the most natural representation of data on a screen would be in a wide-screen view, and this is where the great anamorphic revolution of the ’60s came from when films were shot in ever wider aspect ratios. Since then, this kind of knowledge has been refined and mostly debunked. Yes, we do have a wide-angle view, but the area where we can actually see sharply – the center of our vision – is fairly round. Slightly elliptical and squashed, but not really more than about 4:3 or 3:2. So for detailed viewing, for instance for reading text on a screen, you can utilize most of your detail vision by employing an almost-square screen, a bit like the screens up to the mid-2000s. However, again this is not how marketing took it. Computers in ye olden days were used mostly for productivity and detailed work, but as they commoditized and as the computer as media consumption device evolved, people didn’t necessarily use their computer for work most of the time. They used it to watch media content: movies, television series and photos. And for that kind of viewing, you get the most ‘immersion factor’ if the screen fills as much of your vision (including your peripheral vision) as possible. Which means widescreen. But there’s more marketing still. When detail work was still an important factor, people cared about resolution. As many pixels as possible on the screen. SGI was selling almost-4K CRTs! The most optimal way to get the maximum amount of pixels out of a glass substrate is to cut it as square as possible. 1:1 or 4:3 screens have the most pixels per diagonal inch. But with displays becoming more consumery, inch-size became more important, not amount of pixels. And this is a completely different optimization target. To get the most diagonal inches out of a substrate, you want to make the screen as wide as possible. First we got 16:10, then 16:9 and there have been moderately successful panel manufacturers making 22:9 and 2:1 screens (like Philips). Even though pixel density and absolute resolution went down for a couple of years, inch-sizes went up and that’s what sold. Why buy a 19″ 1280×1024 when you can buy a 21″ 1366×768? Eh… I think that about covers all the major aspects here. There’s more of course; bandwidth limits of HDMI, DVI, DP and of course VGA played a role, and if you go back to the pre-2000s, graphics memory, in-computer bandwdith and simply the limits of commercially available RAMDACs played an important role. But for today’s considerations, this is about all you need to know. Have something to add to the explanation? Sound off in the the comments. Want to read more answers from other tech-savvy Stack Exchange users? Check out the full discussion thread here.

Read the article

Search Results

Search found 1962 results on 79 pages for 'slightly offtopic'.

Page 71/79 | < Previous Page | 67 68 69 70 71 72 73 74 75 76 77 78 | Next Page >

- by Paul White

- by Jan Tielens

- by Adam M.

- by Dennis Vroegop

- by Simon Cooper

- by Pinal Dave

- by DavidWimbush

- by Jeff Victor

- by Simon Cooper

- by BuckWoody

- by Phil Factor

- by Timothy Klenke

- by rgillen

- by Paul White

- by Phil Factor

- by Paul White

- by Simon Cooper

- by user12820842

- by Phil Factor

- by Paul White

- by KunaalKapoor

- by Jason Fitzpatrick

< Previous Page | 67 68 69 70 71 72 73 74 75 76 77 78 | Next Page >