Search Results

Search found 9254 results on 371 pages for 'approach'.

Page 84/371 | < Previous Page | 80 81 82 83 84 85 86 87 88 89 90 91 | Next Page >

How can I create a rules engine without using eval() or exec()?

- by Angela

I have a simple rules/conditions table in my database which is used to generate alerts for one of our systems. I want to create a rules engine or a domain specific language. A simple rule stored in this table would be..(omitting the relationships here) if temp > 40 send email Please note there would be many more such rules. A script runs once daily to evaluate these rules and perform the necessary actions. At the beginning, there was only one rule, so we had the script in place to only support that rule. However we now need to make it more scalable to support different conditions/rules. I have looked into rules engines , but I hope to achieve this in some simple pythonic way. At the moment, I have only come up with eval/exec and I know that is not the most recommended approach. So, what would be the best way to accomplish this?? ( The rules are stored as data in database so each object like "temperature", condition like "/=..etc" , value like "40,50..etc" and action like "email, sms, etc.." are stored in the database, i retrieve this to form the condition...if temp 50 send email, that was my idea to then use exec or eval on them to make it live code..but not sure if this is the right approach )

Read the article
Passing XML data and user from coldfusion page to .NET page

- by Mark Rullo

I'd appreciate some input on this situation, I can't figure out the best way to do this. I have some data that's being prepared for me in a ColdFusion app and in an IFrame within the CF app we want to display some graphs (not strictly an image, it's an entire page) being generated on the .NET side of things. I'd like to pass XML data from the CF side to .NET as well as the user. On the .NET side I'm putting the data in a session so the user can sift through it without the need to have it re-queried and re-passed from CF. What I've tried: Generating XML with CF, putting it in a hidden form field, auto-submitting (with JS) a the form to the .NET side. The issue I'm having with this approach is the encoding being done on the form post. The data has entries like <entry data="hello & goodbye">. It's an issue because it's being URL encdeded, Posted, and when I get it on the .NET side I get <entry data="hello & goodbye"> which isn't properly formed XML. What I'd like to avoid: An intermediary DB approach (dropping the data in a DB on CF, picking it up with .NET) I'd like to only display what is passed to the page. I have security concerns with the data, it's very sensitive. Passing the data to a webservice, returning a GUID, forwarding the user with a URL Parameter to access the passed in data. I think that'd be risky if someone happened on a link to that data. I can't take that risk. I was thinking of passing the data with JSON, but I'm very unfamiliar with it. Thoughts? Thanks for your time folks.

Read the article
c++ simple conditional logging

- by Sunny

Disclaimer: I'm not a c++ developer, I can only do basic things. (I understand pointers, just my knowledge is so rusty, I haven't touch c/c++ for about 20 years :) ) The setup: I have an Outlook addin, written in C#/.Net 1.1. It uses a c++ shim to load. Usually, this works pretty well, and I use in my c# code nlog for logging purposes. But sometimes, the addin fails to load, i.t. it does not hit the managed code at all for me to be able to investigate the problem from the log files. So, I need to hook some basic logging into the c++ shim - just writing in a file. I need to make it as simple as possible for our users to enable. Actually I would prefer not to ship it by default. I was thinking about something, which will check if a specific dll is present (the logging dll), and if so, to use it. Otherwise, it will just not log anything. That way, when I have a user with such a problems, I can send him only the logging dll, the user will save it in the runtime directory, and I'll have the file. I guess this have to be done with some form a factory solution, which returns either a dummy logger, or if the dll is found, a real one. Another option would be to make some simple logger, and rebuild the shim with or w/o using it, based on directives. This is not the desirable approach, because the shim needs to be signed, and I have to instruct the user to make a backup copy of the "real" one, then restore when done, etc., instead of just saving and deleting a dll. I'd appreciate any good suggestion how to approach it, together with links or sample code how to go after this. Cheers

Read the article
synchronizing reads to a java collection

- by jeff

so i want to have an arraylist that stores a series of stock quotes. but i keep track of bid price, ask price and last price for each. of course at any time, the bid ask or last of a given stock can change. i have one thread that updates the prices and one that reads them. i want to make sure that when reading no other thread is updating a price. so i looked at synchronized collection. but that seems to only prevent reading while another thread is adding or deleting an entry to the arraylist. so now i'm onto the wrapper approach: public class Qte_List { private final ArrayList<Qte> the_list; public void UpdateBid(String p_sym, double p_bid){ synchronized (the_list){ Qte q = Qte.FindBySym(the_list, p_sym); q.bid=p_bid;} } public double ReadBid(String p_sym){ synchronized (the_list){ Qte q = Qte.FindBySym(the_list, p_sym); return q.bid;} } so what i want to accomplish with this is only one thread can be doing anything - reading or updating an the_list's contents - at one time. am i approach this right? thanks.

Read the article
Options for organizing android app with multiple independent apps

- by lazyguy

Problem Definition: We have a fairly large app which has multiple use cases such that they are all independent of each other. For example lets say we have a1, a2, a3 & a4 modules that are independent apps or use cases for our main app 'A'. The independent a1, a2, a3, a4 are all purchasable apps such that the user goes to our website instead of play store and activate either a1 or a2 by paying some fees on our website. So basically App 'A' is a free app in play-store and is sort of Dashboard with buttons to launch a1, a2, a3, a4. When the user click on lets say a1 button then we will check if a1 is already installed and launch it but if it is not present then give the user a link to download it. Option 1: Have a main app 'A' and a1, a2, a3, a4 as library project. But with this approach the main app A is too big in size. Option 2: Have a1, a2, a3, a4 build as separate .apk and then put in the assets folder of main app 'A' and then install them as needed. Again size of main app A is bigger. Option 3: Upload a1, a2, a3, a4 to a third party website or play store and download from it as needed. This way the main app remains lighter. Observation: In all these approaches there will be an independent app installed with its own icon on users phone. So basically user can launch from either the Dashboard (which will eventually launch an intent from Activity in a1 app) or user can directly launch app a1. Follow-up Question: Is there any other solution that anyone can suggest to tackle this kind of problem? Another things is by going this approach app a1, a2, a3, a4 can be developed & tested independently of each other.

Read the article
What to set the scalar type to contain a byte []. Entity in MVC2

- by Brad8118

I'm trying out the EF 4.0 and using the Model first approach. I'd like to store images into the database and I'm not sure of the best type for the scalar in the entity. I currently have it(the image scalar type) setup as a binary. From what I have been reading the best way to store the image in the db is a byte[]. So I'm assuming that binary is the way to go. If there is a better way I'd switch. In my controller I have: //file from client to store in the db HttpPostedFileBase file = Request.Files[inputTagName]; if (file.ContentLength > 0) { keyToAdd.Image = new byte[file.ContentLength]; file.InputStream.Write(keyToAdd.Image, 0, file.ContentLength); } This builds fine but when I run it I get an exception writing the stream to keyToAdd.Image. The exception is something like: Method does not exist. Any ideas? Note that when using a EF 4.0 model first approach I only have int16, int32, double, string, decimal, binary, byte, DateTime, Double, Single, and SByte as available types. Thanks

Read the article
Common protected data member in base class?

- by EXP0

I have a base class and several derived classes. The derived classes use some common data, can I just put those common data as protected member of the base class? I know the protected member breaks encapsulation sometimes, so I wonder if there is any good approach. Here is a specific example: class Base{ public: virtual void foo() = 0; void printData(); protected: std::vector<std::string> mData; } class Dr1 : public Base{ public: virtual void foo(); //could change mData } class Dr2 : public Base{ public: virtual void foo(); //could change mData } If I put mData into Dr1 and Dr2 as private member, then I need to put it in both of them, and I can not have printData() in Base since printData() need access to mData unless I make printData() virtual and have identical function in both Dr1 and Dr2, which doesn't make much sense to me. Is there a better way to approach this without using protected member? Thank you.

Read the article
Handling text menu in Python

- by PulpFiction

Hi all. I am trying to create a text based menu in Python. Here is the code: #!/usr/bin/env python def testcaseOutput(): print '1. Add. 2. Subtract. 3. Divide. 4. Multiply' try: answer = int(raw_input('Enter a value (1 - 4) >. ')) except ValueError: print 'Invalid input. Enter a value between 1 -4 .' testcaseOutput() if answer in range(1, 5): return answer else: print 'Invalid input. Enter a value between 1 - 4.' testcaseOutput() My question: When the user enters an invalid input, i.e. not a number, I want this function to get called again. So I used the recursive approach which I think is bad design. I use that approach again in the if answer in range(1, 5). Is there any other way to handle this? I need the prompt called again when there is an invalid input. Also, is there any way I can club the two constraints: check whether input is a number and check whether the number is in the range(1,5) together? As you can see, I am checking that individually.

Read the article
Should I use a class in this: Reading a XML file using lxml.

- by PulpFiction

Hi everyone. This question is in continuation to my previous question, in which I asked about passing around an ElementTree. I need to read the XML files only and to solve this, I decided to create a global ElementTree and then parse it wherever required. My question is: Is this an acceptable practice? I heard global variables are bad. If I don't make it global, I was suggested to make a class. But do I really need to create a class? What benefits would I have from that approach. Note that I would be handling only one ElementTree instance per run, the operations are read-only. If I don't use a class, how and where do I declare that ElementTree so that it available globally? (Note that I would be importing this module) Please answer this question in the respect that I am a beginner to development, and at this stage I can't figure out whether to use a class or just go with the functional style programming approach.

Read the article
How to make a piece of WPF content take up the entire application window

- by Bojin Li

I'm working on an application that contains a number of content areas. I want to implement a behavior such that in response to user input, any of these content areas can be toggled to fit the entire application window, and optionally back to its original position again. I experimented with several approaches and none of them seem optimal for me. Here's what I tried to do: Use the ClipToBoundsProperty on the content I want to make "Full Screen": Doesn't work because only the CanvasPanel seems to fully respect this property. The application need to be localized so I would really like to avoid the CanvasPanel. Use a Grid and collapse the other content areas, such that only the one I want to see is visible, hence taking up the entire screen: This will probably work but doesn't seem easy to implement nor maintain. The "Full Screen" content area could be several levels deep, for example residing inside a Tabcontrol, so I would have to hide the tab headers too etc. Reconstruct the content area in a separate view and display it while hiding the rest: Seems easy enough to do with DataTemplates and my ViewModel objects, but any GUI/View only states are not preserved using this approach. Somehow "lift" the GUI/View I want to "Full Screen" into the separate view and display it while hiding the rest: I don't know how to do this or even if this is possible. Anyway if anyone knows a better approach I would love to know about it. Thanks a lot!

Read the article
Passing huge amounts of data as an hexadecimal (0x123AB...) parameter of a clr stored procedure in s

- by user193655

I post this question has followup of This question, since the thread is not recieving more answers. I'm trying to understand if it is possible to pass as a parameter of a CLR stored procedure a large amount of data as "0x5352532F...". This is to avoid to send the data directly to the CLR stored procedure, instead of sending ti to a temporary DB field and from there passing it as varbinary(max) parmeter to the CLR stored procedure. I have a triple question: 1) is it possible, if yes how? Let's say i want to pass a pdf file to the CLR stored procedure (not the path, the full bits that make up the file). Something like: exec MyCLRStoredProcs.dbo.insertfile @file_remote_path ='c:\temp\test_file.txt' , @file_contents=0x4D5A90000300000004000.... --(this long list is the file content) where insertfile is a stored proc that writes to the server path (at file_remote_path) the binary data I pass as (file_contents). 2) is it there corruption risk of adopting this approach (or it is the same approach that sql server uses behind the scenes)? 3) how to convert the content of a file into the "0x23423..." hexadecimal representation

Read the article
SQL queries to determine all values that would satisfy an arbitrary query

- by jasterm007

I'm trying to figure out how to efficiently run a set of queries that will provide a new table of all values that would return results for an arbitrary query. Say my table has a schema like: id name age city What is an efficient way to list all values that would return results for an arbitrary query, say "NOT city=X AND age BETWEEN Y and Z"? My naive approach for this would be to use a script and recurse through all possible combinations of {city, age, age} and see which SELECTs return more than 0 results, but that seems incredibly inefficient. I've also tried building large joins on {city, age, age} as well and basically using that table as an argument list to the query, but that quickly becomes an impossibility for queries on many columns. For simple conjunctive equality queries, i.e. "name=X and age=Y", this is much simpler, as I can do something like SELECT name, age, count(*) AS count FROM main GROUP BY name, age HAVING count > 0 But I'm having difficulty coming up with a general approach for anything more complicated than that. Any pointers in the right direction would be most helpful, thanks.

Read the article
Is the below thread pool implementation correct(C#3.0)

- by Newbie

Hi Experts, For the first time ever I have implemented thread pooling and I found it to be working. But I am not very sure about the way I have done is the appropriate way it is supposed to be. Would you people mind in spending some valuable time to check and let me know if my approach is correct or not? If you people find that the approach is incorrect , could you please help me out in writing the correct version. I have basicaly read How to use thread pool and based on what ever I have understood I have developed the below program as per my need public class Calculation { #region Private variable declaration ManualResetEvent[] factorManualResetEvent = null; #endregion public void Compute() { factorManualResetEvent = new ManualResetEvent[2]; for (int i = 0; i < 2; i++){ factorManualResetEvent[i] = new ManualResetEvent(false); ThreadPool.QueueUserWorkItem(ThreadPoolCallback, i);} //Wait for all the threads to complete WaitHandle.WaitAll(factorManualResetEvent); //Proceed with the next task(s) NEXT_TASK_TO_BE_EXECUTED(); } #region Private Methods // Wrapper method for use with thread pool. public void ThreadPoolCallback(Object threadContext) { int threadIndex = (int)threadContext; Method1(); Method2(); factorManualResetEvent[threadIndex].Set(); } private void Method1 () { //Code of method 1} private void Method2 () { //Code of method 2 } #endregion }

Read the article
Field to display Previous 30 Day Total

- by whytheq

I've got this table: CREATE TABLE #Data1 ( [Market] VARCHAR(100) NOT NULL, [Operator] VARCHAR(100) NOT NULL, [Date] DATETIME NOT NULL, [Measure] VARCHAR(100) NOT NULL, [Amount] NUMERIC(36,10) NOT NULL, --new calculated fields [DailyAvg_30days] NUMERIC(38,6) NULL DEFAULT 0 ) I've populated all the fields apart from DailyAvg_30days. This field needs to show the total for the preceding 30 days e.g. 1. if Date for a particular record is 2nd Dec then it will be the total for the period 3rd Nov - 2nd Dec inclusive. 2. if Date for a particular record is 1st Dec then it will be the total for the period 2nd Nov - 1st Dec inclusive. My attempt to try to find these totals before updating the table is as follows: SELECT a.[Market], a.[Operator], a.[Date], a.[Measure], a.[Amount], [DailyAvg_30days] = SUM(b.[Amount]) FROM #Data1 a INNER JOIN #Data1 b ON a.[Market] = b.[Market] AND a.[Operator] = b.[Operator] AND a.[Measure] = b.[Measure] AND a.[Date] >= b.[Date]-30 AND a.[Date] <= b.[Date] GROUP BY a.[Market], a.[Operator], a.[Date], a.[Measure], a.[Amount] ORDER BY 1,2,4,3 Is this a valid approach or do I need to approach this from a different angle?

Read the article
find the difference between two very large list

- by user157195

I have two large list(could be a hundred million items), the source of each list can be either from a database table or a flat file. both lists are of comparable sizes, both unsorted. I need to find the difference between them. so I have 3 scenarios: 1. List1 is a database table(assume each row simply have one item(key) that is a string), List2 is a large file. 2. Both lists are from 2 db tables. 3. both lists are from two files. in case 2, I plan to use: select a.item from MyTable a where a.item not in (select b.item form MyTable b) this clearly is inefficient, is there a better way? Another approach is: I plan to sort each list, and then walk down both of them to find the diff. If the list is from a file, I have to read it into a db table first, then use db sorting to output the list. Is the run time complexity still O(nlogn) in db sorting? either approach is a pain and seems would be very slow when the list involved has hundreds of millions of items. any suggestions?

Read the article
List of values as keys for a Map

- by thr

I have lists of variable length where each item can be one of four unique, that I need to use as keys for another object in a map. Assume that each value can be either 0, 1, 2 or 3 (it's not integer in my real code, but a lot easier to explain this way) so a few examples of key lists could be: [1, 0, 2, 3] [3, 2, 1] [1, 0, 0, 1, 1, 3] [2, 3, 1, 1, 2] [1, 2] So, to re-iterate: each item in the list can be either 0, 1, 2 or 3 and there can be any number of items in a list. My first approach was to try to hash the contents of the array, using the built in GetHashCode() in .NET to combine the hash of each element. But since this would return an int I would have to deal with collisions manually (two equal int values are identical to a Dictionary). So my second approach was to use a quad tree, breaking down each item in the list into a Node that has four pointers (one for each possible value) to the next four possible values (with the root node representing [], an empty list), inserting [1, 0, 2] => Foo, [1, 3] => Bar and [1, 0] => Baz into this tree would look like this: Grey nodes nodes being unused pointers/nodes. Though I worry about the performance of this setup, but there will be no need to deal with hash collisions and the tree won't become to deep (there will mostly be lists with 2-6 items stored, rarely over 6). Is there some other magic way to store items with lists of values as keys that I have missed?

Read the article
Loading jQuery Consistently in a .NET Web App

- by Rick Strahl

One thing that frequently comes up in discussions when using jQuery is how to best load the jQuery library (as well as other commonly used and updated libraries) in a Web application. Specifically the issue is the one of versioning and making sure that you can easily update and switch versions of script files with application wide settings in one place and having your script usage reflect those settings in the entire application on all pages that use the script. Although I use jQuery as an example here, the same concepts can be applied to any script library - for example in my Web libraries I use the same approach for jQuery.ui and my own internal jQuery support library. The concepts used here can be applied both in WebForms and MVC. Loading jQuery Properly From CDN Before we look at a generic way to load jQuery via some server logic, let me first point out my preferred way to embed jQuery into the page. I use the Google CDN to load jQuery and then use a fallback URL to handle the offline or no Internet connection scenario. Why use a CDN? CDN links tend to be loaded more quickly since they are very likely to be cached in user's browsers already as jQuery CDN is used by many, many sites on the Web. Using a CDN also removes load from your Web server and puts the load bearing on the CDN provider - in this case Google - rather than on your Web site. On the downside, CDN links gives the provider (Google, Microsoft) yet another way to track users through their Web usage. Here's how I use jQuery CDN plus a fallback link on my WebLog for example: <!DOCTYPE HTML> <html> <head> <script src="//ajax.googleapis.com/ajax/libs/jquery/1.6.4/jquery.min.js"></script> <script> if (typeof (jQuery) == 'undefined') document.write(unescape("%3Cscript " + "src='/Weblog/wwSC.axd?r=Westwind.Web.Controls.Resources.jquery.js' %3E%3C/script%3E")); </script> <title>Rick Strahl's Web Log</title> ... </head> You can see that the CDN is referenced first, followed by a small script block that checks to see whether jQuery was loaded (jQuery object exists). If it didn't load another script reference is added to the document dynamically pointing to a backup URL. In this case my backup URL points at a WebResource in my Westwind.Web assembly, but the URL can also be local script like src="/scripts/jquery.min.js". Important: Use the proper Protocol/Scheme for for CDN Urls [updated based on comments] If you're using a CDN to load an external script resource you should always make sure that the script is loaded with the same protocol as the parent page to avoid mixed content warnings by the browser. You don't want to load a script link to an http:// resource when you're on an https:// page. The easiest way to use this is by using a protocol relative URL: <script src="//ajax.googleapis.com/ajax/libs/jquery/1.6.4/jquery.min.js"></script> which is an easy way to load resources from other domains. This URL syntax will automatically use the parent page's protocol (or more correctly scheme). As long as the remote domains support both http:// and https:// access this should work. BTW this also works in CSS (with some limitations) and links. BTW, I didn't know about this until it was pointed out in the comments. This is a very useful feature for many things - ah the benefits of my blog to myself :-) Version Numbers When you use a CDN you notice that you have to reference a specific version of jQuery. When using local files you may not have to do this as you can rename your private copy of jQuery.js, but for CDN the references are always versioned. The version number is of course very important to ensure you getting the version you have tested with, but it's also important to the provider because it ensures that cached content is always correct. If an existing file was updated the updates might take a very long time to get past the locally cached content and won't refresh properly. The version number ensures you get the right version and not some cached content that has been changed but not updated in your cache. On the other hand version numbers also mean that once you decide to use a new version of the script you now have to change all your script references in your pages. Depending on whether you use some sort of master/layout page or not this may or may not be easy in your application. Even if you do use master/layout pages, chances are that you probably have a few of them and at the very least all of those have to be updated for the scripts. If you use individual pages for all content this issue then spreads to all of your pages. Search and Replace in Files will do the trick, but it's still something that's easy to forget and worry about. Personaly I think it makes sense to have a single place where you can specify common script libraries that you want to load and more importantly which versions thereof and where they are loaded from. Loading Scripts via Server Code Script loading has always been important to me and as long as I can remember I've always built some custom script loading routines into my Web frameworks. WebForms makes this fairly easy because it has a reasonably useful script manager (ClientScriptManager and the ScriptManager) which allow injecting script into the page easily from anywhere in the Page cycle. What's nice about these components is that they allow scripts to be injected by controls so components can wrap up complex script/resource dependencies more easily without having to require long lists of CSS/Scripts/Image includes. In MVC or pure script driven applications like Razor WebPages the process is more raw, requiring you to embed script references in the right place. But its also more immediate - it lets you know exactly which versions of scripts to use because you have to manually embed them. In WebForms with different controls loading resources this often can get confusing because it's quite possible to load multiple versions of the same script library into a page, the results of which are less than optimal… In this post I look a simple routine that embeds jQuery into the page based on a few application wide configuration settings. It returns only a string of the script tags that can be manually embedded into a Page template. It's a small function that merely a string of the script tags shown at the begging of this post along with some options on how that string is comprised. You'll be able to specify in one place which version loads and then all places where the help function is used will automatically reflect this selection. Options allow specification of the jQuery CDN Url, the fallback Url and where jQuery should be loaded from (script folder, Resource or CDN in my case). While this is specific to jQuery you can apply this to other resources as well. For example I use a similar approach with jQuery.ui as well using practically the same semantics. Providing Resources in ControlResources In my Westwind.Web Web utility library I have a class called ControlResources which is responsible for holding resource Urls, resource IDs and string contants that reference those resource IDs. The library also provides a few helper methods for loading common scriptscripts into a Web page. There are specific versions for WebForms which use the ClientScriptManager/ScriptManager and script link methods that can be used in any .NET technology that can embed an expression into the output template (or code for that matter). The ControlResources class contains mostly static content - references to resources mostly. But it also contains a few static properties that configure script loading: A Script LoadMode (CDN, Resource, or script url) A default CDN Url A fallback url They are static properties in the ControlResources class: public class ControlResources { /// <summary> /// Determines what location jQuery is loaded from /// </summary> public static JQueryLoadModes jQueryLoadMode = JQueryLoadModes.ContentDeliveryNetwork; /// <summary> /// jQuery CDN Url on Google /// </summary> public static string jQueryCdnUrl = "//ajax.googleapis.com/ajax/libs/jquery/1.6.4/jquery.min.js"; /// <summary> /// jQuery CDN Url on Google /// </summary> public static string jQueryUiCdnUrl = "//ajax.googleapis.com/ajax/libs/jqueryui/1.8.16/jquery-ui.min.js"; /// <summary> /// jQuery UI fallback Url if CDN is unavailable or WebResource is used /// Note: The file needs to exist and hold the minimized version of jQuery ui /// </summary> public static string jQueryUiLocalFallbackUrl = "~/scripts/jquery-ui.min.js"; } These static properties are fixed values that can be changed at application startup to reflect your preferences. Since they're static they are application wide settings and respected across the entire Web application running. It's best to set these default in Application_Init or similar startup code if you need to change them for your application: protected void Application_Start(object sender, EventArgs e) { // Force jQuery to be loaded off Google Content Network ControlResources.jQueryLoadMode = JQueryLoadModes.ContentDeliveryNetwork; // Allow overriding of the Cdn url ControlResources.jQueryCdnUrl = "http://ajax.googleapis.com/ajax/libs/jquery/1.6.2/jquery.min.js"; // Route to our own internal handler App.OnApplicationStart(); } With these basic settings in place you can then embed expressions into a page easily. In WebForms use: <!DOCTYPE html> <html> <head runat="server"> <%= ControlResources.jQueryLink() %> <script src="scripts/ww.jquery.min.js"></script> </head> In Razor use: <!DOCTYPE html> <html> <head> @Html.Raw(ControlResources.jQueryLink()) <script src="scripts/ww.jquery.min.js"></script> </head> Note that in Razor you need to use @Html.Raw() to force the string NOT to escape. Razor by default escapes string results and this ensures that the HTML content is properly expanded as raw HTML text. Both the WebForms and Razor output produce: <!DOCTYPE html> <html> <head> <script src="http://ajax.googleapis.com/ajax/libs/jquery/1.6.2/jquery.min.js" type="text/javascript"></script> <script type="text/javascript"> if (typeof (jQuery) == 'undefined') document.write(unescape("%3Cscript src='/WestWindWebToolkitWeb/WebResource.axd?d=-b6oWzgbpGb8uTaHDrCMv59VSmGhilZP5_T_B8anpGx7X-PmW_1eu1KoHDvox-XHqA1EEb-Tl2YAP3bBeebGN65tv-7-yAimtG4ZnoWH633pExpJor8Qp1aKbk-KQWSoNfRC7rQJHXVP4tC0reYzVw2&t=634535391996872492' type='text/javascript'%3E%3C/script%3E"));</script> <script src="scripts/ww.jquery.min.js"></script> </head> which produces the desired effect for both CDN load and fallback URL. The implementation of jQueryLink is pretty basic of course: /// <summary> /// Inserts a script link to load jQuery into the page based on the jQueryLoadModes settings /// of this class. Default load is by CDN plus WebResource fallback /// </summary> /// <param name="url"> /// An optional explicit URL to load jQuery from. Url is resolved. /// When specified no fallback is applied /// </param> /// <returns>full script tag and fallback script for jQuery to load</returns> public static string jQueryLink(JQueryLoadModes jQueryLoadMode = JQueryLoadModes.Default, string url = null) { string jQueryUrl = string.Empty; string fallbackScript = string.Empty; if (jQueryLoadMode == JQueryLoadModes.Default) jQueryLoadMode = ControlResources.jQueryLoadMode; if (!string.IsNullOrEmpty(url)) jQueryUrl = WebUtils.ResolveUrl(url); else if (jQueryLoadMode == JQueryLoadModes.WebResource) { Page page = new Page(); jQueryUrl = page.ClientScript.GetWebResourceUrl(typeof(ControlResources), ControlResources.JQUERY_SCRIPT_RESOURCE); } else if (jQueryLoadMode == JQueryLoadModes.ContentDeliveryNetwork) { jQueryUrl = ControlResources.jQueryCdnUrl; if (!string.IsNullOrEmpty(jQueryCdnUrl)) { // check if jquery loaded - if it didn't we're not online and use WebResource fallbackScript = @"<script type=""text/javascript"">if (typeof(jQuery) == 'undefined') document.write(unescape(""%3Cscript src='{0}' type='text/javascript'%3E%3C/script%3E""));</script>"; fallbackScript = string.Format(fallbackScript, WebUtils.ResolveUrl(ControlResources.jQueryCdnFallbackUrl)); } } string output = "<script src=\"" + jQueryUrl + "\" type=\"text/javascript\"></script>"; // add in the CDN fallback script code if (!string.IsNullOrEmpty(fallbackScript)) output += "\r\n" + fallbackScript + "\r\n"; return output; } There's one dependency here on WebUtils.ResolveUrl() which resolves Urls without access to a Page/Control (another one of those features that should be in the runtime, not in the WebForms or MVC engine). You can see there's only a little bit of logic in this code that deals with potentially different load modes. I can load scripts from a Url, WebResources or - my preferred way - from CDN. Based on the static settings the scripts to embed are composed to be returned as simple string <script> tag(s). I find this extremely useful especially when I'm not connected to the internet so that I can quickly swap in a local jQuery resource instead of loading from CDN. While CDN loading with the fallback works it can be a bit slow as the CDN is probed first before the fallback kicks in. Switching quickly in one place makes this trivial. It also makes it very easy once a new version of jQuery rolls around to move up to the new version and ensure that all pages are using the new version immediately. I'm not trying to make this out as 'the' definite way to load your resources, but rather provide it here as a pointer so you can maybe apply your own logic to determine where scripts come from and how they load. You could even automate this some more by using configuration settings or reading the locations/preferences out of some sort of data/metadata store that can be dynamically updated instead via recompilation. FWIW, I use a very similar approach for loading jQuery UI and my own ww.jquery library - the same concept can be applied to any kind of script you might be loading from different locations. Hopefully some of you find this a useful addition to your toolset. Resources Google CDN for jQuery Full ControlResources Source Code ControlResource Documentation Westwind.Web NuGet This method is part of the Westwind.Web library of the West Wind Web Toolkit or you can grab the Web library from NuGet and add to your Visual Studio project. This package includes a host of Web related utilities and script support features. © Rick Strahl, West Wind Technologies, 2005-2011Posted in ASP.NET jQuery Tweet (function() { var po = document.createElement('script'); po.type = 'text/javascript'; po.async = true; po.src = 'https://apis.google.com/js/plusone.js'; var s = document.getElementsByTagName('script')[0]; s.parentNode.insertBefore(po, s); })();

Read the article
.NET HTML Sanitation for rich HTML Input

- by Rick Strahl

Recently I was working on updating a legacy application to MVC 4 that included free form text input. When I set up the new site my initial approach was to not allow any rich HTML input, only simple text formatting that would respect a few simple HTML commands for bold, lists etc. and automatically handles line break processing for new lines and paragraphs. This is typical for what I do with most multi-line text input in my apps and it works very well with very little development effort involved. Then the client sprung another note: Oh by the way we have a bunch of customers (real estate agents) who need to post complete HTML documents. Oh uh! There goes the simple theory. After some discussion and pleading on my part (<snicker>) to try and avoid this type of raw HTML input because of potential XSS issues, the client decided to go ahead and allow raw HTML input anyway. There has been lots of discussions on this subject on StackOverFlow (and here and here) but to after reading through some of the solutions I didn't really find anything that would work even closely for what I needed. Specifically we need to be able to allow just about any HTML markup, with the exception of script code. Remote CSS and Images need to be loaded, links need to work and so. While the 'legit' HTML posted by these agents is basic in nature it does span most of the full gamut of HTML (4). Most of the solutions XSS prevention/sanitizer solutions I found were way to aggressive and rendered the posted output unusable mostly because they tend to strip any externally loaded content. In short I needed a custom solution. I thought the best solution to this would be to use an HTML parser - in this case the Html Agility Pack - and then to run through all the HTML markup provided and remove any of the blacklisted tags and a number of attributes that are prone to JavaScript injection. There's much discussion on whether to use blacklists vs. whitelists in the discussions mentioned above, but I found that whitelists can make sense in simple scenarios where you might allow manual HTML input, but when you need to allow a larger array of HTML functionality a blacklist is probably easier to manage as the vast majority of elements and attributes could be allowed. Also white listing gets a bit more complex with HTML5 and the new proliferation of new HTML tags and most new tags generally don't affect XSS issues directly. Pure whitelisting based on elements and attributes also doesn't capture many edge cases (see some of the XSS cheat sheets listed below) so even with a white list, custom logic is still required to handle many of those edge cases. The Microsoft Web Protection Library (AntiXSS) My first thought was to check out the Microsoft AntiXSS library. Microsoft has an HTML Encoding and Sanitation library in the Microsoft Web Protection Library (formerly AntiXSS Library) on CodePlex, which provides stricter functions for whitelist encoding and sanitation. Initially I thought the Sanitation class and its static members would do the trick for me,but I found that this library is way too restrictive for my needs. Specifically the Sanitation class strips out images and links which rendered the full HTML from our real estate clients completely useless. I didn't spend much time with it, but apparently I'm not alone if feeling this library is not really useful without some way to configure operation. To give you an example of what didn't work for me with the library here's a small and simple HTML fragment that includes script, img and anchor tags. I would expect the script to be stripped and everything else to be left intact. Here's the original HTML:var value = "<b>Here</b> <script>alert('hello')</script> we go. Visit the " + "<a href='http://west-wind.com'>West Wind</a> site. " + "<img src='http://west-wind.com/images/new.gif' /> " ; and the code to sanitize it with the AntiXSS Sanitize class:@Html.Raw(Microsoft.Security.Application.Sanitizer.GetSafeHtmlFragment(value)) This produced a not so useful sanitized string: Here we go. Visit the <a>West Wind</a> site. While it removed the <script> tag (good) it also removed the href from the link and the image tag altogether (bad). In some situations this might be useful, but for most tasks I doubt this is the desired behavior. While links can contain javascript: references and images can 'broadcast' information to a server, without configuration to tell the library what to restrict this becomes useless to me. I couldn't find any way to customize the white list, nor is there code available in this 'open source' library on CodePlex. Using Html Agility Pack for HTML Parsing The WPL library wasn't going to cut it. After doing a bit of research I decided the best approach for a custom solution would be to use an HTML parser and inspect the HTML fragment/document I'm trying to import. I've used the HTML Agility Pack before for a number of apps where I needed an HTML parser without requiring an instance of a full browser like the Internet Explorer Application object which is inadequate in Web apps. In case you haven't checked out the Html Agility Pack before, it's a powerful HTML parser library that you can use from your .NET code. It provides a simple, parsable HTML DOM model to full HTML documents or HTML fragments that let you walk through each of the elements in your document. If you've used the HTML or XML DOM in a browser before you'll feel right at home with the Agility Pack. Blacklist based HTML Parsing to strip XSS Code For my purposes of HTML sanitation, the process involved is to walk the HTML document one element at a time and then check each element and attribute against a blacklist. There's quite a bit of argument of what's better: A whitelist of allowed items or a blacklist of denied items. While whitelists tend to be more secure, they also require a lot more configuration. In the case of HTML5 a whitelist could be very extensive. For what I need, I only want to ensure that no JavaScript is executed, so a blacklist includes the obvious <script> tag plus any tag that allows loading of external content including <iframe>, <object>, <embed> and <link> etc. <form> is also excluded to avoid posting content to a different location. I also disallow <head> and <meta> tags in particular for my case, since I'm only allowing posting of HTML fragments. There is also some internal logic to exclude some attributes or attributes that include references to JavaScript or CSS expressions. The default tag blacklist reflects my use case, but is customizable and can be added to. Here's my HtmlSanitizer implementation:using System.Collections.Generic; using System.IO; using System.Xml; using HtmlAgilityPack; namespace Westwind.Web.Utilities { public class HtmlSanitizer { public HashSet<string> BlackList = new HashSet<string>() { { "script" }, { "iframe" }, { "form" }, { "object" }, { "embed" }, { "link" }, { "head" }, { "meta" } }; /// <summary> /// Cleans up an HTML string and removes HTML tags in blacklist /// </summary> /// <param name="html"></param> /// <returns></returns> public static string SanitizeHtml(string html, params string[] blackList) { var sanitizer = new HtmlSanitizer(); if (blackList != null && blackList.Length > 0) { sanitizer.BlackList.Clear(); foreach (string item in blackList) sanitizer.BlackList.Add(item); } return sanitizer.Sanitize(html); } /// <summary> /// Cleans up an HTML string by removing elements /// on the blacklist and all elements that start /// with onXXX . /// </summary> /// <param name="html"></param> /// <returns></returns> public string Sanitize(string html) { var doc = new HtmlDocument(); doc.LoadHtml(html); SanitizeHtmlNode(doc.DocumentNode); //return doc.DocumentNode.WriteTo(); string output = null; // Use an XmlTextWriter to create self-closing tags using (StringWriter sw = new StringWriter()) { XmlWriter writer = new XmlTextWriter(sw); doc.DocumentNode.WriteTo(writer); output = sw.ToString(); // strip off XML doc header if (!string.IsNullOrEmpty(output)) { int at = output.IndexOf("?>"); output = output.Substring(at + 2); } writer.Close(); } doc = null; return output; } private void SanitizeHtmlNode(HtmlNode node) { if (node.NodeType == HtmlNodeType.Element) { // check for blacklist items and remove if (BlackList.Contains(node.Name)) { node.Remove(); return; } // remove CSS Expressions and embedded script links if (node.Name == "style") { if (string.IsNullOrEmpty(node.InnerText)) { if (node.InnerHtml.Contains("expression") || node.InnerHtml.Contains("javascript:")) node.ParentNode.RemoveChild(node); } } // remove script attributes if (node.HasAttributes) { for (int i = node.Attributes.Count - 1; i >= 0; i--) { HtmlAttribute currentAttribute = node.Attributes[i]; var attr = currentAttribute.Name.ToLower(); var val = currentAttribute.Value.ToLower(); span style="background: white; color: green">// remove event handlers if (attr.StartsWith("on")) node.Attributes.Remove(currentAttribute); // remove script links else if ( //(attr == "href" || attr== "src" || attr == "dynsrc" || attr == "lowsrc") && val != null && val.Contains("javascript:")) node.Attributes.Remove(currentAttribute); // Remove CSS Expressions else if (attr == "style" && val != null && val.Contains("expression") || val.Contains("javascript:") || val.Contains("vbscript:")) node.Attributes.Remove(currentAttribute); } } } // Look through child nodes recursively if (node.HasChildNodes) { for (int i = node.ChildNodes.Count - 1; i >= 0; i--) { SanitizeHtmlNode(node.ChildNodes[i]); } } } } } Please note: Use this as a starting point only for your own parsing and review the code for your specific use case! If your needs are less lenient than mine were you can you can make this much stricter by not allowing src and href attributes or CSS links if your HTML doesn't allow it. You can also check links for external URLs and disallow those - lots of options. The code is simple enough to make it easy to extend to fit your use cases more specifically. It's also quite easy to make this code work using a WhiteList approach if you want to go that route. The code above is semi-generic for allowing full featured HTML fragments that only disallow script related content. The Sanitize method walks through each node of the document and then recursively drills into all of its children until the entire document has been traversed. Note that the code here uses an XmlTextWriter to write output - this is done to preserve XHTML style self-closing tags which are otherwise left as non-self-closing tags. The sanitizer code scans for blacklist elements and removes those elements not allowed. Note that the blacklist is configurable either in the instance class as a property or in the static method via the string parameter list. Additionally the code goes through each element's attributes and looks for a host of rules gleaned from some of the XSS cheat sheets listed at the end of the post. Clearly there are a lot more XSS vulnerabilities, but a lot of them apply to ancient browsers (IE6 and versions of Netscape) - many of these glaring holes (like CSS expressions - WTF IE?) have been removed in modern browsers. What a Pain To be honest this is NOT a piece of code that I wanted to write. I think building anything related to XSS is better left to people who have far more knowledge of the topic than I do. Unfortunately, I was unable to find a tool that worked even closely for me, or even provided a working base. For the project I was working on I had no choice and I'm sharing the code here merely as a base line to start with and potentially expand on for specific needs. It's sad that Microsoft Web Protection Library is currently such a train wreck - this is really something that should come from Microsoft as the systems vendor or possibly a third party that provides security tools. Luckily for my application we are dealing with a authenticated and validated users so the user base is fairly well known, and relatively small - this is not a wide open Internet application that's directly public facing. As I mentioned earlier in the post, if I had my way I would simply not allow this type of raw HTML input in the first place, and instead rely on a more controlled HTML input mechanism like MarkDown or even a good HTML Edit control that can provide some limits on what types of input are allowed. Alas in this case I was overridden and we had to go forward and allow *any* raw HTML posted. Sometimes I really feel sad that it's come this far - how many good applications and tools have been thwarted by fear of XSS (or worse) attacks? So many things that could be done *if* we had a more secure browser experience and didn't have to deal with every little script twerp trying to hack into Web pages and obscure browser bugs. So much time wasted building secure apps, so much time wasted by others trying to hack apps… We're a funny species - no other species manages to waste as much time, effort and resources as we humans do :-) Resources Code on GitHub Html Agility Pack XSS Cheat Sheet XSS Prevention Cheat Sheet Microsoft Web Protection Library (AntiXss) StackOverflow Links: http://stackoverflow.com/questions/341872/html-sanitizer-for-net http://blog.stackoverflow.com/2008/06/safe-html-and-xss/ http://code.google.com/p/subsonicforums/source/browse/trunk/SubSonic.Forums.Data/HtmlScrubber.cs?r=61© Rick Strahl, West Wind Technologies, 2005-2012Posted in Security HTML ASP.NET JavaScript Tweet !function(d,s,id){var js,fjs=d.getElementsByTagName(s)[0];if(!d.getElementById(id)){js=d.createElement(s);js.id=id;js.src="//platform.twitter.com/widgets.js";fjs.parentNode.insertBefore(js,fjs);}}(document,"script","twitter-wjs"); (function() { var po = document.createElement('script'); po.type = 'text/javascript'; po.async = true; po.src = 'https://apis.google.com/js/plusone.js'; var s = document.getElementsByTagName('script')[0]; s.parentNode.insertBefore(po, s); })();

Read the article
Inheritance Mapping Strategies with Entity Framework Code First CTP5 Part 1: Table per Hierarchy (TPH)

- by mortezam

A simple strategy for mapping classes to database tables might be “one table for every entity persistent class.” This approach sounds simple enough and, indeed, works well until we encounter inheritance. Inheritance is such a visible structural mismatch between the object-oriented and relational worlds because object-oriented systems model both “is a” and “has a” relationships. SQL-based models provide only "has a" relationships between entities; SQL database management systems don’t support type inheritance—and even when it’s available, it’s usually proprietary or incomplete. There are three different approaches to representing an inheritance hierarchy: Table per Hierarchy (TPH): Enable polymorphism by denormalizing the SQL schema, and utilize a type discriminator column that holds type information. Table per Type (TPT): Represent "is a" (inheritance) relationships as "has a" (foreign key) relationships. Table per Concrete class (TPC): Discard polymorphism and inheritance relationships completely from the SQL schema.I will explain each of these strategies in a series of posts and this one is dedicated to TPH. In this series we'll deeply dig into each of these strategies and will learn about "why" to choose them as well as "how" to implement them. Hopefully it will give you a better idea about which strategy to choose in a particular scenario. Inheritance Mapping with Entity Framework Code FirstAll of the inheritance mapping strategies that we discuss in this series will be implemented by EF Code First CTP5. The CTP5 build of the new EF Code First library has been released by ADO.NET team earlier this month. EF Code-First enables a pretty powerful code-centric development workflow for working with data. I’m a big fan of the EF Code First approach, and I’m pretty excited about a lot of productivity and power that it brings. When it comes to inheritance mapping, not only Code First fully supports all the strategies but also gives you ultimate flexibility to work with domain models that involves inheritance. The fluent API for inheritance mapping in CTP5 has been improved a lot and now it's more intuitive and concise in compare to CTP4. A Note For Those Who Follow Other Entity Framework ApproachesIf you are following EF's "Database First" or "Model First" approaches, I still recommend to read this series since although the implementation is Code First specific but the explanations around each of the strategies is perfectly applied to all approaches be it Code First or others. A Note For Those Who are New to Entity Framework and Code-FirstIf you choose to learn EF you've chosen well. If you choose to learn EF with Code First you've done even better. To get started, you can find a great walkthrough by Scott Guthrie here and another one by ADO.NET team here. In this post, I assume you already setup your machine to do Code First development and also that you are familiar with Code First fundamentals and basic concepts. You might also want to check out my other posts on EF Code First like Complex Types and Shared Primary Key Associations. A Top Down Development ScenarioThese posts take a top-down approach; it assumes that you’re starting with a domain model and trying to derive a new SQL schema. Therefore, we start with an existing domain model, implement it in C# and then let Code First create the database schema for us. However, the mapping strategies described are just as relevant if you’re working bottom up, starting with existing database tables. I’ll show some tricks along the way that help you dealing with nonperfect table layouts. Let’s start with the mapping of entity inheritance. -- The Domain ModelIn our domain model, we have a BillingDetail base class which is abstract (note the italic font on the UML class diagram below). We do allow various billing types and represent them as subclasses of BillingDetail class. As for now, we support CreditCard and BankAccount: Implement the Object Model with Code First As always, we start with the POCO classes. Note that in our DbContext, I only define one DbSet for the base class which is BillingDetail. Code First will find the other classes in the hierarchy based on Reachability Convention. public abstract class BillingDetail { public int BillingDetailId { get; set; } public string Owner { get; set; } public string Number { get; set; } } public class BankAccount : BillingDetail { public string BankName { get; set; } public string Swift { get; set; } } public class CreditCard : BillingDetail { public int CardType { get; set; } public string ExpiryMonth { get; set; } public string ExpiryYear { get; set; } } public class InheritanceMappingContext : DbContext { public DbSet<BillingDetail> BillingDetails { get; set; } } This object model is all that is needed to enable inheritance with Code First. If you put this in your application you would be able to immediately start working with the database and do CRUD operations. Before going into details about how EF Code First maps this object model to the database, we need to learn about one of the core concepts of inheritance mapping: polymorphic and non-polymorphic queries. Polymorphic Queries LINQ to Entities and EntitySQL, as object-oriented query languages, both support polymorphic queries—that is, queries for instances of a class and all instances of its subclasses, respectively. For example, consider the following query: IQueryable<BillingDetail> linqQuery = from b in context.BillingDetails select b; List<BillingDetail> billingDetails = linqQuery.ToList(); Or the same query in EntitySQL: string eSqlQuery = @"SELECT VAlUE b FROM BillingDetails AS b"; ObjectQuery<BillingDetail> objectQuery = ((IObjectContextAdapter)context).ObjectContext .CreateQuery<BillingDetail>(eSqlQuery); List<BillingDetail> billingDetails = objectQuery.ToList(); linqQuery and eSqlQuery are both polymorphic and return a list of objects of the type BillingDetail, which is an abstract class but the actual concrete objects in the list are of the subtypes of BillingDetail: CreditCard and BankAccount. Non-polymorphic QueriesAll LINQ to Entities and EntitySQL queries are polymorphic which return not only instances of the specific entity class to which it refers, but all subclasses of that class as well. On the other hand, Non-polymorphic queries are queries whose polymorphism is restricted and only returns instances of a particular subclass. In LINQ to Entities, this can be specified by using OfType<T>() Method. For example, the following query returns only instances of BankAccount: IQueryable<BankAccount> query = from b in context.BillingDetails.OfType<BankAccount>() select b; EntitySQL has OFTYPE operator that does the same thing: string eSqlQuery = @"SELECT VAlUE b FROM OFTYPE(BillingDetails, Model.BankAccount) AS b"; In fact, the above query with OFTYPE operator is a short form of the following query expression that uses TREAT and IS OF operators: string eSqlQuery = @"SELECT VAlUE TREAT(b as Model.BankAccount) FROM BillingDetails AS b WHERE b IS OF(Model.BankAccount)"; (Note that in the above query, Model.BankAccount is the fully qualified name for BankAccount class. You need to change "Model" with your own namespace name.) Table per Class Hierarchy (TPH)An entire class hierarchy can be mapped to a single table. This table includes columns for all properties of all classes in the hierarchy. The concrete subclass represented by a particular row is identified by the value of a type discriminator column. You don’t have to do anything special in Code First to enable TPH. It's the default inheritance mapping strategy: This mapping strategy is a winner in terms of both performance and simplicity. It’s the best-performing way to represent polymorphism—both polymorphic and nonpolymorphic queries perform well—and it’s even easy to implement by hand. Ad-hoc reporting is possible without complex joins or unions. Schema evolution is straightforward. Discriminator Column As you can see in the DB schema above, Code First has to add a special column to distinguish between persistent classes: the discriminator. This isn’t a property of the persistent class in our object model; it’s used internally by EF Code First. By default, the column name is "Discriminator", and its type is string. The values defaults to the persistent class names —in this case, “BankAccount” or “CreditCard”. EF Code First automatically sets and retrieves the discriminator values. TPH Requires Properties in SubClasses to be Nullable in the Database TPH has one major problem: Columns for properties declared by subclasses will be nullable in the database. For example, Code First created an (INT, NULL) column to map CardType property in CreditCard class. However, in a typical mapping scenario, Code First always creates an (INT, NOT NULL) column in the database for an int property in persistent class. But in this case, since BankAccount instance won’t have a CardType property, the CardType field must be NULL for that row so Code First creates an (INT, NULL) instead. If your subclasses each define several non-nullable properties, the loss of NOT NULL constraints may be a serious problem from the point of view of data integrity. TPH Violates the Third Normal FormAnother important issue is normalization. We’ve created functional dependencies between nonkey columns, violating the third normal form. Basically, the value of Discriminator column determines the corresponding values of the columns that belong to the subclasses (e.g. BankName) but Discriminator is not part of the primary key for the table. As always, denormalization for performance can be misleading, because it sacrifices long-term stability, maintainability, and the integrity of data for immediate gains that may be also achieved by proper optimization of the SQL execution plans (in other words, ask your DBA). Generated SQL QueryLet's take a look at the SQL statements that EF Code First sends to the database when we write queries in LINQ to Entities or EntitySQL. For example, the polymorphic query for BillingDetails that you saw, generates the following SQL statement: SELECT [Extent1].[Discriminator] AS [Discriminator], [Extent1].[BillingDetailId] AS [BillingDetailId], [Extent1].[Owner] AS [Owner], [Extent1].[Number] AS [Number], [Extent1].[BankName] AS [BankName], [Extent1].[Swift] AS [Swift], [Extent1].[CardType] AS [CardType], [Extent1].[ExpiryMonth] AS [ExpiryMonth], [Extent1].[ExpiryYear] AS [ExpiryYear] FROM [dbo].[BillingDetails] AS [Extent1] WHERE [Extent1].[Discriminator] IN ('BankAccount','CreditCard') Or the non-polymorphic query for the BankAccount subclass generates this SQL statement: SELECT [Extent1].[BillingDetailId] AS [BillingDetailId], [Extent1].[Owner] AS [Owner], [Extent1].[Number] AS [Number], [Extent1].[BankName] AS [BankName], [Extent1].[Swift] AS [Swift] FROM [dbo].[BillingDetails] AS [Extent1] WHERE [Extent1].[Discriminator] = 'BankAccount' Note how Code First adds a restriction on the discriminator column and also how it only selects those columns that belong to BankAccount entity. Change Discriminator Column Data Type and Values With Fluent API Sometimes, especially in legacy schemas, you need to override the conventions for the discriminator column so that Code First can work with the schema. The following fluent API code will change the discriminator column name to "BillingDetailType" and the values to "BA" and "CC" for BankAccount and CreditCard respectively: protected override void OnModelCreating(System.Data.Entity.ModelConfiguration.ModelBuilder modelBuilder) { modelBuilder.Entity<BillingDetail>() .Map<BankAccount>(m => m.Requires("BillingDetailType").HasValue("BA")) .Map<CreditCard>(m => m.Requires("BillingDetailType").HasValue("CC")); } Also, changing the data type of discriminator column is interesting. In the above code, we passed strings to HasValue method but this method has been defined to accepts a type of object: public void HasValue(object value); Therefore, if for example we pass a value of type int to it then Code First not only use our desired values (i.e. 1 & 2) in the discriminator column but also changes the column type to be (INT, NOT NULL): modelBuilder.Entity<BillingDetail>() .Map<BankAccount>(m => m.Requires("BillingDetailType").HasValue(1)) .Map<CreditCard>(m => m.Requires("BillingDetailType").HasValue(2)); SummaryIn this post we learned about Table per Hierarchy as the default mapping strategy in Code First. The disadvantages of the TPH strategy may be too serious for your design—after all, denormalized schemas can become a major burden in the long run. Your DBA may not like it at all. In the next post, we will learn about Table per Type (TPT) strategy that doesn’t expose you to this problem. References ADO.NET team blog Java Persistence with Hibernate book a { text-decoration: none; } a:visited { color: Blue; } .title { padding-bottom: 5px; font-family: Segoe UI; font-size: 11pt; font-weight: bold; padding-top: 15px; } .code, .typeName { font-family: consolas; } .typeName { color: #2b91af; } .padTop5 { padding-top: 5px; } .padTop10 { padding-top: 10px; } p.MsoNormal { margin-top: 0in; margin-right: 0in; margin-bottom: 10.0pt; margin-left: 0in; line-height: 115%; font-size: 11.0pt; font-family: "Calibri" , "sans-serif"; }

Read the article
Guide to reduce TFS database growth using the Test Attachment Cleaner

- by terje

Recently there has been several reports on TFS databases growing too fast and growing too big. Notable this has been observed when one has started to use more features of the Testing system. Also, the TFS 2010 handles test results differently from TFS 2008, and this leads to more data stored in the TFS databases. As a consequence of this there has been released some tools to remove unneeded data in the database, and also some fixes to correct for bugs which has been found and corrected during this process. Further some preventive practices and maintenance rules should be adopted. A lot of people have blogged about this, among these are: Anu’s very important blog post here describes both the problem and solutions to handle it. She describes both the Test Attachment Cleaner tool, and also some QFE/CU releases to fix some underlying bugs which prevented the tool from being fully effective. Brian Harry’s blog post here describes the problem too This forum thread describes the problem with some solution hints. Ravi Shanker’s blog post here describes best practices on solving this (TBP) Grant Holidays blogpost here describes strategies to use the Test Attachment Cleaner both to detect space problems and how to rectify them. The problem can be divided into the following areas: Publishing of test results from builds Publishing of manual test results and their attachments in particular Publishing of deployment binaries for use during a test run Bugs in SQL server preventing total cleanup of data (All the published data above is published into the TFS database as attachments.) The test results will include all data being collected during the run. Some of this data can grow rather large, like IntelliTrace logs and video recordings. Also the pushing of binaries which happen for automated test runs, including tests run during a build using code coverage which will include all the files in the deployment folder, contributes a lot to the size of the attached data. In order to handle this systematically, I have set up a 3-stage process: Find out if you have a database space issue Set up your TFS server to minimize potential database issues If you have the “problem”, clean up the database and otherwise keep it clean Analyze the data Are your database( s) growing ? Are unused test results growing out of proportion ? To find out about this you need to query your TFS database for some of the information, and use the Test Attachment Cleaner (TAC) to obtain some more detailed information. If you don’t have too many databases you can use the SQL Server reports from within the Management Studio to analyze the database and table sizes. Or, you can use a set of queries . I find queries often faster to use because I can tweak them the way I want them. But be aware that these queries are non-documented and non-supported and may change when the product team wants to change them. If you have multiple Project Collections, find out which might have problems: (Disclaimer: The queries below work on TFS 2010. They will not work on Dev-11, since the table structure have been changed. I will try to update them for Dev-11 when it is released.) Open a SQL Management Studio session onto the SQL Server where you have your TFS Databases. Use the query below to find the Project Collection databases and their sizes, in descending size order. use master select DB_NAME(database_id) AS DBName, (size/128) SizeInMB FROM sys.master_files where type=0 and substring(db_name(database_id),1,4)='Tfs_' and DB_NAME(database_id)<>'Tfs_Configuration' order by size desc Doing this on one of our SQL servers gives the following results: It is pretty easy to see on which collection to start the work Find out which tables are possibly too large Keep a special watch out for the Tfs_Attachment table. Use the script at the bottom of Grant’s blog to find the table sizes in descending size order. In our case we got this result: From Grant’s blog we learnt that the tbl_Content is in the Version Control category, so the major only big issue we have here is the tbl_AttachmentContent. Find out which team projects have possibly too large attachments In order to use the TAC to find and eventually delete attachment data we need to find out which team projects have these attachments. The team project is a required parameter to the TAC. Use the following query to find this, replace the collection database name with whatever applies in your case: use Tfs_DefaultCollection select p.projectname, sum(a.compressedlength)/1024/1024 as sizeInMB from dbo.tbl_Attachment as a inner join tbl_testrun as tr on a.testrunid=tr.testrunid inner join tbl_project as p on p.projectid=tr.projectid group by p.projectname order by sum(a.compressedlength) desc In our case we got this result (had to remove some names), out of more than 100 team projects accumulated over quite some years: As can be seen here it is pretty obvious the “Byggtjeneste – Projects” are the main team project to take care of, with the ones on lines 2-4 as the next ones. Check which attachment types takes up the most space It can be nice to know which attachment types takes up the space, so run the following query: use Tfs_DefaultCollection select a.attachmenttype, sum(a.compressedlength)/1024/1024 as sizeInMB from dbo.tbl_Attachment as a inner join tbl_testrun as tr on a.testrunid=tr.testrunid inner join tbl_project as p on p.projectid=tr.projectid group by a.attachmenttype order by sum(a.compressedlength) desc We then got this result: From this it is pretty obvious that the problem here is the binary files, as also mentioned in Anu’s blog. Check which file types, by their extension, takes up the most space Run the following query use Tfs_DefaultCollection select SUBSTRING(filename,len(filename)-CHARINDEX('.',REVERSE(filename))+2,999)as Extension, sum(compressedlength)/1024 as SizeInKB from tbl_Attachment group by SUBSTRING(filename,len(filename)-CHARINDEX('.',REVERSE(filename))+2,999) order by sum(compressedlength) desc This gives a result like this: Now you should have collected enough information to tell you what to do – if you got to do something, and some of the information you need in order to set up your TAC settings file, both for a cleanup and for scheduled maintenance later. Get your TFS server and environment properly set up Even if you have got the problem or if have yet not got the problem, you should ensure the TFS server is set up so that the risk of getting into this problem is minimized. To ensure this you should install the following set of updates and components. The assumption is that your TFS Server is at SP1 level. Install the QFE for KB2608743 – which also contains detailed instructions on its use, download from here. The QFE changes the default settings to not upload deployed binaries, which are used in automated test runs. Binaries will still be uploaded if: Code coverage is enabled in the test settings. You change the UploadDeploymentItem to true in the testsettings file. Be aware that this might be reset back to false by another user which haven't installed this QFE. The hotfix should be installed to The build servers (the build agents) The machine hosting the Test Controller Local development computers (Visual Studio) Local test computers (MTM) It is not required to install it to the TFS Server, test agents or the build controller – it has no effect on these programs. If you use the SQL Server 2008 R2 you should also install the CU 10 (or later). This CU fixes a potential problem of hanging “ghost” files. This seems to happen only in certain trigger situations, but to ensure it doesn’t bite you, it is better to make sure this CU is installed. There is no such CU for SQL Server 2008 pre-R2 Work around: If you suspect hanging ghost files, they can be – with some mental effort, deduced from the ghost counters using the following SQL query: use master SELECT DB_NAME(database_id) as 'database',OBJECT_NAME(object_id) as 'objectname', index_type_desc,ghost_record_count,version_ghost_record_count,record_count,avg_record_size_in_bytes FROM sys.dm_db_index_physical_stats (DB_ID(N'<DatabaseName>'), OBJECT_ID(N'<TableName>'), NULL, NULL , 'DETAILED') The problem is a stalled ghost cleanup process. Restarting the SQL server after having stopped all components that depends on it, like the TFS Server and SPS services – that is all applications that connect to the SQL server. Then restart the SQL server, and finally start up all dependent processes again. (I would guess a complete server reboot would do the trick too.) After this the ghost cleanup process will run properly again. The fix will come in the next CU cycle for SQL Server R2 SP1. The R2 pre-SP1 and R2 SP1 have separate maintenance cycles, and are maintained individually. Each have its own set of CU’s. When it comes I will add the link here to that CU. The "hanging ghost file” issue came up after one have run the TAC, and deleted enourmes amount of data. The SQL Server can get into this hanging state (without the QFE) in certain cases due to this. And of course, install and set up the Test Attachment Cleaner command line power tool. This should be done following some guidelines from Ravi Shanker: “When you run TAC, ensure that you are deleting small chunks of data at regular intervals (say run TAC every night at 3AM to delete data that is between age 730 to 731 days) – this will ensure that small amounts of data are being deleted and SQL ghosted record cleanup can catch up with the number of deletes performed. “ This rule minimizes the risk of the ghosted hang problem to occur, and further makes it easier for the SQL server ghosting process to work smoothly. “Run DBCC SHRINKDB post the ghosted records are cleaned up to physically reclaim the space on the file system” This is the last step in a 3 step process of removing SQL server data. First they are logically deleted. Then they are cleaned out by the ghosting process, and finally removed using the shrinkdb command. Cleaning out the attachments The TAC is run from the command line using a set of parameters and controlled by a settingsfile. The parameters point out a server uri including the team project collection and also point at a specific team project. So in order to run this for multiple team projects regularly one has to set up a script to run the TAC multiple times, once for each team project. When you install the TAC there is a very useful readme file in the same directory. When the deployment binaries are published to the TFS server, ALL items are published up from the deployment folder. That often means much more files than you would assume are necessary. This is a brute force technique. It works, but you need to take care when cleaning up. Grant has shown how their settings file looks in his blog post, removing all attachments older than 180 days , as long as there are no active workitems connected to them. This setting can be useful to clean out all items, both in a clean-up once operation, and in a general There are two scenarios we need to consider: Cleaning up an existing overgrown database Maintaining a server to avoid an overgrown database using scheduled TAC 1. Cleaning up a database which has grown too big due to these attachments. This job is a “Once” job. We do this once and then move on to make sure it won’t happen again, by taking the actions in 2) below. In this scenario you should only consider the large files. Your goal should be to simply reduce the size, and don’t bother about the smaller stuff. That can be left a scheduled TAC cleanup ( 2 below). Here you can use a very general settings file, and just remove the large attachments, or you can choose to remove any old items. Grant’s settings file is an example of the last one. A settings file to remove only large attachments could look like this:  <DeletionCriteria> <TestRun /> <Attachment> <SizeInMB GreaterThan="10" /> </Attachment> </DeletionCriteria> Or like this: If you want only to remove dll’s and pdb’s about that size, add an Extensions-section. Without that section, all extensions will be deleted.  <DeletionCriteria> <TestRun /> <Attachment> <SizeInMB GreaterThan="10" /> <Extensions> <Include value="dll" /> <Include value="pdb" /> </Extensions> </Attachment> </DeletionCriteria> Before you start up your scheduled maintenance, you should clear out all older items. 2. Scheduled maintenance using the TAC If you run a schedule every night, and remove old items, and also remove them in small batches. It is important to run this often, like every night, in order to keep the number of deleted items low. That way the SQL ghost process works better. One approach could be to delete all items older than some number of days, let’s say 180 days. This could be combined with restricting it to keep attachments with active or resolved bugs. Doing this every night ensures that only small amounts of data is deleted.  <DeletionCriteria> <TestRun> <AgeInDays OlderThan="180" /> </TestRun> <Attachment /> <LinkedBugs> <Exclude state="Active" /> <Exclude state="Resolved"/> </LinkedBugs> </DeletionCriteria> In my experience there are projects which are left with active or resolved workitems, akthough no further work is done. It can be wise to have a cleanup process with no restrictions on linked bugs at all. Note that you then have to remove the whole LinkedBugs section. A approach which could work better here is to do a two step approach, use the schedule above to with no LinkedBugs as a sweeper cleaning task taking away all data older than you could care about. Then have another scheduled TAC task to take out more specifically attachments that you are not likely to use. This task could be much more specific, and based on your analysis clean out what you know is troublesome data.  <DeletionCriteria> <TestRun > <AgeInDays OlderThan="30" /> </TestRun> <Attachment> <SizeInMB GreaterThan="10" /> <Extensions> <Include value="iTrace"/> <Include value="dll"/> <Include value="pdb"/> <Include value="wmv"/> </Extensions> </Attachment> <LinkedBugs> <Exclude state="Active" /> <Exclude state="Resolved" /> </LinkedBugs> </DeletionCriteria> The readme document for the TAC says that it recognizes “internal” extensions, but it does recognize any extension. To run the tool do the following command: tcmpt attachmentcleanup /collection:your_tfs_collection_url /teamproject:your_team_project /settingsfile:path_to_settingsfile /outputfile:%temp%/teamproject.tcmpt.log /mode:delete Shrinking the database You could run a shrink database command after the TAC has run in cases where there are a lot of data being deleted. In this case you SHOULD do it, to free up all that space. But, after the shrink operation you should do a rebuild indexes, since the shrink operation will leave the database in a very fragmented state, which will reduce performance. Note that you need to rebuild indexes, reorganizing is not enough. For smaller amounts of data you should NOT shrink the database, since the data will be reused by the SQL server when it need to add more records. In fact, it is regarded as a bad practice to shrink the database regularly. So on a daily maintenance schedule you should NOT shrink the database. To shrink the database you do a DBCC SHRINKDATABASE command, and then follow up with a DBCC INDEXDEFRAG afterwards. I find the easiest way to do this is to create a SQL Maintenance plan including the Shrink Database Task and the Rebuild Index Task and just execute it when you need to do this.

Read the article
Abstracting functionality

- by Ralf Westphal

Originally posted on: http://geekswithblogs.net/theArchitectsNapkin/archive/2014/08/22/abstracting-functionality.aspxWhat is more important than data? Functionality. Yes, I strongly believe we should switch to a functionality over data mindset in programming. Or actually switch back to it. Focus on functionality Functionality once was at the core of software development. Back when algorithms were the first thing you heard about in CS classes. Sure, data structures, too, were important - but always from the point of view of algorithms. (Niklaus Wirth gave one of his books the title “Algorithms + Data Structures” instead of “Data Structures + Algorithms” for a reason.) The reason for the focus on functionality? Firstly, because software was and is about doing stuff. Secondly because sufficient performance was hard to achieve, and only thirdly memory efficiency. But then hardware became more powerful. That gave rise to a new mindset: object orientation. And with it functionality was devalued. Data took over its place as the most important aspect. Now discussions revolved around structures motivated by data relationships. (John Beidler gave his book the title “Data Structures and Algorithms: An Object Oriented Approach” instead of the other way around for a reason.) Sure, this data could be embellished with functionality. But nevertheless functionality was second. When you look at (domain) object models what you mostly find is (domain) data object models. The common object oriented approach is: data aka structure over functionality. This is true even for the most modern modeling approaches like Domain Driven Design. Look at the literature and what you find is recommendations on how to get data structures right: aggregates, entities, value objects. I´m not saying this is what object orientation was invented for. But I´m saying that´s what I happen to see across many teams now some 25 years after object orientation became mainstream through C++, Delphi, and Java. But why should we switch back? Because software development cannot become truly agile with a data focus. The reason for that lies in what customers need first: functionality, behavior, operations. To be clear, that´s not why software is built. The purpose of software is to be more efficient than the alternative. Money mainly is spent to get a certain level of quality (e.g. performance, scalability, security etc.). But without functionality being present, there is nothing to work on the quality of. What customers want is functionality of a certain quality. ASAP. And tomorrow new functionality needs to be added, existing functionality needs to be changed, and quality needs to be increased. No customer ever wanted data or structures. Of course data should be processed. Data is there, data gets generated, transformed, stored. But how the data is structured for this to happen efficiently is of no concern to the customer. Ask a customer (or user) whether she likes the data structured this way or that way. She´ll say, “I don´t care.” But ask a customer (or user) whether he likes the functionality and its quality this way or that way. He´ll say, “I like it” (or “I don´t like it”). Build software incrementally From this very natural focus of customers and users on functionality and its quality follows we should develop software incrementally. That´s what Agility is about. Deliver small increments quickly and often to get frequent feedback. That way less waste is produced, and learning can take place much easier (on the side of the customer as well as on the side of developers). An increment is some added functionality or quality of functionality.[1] So as it turns out, Agility is about functionality over whatever. But software developers’ thinking is still stuck in the object oriented mindset of whatever over functionality. Bummer. I guess that (at least partly) explains why Agility always hits a glass ceiling in projects. It´s a clash of mindsets, of cultures. Driving software development by demanding small increases in functionality runs against thinking about software as growing (data) structures sprinkled with functionality. (Excuse me, if this sounds a bit broad-brush. But you get my point.) The need for abstraction In the end there need to be data structures. Of course. Small and large ones. The phrase functionality over data does not deny that. It´s not functionality instead of data or something. It´s just over, i.e. functionality should be thought of first. It´s a tad more important. It´s what the customer wants. That´s why we need a way to design functionality. Small and large. We need to be able to think about functionality before implementing it. We need to be able to reason about it among team members. We need to be able to communicate our mental models of functionality not just by speaking about them, but also on paper. Otherwise reasoning about it does not scale. We learned thinking about functionality in the small using flow charts, Nassi-Shneiderman diagrams, pseudo code, or UML sequence diagrams. That´s nice and well. But it does not scale. You can use these tools to describe manageable algorithms. But it does not work for the functionality triggered by pressing the “1-Click Order” on an amazon product page for example. There are several reasons for that, I´d say. Firstly, the level of abstraction over code is negligible. It´s essentially non-existent. Drawing a flow chart or writing pseudo code or writing actual code is very, very much alike. All these tools are about control flow like code is.[2] In addition all tools are computationally complete. They are about logic which is expressions and especially control statements. Whatever you code in Java you can fully (!) describe using a flow chart. And then there is no data. They are about control flow and leave out the data altogether. Thus data mostly is assumed to be global. That´s shooting yourself in the foot, as I hope you agree. Even if it´s functionality over data that does not mean “don´t think about data”. Right to the contrary! Functionality only makes sense with regard to data. So data needs to be in the picture right from the start - but it must not dominate the thinking. The above tools fail on this. Bottom line: So far we´re unable to reason in a scalable and abstract manner about functionality. That´s why programmers are so driven to start coding once they are presented with a problem. Programming languages are the only tool they´ve learned to use to reason about functional solutions. Or, well, there might be exceptions. Mathematical notation and SQL may have come to your mind already. Indeed they are tools on a higher level of abstraction than flow charts etc. That´s because they are declarative and not computationally complete. They leave out details - in order to deliver higher efficiency in devising overall solutions. We can easily reason about functionality using mathematics and SQL. That´s great. Except for that they are domain specific languages. They are not general purpose. (And they don´t scale either, I´d say.) Bummer. So to be more precise we need a scalable general purpose tool on a higher than code level of abstraction not neglecting data. Enter: Flow Design. Abstracting functionality using data flows I believe the solution to the problem of abstracting functionality lies in switching from control flow to data flow. Data flow very naturally is not about logic details anymore. There are no expressions and no control statements anymore. There are not even statements anymore. Data flow is declarative by nature. With data flow we get rid of all the limiting traits of former approaches to modeling functionality. In addition, nomen est omen, data flows include data in the functionality picture. With data flows, data is visibly flowing from processing step to processing step. Control is not flowing. Control is wherever it´s needed to process data coming in. That´s a crucial difference and needs some rewiring in your head to be fully appreciated.[2] Since data flows are declarative they are not the right tool to describe algorithms, though, I´d say. With them you don´t design functionality on a low level. During design data flow processing steps are black boxes. They get fleshed out during coding. Data flow design thus is more coarse grained than flow chart design. It starts on a higher level of abstraction - but then is not limited. By nesting data flows indefinitely you can design functionality of any size, without losing sight of your data. Data flows scale very well during design. They can be used on any level of granularity. And they can easily be depicted. Communicating designs using data flows is easy and scales well, too. The result of functional design using data flows is not algorithms (too low level), but processes. Think of data flows as descriptions of industrial production lines. Data as material runs through a number of processing steps to be analyzed, enhances, transformed. On the top level of a data flow design might be just one processing step, e.g. “execute 1-click order”. But below that are arbitrary levels of flows with smaller and smaller steps. That´s not layering as in “layered architecture”, though. Rather it´s a stratified design à la Abelson/Sussman. Refining data flows is not your grandpa´s functional decomposition. That was rooted in control flows. Refining data flows does not suffer from the limits of functional decomposition against which object orientation was supposed to be an antidote. Summary I´ve been working exclusively with data flows for functional design for the past 4 years. It has changed my life as a programmer. What once was difficult is now easy. And, no, I´m not using Clojure or F#. And I´m not a async/parallel execution buff. Designing the functionality of increments using data flows works great with teams. It produces design documentation which can easily be translated into code - in which then the smallest data flow processing steps have to be fleshed out - which is comparatively easy. Using a systematic translation approach code can mirror the data flow design. That way later on the design can easily be reproduced from the code if need be. And finally, data flow designs play well with object orientation. They are a great starting point for class design. But that´s a story for another day. To me data flow design simply is one of the missing links of systematic lightweight software design. There are also other artifacts software development can produce to get feedback, e.g. process descriptions, test cases. But customers can be delighted more easily with code based increments in functionality. ? No, I´m not talking about the endless possibilities this opens for parallel processing. Data flows are useful independently of multi-core processors and Actor-based designs. That´s my whole point here. Data flows are good for reasoning and evolvability. So forget about any special frameworks you might need to reap benefits from data flows. None are necessary. Translating data flow designs even into plain of Java is possible. ?

Read the article
64-bit Archives Needed

- by user9154181

A little over a year ago, we received a question from someone who was trying to build software on Solaris. He was getting errors from the ar command when creating an archive. At that time, the ar command on Solaris was a 32-bit command. There was more than 2GB of data, and the ar command was hitting the file size limit for a 32-bit process that doesn't use the largefile APIs. Even in 2011, 2GB is a very large amount of code, so we had not heard this one before. Most of our toolchain was extended to handle 64-bit sized data back in the 1990's, but archives were not changed, presumably because there was no perceived need for it. Since then of course, programs have continued to get larger, and in 2010, the time had finally come to investigate the issue and find a way to provide for larger archives. As part of that process, I had to do a deep dive into the archive format, and also do some Unix archeology. I'm going to record what I learned here, to document what Solaris does, and in the hope that it might help someone else trying to solve the same problem for their platform. Archive Format Details Archives are hardly cutting edge technology. They are still used of course, but their basic form hasn't changed in decades. Other than to fix a bug, which is rare, we don't tend to touch that code much. The archive file format is described in /usr/include/ar.h, and I won't repeat the details here. Instead, here is a rough overview of the archive file format, implemented by System V Release 4 (SVR4) Unix systems such as Solaris: Every archive starts with a "magic number". This is a sequence of 8 characters: "!<arch>\n". The magic number is followed by 1 or more members. A member starts with a fixed header, defined by the ar_hdr structure in/usr/include/ar.h. Immediately following the header comes the data for the member. Members must be padded at the end with newline characters so that they have even length. The requirement to pad members to an even length is a dead giveaway as to the age of the archive format. It tells you that this format dates from the 1970's, and more specifically from the era of 16-bit systems such as the PDP-11 that Unix was originally developed on. A 32-bit system would have required 4 bytes, and 64-bit systems such as we use today would probably have required 8 bytes. 2 byte alignment is a poor choice for ELF object archive members. 32-bit objects require 4 byte alignment, and 64-bit objects require 64-bit alignment. The link-editor uses mmap() to process archives, and if the members have the wrong alignment, we have to slide (copy) them to the correct alignment before we can access the ELF data structures inside. The archive format requires 2 byte padding, but it doesn't prohibit more. The Solaris ar command takes advantage of this, and pads ELF object members to 8 byte boundaries. Anything else is padded to 2 as required by the format. The archive header (ar_hdr) represents all numeric values using an ASCII text representation rather than as binary integers. This means that an archive that contains only text members can be viewed using tools such as cat, more, or a text editor. The original designers of this format clearly thought that archives would be used for many file types, and not just for objects. Things didn't turn out that way of course — nearly all archives contain relocatable objects for a single operating system and machine, and are used primarily as input to the link-editor (ld). Archives can have special members that are created by the ar command rather than being supplied by the user. These special members are all distinguished by having a name that starts with the slash (/) character. This is an unambiguous marker that says that the user could not have supplied it. The reason for this is that regular archive members are given the plain name of the file that was inserted to create them, and any path components are stripped off. Slash is the delimiter character used by Unix to separate path components, and as such cannot occur within a plain file name. The ar command hides the special members from you when you list the contents of an archive, so most users don't know that they exist. There are only two possible special members: A symbol table that maps ELF symbols to the object archive member that provides it, and a string table used to hold member names that exceed 15 characters. The '/' convention for tagging special members provides room for adding more such members should the need arise. As I will discuss below, we took advantage of this fact to add an alternate 64-bit symbol table special member which is used in archives that are larger than 4GB. When an archive contains ELF object members, the ar command builds a special archive member known as the symbol table that maps all ELF symbols in the object to the archive member that provides it. The link-editor uses this symbol table to determine which symbols are provided by the objects in that archive. If an archive has a symbol table, it will always be the first member in the archive, immediately following the magic number. Unlike member headers, symbol tables do use binary integers to represent offsets. These integers are always stored in big-endian format, even on a little endian host such as x86. The archive header (ar_hdr) provides 15 characters for representing the member name. If any member has a name that is longer than this, then the real name is written into a special archive member called the string table, and the member's name field instead contains a slash (/) character followed by a decimal representation of the offset of the real name within the string table. The string table is required to precede all normal archive members, so it will be the second member if the archive contains a symbol table, and the first member otherwise. The archive format is not designed to make finding a given member easy. Such operations move through the archive from front to back examining each member in turn, and run in O(n) time. This would be bad if archives were commonly used in that manner, but in general, they are not. Typically, the ar command is used to build an new archive from scratch, inserting all the objects in one operation, and then the link-editor accesses the members in the archive in constant time by using the offsets provided by the symbol table. Both of these operations are reasonably efficient. However, listing the contents of a large archive with the ar command can be rather slow. Factors That Limit Solaris Archive Size As is often the case, there was more than one limiting factor preventing Solaris archives from growing beyond the 32-bit limits of 2GB (32-bit signed) and 4GB (32-bit unsigned). These limits are listed in the order they are hit as archive size grows, so the earlier ones mask those that follow. The original Solaris archive file format can handle sizes up to 4GB without issue. However, the ar command was delivered as a 32-bit executable that did not use the largefile APIs. As such, the ar command itself could not create a file larger than 2GB. One can solve this by building ar with the largefile APIs which would allow it to reach 4GB, but a simpler and better answer is to deliver a 64-bit ar, which has the ability to scale well past 4GB. Symbol table offsets are stored as 32-bit big-endian binary integers, which limits the maximum archive size to 4GB. To get around this limit requires a different symbol table format, or an extension mechanism to the current one, similar in nature to the way member names longer than 15 characters are handled in member headers. The size field in the archive member header (ar_hdr) is an ASCII string capable of representing a 32-bit unsigned value. This places a 4GB size limit on the size of any individual member in an archive. In considering format extensions to get past these limits, it is important to remember that very few archives will require the ability to scale past 4GB for many years. The old format, while no beauty, continues to be sufficient for its purpose. This argues for a backward compatible fix that allows newer versions of Solaris to produce archives that are compatible with older versions of the system unless the size of the archive exceeds 4GB. Archive Format Differences Among Unix Variants While considering how to extend Solaris archives to scale to 64-bits, I wanted to know how similar archives from other Unix systems are to those produced by Solaris, and whether they had already solved the 64-bit issue. I've successfully moved archives between different Unix systems before with good luck, so I knew that there was some commonality. If it turned out that there was already a viable defacto standard for 64-bit archives, it would obviously be better to adopt that rather than invent something new. The archive file format is not formally standardized. However, the ar command and archive format were part of the original Unix from Bell Labs. Other systems started with that format, extending it in various often incompatible ways, but usually with the same common shared core. Most of these systems use the same magic number to identify their archives, despite the fact that their archives are not always fully compatible with each other. It is often true that archives can be copied between different Unix variants, and if the member names are short enough, the ar command from one system can often read archives produced on another. In practice, it is rare to find an archive containing anything other than objects for a single operating system and machine type. Such an archive is only of use on the type of system that created it, and is only used on that system. This is probably why cross platform compatibility of archives between Unix variants has never been an issue. Otherwise, the use of the same magic number in archives with incompatible formats would be a problem. I was able to find information for a number of Unix variants, described below. These can be divided roughly into three tribes, SVR4 Unix, BSD Unix, and IBM AIX. Solaris is a SVR4 Unix, and its archives are completely compatible with those from the other members of that group (GNU/Linux, HP-UX, and SGI IRIX). AIX AIX is an exception to rule that Unix archive formats are all based on the original Bell labs Unix format. It appears that AIX supports 2 formats (small and big), both of which differ in fundamental ways from other Unix systems: These formats use a different magic number than the standard one used by Solaris and other Unix variants. They include support for removing archive members from a file without reallocating the file, marking dead areas as unused, and reusing them when new archive items are inserted. They have a special table of contents member (File Member Header) which lets you find out everything that's in the archive without having to actually traverse the entire file. Their symbol table members are quite similar to those from other systems though. Their member headers are doubly linked, containing offsets to both the previous and next members. Of the Unix systems described here, AIX has the only format I saw that will have reasonable insert/delete performance for really large archives. Everyone else has O(n) performance, and are going to be slow to use with large archives. BSD BSD has gone through 4 versions of archive format, which are described in their manpage. They use the same member header as SVR4, but their symbol table format is different, and their scheme for long member names puts the name directly after the member header rather than into a string table. GNU/Linux The GNU toolchain uses the SVR4 format, and is compatible with Solaris. HP-UX HP-UX seems to follow the SVR4 model, and is compatible with Solaris. IRIX IRIX has 32 and 64-bit archives. The 32-bit format is the standard SVR4 format, and is compatible with Solaris. The 64-bit format is the same, except that the symbol table uses 64-bit integers. IRIX assumes that an archive contains objects of a single ELFCLASS/MACHINE, and any archive containing ELFCLASS64 objects receives a 64-bit symbol table. Although they only use it for 64-bit objects, nothing in the archive format limits it to ELFCLASS64. It would be perfectly valid to produce a 64-bit symbol table in an archive containing 32-bit objects, text files, or anything else. Tru64 Unix (Digital/Compaq/HP) Tru64 Unix uses a format much like ours, but their symbol table is a hash table, making specific symbol lookup much faster. The Solaris link-editor uses archives by examining the entire symbol table looking for unsatisfied symbols for the link, and not by looking up individual symbols, so there would be no benefit to Solaris from such a hash table. The Tru64 ld must use a different approach in which the hash table pays off for them. Widening the existing SVR4 archive symbol tables rather than inventing something new is the simplest path forward. There is ample precedent for this approach in the ELF world. When ELF was extended to support 64-bit objects, the approach was largely to take the existing data structures, and define 64-bit versions of them. We called the old set ELF32, and the new set ELF64. My guess is that there was no need to widen the archive format at that time, but had there been, it seems obvious that this is how it would have been done. The Implementation of 64-bit Solaris Archives As mentioned earlier, there was no desire to improve the fundamental nature of archives. They have always had O(n) insert/delete behavior, and for the most part it hasn't mattered. AIX made efforts to improve this, but those efforts did not find widespread adoption. For the purposes of link-editing, which is essentially the only thing that archives are used for, the existing format is adequate, and issues of backward compatibility trump the desire to do something technically better. Widening the existing symbol table format to 64-bits is therefore the obvious way to proceed. For Solaris 11, I implemented that, and I also updated the ar command so that a 64-bit version is run by default. This eliminates the 2 most significant limits to archive size, leaving only the limit on an individual archive member. We only generate a 64-bit symbol table if the archive exceeds 4GB, or when the new -S option to the ar command is used. This maximizes backward compatibility, as an archive produced by Solaris 11 is highly likely to be less than 4GB in size, and will therefore employ the same format understood by older versions of the system. The main reason for the existence of the -S option is to allow us to test the 64-bit format without having to construct huge archives to do so. I don't believe it will find much use outside of that. Other than the new ability to create and use extremely large archives, this change is largely invisible to the end user. When reading an archive, the ar command will transparently accept either form of symbol table. Similarly, the ELF library (libelf) has been updated to understand either format. Users of libelf (such as the link-editor ld) do not need to be modified to use the new format, because these changes are encapsulated behind the existing functions provided by libelf. As mentioned above, this work did not lift the limit on the maximum size of an individual archive member. That limit remains fixed at 4GB for now. This is not because we think objects will never get that large, for the history of computing says otherwise. Rather, this is based on an estimation that single relocatable objects of that size will not appear for a decade or two. A lot can change in that time, and it is better not to overengineer things by writing code that will sit and rot for years without being used. It is not too soon however to have a plan for that eventuality. When the time comes when this limit needs to be lifted, I believe that there is a simple solution that is consistent with the existing format. The archive member header size field is an ASCII string, like the name, and as such, the overflow scheme used for long names can also be used to handle the size. The size string would be placed into the archive string table, and its offset in the string table would then be written into the archive header size field using the same format "/ddd" used for overflowed names.

Read the article
PASS: Bylaw Changes

- by Bill Graziano

While you’re reading this, a post should be going up on the PASS blog on the plans to change our bylaws. You should be able to find our old bylaws, our proposed bylaws and a red-lined version of the changes. We plan to listen to feedback until March 31st. At that point we’ll decide whether to vote on these changes or take other action. The executive summary is that we’re adding a restriction to prevent more than two people from the same company on the Board and eliminating the Board’s Officer Appointment Committee to have Officers directly elected by the Board. This second change better matches how officer elections have been conducted in the past. The Gritty Details Our scope was to change bylaws to match how PASS actually works and tackle a limited set of issues. Changing the bylaws is hard. We’ve been working on these changes since the March board meeting last year. At that meeting we met and talked through the issues we wanted to address. In years past the Board has tried to come up with language and then we’ve discussed and negotiated to get to the result. In March, we gave HQ guidance on what we wanted and asked them to come up with a starting point. Hannes worked on building us an initial set of changes that we could work our way through. Discussing changes like this over email is difficult wasn’t very productive. We do a much better job on this at the in-person Board meetings. Unfortunately there are only 2 or 3 of those a year. In August we met in Nashville and spent time discussing the changes. That was also the day after we released the slate for the 2010 election. The discussion around that colored what we talked about in terms of these changes. We talked very briefly at the Summit and again reviewed and revised the changes at the Board meeting in January. This is the result of those changes and discussions. We made numerous small changes to clean up language and make wording more clear. We also made two big changes. Director Employment Restrictions The first is that only two people from the same company can serve on the Board at the same time. The actual language in section VI.3 reads: A maximum of two (2) Directors who are employed by, or who are joint owners or partners in, the same for-profit venture, company, organization, or other legal entity, may concurrently serve on the PASS Board of Directors at any time. The definition of “employed” is at the sole discretion of the Board. And what a mess this turns out to be in practice. Our membership is a hodgepodge of interlocking relationships. Let’s say three Board members get together and start a blog service for SQL Server bloggers. It’s technically for-profit. Let’s assume it makes $8 in the first year. Does that trigger this clause? (Technically yes.) We had a horrible time trying to write language that covered everything. All the sample bylaws that we found were just as vague as this. That led to the third clause in this section. The first sentence reads: The Board of Directors reserves the right, strictly on a case-by-case basis, to overrule the requirements of Section VI.3 by majority decision for any single Director’s conflict of employment. We needed some way to handle the trivial issues and exercise some judgment. It seems like a public vote is the best way. This discloses the relationship and gets each Board member on record on the issue. In practice I think this clause will rarely be used. I think this entire section will only be invoked for actual employment issues and not for small side projects. In either case we have the mechanisms in place to handle it in a public, transparent way. That’s the first and third clauses. The second clause says that if your situation changes and you fall afoul of this restriction you need to notify the Board. The clause further states that if this new job means a Board members violates the “two-per-company” rule the Board may request their resignation. The Board can also allow the person to continue serving with a majority vote. I think this will also take some judgment. Consider a person switching jobs that leads to three people from the same company. I’m very likely to ask for someone to resign if all three are two weeks into a two year term. I’m unlikely to ask anyone to resign if one is two weeks away from ending their term. In either case, the decision will be a public vote that we can be held accountable for. One concern that was raised was whether this would affect someone choosing to accept a job. I think that’s a choice for them to make. PASS is clearly stating its intent that only two directors from any one organization should serve at any time. Once these bylaws are approved, this policy should not come as a surprise to any potential or current Board members considering a job change. This clause isn’t perfect. The biggest hole is business relationships that aren’t defined above. Let’s say that two employees from company “X” serve on the Board. What happens if I accept a full-time consulting contract with that company? Let’s assume I’m working directly for one of the two existing Board members. That doesn’t violate section VI.3. But I think it’s clearly the kind of relationship we’d like to prevent. Unfortunately that was even harder to write than what we have now. I fully expect that in the next revision of the bylaws we’ll address this. It just didn’t make it into this one. Officer Elections The officer election process received a slightly different rewrite. Our goal was to codify in the bylaws the actual process we used to elect the officers. The officers are the President, Executive Vice-President (EVP) and Vice-President of Marketing. The Immediate Past President (IPP) is also an officer but isn’t elected. The IPP serves in that role for two years after completing their term as President. We do that for continuity’s sake. Some organizations have a President-elect that serves for one or two years. The group that founded PASS chose to have an IPP. When I started on the Board, the Nominating Committee (NomCom) selected the slate for the at-large directors and the slate for the officers. There was always one candidate for each officer position. It wasn’t really an election so much as the NomCom decided who the next person would be for each officer position. Behind the scenes the Board worked to select the best people for the role. In June 2009 that process was changed to bring it line with what actually happens. An Officer Appointment Committee was created that was a subset of the Board. That committee would take time to interview the candidates and present a slate to the Board for approval. The majority vote of the Board would determine the officers for the next two years. In practice the Board itself interviewed the candidates and conducted the elections. That means it was time to change the bylaws again. Section VII.2 and VII.3 spell out the process used to select the officers. We use the phrase “Officer Appointment” to separate it from the Director election but the end result is that the Board elects the officers. Section VII.3 starts: Officers shall be appointed bi-annually by a majority of all the voting members of the Board of Directors. Everything else revolves around that sentence. We use the word appoint but they truly are elected. There are details in the bylaws for term limits, minimum requirements for President (1 prior term as an officer), tie breakers and filling vacancies. In practice we will have an election for President, then an election for EVP and then an election for VP Marketing. That means that losing candidates will be able to fall down the ladder and run for the next open position. Another point to note is that officers aren’t at-large directors. That means if a current sitting officer loses all three elections they are off the Board. Having Board member votes public will help with the transparency of this approach. This process has a number of positive and negatives. The biggest concern I expect to hear is that our members don’t directly choose the officers. I’m going to try and list all the positives and negatives of this approach. Many non-profits value continuity and are slower to change than a business. On the plus side this promotes that. On the negative side this promotes that. If we change too slowly the members complain that we aren’t responsive. If we change too quickly we make mistakes and fail at various things. We’ve been criticized for both of those lately so I’m not entirely sure where to draw the line. My rough assumption to this point is that we’re going too slow on governance and too quickly on becoming “more than a Summit.” This approach creates competition in the officer elections. If you are an at-large director there is no consequence to losing an election. If you are an officer the only way to stay on the Board is to win an officer election or an at-large election. If you are an officer and lose an election you can always run for the next office down. This makes it very easy for multiple people to contest an election. There is value in a person moving through the officer positions up to the Presidency. Having the Board select the officers promotes this. The down side is that it takes a LOT of time to get to the Presidency. We’ve had good people struggle with burnout. We’ve had lots of discussion around this. The process as we’ve described it here makes it possible for someone to move quickly through the ranks but doesn’t prevent people from working their way up through each role. We talked long and hard about having the officers elected by the members. We had a self-imposed deadline to complete these changes prior to elections this summer. The other challenge was that our original goal was to make the bylaws reflect our actual process rather than create a new one. I believe we accomplished this goal. We ran out of time to consider this option in the detail it needs. Having member elections for officers needs a number of problems solved. We would need a way for candidates to fall through the election. This is what promotes competition. Without this few people would risk an election and we’ll be back to one candidate per slot. We need to do this without having multiple elections. We may be able to copy what other organizations are doing but I was surprised at how little I could find on other organizations. We also need a way for people that lose an officer election to win an at-large election. Otherwise we’ll have very little competition for officers. This brings me to an area that I think we as a Board haven’t done a good job. We haven’t built a strong process to tell you who is doing a good job and who isn’t. This is a double-edged sword. I don’t want to highlight Board members that are failing. That’s not a good way to get people to volunteer and run for the Board. But I also need a way let the members make an informed choice about who is doing a good job and would make a good officer. Encouraging Board members to blog, publishing minutes and making votes public helps in that regard but isn’t the final answer. I don’t know what the final answer is yet. I do know that the Board members themselves are uniquely positioned to know which other Board members are doing good work. They know who speaks up in meetings, who works to build consensus, who has good ideas and who works with the members. What I Could Do Better I’ve learned a lot writing this about how we communicated with our members. The next time we revise the bylaws I’d do a few things differently. The biggest change would be to provide better documentation. The March 2009 minutes provide a very detailed look into what changes we wanted to make to the bylaws. Looking back, I’m a little surprised at how closely they matched our final changes and covered the various arguments. If you just read those you’d get 90% of what we eventually changed. Nearly everything else was just details around implementation. I’d also consider publishing a scope document defining exactly what we were doing any why. I think it really helped that we had a limited, defined goal in mind. I don’t think we did a good job communicating that goal outside the meeting minutes though. That said, I wish I’d blogged more after the August and January meeting. I think it would have helped more people to know that this change was coming and to be ready for it. Conclusion These changes address two big concerns that the Board had. First, it prevents a single organization from dominating the Board. Second, it codifies and clearly spells out how officers are elected. This is the process that was previously followed but it was somewhat murky. These changes bring clarity to this and clearly explain the process the Board will follow. We’re going to listen to feedback until March 31st. At that time we’ll decide whether to approve these changes. I’m also assuming that we’ll start another round of changes in the next year or two. Are there other issues in the bylaws that we should tackle in the future?

Read the article
How to simulate inner join on very large files in java (without running out of memory)

- by Constantin

I am trying to simulate SQL joins using java and very large text files (INNER, RIGHT OUTER and LEFT OUTER). The files have already been sorted using an external sort routine. The issue I have is I am trying to find the most efficient way to deal with the INNER join part of the algorithm. Right now I am using two Lists to store the lines that have the same key and iterate through the set of lines in the right file once for every line in the left file (provided the keys still match). In other words, the join key is not unique in each file so would need to account for the Cartesian product situations ... left_01, 1 left_02, 1 right_01, 1 right_02, 1 right_03, 1 left_01 joins to right_01 using key 1 left_01 joins to right_02 using key 1 left_01 joins to right_03 using key 1 left_02 joins to right_01 using key 1 left_02 joins to right_02 using key 1 left_02 joins to right_03 using key 1 My concern is one of memory. I will run out of memory if i use the approach below but still want the inner join part to work fairly quickly. What is the best approach to deal with the INNER join part keeping in mind that these files may potentially be huge public class Joiner { private void join(BufferedReader left, BufferedReader right, BufferedWriter output) throws Throwable { BufferedReader _left = left; BufferedReader _right = right; BufferedWriter _output = output; Record _leftRecord; Record _rightRecord; _leftRecord = read(_left); _rightRecord = read(_right); while( _leftRecord != null && _rightRecord != null ) { if( _leftRecord.getKey() < _rightRecord.getKey() ) { write(_output, _leftRecord, null); _leftRecord = read(_left); } else if( _leftRecord.getKey() > _rightRecord.getKey() ) { write(_output, null, _rightRecord); _rightRecord = read(_right); } else { List<Record> leftList = new ArrayList<Record>(); List<Record> rightList = new ArrayList<Record>(); _leftRecord = readRecords(leftList, _leftRecord, _left); _rightRecord = readRecords(rightList, _rightRecord, _right); for( Record equalKeyLeftRecord : leftList ){ for( Record equalKeyRightRecord : rightList ){ write(_output, equalKeyLeftRecord, equalKeyRightRecord); } } } } if( _leftRecord != null ) { write(_output, _leftRecord, null); _leftRecord = read(_left); while(_leftRecord != null) { write(_output, _leftRecord, null); _leftRecord = read(_left); } } else { if( _rightRecord != null ) { write(_output, null, _rightRecord); _rightRecord = read(_right); while(_rightRecord != null) { write(_output, null, _rightRecord); _rightRecord = read(_right); } } } _left.close(); _right.close(); _output.flush(); _output.close(); } private Record read(BufferedReader reader) throws Throwable { Record record = null; String data = reader.readLine(); if( data != null ) { record = new Record(data.split("\t")); } return record; } private Record readRecords(List<Record> list, Record record, BufferedReader reader) throws Throwable { int key = record.getKey(); list.add(record); record = read(reader); while( record != null && record.getKey() == key) { list.add(record); record = read(reader); } return record; } private void write(BufferedWriter writer, Record left, Record right) throws Throwable { String leftKey = (left == null ? "null" : Integer.toString(left.getKey())); String leftData = (left == null ? "null" : left.getData()); String rightKey = (right == null ? "null" : Integer.toString(right.getKey())); String rightData = (right == null ? "null" : right.getData()); writer.write("[" + leftKey + "][" + leftData + "][" + rightKey + "][" + rightData + "]\n"); } public static void main(String[] args) { try { BufferedReader leftReader = new BufferedReader(new FileReader("LEFT.DAT")); BufferedReader rightReader = new BufferedReader(new FileReader("RIGHT.DAT")); BufferedWriter output = new BufferedWriter(new FileWriter("OUTPUT.DAT")); Joiner joiner = new Joiner(); joiner.join(leftReader, rightReader, output); } catch (Throwable e) { e.printStackTrace(); } } } After applying the ideas from the proposed answer, I changed the loop to this private void join(RandomAccessFile left, RandomAccessFile right, BufferedWriter output) throws Throwable { long _pointer = 0; RandomAccessFile _left = left; RandomAccessFile _right = right; BufferedWriter _output = output; Record _leftRecord; Record _rightRecord; _leftRecord = read(_left); _rightRecord = read(_right); while( _leftRecord != null && _rightRecord != null ) { if( _leftRecord.getKey() < _rightRecord.getKey() ) { write(_output, _leftRecord, null); _leftRecord = read(_left); } else if( _leftRecord.getKey() > _rightRecord.getKey() ) { write(_output, null, _rightRecord); _pointer = _right.getFilePointer(); _rightRecord = read(_right); } else { long _tempPointer = 0; int key = _leftRecord.getKey(); while( _leftRecord != null && _leftRecord.getKey() == key ) { _right.seek(_pointer); _rightRecord = read(_right); while( _rightRecord != null && _rightRecord.getKey() == key ) { write(_output, _leftRecord, _rightRecord ); _tempPointer = _right.getFilePointer(); _rightRecord = read(_right); } _leftRecord = read(_left); } _pointer = _tempPointer; } } if( _leftRecord != null ) { write(_output, _leftRecord, null); _leftRecord = read(_left); while(_leftRecord != null) { write(_output, _leftRecord, null); _leftRecord = read(_left); } } else { if( _rightRecord != null ) { write(_output, null, _rightRecord); _rightRecord = read(_right); while(_rightRecord != null) { write(_output, null, _rightRecord); _rightRecord = read(_right); } } } _left.close(); _right.close(); _output.flush(); _output.close(); } UPDATE While this approach worked, it was terribly slow and so I have modified this to create files as buffers and this works very well. Here is the update ... private long getMaxBufferedLines(File file) throws Throwable { long freeBytes = Runtime.getRuntime().freeMemory() / 2; return (freeBytes / (file.length() / getLineCount(file))); } private void join(File left, File right, File output, JoinType joinType) throws Throwable { BufferedReader leftFile = new BufferedReader(new FileReader(left)); BufferedReader rightFile = new BufferedReader(new FileReader(right)); BufferedWriter outputFile = new BufferedWriter(new FileWriter(output)); long maxBufferedLines = getMaxBufferedLines(right); Record leftRecord; Record rightRecord; leftRecord = read(leftFile); rightRecord = read(rightFile); while( leftRecord != null && rightRecord != null ) { if( leftRecord.getKey().compareTo(rightRecord.getKey()) < 0) { if( joinType == JoinType.LeftOuterJoin || joinType == JoinType.LeftExclusiveJoin || joinType == JoinType.FullExclusiveJoin || joinType == JoinType.FullOuterJoin ) { write(outputFile, leftRecord, null); } leftRecord = read(leftFile); } else if( leftRecord.getKey().compareTo(rightRecord.getKey()) > 0 ) { if( joinType == JoinType.RightOuterJoin || joinType == JoinType.RightExclusiveJoin || joinType == JoinType.FullExclusiveJoin || joinType == JoinType.FullOuterJoin ) { write(outputFile, null, rightRecord); } rightRecord = read(rightFile); } else if( leftRecord.getKey().compareTo(rightRecord.getKey()) == 0 ) { String key = leftRecord.getKey(); List<File> rightRecordFileList = new ArrayList<File>(); List<Record> rightRecordList = new ArrayList<Record>(); rightRecordList.add(rightRecord); rightRecord = consume(key, rightFile, rightRecordList, rightRecordFileList, maxBufferedLines); while( leftRecord != null && leftRecord.getKey().compareTo(key) == 0 ) { processRightRecords(outputFile, leftRecord, rightRecordFileList, rightRecordList, joinType); leftRecord = read(leftFile); } // need a dispose for deleting files in list } else { throw new Exception("DATA IS NOT SORTED"); } } if( leftRecord != null ) { if( joinType == JoinType.LeftOuterJoin || joinType == JoinType.LeftExclusiveJoin || joinType == JoinType.FullExclusiveJoin || joinType == JoinType.FullOuterJoin ) { write(outputFile, leftRecord, null); } leftRecord = read(leftFile); while(leftRecord != null) { if( joinType == JoinType.LeftOuterJoin || joinType == JoinType.LeftExclusiveJoin || joinType == JoinType.FullExclusiveJoin || joinType == JoinType.FullOuterJoin ) { write(outputFile, leftRecord, null); } leftRecord = read(leftFile); } } else { if( rightRecord != null ) { if( joinType == JoinType.RightOuterJoin || joinType == JoinType.RightExclusiveJoin || joinType == JoinType.FullExclusiveJoin || joinType == JoinType.FullOuterJoin ) { write(outputFile, null, rightRecord); } rightRecord = read(rightFile); while(rightRecord != null) { if( joinType == JoinType.RightOuterJoin || joinType == JoinType.RightExclusiveJoin || joinType == JoinType.FullExclusiveJoin || joinType == JoinType.FullOuterJoin ) { write(outputFile, null, rightRecord); } rightRecord = read(rightFile); } } } leftFile.close(); rightFile.close(); outputFile.flush(); outputFile.close(); } public void processRightRecords(BufferedWriter outputFile, Record leftRecord, List<File> rightFiles, List<Record> rightRecords, JoinType joinType) throws Throwable { for(File rightFile : rightFiles) { BufferedReader rightReader = new BufferedReader(new FileReader(rightFile)); Record rightRecord = read(rightReader); while(rightRecord != null){ if( joinType == JoinType.LeftOuterJoin || joinType == JoinType.RightOuterJoin || joinType == JoinType.FullOuterJoin || joinType == JoinType.InnerJoin ) { write(outputFile, leftRecord, rightRecord); } rightRecord = read(rightReader); } rightReader.close(); } for(Record rightRecord : rightRecords) { if( joinType == JoinType.LeftOuterJoin || joinType == JoinType.RightOuterJoin || joinType == JoinType.FullOuterJoin || joinType == JoinType.InnerJoin ) { write(outputFile, leftRecord, rightRecord); } } } /** * consume all records having key (either to a single list or multiple files) each file will * store a buffer full of data. The right record returned represents the outside flow (key is * already positioned to next one or null) so we can't use this record in below while loop or * within this block in general when comparing current key. The trick is to keep consuming * from a List. When it becomes empty, re-fill it from the next file until all files have * been consumed (and the last node in the list is read). The next outside iteration will be * ready to be processed (either it will be null or it points to the next biggest key * @throws Throwable * */ private Record consume(String key, BufferedReader reader, List<Record> records, List<File> files, long bufferMaxRecordLines ) throws Throwable { boolean processComplete = false; Record record = records.get(records.size() - 1); while(!processComplete){ long recordCount = records.size(); if( record.getKey().compareTo(key) == 0 ){ record = read(reader); while( record != null && record.getKey().compareTo(key) == 0 && recordCount < bufferMaxRecordLines ) { records.add(record); recordCount++; record = read(reader); } } processComplete = true; // if record is null, we are done if( record != null ) { // if the key has changed, we are done if( record.getKey().compareTo(key) == 0 ) { // Same key means we have exhausted the buffer. // Dump entire buffer into a file. The list of file // pointers will keep track of the files ... processComplete = false; dumpBufferToFile(records, files); records.clear(); records.add(record); } } } return record; } /** * Dump all records in List of Record objects to a file. Then, add that * file to List of File objects * * NEED TO PLACE A LIMIT ON NUMBER OF FILE POINTERS (check size of file list) * * @param records * @param files * @throws Throwable */ private void dumpBufferToFile(List<Record> records, List<File> files) throws Throwable { String prefix = "joiner_" + files.size() + 1; String suffix = ".dat"; File file = File.createTempFile(prefix, suffix, new File("cache")); BufferedWriter writer = new BufferedWriter(new FileWriter(file)); for( Record record : records ) { writer.write( record.dump() ); } files.add(file); writer.flush(); writer.close(); }

Read the article
How to install SQL Server 2005 Configuration Manager without installing SQL Server Management Studio

- by Arnold Zokas

Hi, I need to configure SQL Server aliases on a public-facing production server. To do that, I need to install SQL Server Configuration Manager. I was not able to find a standalone installer for that, so I am having to install SQL Server 2005 Client Components. This approach is not ideal as we don't want to have SSMS on an public-facing production server. Is there a way to install SQL Server 2005 Configuration Manager without installing SQL Server Management Studio? Thanks, Arnold

Read the article

Search Results

Search found 9254 results on 371 pages for 'approach'.

Page 84/371 | < Previous Page | 80 81 82 83 84 85 86 87 88 89 90 91 | Next Page >

- by Angela

- by Mark Rullo

- by Sunny

- by jeff

- by lazyguy

- by Brad8118

- by EXP0

- by PulpFiction

- by PulpFiction

- by Bojin Li

- by user193655

- by jasterm007

- by Newbie

- by whytheq

- by user157195

- by thr

- by Rick Strahl

- by Rick Strahl

- by mortezam

- by terje

- by Ralf Westphal

- by user9154181

- by Bill Graziano

- by Constantin

- by Arnold Zokas

< Previous Page | 80 81 82 83 84 85 86 87 88 89 90 91 | Next Page >