Search Results

Search found 1281 results on 52 pages for 'joes 2 pros'.

Page 51/52 | < Previous Page | 47 48 49 50 51 52 | Next Page >

Developing Mobile Applications: Web, Native, or Hybrid?

- by Michelle Kimihira

Authors: Joe Huang, Senior Principal Product Manager, Oracle Mobile Application Development Framework and Carlos Chang, Senior Principal Product Director The proliferation of mobile devices and platforms represents a game-changing technology shift on a number of levels. Companies must decide not only the best strategic use of mobile platforms, but also how to most efficiently implement them. Inevitably, this conversation devolves to the developers, who face the task of developing and supporting mobile applications—not a simple task in light of the number of devices and platforms. Essentially, developers can choose from the following three different application approaches, each with its own set of pros and cons. Native Applications: This refers to apps built for and installed on a specific platform, such as iOS or Android, using a platform-specific software development kit (SDK). For example, apps for Apple’s iPhone and iPad are designed to run specifically on iOS and are written in Xcode/Objective-C. Android has its own variation of Java, Windows uses C#, and so on. Native apps written for one platform cannot be deployed on another. Native apps offer fast performance and access to native-device services but require additional resources to develop and maintain each platform, which can be expensive and time consuming. Mobile Web Applications: Unlike native apps, mobile web apps are not installed on the device; rather, they are accessed via a Web browser. These are server-side applications that render HTML, typically adjusting the design depending on the type of device making the request. There are no program coding constraints for writing server-side apps—they can be written in Java, C, PHP, etc., it doesn’t matter. Instead, the server detects what type of mobile browser is pinging the server and adjusts accordingly. For example, it can deliver fully JavaScript and CSS-enabled content to smartphone browsers, while downgrading gracefully to basic HTML for feature phone browsers. Mobile apps work across platforms, but are limited to what you can do through a browser and require Internet connectivity. For certain types of applications, these constraints may not be an issue. Oracle supports mobile web applications via ADF Faces (for tablets) and ADF Mobile browser (Trinidad) for smartphone and feature phones. Hybrid Applications: As the name implies, hybrid apps combine technologies from native and mobile Web apps to gain the benefits each. For example, these apps are installed on a device, like their pure native app counterparts, while the user interface (UI) is based on HTML5. This UI runs locally within the native container, which usually leverages the device’s browser engine. The advantage of using HTML5 is a consistent, cross-platform UI that works well on most devices. Combining this with the native container, which is installed on-device, provides mobile users with access to local device services, such as camera, GPS, and local device storage. Native apps may offer greater flexibility in integrating with device native services. However, since hybrid applications already provide device integrations that typical enterprise applications need, this is typically less of an issue. The new Oracle ADF Mobile release is an HTML5 and Java hybrid framework that targets mobile app development to iOS and Android from one code base. So, Which is the Best Approach? The short answer is – the best choice depends on the type of application you are developing. For instance, animation-intensive apps such as games would favor native apps, while hybrid applications may be better suited for enterprise mobile apps because they provide multi-platform support. Just for starters, the following issues must be considered when choosing a development path. Application Complexity: How complex is the application? A quick app that accesses a database or Web service for some data to display? You can keep it simple, and a mobile Web app may suffice. However, for a mobile/field worker type of applications that supports mission critical functionality, hybrid or native applications are typically needed. Richness of User Interactivity: What type of user experience is required for the application? Mobile browser-based app that’s optimized for mobile UI may suffice for quick lookup or productivity type of applications. However, hybrid/native application would typically be required to deliver highly interactive user experiences needed for field-worker type of applications. For example, interactive BI charts/graphs, maps, voice/email integration, etc. In the most extreme case like gaming applications, native applications may be necessary to deliver the highly animated and graphically intensive user experience. Performance: What type of performance is required by the application functionality? For instance, for real-time look up of data over the network, mobile app performance depends on network latency and server infrastructure capabilities. If consistent performance is required, data would typically need to be cached, which is supported on hybrid or native applications only. Connectivity and Availability: What sort of connectivity will your application require? Does the app require Web access all the time in order to always retrieve the latest data from the server? Or do the requirements dictate offline support? While native and hybrid apps can be built to operate offline, Web mobile apps require Web connectivity. Multi-platform Requirements: The terms “consumerization of IT” and BYOD (bring your own device) effectively mean that the line between the consumer and the enterprise devices have become blurred. Employees are bringing their personal mobile devices to work and are often expecting that they work in the corporate network and access back-office applications. Even if companies restrict access to the big dogs: (iPad, iPhone, Android phones and tablets, possibly Windows Phone and tablets), trying to support each platform natively will require increasing resources and domain expertise with each new language/platform. And let’s not forget the maintenance costs, involved in upgrading new versions of each platform. Where multi-platform support is needed, Web mobile or hybrid apps probably have the advantage. Going native, and trying to support multiple operating systems may be cost prohibitive with existing resources and developer skills. Device-Services Access: If your app needs to access local device services, such as the camera, contacts app, accelerometer, etc., then your choices are limited to native or hybrid applications. Fragmentation: Apple controls Apple iOS and the only concern is what version iOS is running on any given device. Not so Android, which is open source. There are many, many versions and variants of Android running on different devices, which can be a nightmare for app developers trying to support different devices running different flavors of Android. (Is it an Amazon Kindle Fire? a Samsung Galaxy? A Barnes & Noble Nook?) This is a nightmare scenario for native apps—on the other hand, a mobile Web or hybrid app, when properly designed, can shield you from these complexities because they are based on common frameworks. Resources: How many developers can you dedicate to building and supporting mobile application development? What are their existing skills sets? If you’re considering native application development due to the complexity of the application under development, factor the costs of becoming proficient on a each platform’s OS and programming language. Add another platform, and that’s another language, another SDK. On the other side of the equation, Web mobile or hybrid applications are simpler to make, and readily support more platforms, but there may be performance trade-offs. Conclusion This only scratches the surface. However, I hope to have suggested some food for thought in choosing your mobile development strategy. Do your due diligence, search the Web, read up on mobile, talk to peers, attend events. The development team at Oracle is working hard on mobile technologies to help customers extend enterprise applications to mobile faster and effectively. To learn more on what Oracle has to offer, check out the Oracle ADF Mobile (hybrid) and ADF Faces/ADF Mobile browser (Web Mobile) solutions from Oracle. Additional Information Blog: ADF Blog Product Information on OTN: ADF Mobile Product Information on Oracle.com: Oracle Fusion Middleware Follow us on Twitter and Facebook Subscribe to our regular Fusion Middleware Newsletter

Read the article
Why do we (really) program to interfaces?

- by Kyle Burns

One of the earliest lessons I was taught in Enterprise development was "always program against an interface". This was back in the VB6 days and I quickly learned that no code would be allowed to move to the QA server unless my business objects and data access objects each are defined as an interface and have a matching implementation class. Why? "It's more reusable" was one answer. "It doesn't tie you to a specific implementation" a slightly more knowing answer. And let's not forget the discussion ending "it's a standard". The problem with these responses was that senior people didn't really understand the reason we were doing the things we were doing and because of that, we were entirely unable to realize the intent behind the practice - we simply used interfaces and had a bunch of extra code to maintain to show for it. It wasn't until a few years later that I finally heard the term "Inversion of Control". Simply put, "Inversion of Control" takes the creation of objects that used to be within the control (and therefore a responsibility of) of your component and moves it to some outside force. For example, consider the following code which follows the old "always program against an interface" rule in the manner of many corporate development shops: 1: ICatalog catalog = new Catalog(); 2: Category[] categories = catalog.GetCategories(); In this example, I met the requirement of the rule by declaring the variable as ICatalog, but I didn't hit "it doesn't tie you to a specific implementation" because I explicitly created an instance of the concrete Catalog object. If I want to test the functionality of the code I just wrote I have to have an environment in which Catalog can be created along with any of the resources upon which it depends (e.g. configuration files, database connections, etc) in order to test my functionality. That's a lot of setup work and one of the things that I think ultimately discourages real buy-in of unit testing in many development shops. So how do I test my code without needing Catalog to work? A very primitive approach I've seen is to change the line the instantiates catalog to read: 1: ICatalog catalog = new FakeCatalog(); once the test is run and passes, the code is switched back to the real thing. This obviously poses a huge risk for introducing test code into production and in my opinion is worse than just keeping the dependency and its associated setup work. Another popular approach is to make use of Factory methods which use an object whose "job" is to know how to obtain a valid instance of the object. Using this approach, the code may look something like this: 1: ICatalog catalog = CatalogFactory.GetCatalog(); The code inside the factory is responsible for deciding "what kind" of catalog is needed. This is a far better approach than the previous one, but it does make projects grow considerably because now in addition to the interface, the real implementation, and the fake implementation(s) for testing you have added a minimum of one factory (or at least a factory method) for each of your interfaces. Once again, developers say "that's too complicated and has me writing a bunch of useless code" and quietly slip back into just creating a new Catalog and chalking any test failures up to "it will probably work on the server". This is where software intended specifically to facilitate Inversion of Control comes into play. There are many libraries that take on the Inversion of Control responsibilities in .Net and most of them have many pros and cons. From this point forward I'll discuss concepts from the standpoint of the Unity framework produced by Microsoft's Patterns and Practices team. I'm primarily focusing on this library because it questions about it inspired this posting. At Unity's core and that of most any IoC framework is a catalog or registry of components. This registry can be configured either through code or using the application's configuration file and in the most simple terms says "interface X maps to concrete implementation Y". It can get much more complicated, but I want to keep things at the "what does it do" level instead of "how does it do it". The object that exposes most of the Unity functionality is the UnityContainer. This object exposes methods to configure the catalog as well as the Resolve<T> method which is used to obtain an instance of the type represented by T. When using the Resolve<T> method, Unity does not necessarily have to just "new up" the requested object, but also can track dependencies of that object and ensure that the entire dependency chain is satisfied. There are three basic ways that I have seen Unity used within projects. Those are through classes directly using the Unity container, classes requiring injection of dependencies, and classes making use of the Service Locator pattern. The first usage of Unity is when classes are aware of the Unity container and directly call its Resolve method whenever they need the services advertised by an interface. The up side of this approach is that IoC is utilized, but the down side is that every class has to be aware that Unity is being used and tied directly to that implementation. Many developers don't like the idea of as close a tie to specific IoC implementation as is represented by using Unity within all of your classes and for the most part I agree that this isn't a good idea. As an alternative, classes can be designed for Dependency Injection. Dependency Injection is where a force outside the class itself manipulates the object to provide implementations of the interfaces that the class needs to interact with the outside world. This is typically done either through constructor injection where the object has a constructor that accepts an instance of each interface it requires or through property setters accepting the service providers. When using dependency, I lean toward the use of constructor injection because I view the constructor as being a much better way to "discover" what is required for the instance to be ready for use. During resolution, Unity looks for an injection constructor and will attempt to resolve instances of each interface required by the constructor, throwing an exception of unable to meet the advertised needs of the class. The up side of this approach is that the needs of the class are very clearly advertised and the class is unaware of which IoC container (if any) is being used. The down side of this approach is that you're required to maintain the objects passed to the constructor as instance variables throughout the life of your object and that objects which coordinate with many external services require a lot of additional constructor arguments (this gets ugly and may indicate a need for refactoring). The final way that I've seen and used Unity is to make use of the ServiceLocator pattern, of which the Patterns and Practices team has also provided a Unity-compatible implementation. When using the ServiceLocator, your class calls ServiceLocator.Retrieve in places where it would have called Resolve on the Unity container. Like using Unity directly, it does tie you directly to the ServiceLocator implementation and makes your code aware that dependency injection is taking place, but it does have the up side of giving you the freedom to swap out the underlying IoC container if necessary. I'm not hugely concerned with hiding IoC entirely from the class (I view this as a "nice to have"), so the single biggest problem that I see with the ServiceLocator approach is that it provides no way to proactively advertise needs in the way that constructor injection does, allowing more opportunity for difficult to track runtime errors. This blog entry has not been intended in any way to be a definitive work on IoC, but rather as something to spur thought about why we program to interfaces and some ways to reach the intended value of the practice instead of having it just complicate your code. I hope that it helps somebody begin or continue a journey away from being a "Cargo Cult Programmer".

Read the article
WIF, ADFS 2 and WCF–Part 2: The Service

- by Your DisplayName here!

OK – so let’s first start with a simple WCF service and connect that to ADFS 2 for authentication. The service itself simply echoes back the user’s claims – just so we can make sure it actually works and to see how the ADFS 2 issuance rules emit claims for the service: [ServiceContract(Namespace = "urn:leastprivilege:samples")] public interface IService { [OperationContract] List<ViewClaim> GetClaims(); } public class Service : IService { public List<ViewClaim> GetClaims() { var id = Thread.CurrentPrincipal.Identity as IClaimsIdentity; return (from c in id.Claims select new ViewClaim { ClaimType = c.ClaimType, Value = c.Value, Issuer = c.Issuer, OriginalIssuer = c.OriginalIssuer }).ToList(); } } The ViewClaim data contract is simply a DTO that holds the claim information. Next is the WCF configuration – let’s have a look step by step. First I mapped all my http based services to the federation binding. This is achieved by using .NET 4.0’s protocol mapping feature (this can be also done the 3.x way – but in that scenario all services will be federated): <protocolMapping> <add scheme="http" binding="ws2007FederationHttpBinding" /> </protocolMapping> Next, I provide a standard configuration for the federation binding: <bindings> <ws2007FederationHttpBinding> <binding> <security mode="TransportWithMessageCredential"> <message establishSecurityContext="false"> <issuerMetadata address="https://server/adfs/services/trust/mex" /> </message> </security> </binding> </ws2007FederationHttpBinding> </bindings> This binding points to our ADFS 2 installation metadata endpoint. This is all that is needed for svcutil (aka “Add Service Reference”) to generate the required client configuration. I also chose mixed mode security (SSL + basic message credential) for best performance. This binding also disables session – you can control that via the establishSecurityContext setting on the binding. This has its pros and cons. Something for a separate blog post, I guess. Next, the behavior section adds support for metadata and WIF: <behaviors> <serviceBehaviors> <behavior> <serviceMetadata httpsGetEnabled="true" /> <federatedServiceHostConfiguration /> </behavior> </serviceBehaviors> </behaviors> The next step is to add the WIF specific configuration (in <microsoft.identityModel />). First we need to specify the key material that we will use to decrypt the incoming tokens. This is optional for web applications but for web services you need to protect the proof key – so this is mandatory (at least for symmetric proof keys, which is the default): <serviceCertificate> <certificateReference storeLocation="LocalMachine" storeName="My" x509FindType="FindBySubjectDistinguishedName" findValue="CN=Service" /> </serviceCertificate> You also have to specify which incoming tokens you trust. This is accomplished by registering the thumbprint of the signing keys you want to accept. You get this information from the signing certificate configured in ADFS 2: <issuerNameRegistry type="...ConfigurationBasedIssuerNameRegistry"> <trustedIssuers> <add thumbprint="d1 … db" name="ADFS" /> </trustedIssuers> </issuerNameRegistry> The last step (promised) is to add the allowed audience URIs to the configuration – WCF clients use (by default – and we’ll come back to this) the endpoint address of the service: <audienceUris> <add value="https://machine/soapadfs/service.svc" /> </audienceUris> OK – that’s it – now we have a basic WCF service that uses ADFS 2 for authentication. The next step will be to set-up ADFS to issue tokens for this service. Afterwards we can explore various options on how to use this service from a client. Stay tuned… (if you want to have a look at the full source code or peek at the upcoming parts – you can download the complete solution here)

Read the article
The True Cost of a Solution

- by D'Arcy Lussier

I had a Twitter chat recently with someone suggesting Oracle and SQL Server were losing out to OSS (Open Source Software) in the enterprise due to their issues with scaling or being too generic (one size fits all). I challenged that a bit, as my experience with enterprise sized clients has been different – adverse to OSS but receptive to an established vendor. The response I got was: Found it easier to influence change by showing how X can’t solve our problems or X is extremely costly to scale. Money talks. I think this is definitely the right approach for anyone pitching an alternate or alien technology as part of a solution: identify the issue, identify the solution, then present pros and cons including a cost/benefit analysis. What can happen though is we get tunnel vision and don’t present a full view of the costs associated with a solution. An “Acura”te Example (I’m so clever…) This is my dream vehicle, a Crystal Black Pearl coloured Acura MDX with the SH-AWD package! We’re a family of 4 (5 if my daughters ever get their wish of adding a dog), and I’ve always wanted a luxury type of vehicle, so this is a perfect replacement in a few years when our Rav 4 has hit the 8 – 10 year mark. MSRP – $62,890 But as we all know, that’s not *really* the cost of the vehicle. There’s taxes and fees added on, there’s the extended warranty if I choose to purchase it, there’s the finance rate that needs to be factored in… MSRP – $62,890 Taxes – $7,546 Warranty - $2,500 SubTotal – $72,936 Finance Charge – $ 1094.04 Grand Total – $74,030 Well! Glad we did that exercise – we discovered an extra $11k added on to the MSRP! Well now we have our true price…or do we? Lifetime of the Vehicle I’m expecting to have this vehicle for 7 – 10 years. While the hard cost of the vehicle is known and dealt with, the costs to run and maintain the vehicle are on top of this. I did some research, and here’s what I’ve found: Fuel and Mileage Gas prices are high as it is for regular fuel, but getting into an MDX will require that I *only* purchase premium fuel, which comes at a premium price. I need to expect my bill at the pump to be higher. Comparing the MDX to my 2007 Rav4 also shows I’ll be gassing up more often. The Rav4 has a city MPG of 21, while the MDX plummets to 16! The MDX does have a bigger fuel tank though, so all in all the number of times I hit the pumps might even out. Still, I estimate I’ll be spending approximately $8000 – $10000 more on gas over a 10 year period than my current Rav4. Service Options Limited Although I have options with my Toyota here in Winnipeg (we have 4 Toyota dealerships), I do go to my original dealer for any service work. Still, I like the fact that I have options. However, there’s only one Acura dealership in all of Winnipeg! So if, for whatever reason, I’m not satisfied with the level of service I’m stuck. Non Warranty Service Work Also let’s not forget that there’s a bulk of work required every year that is *not* covered under warranty – oil changes, tire rotations, brake pads, etc. I expect I’ll need to get new tires at the 5 years mark as well, which can easily be $1200 – $1500 (I just paid $1000 for new tires for the Rav4 and we’re at the 5 year mark). Now these aren’t going to be *new* costs that I’m not used to from our existing vehicles, but they should still be factored in. I’d budget $500/year, or $5000 over the 10 years I’ll own the vehicle. Final Assessment So let’s re-assess the true cost of my dream MDX: MSRP $62,890 Taxes $7,546 Warranty $2,500 Finance Charge $1094 Gas $10,000 Service Work $5000 Grand Total $89,030 So now I have a better idea of 10 year cost overall, and I’ve identified some concerns with local service availability. And there’s now much more to consider over the original $62,890 price tag. Tying This Back to Technology Solutions The process that we just went through is no different than what organizations do when considering implementing a new system, technology, or technology based solution, within their environments. It’s easy to tout the short term cost savings of particular product/platform/technology in a vacuum. But its when you consider the wider impact that the true cost comes into play. Let’s create a scenario: A company is not happy with its current data reporting suite. An employee suggests moving to an open source solution. The selling points are: - Because its open source its free - The organization would have access to the source code so they could alter it however they wished - It provided features not available with the current reporting suite At first this sounds great to the management and executive, but then they start asking some questions and uncover more information: - The OSS product is built on a technology not used anywhere within the organization - There are no vendors offering product support for the OSS product - The OSS product requires a specific server platform to operate on, one that’s not standard in the organization All of a sudden, the true cost of implementing this solution is starting to become clearer. The company might save money on licensing costs, but their training costs would increase significantly – developers would need to learn how to develop in the technology the OSS solution was built on, IT staff must learn how to set up and maintain a new server platform within their existing infrastructure, and if a problem was found there was no vendor to contact for support. The true cost of implementing a “free” OSS solution is actually spinning up a project to implement it within the organization – no small cost. And that’s just the short-term cost. Now the organization must ensure they maintain trained staff who can make changes to the OSS reporting solution and IT staff that will stay knowledgeable in the new server platform. If those skills are very niche, then higher labour costs could be incurred if those people are hard to find or if trained employees use that knowledge as leverage for higher pay. Maybe a vendor exists that will contract out support, but then there are those costs to consider as well. And let’s not forget end-user training – in our example, anyone that runs reports will need to be trained on how to use the new system. Here’s the Point We still tend to look at software in an “off the shelf” kind of way. It’s very easy to say “oh, this product is better than vendor x’s product – and its free because its OSS!” but the reality is that implementing any new technology within an organization has a cost regardless of the retail price of the product. Training, integration, support – these are real costs that impact an organization and span multiple departments. Whether you’re pitching an improved business process, a new system, or a new technology, you need to consider the bigger picture costs of implementation. What you define as success (in our example, having better reporting functionality) might not be what others define as success if implementing your solution causes them issues. A true enterprise solution needs to consider the entire enterprise.

Read the article
Testing Workflows – Test-First

- by Timothy Klenke

Originally posted on: http://geekswithblogs.net/TimothyK/archive/2014/05/30/testing-workflows-ndash-test-first.aspxThis is the second of two posts on some common strategies for approaching the job of writing tests. The previous post covered test-after workflows where as this will focus on test-first. Each workflow presented is a method of attack for adding tests to a project. The more tools in your tool belt the better. So here is a partial list of some test-first methodologies. Ping Pong Ping Pong is a methodology commonly used in pair programing. One developer will write a new failing test. Then they hand the keyboard to their partner. The partner writes the production code to get the test passing. The partner then writes the next test before passing the keyboard back to the original developer. The reasoning behind this testing methodology is to facilitate pair programming. That is to say that this testing methodology shares all the benefits of pair programming, including ensuring multiple team members are familiar with the code base (i.e. low bus number). Test Blazer Test Blazing, in some respects, is also a pairing strategy. The developers don’t work side by side on the same task at the same time. Instead one developer is dedicated to writing tests at their own desk. They write failing test after failing test, never touching the production code. With these tests they are defining the specification for the system. The developer most familiar with the specifications would be assigned this task. The next day or later in the same day another developer fetches the latest test suite. Their job is to write the production code to get those tests passing. Once all the tests pass they fetch from source control the latest version of the test project to get the newer tests. This methodology has some of the benefits of pair programming, namely lowering the bus number. This can be good way adding an extra developer to a project without slowing it down too much. The production coder isn’t slowed down writing tests. The tests are in another project from the production code, so there shouldn’t be any merge conflicts despite two developers working on the same solution. This methodology is also a good test for the tests. Can another developer figure out what system should do just by reading the tests? This question will be answered as the production coder works there way through the test blazer’s tests. Test Driven Development (TDD) TDD is a highly disciplined practice that calls for a new test and an new production code to be written every few minutes. There are strict rules for when you should be writing test or production code. You start by writing a failing (red) test, then write the simplest production code possible to get the code working (green), then you clean up the code (refactor). This is known as the red-green-refactor cycle. The goal of TDD isn’t the creation of a suite of tests, however that is an advantageous side effect. The real goal of TDD is to follow a practice that yields a better design. The practice is meant to push the design toward small, decoupled, modularized components. This is generally considered a better design that large, highly coupled ball of mud. TDD accomplishes this through the refactoring cycle. Refactoring is only possible to do safely when tests are in place. In order to use TDD developers must be trained in how to look for and repair code smells in the system. Through repairing these sections of smelly code (i.e. a refactoring) the design of the system emerges. For further information on TDD, I highly recommend the series “Is TDD Dead?”. It discusses its pros and cons and when it is best used. Acceptance Test Driven Development (ATDD) Whereas TDD focuses on small unit tests that concentrate on a small piece of the system, Acceptance Tests focuses on the larger integrated environment. Acceptance Tests usually correspond to user stories, which come directly from the customer. The unit tests focus on the inputs and outputs of smaller parts of the system, which are too low level to be of interest to the customer. ATDD generally uses the same tools as TDD. However, ATDD uses fewer mocks and test doubles than TDD. ATDD often complements TDD; they aren’t competing methods. A full test suite will usually consist of a large number of unit (created via TDD) tests and a smaller number of acceptance tests. Behaviour Driven Development (BDD) BDD is more about audience than workflow. BDD pushes the testing realm out towards the client. Developers, managers and the client all work together to define the tests. Typically different tooling is used for BDD than acceptance and unit testing. This is done because the audience is not just developers. Tools using the Gherkin family of languages allow for test scenarios to be described in an English format. Other tools such as MSpec or FitNesse also strive for highly readable behaviour driven test suites. Because these tests are public facing (viewable by people outside the development team), the terminology usually changes. You can’t get away with the same technobabble you can with unit tests written in a programming language that only developers understand. For starters, they usually aren’t called tests. Usually they’re called “examples”, “behaviours”, “scenarios”, or “specifications”. This may seem like a very subtle difference, but I’ve seen this small terminology change have a huge impact on the acceptance of the process. Many people have a bias that testing is something that comes at the end of a project. When you say we need to define the tests at the start of the project many people will immediately give that a lower priority on the project schedule. But if you say we need to define the specification or behaviour of the system before we can start, you’ll get more cooperation. Keep these test-first and test-after workflows in your tool belt. With them you’ll be able to find new opportunities to apply them.

Read the article
ANTS CLR and Memory Profiler In Depth Review (Part 2 of 2 – Memory Profiler)

- by ToStringTheory

One of the things that people might not know about me, is my obsession to make my code as efficient as possible. Many people might not realize how much of a task or undertaking that this might be, but it is surely a task as monumental as climbing Mount Everest, except this time it is a challenge for the mind… In trying to make code efficient, there are many different factors that play a part – size of project or solution, tiers, language used, experience and training of the programmer, technologies used, maintainability of the code – the list can go on for quite some time. I spend quite a bit of time when developing trying to determine what is the best way to implement a feature to accomplish the efficiency that I look to achieve. One program that I have recently come to learn about – Red Gate ANTS Performance (CLR) and Memory profiler gives me tools to accomplish that job more efficiently as well. In this review, I am going to cover some of the features of the ANTS memory profiler set by compiling some hideous example code to test against. Notice As a member of the Geeks With Blogs Influencers program, one of the perks is the ability to review products, in exchange for a free license to the program. I have not let this affect my opinions of the product in any way, and Red Gate nor Geeks With Blogs has tried to influence my opinion regarding this product in any way. Introduction – Part 2 In my last post, I reviewed the feature packed Red Gate ANTS Performance Profiler. Separate from the Red Gate Performance Profiler is the Red Gate ANTS Memory Profiler – a simple, easy to use utility for checking how your application is handling memory management… A tool that I wish I had had many times in the past. This post will be focusing on the ANTS Memory Profiler and its tool set. The memory profiler has a large assortment of features just like the Performance Profiler, with the new session looking nearly exactly alike: ANTS Memory Profiler Memory profiling is not something that I have to do very often… In the past, the few cases I’ve had to find a memory leak in an application I have usually just had to trace the code of the operations being performed to look for oddities… Sadly, I have come across more undisposed/non-using’ed IDisposable objects, usually from ADO.Net than I would like to ever see. Support is not fun, however using ANTS Memory Profiler makes this task easier. For this round of testing, I am going to use the same code from my previous example, using the WPF application. This time, I will choose the ‘Profile Memory’ option from the ANTS menu in Visual Studio, which launches the solution in its currently configured state/start-up project, and then launches the ANTS Memory Profiler to help. It prepopulates all of the fields with the current project information, and all I have to do is select the ‘Start Profiling’ option. When the window comes up, it is actually quite barren, just giving ideas on how to work the profiler. You start by getting to the point in your application that you want to profile, and then taking a ‘Memory Snapshot’. This performs a full garbage collection, and snapshots the managed heap. Using the same WPF app as before, I will go ahead and take a snapshot now. As you can see, ANTS is already giving me lots of information regarding the snapshot, however this is just a snapshot. The whole point of the profiler is to perform an action, usually one where a memory problem is being noticed, and then take another snapshot and perform a diff between them to see what has changed. I am going to go ahead and generate 5000 primes, and then take another snapshot: As you can see, ANTS is already giving me a lot of new information about this snapshot compared to the last. Information such as difference in memory usage, fragmentation, class usage, etc… If you take more snapshots, you can use the dropdown at the top to set your actual comparison snapshots. If you beneath the timeline, you will see a breadcrumb trail showing how best to approach profiling memory using ANTS. When you first do the comparison, you start on the Summary screen. You can either use the charts at the bottom, or switch to the class list screen to get to the next step. Here is the class list screen: As you can see, it lists information about all of the instances between the snapshots, as well as at the bottom giving you a way to filter by telling ANTS what your problem is. I am going to go ahead and select the Int16[] to look at the Instance Categorizer Using the instance categorizer, you can travel backwards to see where all of the instances are coming from. It may be hard to see in this image, but hopefully the lightbox (click on it) will help: I can see that all of these instances are rooted to the application through the UI TextBlock control. This image will probably be even harder to see, however using the ‘Instance Retention Graph’, you can trace an objects memory inheritance up the chain to see its roots as well. This is a simple example, as this is simply a known element. Usually you would be profiling an actual problem, and comparing those differences. I know in the past, I have spotted a problem where a new context was created per page load, and it was rooted into the application through an event. As the application began to grow, performance and reliability problems started to emerge. A tool like this would have been a great way to identify the problem quickly. Overview Overall, I think that the Red Gate ANTS Memory Profiler is a great utility for debugging those pesky leaks. 3 Biggest Pros: Easy to use interface with lots of options for configuring profiling session Intuitive and helpful interface for drilling down from summary, to instance, to root graphs ANTS provides an API for controlling the profiler. Not many options, but still helpful. 2 Biggest Cons: Inability to automatically snapshot the memory by interval Lack of complete integration with Visual Studio via an extension panel Ratings Ease of Use (9/10) – I really do believe that they have brought simplicity to the once difficult task of memory profiling. I especially liked how it stepped you further into the drilldown by directing you towards the best options. Effectiveness (10/10) – I believe that the profiler does EXACTLY what it purports to do. Features (7/10) – A really great set of features all around in the application, however, I would like to see some ability for automatically triggering snapshots based on intervals or framework level items such as events. Customer Service (10/10) – My entire experience with Red Gate personnel has been nothing but good. their people are friendly, helpful, and happy! UI / UX (9/10) – The interface is very easy to get around, and all of the options are easy to find. With a little bit of poking around, you’ll be optimizing Hello World in no time flat! Overall (9/10) – Overall, I am happy with the Memory Profiler and its features, as well as with the service I received when working with the Red Gate personnel. Thank you for reading up to here, or skipping ahead – I told you it would be shorter! Please, if you do try the product, drop me a message and let me know what you think! I would love to hear any opinions you may have on the product. Code Feel free to download the code I used above – download via DropBox

Read the article
Rosewill RSV-S5 and it's transferespeeds

- by DoomStone

I have just bought a Rosewill RSV-S5, I have installed 5x1,5Tb Western Digital Green disks in it. After that have I created a Raid5 on them all with the software that followed with the hardware. Not the raid it self works fine, but it is SLOW, I can only obtain a maximum of 25 MB/s, and if SABnzbd+ is downloading with 5 MB/s is it having a hard time streaming a normal DIVX (700 mb) movie. Is this normal or is there something wrong? Edit: should be able to handle 3 Gbps = 384 megabytes / second Edit 2: As you can see am I only downloading with 3,76 MB/s and I'm trying to watch V s02e08 (720p), but it is completely unwatchable, as I can see 30 sec, and the it buffers for 20 sec. Edit: Other information there might be required I'm running Windows Server 2008 R2, optimized for program performance. Windows is installed on a 60GB SSD. I have a 50 Mb/s internet connection and a 1 Gb/s LAN, all connected with Cat6 Ethernet cables. The MCE is using a Gigabyte EP35C-DS3R motherboard with 2 GB DDR2 ram. Edit 3: I have used chunk sizes for 128 KB Edit 4: I found this on newegg Pros: Enclosure for 5x2TB hard drive is fine. This is basically a rebranded San Digital TR5M-B product. For support Rosewill tells you to contact San Digital. No direct support from Silicon Image for the computer raid card. Cons: Includes computer Silicon Image 3132 raid card, extremely slow raid 5 write (our tests ~10MB/s). Compare to regular internal local drive write 30-60MB/s. We basically dumped the Sil3132 card and replaced with High Point RocketRaid 622 card for extra $69.99. Note for RR622, turn off ECRC (end to end CRC check) for card to work on IBM xserver. What took 12hrs to copy now took 2-3hrs. San Digital realized the problem and has the newer model TR5M-BP TowerRaid Plus that comes with High Point RocketRaid 622 card. Rosewill should discontinue this product and go with TR5M-BP. Could not get Silicon Image raid management software to work with complicated 2008R2 server with 10 NICs, application doesn't know how to talk to localhost port with all those NICs. No updates from Silicon Image and support from San Digital ignored. Gave up on Sil3132 card. Save yourself from a lot of headaches, get the RR622 card too if you are going to buy this product. Other Thoughts: The newer model is TR5M-BP TowerRaid Plus, comes with High Point RocketRaid 622 raid card for the PC instead of Silicon Image Sil3132. According to San Digital, raid 5 performance for Sil3132 read 80MB/s write 19MB/s, and RR622 read 154MB/s write 149MB/s. Our RR622 tests gave (8TB raid 5) write ~80-110MB/s copying 40GB file took 8mins. So I have now ordered a HighPoint RocketRAID 622 2P ext SATA III and hopes that it will solve my problems.

Read the article
What's up with LDoms: Part 1 - Introduction & Basic Concepts

- by Stefan Hinker

LDoms - the correct name is Oracle VM Server for SPARC - have been around for quite a while now. But to my surprise, I get more and more requests to explain how they work or to give advise on how to make good use of them. This made me think that writing up a few articles discussing the different features would be a good idea. Now - I don't intend to rewrite the LDoms Admin Guide or to copy and reformat the (hopefully) well known "Beginners Guide to LDoms" by Tony Shoumack from 2007. Those documents are very recommendable - especially the Beginners Guide, although based on LDoms 1.0, is still a good place to begin with. However, LDoms have come a long way since then, and I hope to contribute to their adoption by discussing how they work and what features there are today. In this and the following posts, I will use the term "LDoms" as a common abbreviation for Oracle VM Server for SPARC, just because it's a lot shorter and easier to type (and presumably, read). So, just to get everyone on the same baseline, lets briefly discuss the basic concepts of virtualization with LDoms. LDoms make use of a hypervisor as a layer of abstraction between real, physical hardware and virtual hardware. This virtual hardware is then used to create a number of guest systems which each behave very similar to a system running on bare metal: Each has its own OBP, each will install its own copy of the Solaris OS and each will see a certain amount of CPU, memory, disk and network resources available to it. Unlike some other type 1 hypervisors running on x86 hardware, the SPARC hypervisor is embedded in the system firmware and makes use both of supporting functions in the sun4v SPARC instruction set as well as the overall CPU architecture to fulfill its function. The CMT architecture of the supporting CPUs (T1 through T4) provide a large number of cores and threads to the OS. For example, the current T4 CPU has eight cores, each running 8 threads, for a total of 64 threads per socket. To the OS, this looks like 64 CPUs. The SPARC hypervisor, when creating guest systems, simply assigns a certain number of these threads exclusively to one guest, thus avoiding the overhead of having to schedule OS threads to CPUs, as do typical x86 hypervisors. The hypervisor only assigns CPUs and then steps aside. It is not involved in the actual work being dispatched from the OS to the CPU, all it does is maintain isolation between different guests. Likewise, memory is assigned exclusively to individual guests. Here, the hypervisor provides generic mappings between the physical hardware addresses and the guest's views on memory. Again, the hypervisor is not involved in the actual memory access, it only maintains isolation between guests. During the inital setup of a system with LDoms, you start with one special domain, called the Control Domain. Initially, this domain owns all the hardware available in the system, including all CPUs, all RAM and all IO resources. If you'd be running the system un-virtualized, this would be what you'd be working with. To allow for guests, you first resize this initial domain (also called a primary domain in LDoms speak), assigning it a small amount of CPU and memory. This frees up most of the available CPU and memory resources for guest domains. IO is a little more complex, but very straightforward. When LDoms 1.0 first came out, the only way to provide IO to guest systems was to create virtual disk and network services and attach guests to these services. In the meantime, several different ways to connect guest domains to IO have been developed, the most recent one being SR-IOV support for network devices released in version 2.2 of Oracle VM Server for SPARC. I will cover these more advanced features in detail later. For now, lets have a short look at the initial way IO was virtualized in LDoms: For virtualized IO, you create two services, one "Virtual Disk Service" or vds, and one "Virtual Switch" or vswitch. You can, of course, also create more of these, but that's more advanced than I want to cover in this introduction. These IO services now connect real, physical IO resources like a disk LUN or a networt port to the virtual devices that are assigned to guest domains. For disk IO, the normal case would be to connect a physical LUN (or some other storage option that I'll discuss later) to one specific guest. That guest would be assigned a virtual disk, which would appear to be just like a real LUN to the guest, while the IO is actually routed through the virtual disk service down to the physical device. For network, the vswitch acts very much like a real, physical ethernet switch - you connect one physical port to it for outside connectivity and define one or more connections per guest, just like you would plug cables between a real switch and a real system. For completeness, there is another service that provides console access to guest domains which mimics the behavior of serial terminal servers. The connections between the virtual devices on the guest's side and the virtual IO services in the primary domain are created by the hypervisor. It uses so called "Logical Domain Channels" or LDCs to create point-to-point connections between all of these devices and services. These LDCs work very similar to high speed serial connections and are configured automatically whenever the Control Domain adds or removes virtual IO. To see all this in action, now lets look at a first example. I will start with a newly installed machine and configure the control domain so that it's ready to create guest systems. In a first step, after we've installed the software, let's start the virtual console service and downsize the primary domain. root@sun # ldm list NAME STATE FLAGS CONS VCPU MEMORY UTIL UPTIME primary active -n-c-- UART 512 261632M 0.3% 2d 13h 58m root@sun # ldm add-vconscon port-range=5000-5100 \ primary-console primary root@sun # svcadm enable vntsd root@sun # svcs vntsd STATE STIME FMRI online 9:53:21 svc:/ldoms/vntsd:default root@sun # ldm set-vcpu 16 primary root@sun # ldm set-mau 1 primary root@sun # ldm start-reconf primary root@sun # ldm set-memory 7680m primary root@sun # ldm add-config initial root@sun # shutdown -y -g0 -i6 So what have I done: I've defined a range of ports (5000-5100) for the virtual network terminal service and then started that service. The vnts will later provide console connections to guest systems, very much like serial NTS's do in the physical world. Next, I assigned 16 vCPUs (on this platform, a T3-4, that's two cores) to the primary domain, freeing the rest up for future guest systems. I also assigned one MAU to this domain. A MAU is a crypto unit in the T3 CPU. These need to be explicitly assigned to domains, just like CPU or memory. (This is no longer the case with T4 systems, where crypto is always available everywhere.) Before I reassigned the memory, I started what's called a "delayed reconfiguration" session. That avoids actually doing the change right away, which would take a considerable amount of time in this case. Instead, I'll need to reboot once I'm all done. I've assigned 7680MB of RAM to the primary. That's 8GB less the 512MB which the hypervisor uses for it's own private purposes. You can, depending on your needs, work with less. I'll spend a dedicated article on sizing, discussing the pros and cons in detail. Finally, just before the reboot, I saved my work on the ILOM, to make this configuration available after a powercycle of the box. (It'll always be available after a simple reboot, but the ILOM needs to know the configuration of the hypervisor after a power-cycle, before the primary domain is booted.) Now, lets create a first disk service and a first virtual switch which is connected to the physical network device igb2. We will later use these to connect virtual disks and virtual network ports of our guest systems to real world storage and network. root@sun # ldm add-vds primary-vds root@sun # ldm add-vswitch net-dev=igb2 switch-primary primary You are free to choose whatever names you like for the virtual disk service and the virtual switch. I strongly recommend that you choose names that make sense to you and describe the function of each service in the context of your implementation. For the vswitch, for example, you could choose names like "admin-vswitch" or "production-network" etc. This already concludes the configuration of the control domain. We've freed up considerable amounts of CPU and RAM for guest systems and created the necessary infrastructure - console, vts and vswitch - so that guests systems can actually interact with the outside world. The system is now ready to create guests, which I'll describe in the next section. For further reading, here are some recommendable links: The LDoms 2.2 Admin Guide The "Beginners Guide to LDoms" The LDoms Information Center on MOS LDoms on OTN

Read the article
Craftsmanship is ALL that Matters

- by Wayne Molina

Today, I'm going to talk about a touchy subject: the notion of working in a company that doesn't use the prescribed "best practices" in its software development endeavours. Over the years I have, using a variety of pseudonyms, asked this question on popular programming forums. Although I always add in some minor variation of the story to avoid suspicion that it's the same person posting, the crux of the tale remains the same: A Programmer’s Tale A junior software developer has just started a new job at an average company, creating average line-of-business applications for internal use (the most typical scenario programmers find themselves in). This hypothetical newbie has spent a lot of time reading up on the "theory" of software development, devouring books, blogs and screencasts from well-known and respected software developers in the community in order to broaden his knowledge and "do what the pros do". He begins his new job, eager to apply what he's learned on a real-world project only to discover that his new teammates doesn't use any of those concepts and techniques. They hack their way through development, or in a best-case scenario use some homebrew, thrown-together semblance of a framework for their applications that follows not one of the best practices suggested by the “elite” in the software community - things like TDD (TDD as a "best practice" is the only subjective part of this post, but it's included here due to a very large following of respected developers who consider it one), the SOLID principles, well-known and venerable tools, even version control in a worst case and truly nightmarish scenario. Our protagonist is frustrated that he isn't doing things the "proper" way - a way he's spent personal time digesting and learning about and, more importantly, a way that some of the top developers in the industry advocate - and turns to a forum to ask the advice of his peers. Invariably the answer I, in the guise of the concerned newbie, will receive is that A) I don't know anything and should just shut my mouth and sling code the bad way like everybody else on the team, and B) These "best practices" are fade or a joke, and the only thing that matters is shipping software to your customers. I am here today to say that anyone who says this, or anything like it, is not only full of crap but indicative of exactly the type of “developer” that has helped to give our industry a bad name. Here is why: One Who Knows Nothing, Understands Nothing On one hand, you have the cognoscenti of the .NET development world. Guys like James Avery, Jeremy Miller, Ayende Rahien and Rob Conery; all well-respected and noted programmers that are pretty much our version of celebrities. These guys write blogs, books, and post videos outlining the "correct" way of writing software to make sure it not only works but is maintainable and extensible and a joy to work with. They tout the virtues of the SOLID principles, or of using TDD/BDD, or using a mature ORM like NHibernate, Subsonic or even Entity Framework. On the other hand, you have Joe Everyman, Lead Software Developer at Initrode Corporation - in our hypothetical story Joe is the junior developer's new boss. Joe's been with Initrode for 10 years, starting as the company’s very first programmer and over the years building up a little fiefdom of his own until at the present he’s in charge of all Initrode’s software development. Joe writes code the same way he always has, without bothering to learn much, if anything. He looked at NHibernate once and found it was "too hard", so he uses a primitive implementation of the TableDataGateway pattern as a wrapper around SqlClient.SqlConnection and SqlClient.SqlCommand instead of an actual ORM (or, in a better case scenario, has created his own ORM); the thought of using LINQ or Entity Framework or really anything other than his own hastily homebrew solution has never occurred to him. He doesn't understand TDD and considers “testing” to be using the .NET debugger to step through code, or simply loading up an app and entering some values to see if it works. He doesn't really understand SOLID, and he doesn't care to. He's worked as a programmer for years, and that's all that counts. Right? WRONG. Who would you rather trust? Someone with years of experience and who writes books, creates well-known software and is akin to a celebrity, or someone with no credibility outside their own minute environment who throws around their clout and company seniority as the "proof" of their ability? Joe Everyman may have years of experience at Initrode as a programmer, and says to do things "his way" but someone like Jeremy Miller or Ayende Rahien have years of experience at companies just like Initrode, THEY know ten times more than Joe Everyman knows or could ever hope to know, and THEY say to do things "this way". Here's another way of thinking about it: If you wanted to get into politics and needed advice on the best way to do it, would you rather listen to the mayor of Hicktown, USA or Barack Obama? One is a small-time nobody while the other is very well-known and, as such, would probably have much more accurate and beneficial advice. NOTE: The selection of Barack Obama as an example in no way, shape, or form suggests a political affiliation or political bent to this post or blog, and no political innuendo should be mistakenly read from it; the intent was merely to compare a small-time persona with a well-known persona in a non-software field. Feel free to replace the name "Barack Obama" with any well-known Congressman, Senator or US President of your choice. DIY Considered Harmful I will say right now that the homebrew development environment is the WORST one for an aspiring programmer, because it relies on nothing outside it's own little box - no useful skill outside of the small pond. If you are forced to use some half-baked, homebrew ORM created by your Director of Software, you are not learning anything valuable you can take with you in the future; now, if you plan to stay at Initrode for 10 years like Joe Everyman, this is fine and dandy. However if, like most of us, you want to advance your career outside a very narrow space you will do more harm than good by sticking it out in an environment where you, to be frank, know better than everybody else because you are aware of alternative and, in almost most cases, better tools for the job. A junior developer who understands why the SOLID principles are good to follow, or why TDD is beneficial, or who knows that it's better to use NHibernate/Subsonic/EF/LINQ/well-known ORM versus some in-house one knows better than a senior developer with 20 years experience who doesn't understand any of that, plain and simple. Anyone who disagrees is either a liar, or someone who, just like Joe Everyman, Lead Developer, relies on seniority and tenure rather than adapting their knowledge as things evolve. In many cases, the Joe Everymans of the world act this way out of fear - they cannot possibly fathom that a “junior” could know more than them; after all, they’ve spent 10 or more years in the same company, doing the same job, cranking out the same shoddy software. And here comes a newbie who hasn’t spent 10+ years doing the same things, with a fresh and often radical take on the craft, and Joe Everyman is afraid he might have to put some real effort into his career again instead of just pointing to his 10 years of service at Initrode as “proof” that he’s good, or that he might have to learn something new to improve; in most cases the problem is Joe Everyman, and by extension Initrode itself, has a mentality of just being “good enough”, and mediocrity is the rule of the day. A Thorn Bush is No Place for a Phoenix My advice is that if you work on a team where they don't use the best practices that some of the most famous developers in our field say is the "right" way to do things (and have legions of people who agree), and YOU are aware of these practices and can see why they work, then LEAVE the company. Find a company where they DO care about quality, and craftsmanship, otherwise you will never be happy. There is no point in "dumbing" yourself down to the level of your co-workers and slinging code without care to craftsmanship. In 95% of these situations there will be no point in bringing it to the attention of Joe Everyman because he won't listen; he might even get upset that someone is trying to "upstage" him and fire the newbie, and replace someone with loads of untapped potential with a drone that will just nod affirmatively and grind out the tasks assigned without question. Find a company that has people smart enough to listen to the "best and brightest", and be happy. Do not, I repeat, DO NOT waste away in a job working for ignorant people. At the end of the day software development IS a craft, and a level of craftsmanship is REQUIRED for any serious professional. When you have knowledgeable people with the credibility to back it up saying one thing, and small-time people who are, to put it bluntly, nobodies in the field saying and doing something totally different because they can't comprehend it, leave the nobodies to their own devices to fade into obscurity. Work for a company that uses REAL software engineering techniques and really cares about craftsmanship. The biggest issue affecting our career, and the reason software development has never been the respected, white-collar career it was meant to be, is because hacks and charlatans can pass themselves off as professional programmers without following a lick of good advice from programmers much better at the craft than they are. These modern day snake-oil salesmen entrench themselves in companies by hoodwinking non-technical businesspeople and customers with their shoddy wares, end up in senior/lead/executive positions, and push their lack of knowledge on everybody unfortunate enough to work with/for/under them, crushing any dissent or voices of reason and change under their tyrannical heel and leaving behind a trail of dismayed and, often, unemployed junior developers who were made examples of to keep up the facade and avoid the shadow of doubt being cast upon them. To sum this up another way: If you surround yourself with learned people, you will learn. Surround yourself with ignorant people who can't, as the saying goes, see the forest through the trees, and you'll learn nothing of any real value. There is more to software development than just writing code, and the end goal should not be just "shipping software", it should be shipping software that is extensible, maintainable, and above all else software whose creation has broadened your knowledge in some capacity, even if a minor one. An eager newbie who knows theory and thirsts for knowledge can easily be moulded and taught the advanced topics, but the same can't be said of someone who only cares about the finish line. This industry needs more people espousing the benefits of software craftsmanship and proper software engineering techniques, and less Joe Everymans who are unwilling to adapt or foster new ways of thinking. Conclusion - I Cast “Protection from Fire” I am fairly certain this post will spark some controversy and might even invite the flames. Please keep in mind these are opinions and nothing more. A little healthy rant and subsequent flamewar can be good for the soul once in a while. To paraphrase The Godfather: It helps to get rid of the bad blood.

Read the article
C#/.NET Little Wonders: The Predicate, Comparison, and Converter Generic Delegates

- by James Michael Hare

Once again, in this series of posts I look at the parts of the .NET Framework that may seem trivial, but can help improve your code by making it easier to write and maintain. The index of all my past little wonders posts can be found here. In the last three weeks, we examined the Action family of delegates (and delegates in general), the Func family of delegates, and the EventHandler family of delegates and how they can be used to support generic, reusable algorithms and classes. This week I will be completing my series on the generic delegates in the .NET Framework with a discussion of three more, somewhat less used, generic delegates: Predicate<T>, Comparison<T>, and Converter<TInput, TOutput>. These are older generic delegates that were introduced in .NET 2.0, mostly for use in the Array and List<T> classes. Though older, it’s good to have an understanding of them and their intended purpose. In addition, you can feel free to use them yourself, though obviously you can also use the equivalents from the Func family of delegates instead. Predicate<T> – delegate for determining matches The Predicate<T> delegate was a very early delegate developed in the .NET 2.0 Framework to determine if an item was a match for some condition in a List<T> or T[]. The methods that tend to use the Predicate<T> include: Find(), FindAll(), FindLast() Uses the Predicate<T> delegate to finds items, in a list/array of type T, that matches the given predicate. FindIndex(), FindLastIndex() Uses the Predicate<T> delegate to find the index of an item, of in a list/array of type T, that matches the given predicate. The signature of the Predicate<T> delegate (ignoring variance for the moment) is: 1: public delegate bool Predicate<T>(T obj); So, this is a delegate type that supports any method taking an item of type T and returning bool. In addition, there is a semantic understanding that this predicate is supposed to be examining the item supplied to see if it matches a given criteria. 1: // finds first even number (2) 2: var firstEven = Array.Find(numbers, n => (n % 2) == 0); 3: 4: // finds all odd numbers (1, 3, 5, 7, 9) 5: var allEvens = Array.FindAll(numbers, n => (n % 2) == 1); 6: 7: // find index of first multiple of 5 (4) 8: var firstFiveMultiplePos = Array.FindIndex(numbers, n => (n % 5) == 0); This delegate has typically been succeeded in LINQ by the more general Func family, so that Predicate<T> and Func<T, bool> are logically identical. Strictly speaking, though, they are different types, so a delegate reference of type Predicate<T> cannot be directly assigned to a delegate reference of type Func<T, bool>, though the same method can be assigned to both. 1: // SUCCESS: the same lambda can be assigned to either 2: Predicate<DateTime> isSameDayPred = dt => dt.Date == DateTime.Today; 3: Func<DateTime, bool> isSameDayFunc = dt => dt.Date == DateTime.Today; 4: 5: // ERROR: once they are assigned to a delegate type, they are strongly 6: // typed and cannot be directly assigned to other delegate types. 7: isSameDayPred = isSameDayFunc; When you assign a method to a delegate, all that is required is that the signature matches. This is why the same method can be assigned to either delegate type since their signatures are the same. However, once the method has been assigned to a delegate type, it is now a strongly-typed reference to that delegate type, and it cannot be assigned to a different delegate type (beyond the bounds of variance depending on Framework version, of course). Comparison<T> – delegate for determining order Just as the Predicate<T> generic delegate was birthed to give Array and List<T> the ability to perform type-safe matching, the Comparison<T> was birthed to give them the ability to perform type-safe ordering. The Comparison<T> is used in Array and List<T> for: Sort() A form of the Sort() method that takes a comparison delegate; this is an alternate way to custom sort a list/array from having to define custom IComparer<T> classes. The signature for the Comparison<T> delegate looks like (without variance): 1: public delegate int Comparison<T>(T lhs, T rhs); The goal of this delegate is to compare the left-hand-side to the right-hand-side and return a negative number if the lhs < rhs, zero if they are equal, and a positive number if the lhs > rhs. Generally speaking, null is considered to be the smallest value of any reference type, so null should always be less than non-null, and two null values should be considered equal. In most sort/ordering methods, you must specify an IComparer<T> if you want to do custom sorting/ordering. The Array and List<T> types, however, also allow for an alternative Comparison<T> delegate to be used instead, essentially, this lets you perform the custom sort without having to have the custom IComparer<T> class defined. It should be noted, however, that the LINQ OrderBy(), and ThenBy() family of methods do not support the Comparison<T> delegate (though one could easily add their own extension methods to create one, or create an IComparer() factory class that generates one from a Comparison<T>). So, given this delegate, we could use it to perform easy sorts on an Array or List<T> based on custom fields. Say for example we have a data class called Employee with some basic employee information: 1: public sealed class Employee 2: { 3: public string Name { get; set; } 4: public int Id { get; set; } 5: public double Salary { get; set; } 6: } And say we had a List<Employee> that contained data, such as: 1: var employees = new List<Employee> 2: { 3: new Employee { Name = "John Smith", Id = 2, Salary = 37000.0 }, 4: new Employee { Name = "Jane Doe", Id = 1, Salary = 57000.0 }, 5: new Employee { Name = "John Doe", Id = 5, Salary = 60000.0 }, 6: new Employee { Name = "Jane Smith", Id = 3, Salary = 59000.0 } 7: }; Now, using the Comparison<T> delegate form of Sort() on the List<Employee>, we can sort our list many ways: 1: // sort based on employee ID 2: employees.Sort((lhs, rhs) => Comparer<int>.Default.Compare(lhs.Id, rhs.Id)); 3: 4: // sort based on employee name 5: employees.Sort((lhs, rhs) => string.Compare(lhs.Name, rhs.Name)); 6: 7: // sort based on salary, descending (note switched lhs/rhs order for descending) 8: employees.Sort((lhs, rhs) => Comparer<double>.Default.Compare(rhs.Salary, lhs.Salary)); So again, you could use this older delegate, which has a lot of logical meaning to it’s name, or use a generic delegate such as Func<T, T, int> to implement the same sort of behavior. All this said, one of the reasons, in my opinion, that Comparison<T> isn’t used too often is that it tends to need complex lambdas, and the LINQ ability to order based on projections is much easier to use, though the Array and List<T> sorts tend to be more efficient if you want to perform in-place ordering. Converter<TInput, TOutput> – delegate to convert elements The Converter<TInput, TOutput> delegate is used by the Array and List<T> delegate to specify how to convert elements from an array/list of one type (TInput) to another type (TOutput). It is used in an array/list for: ConvertAll() Converts all elements from a List<TInput> / TInput[] to a new List<TOutput> / TOutput[]. The delegate signature for Converter<TInput, TOutput> is very straightforward (ignoring variance): 1: public delegate TOutput Converter<TInput, TOutput>(TInput input); So, this delegate’s job is to taken an input item (of type TInput) and convert it to a return result (of type TOutput). Again, this is logically equivalent to a newer Func delegate with a signature of Func<TInput, TOutput>. In fact, the latter is how the LINQ conversion methods are defined. So, we could use the ConvertAll() syntax to convert a List<T> or T[] to different types, such as: 1: // get a list of just employee IDs 2: var empIds = employees.ConvertAll(emp => emp.Id); 3: 4: // get a list of all emp salaries, as int instead of double: 5: var empSalaries = employees.ConvertAll(emp => (int)emp.Salary); Note that the expressions above are logically equivalent to using LINQ’s Select() method, which gives you a lot more power: 1: // get a list of just employee IDs 2: var empIds = employees.Select(emp => emp.Id).ToList(); 3: 4: // get a list of all emp salaries, as int instead of double: 5: var empSalaries = employees.Select(emp => (int)emp.Salary).ToList(); The only difference with using LINQ is that many of the methods (including Select()) are deferred execution, which means that often times they will not perform the conversion for an item until it is requested. This has both pros and cons in that you gain the benefit of not performing work until it is actually needed, but on the flip side if you want the results now, there is overhead in the behind-the-scenes work that support deferred execution (it’s supported by the yield return / yield break keywords in C# which define iterators that maintain current state information). In general, the new LINQ syntax is preferred, but the older Array and List<T> ConvertAll() methods are still around, as is the Converter<TInput, TOutput> delegate. Sidebar: Variance support update in .NET 4.0 Just like our descriptions of Func and Action, these three early generic delegates also support more variance in assignment as of .NET 4.0. Their new signatures are: 1: // comparison is contravariant on type being compared 2: public delegate int Comparison<in T>(T lhs, T rhs); 3: 4: // converter is contravariant on input and covariant on output 5: public delegate TOutput Contravariant<in TInput, out TOutput>(TInput input); 6: 7: // predicate is contravariant on input 8: public delegate bool Predicate<in T>(T obj); Thus these delegates can now be assigned to delegates allowing for contravariance (going to a more derived type) or covariance (going to a less derived type) based on whether the parameters are input or output, respectively. Summary Today, we wrapped up our generic delegates discussion by looking at three lesser-used delegates: Predicate<T>, Comparison<T>, and Converter<TInput, TOutput>. All three of these tend to be replaced by their more generic Func equivalents in LINQ, but that doesn’t mean you shouldn’t understand what they do or can’t use them for your own code, as they do contain semantic meanings in their names that sometimes get lost in the more generic Func name. Tweet Technorati Tags: C#,CSharp,.NET,Little Wonders,delegates,generics,Predicate,Converter,Comparison

Read the article
How I Work: A Cloud Developer's Workstation

- by BuckWoody

I've written here a little about how I work during the day, including things like using a stand-up desk (still doing that, by the way). Inspired by a Twitter conversation yesterday, I thought I might explain how I set up my computing environment. First, a couple of important points. I work in Cloud Computing, specifically (but not limited to) Windows Azure. Windows Azure has features to run a Virtual Machine (IaaS), run code without having to control a Virtual Machine (PaaS) and use databases, video streaming, Hadoop and more (a kind of SaaS for tech pros). As such, my designs run the gamut of on-premises, VM's in the Cloud, and software that I write for a platform. I focus on data primarily, meaning that I design a lot of systems that use an RDBMS (like SQL Server or Windows Azure Databases) or a NoSQL approach (MongoDB on Azure or large-scale Key-Value Pairs in Table storage) and even Hadoop and R, and also Cloud Numerics in F#. All that being said, those things inform my choices below. Hardware I have a Lenovo X220 tablet/laptop which I really like a great deal - it's a light, tough, extremely fast system. When I travel, that's the system I take. It has 8GB of RAM, and an SSD drive. I sometimes use that to develop or work at a client's site, on the road, or in the living room when I'm not in my home office. My main system is a GateWay DX430017 - I've maxed it out on RAM, and I have two 1TB drives in it. It's not only my workstation for work; I leave it on all the time and it streams our videos, music and books. I have about 3400 e-books, and I've just started using Calibre to stream the library. I run Windows 8 on it so I can set up Hyper-V images, since Windows Azure allows me to move regular Hyper-V disks back and forth to the Cloud. That's where all my "servers" are, when I have to use an IaaS approach. The reason I use a desktop-style system rather than a laptop only approach is that a good part of my job is setting up architectures to solve really big, complex problems. That means I have to simulate entire networks on-premises, along with the Hybrid Cloud approach I use a lot. I need a lot of disk space and memory for that, and I use two huge monitors on my stand-up desk. I could probably use 10 monitors if I had the room for them. Also, since it's our home system as well, I leave it on all the time and it doesn't travel. Software For the software for my systems, it's important to keep in mind that I not only write code, but I design databases, teach, present, and create Linux and other environments. Windows 8 - While the jury is out for me on the new interface, the context-sensitive search, integrated everything, and speed is just hands-down the right choice. I've evaluated a server OS, Linux, even an Apple, but I just am not as efficient on those as I am with Windows 8. Visual Studio Ultimate - I develop primarily in .NET (C# and F# mostly) and I use the Team Foundation Server in the cloud, and I'm asked to do everything from UI to Services, so I need everything. Windows Azure SDK, Windows Azure Training Kit - I need the first to set up my Azure PaaS coding, and the second has all the info I need for PaaS, IaaS and SaaS. This is primarily how I get paid. :) SQL Server Developer Edition - While I might install Oracle, MySQL and Postgres on my VM's, the "outside" environment is SQL Server for an RDBMS. I install the Developer Edition because it has the same features as Enterprise Edition, and comes with all the client tools and documentation. Microsoft Office - Even if I didn't work here, this is what I would use. I've just grown too accustomed to doing business this way to change, so my advice is always "use what works", and this does. The parts I use are: OneNote (and a Math Add-in) - I do almost everything - and I mean everything in OneNote. I can code, do high-end math, present, design, collaborate and more. All my notebooks are on my Skydrive. I can use them from any system, anywhere. If you take the time to learn this program, you'll be hooked. Excel with PowerPivot - Don't make that face. Excel is the world's database, and every Data Scientist I know - even the ones where I teach at the University of Washington - know it, use it, and love it. Outlook - Primary communications, CRM and contact tool. I have all of my social media hooked up to it, so when I get an e-mail from you, I see everything, see all the history we've had on e-mail, find you on a map and more. Lync - I was fine with LiveMeeting, although it has it's moments. For me, the Lync client is tres-awesome. I use this throughout my day, present on it, stay in contact with colleagues and the folks on the dev team (who wish I didn't have it) and more. PowerPoint - Once again, don't make that face. Whenever I see someone complaining about PowerPoint, I have 100% of the time found they don't know how to use it. If you suck at presenting or creating content, don't blame PowerPoint. Works great on my machine. :) Zoomit - Magnifier - On Windows 7 (and 8 as well) there's a built-in magnifier, but I install Zoomit out of habit. It enlarges the screen. If you don't use one of these tools (or their equivalent on some other OS) then you're presenting/teaching wrong, and you should stop presenting/teaching until you get them and learn how to show people what you can see on your tiny, tiny monitor. :) Cygwin - Unix for Windows. OK, that's not true, but it's mostly that. I grew up on mainframes and Unix (IBM and HP, thank you) and I can't imagine life without sed, awk, grep, vim, and bash. I also tend to take a lot of the "Science" and "Development" and "Database" packages in it as well. PuTTY - Speaking of Unix, when I need to connect to my Linux VM's in Windows Azure, I want to do it securely. This is the tool for that. Notepad++ - Somewhere between torturing myself in vim and luxuriating in OneNote is Notepad++. Everyone has a favorite text editor; this one is mine. Too many features to name, and it's free. Browsers - I install Chrome, Firefox and of course IE. I know it's in vogue to rant on IE, but I tend to think for myself a great deal, and I've had few (none) problems with it. The others I have for the haterz that make sites that won't run in IE. Visio - I've used a lot of design packages, but none have the extreme meta-data edit capabilities of Visio. I don't use this all the time - it can be rather heavy, but what it does it does really well. I also present this way when I'm not using PowerPoint. Yup, I just bring up Visio and diagram away as I'm chatting with clients. Depending on what we're covering, this can be the right tool for that. Tweetdeck - The AIR one, not that new disaster they came out with. I live on social media, since you, dear readers, are my cube-mates. When I get tired of you all, I close Tweetdeck. When I need help or someone needs help from me, or if I want to see a picture of a cat while I'm coding, I bring it up. It's up most all day and night. Windows Media Player - I listen to Trance or Classical when I code, and I find music managers overbearing and extra. I just use what comes in the box, and it works great for me. R - F# and Cloud Numerics now allows me to load in R libraries (yay!) and I use this for statistical work on big data loads. Microsoft Math - One of the most amazing, free, rich, amazing, awesome, amazing calculators out there. I get the 64-bit version for quick math conversions, plots and formula-checks. Python - I know, right? Who knew that the scientific community loved Python so much. But they do. I use 2.7; not as much runs with 3+. I also use IronPython in Visual Studio, or I edit in Notepad++ Camstudio recorder - Windows PSR - In much of my training, and all of my teaching at the UW, I need to show a process on a screen. Camstudio records screen and voice, and it's free. If I need to make static training, I use the Windows PSR tool that's built right in. It's ostensibly for problem duplication, but I use it to record for training. OK - your turn. Post a link to your blog entry below, and tell me how you set your system up.

Read the article
Is the Cloud ready for an Enterprise Java web application? Seeking a JEE hosting advice.

- by Jakub Holý

Greetings to all the smart people around here! I'd like to ask whether it is feasible or a good idea at all to deploy a Java enterprise web application to a Cloud such as Amazon EC2. More exactly, I'm looking for infrastructure options for an application that shall handle few hundred users with long but neither CPU nor memory intensive sessions. I'm considering dedicated servers, virtual private servers (VPSs) and EC2. I've noticed that there is a project called JBoss Cloud so people are working on enabling such a deployment, on the other hand it doesn't seem to be mature yet and I'm not sure that the cloud is ready for this kind of applications, which differs from the typical cloud-based applications like Twitter. Would you recommend to deploy it to the cloud? What are the pros and cons? The application is a Java EE 5 web application whose main function is to enable users to compose their own customized Product by combining the available Parts. It uses stateless and stateful session beans and JPA for persistence of entities to a RDBMS and fetches information about Parts from the company's inventory system via a web service. Aside of external users it's used also by few internal ones, who are authenticated against the company's LDAP. The application should handle around 300-400 concurrent users building their product and should be reasonably scalable and available though these qualities are only of a medium importance at this stage. I've proposed an architecture consisting of a firewall (FW) and load balancer supporting sticky sessions and https (in the Cloud this would be replaced with EC2's Elastic Load Balancing service and FW on the app. servers, in a physical architecture the load-balancer would be a HW), then two physical clustered application servers combined with web servers (so that if one fails, a user doesn't loose his/her long built product) and finally a database server. The DB server would need a slave backup instance that can replace the master instance if it fails. This should provide reasonable availability and fault tolerance and provide good scalability as long as a single RDBMS can keep with the load, which should be OK for quite a while because most of the operations are done in the memory using a stateful bean and only occasionally stored or retrieved from the DB and the amount of data is low too. A problematic part could be the dependency on the remote inventory system webservice but with good caching of its outputs in the application it should be OK too. Unfortunately I've only vague idea of the system resources (memory size, number and speed of CPUs/cores) that such an "average Java EE application" for few hundred users needs. My rough and mostly unfounded estimate based on actual Amazon offerings is that 1.7GB and a single, 2-core "modern CPU" with speed around 2.5GHz (the High-CPU Medium Instance) should be sufficient for any of the two application servers (since we can handle higher load by provisioning more of them). Alternatively I would consider using the Large instance (64b, 7.5GB RAM, 2 cores at 1GHz) So my question is whether such a deployment to the cloud is technically and financially feasible or whether dedicated/VPS servers would be a better option and whether there are some real-world experiences with something similar. Thank you very much! /Jakub Holy PS: I've found the JBoss EAP in a Cloud Case Study that shows that it is possible to deploy a real-world Java EE application to the EC2 cloud but unfortunately there're no details regarding topology, instance types, or anything :-(

Read the article
SQL Cruise Alaska 2011

- by Grant Fritchey

I had the extreme good fortune to get sent on the last SQL Cruise to Alaska. I love my job. In case you don't what this is, SQL Cruise is a trip on a cruise ship during which you get to attend classes while on the boat, learning all about SQL Server and related topics as well as network with the instructors and the other Cruisers. Frankly, it's amazing. Classes ran from Monday, 5/30, to Saturday, 6/4. The networking was constant, between classes, at night on cruise ship, out on excursions in Alaskan rainforests and while snorkeling in ocean waters. Here's a run down of the experience from my point of view. Because I couldn't travel out 2 days early, I missed the BBQ that occurred the day before the cruise when many of the Cruisers received their swag bags. Some of that swag came from Red Gate. I researched what was useful on a cruise like this and purchased small flashlights and binoculars for all the Cruisers. The flashlights were because, depending on your cabin, ships can be very dark. The binoculars were so that the cruisers could watch all the beautiful landscape as it flowed by. I would have liked to have been there when the bags were opened, but I heard from several people that they appreciated the gifts. Cruisers "In" the hot tub. Pictured: Marjory Woody, Michele Grondin, Kyle Brandt, Grant Fritchey, John Halunen Sunday I went to board the ship with my wife. We had a bit of an adventure because I messed up our documents. It all worked out and we got on board to meet up at the back of the boat at one of the outdoor bars with the other Cruisers, thanks to tweets letting everyone know where to go. That was the end of electronic coordination on the trip (connectivity in Alaska was horrible for everyone except AT&T). The Cruisers were a great bunch of people and it was a real honor to meet them and get to spend time with them. After everyone settled into their cabins, our very first activity was a contest, sponsored by Red Gate. The Cruisers, in an effort to get to know each other and the ship, were required to go all over taking various photographs, some of them hilarious. The winning team of three would all win prizes. Some of the significant others helped out and I tagged along with a team that tied for first but lost the coin toss. The winning team consisted of Christina Leo (blog|twitter), Ryan Malcom (twitter), Neil Hambly (blog|twitter). They then had to do math and identify the cabin with the lowest prime number, oh, and get a picture of it and be the first to get back up to the bar where we were waiting. Christina came in first and very happily carried home an Ipad2. Ryan won a 1TB portable hard drive and Neil won a wireless mouse (picture below, note my special SQL Server Central Friday Shirt. Thanks Steve (blog|twitter)). Winners: Christina Leo, Neil Hambly, Ryan Malcolm. Just Lucky: Grant Fritchey Monday morning classes started. Buck Woody (blog|twitter) was a special guest speaker on this cruise. His theme was "Three C's on the High Seas: Career, Communication and Cloud." The first session was all on Career. I'm not going to type out all my notes from the session, but let's just say, if you get the chance to hear Buck talk about how to manage your career, I suggest you attend. I have a ton of blog posts that I'll be putting together over the next several months (yes, months) both here and over on ScaryDBA. I also have a bunch of work I'm going to be doing to get my career performance bumped up a notch or two (and let's face it, that won't be easy). Later on Monday, Tim Ford (blog|twitter) did a session on DMOs. Specifically the session was on Tim's Period Table of DMOs that he has put together, and how to use some of the more interesting DMOs in your day to day job. It was a great session, packed with good information. Next, Brent Ozar (blog|twitter) did a session on how to monitor and guide SAN configuration for the DBA that doesn't have access to the SAN. That was some seriously useful information. Tuesday morning we only had a single class. Kendra Little (blog|twitter) taught us all about "No Lock for Yes Fun". It was all about the different transaction isolation levels and how they work. There is so often confusion in this area and Kendra does a great job in clarifying the information. Also, she tosses in her excellent drawings to liven up the presentation. Then it was excursion time in Juneau. My wife and I, along with several other Cruisers, took a hike up around the Mendenhall Glacier. It was absolutely beautiful weather and walking through the Alaskan rain forest was a treat. Our guide, Jason, was a great guy and it was a good day of hiking. Wednesday was an all day excursion in Skagway. My wife and I took the "Ghost and Good Time Girls" walking tour that ended up at a bar that used to be a brothel, the Red Onion. It was a great history of the town. We went back out and hit a few museums and exhibits. We also hiked up the side of the mountain to see the Dewey Lake and some great views of the town. Finally we hiked out to the far side of town to see the Gold Rush cemetery. Hiking done we went back to the boat and had a quiet dinner on our own. Thursday we cruised through Glacier Bay and saw at least four different glaciers including sitting next to the Marjory Glacier for about an hour. It was amazing. Then it got better. We went into class with Buck again, this time to talk about Communication. Again, I've got pages of notes that I'm going to be referring back to for some time to come. This was an excellent opportunity to learn. Snorkelers: Nicole Bertrand, Aaron Bertrand, Grant Fritchey, Neil Hambly, Christina Leo, John Robel, Yanni Robel, Tim Ford Friday we pulled into Ketchikan. A bunch of us went snorkeling. Yes, snorkeling. Yes, in Alaska. Yes, snorkeling in the ocean in Alaska. It was fantastic. They had us put on 7mm thick wet suits (an adventure all by itself) so it was basically warm the entire time we were in the water (except for the occasional squirt of cold water down my back). Before we got in the water a bald eagle flew up and landed about 15 feet in front of us, which was just an incredible event. Then our guide pointed out about 14 other eagles in the area, hanging out in the trees. Wow! The water was pretty clear and there was a ton of things to see. That was absolutely a blast. Back on the boat I presented a session called Execution Plans: The Deep Dive (note the nautical theme). It seemed to go over well and I had several good questions come out of the session that will lead to new blog posts. After I presented, it was Aaron Bertrand's (blog|twitter) turn. He did a session on "What's New in Denali" that provided a lot of great information. He was able to incorporate new things straight out of Tech-Ed, so this was expanded beyond his usual presentation. The man really knows what he's talking about and communicates it well. Saturday we were travelling so there was time for a bunch of classes. Jeremiah Peschka (blog|twitter) did a great overview of some of the NoSQL databases and what they should be used for. The session was called "The Database is Dead" but it was really about how there are specific uses for these databases that SQL Server doesn't fill, but also that these databases can't replace SQL Server in other areas. Again, good material. Brent Ozar presented again with a session on Defensive Indexing. It was an overview of how indexes work and a deep dive into how to apply them appropriately in your databases to better support access. A good session, as you would expect. Then we pulled into Victoria, BC, in Canada and had a nice dinner with several of the Cruisers, including Denny Cherry (blog|twitter). After that it was back to Seattle on Sunday. By the way, the Science Fiction Museum in Seattle isn't a Science Fiction Museum any more. I was very disappointed to discover this. Overall, it was a great experience. I'm extremely appreciative of Red Gate for sending me and for Tim, Brent, Kendra and Jeremiah for having me. The other Cruisers were all amazing people and it was an honor & privilege to meet them and spend time with them. While this was a seriously fun time, it was also a very serious training opportunity with solid information coming from seasoned industry pros.

Read the article
C#/.NET Little Wonders: Skip() and Take()

- by James Michael Hare

Once again, in this series of posts I look at the parts of the .NET Framework that may seem trivial, but can help improve your code by making it easier to write and maintain. The index of all my past little wonders posts can be found here. I’ve covered many valuable methods from System.Linq class library before, so you already know it’s packed with extension-method goodness. Today I’d like to cover two small families I’ve neglected to mention before: Skip() and Take(). While these methods seem so simple, they are an easy way to create sub-sequences for IEnumerable<T>, much the way GetRange() creates sub-lists for List<T>. Skip() and SkipWhile() The Skip() family of methods is used to ignore items in a sequence until either a certain number are passed, or until a certain condition becomes false. This makes the methods great for starting a sequence at a point possibly other than the first item of the original sequence. The Skip() family of methods contains the following methods (shown below in extension method syntax): Skip(int count) Ignores the specified number of items and returns a sequence starting at the item after the last skipped item (if any). SkipWhile(Func<T, bool> predicate) Ignores items as long as the predicate returns true and returns a sequence starting with the first item to invalidate the predicate (if any). SkipWhile(Func<T, int, bool> predicate) Same as above, but passes not only the item itself to the predicate, but also the index of the item. For example: 1: var list = new[] { 3.14, 2.72, 42.0, 9.9, 13.0, 101.0 }; 2: 3: // sequence contains { 2.72, 42.0, 9.9, 13.0, 101.0 } 4: var afterSecond = list.Skip(1); 5: Console.WriteLine(string.Join(", ", afterSecond)); 6: 7: // sequence contains { 42.0, 9.9, 13.0, 101.0 } 8: var afterFirstDoubleDigit = list.SkipWhile(v => v < 10.0); 9: Console.WriteLine(string.Join(", ", afterFirstDoubleDigit)); Note that the SkipWhile() stops skipping at the first item that returns false and returns from there to the rest of the sequence, even if further items in that sequence also would satisfy the predicate (otherwise, you’d probably be using Where() instead, of course). If you do use the form of SkipWhile() which also passes an index into the predicate, then you should keep in mind that this is the index of the item in the sequence you are calling SkipWhile() from, not the index in the original collection. That is, consider the following: 1: var list = new[] { 1.0, 1.1, 1.2, 2.2, 2.3, 2.4 }; 2: 3: // Get all items < 10, then 4: var whatAmI = list 5: .Skip(2) 6: .SkipWhile((i, x) => i > x); For this example the result above is 2.4, and not 1.2, 2.2, 2.3, 2.4 as some might expect. The key is knowing what the index is that’s passed to the predicate in SkipWhile(). In the code above, because Skip(2) skips 1.0 and 1.1, the sequence passed to SkipWhile() begins at 1.2 and thus it considers the “index” of 1.2 to be 0 and not 2. This same logic applies when using any of the extension methods that have an overload that allows you to pass an index into the delegate, such as SkipWhile(), TakeWhile(), Select(), Where(), etc. It should also be noted, that it’s fine to Skip() more items than exist in the sequence (an empty sequence is the result), or even to Skip(0) which results in the full sequence. So why would it ever be useful to return Skip(0) deliberately? One reason might be to return a List<T> as an immutable sequence. Consider this class: 1: public class MyClass 2: { 3: private List<int> _myList = new List<int>(); 4: 5: // works on surface, but one can cast back to List<int> and mutate the original... 6: public IEnumerable<int> OneWay 7: { 8: get { return _myList; } 9: } 10: 11: // works, but still has Add() etc which throw at runtime if accidentally called 12: public ReadOnlyCollection<int> AnotherWay 13: { 14: get { return new ReadOnlyCollection<int>(_myList); } 15: } 16: 17: // immutable, can't be cast back to List<int>, doesn't have methods that throw at runtime 18: public IEnumerable<int> YetAnotherWay 19: { 20: get { return _myList.Skip(0); } 21: } 22: } This code snippet shows three (among many) ways to return an internal sequence in varying levels of immutability. Obviously if you just try to return as IEnumerable<T> without doing anything more, there’s always the danger the caller could cast back to List<T> and mutate your internal structure. You could also return a ReadOnlyCollection<T>, but this still has the mutating methods, they just throw at runtime when called instead of giving compiler errors. Finally, you can return the internal list as a sequence using Skip(0) which skips no items and just runs an iterator through the list. The result is an iterator, which cannot be cast back to List<T>. Of course, there’s many ways to do this (including just cloning the list, etc.) but the point is it illustrates a potential use of using an explicit Skip(0). Take() and TakeWhile() The Take() and TakeWhile() methods can be though of as somewhat of the inverse of Skip() and SkipWhile(). That is, while Skip() ignores the first X items and returns the rest, Take() returns a sequence of the first X items and ignores the rest. Since they are somewhat of an inverse of each other, it makes sense that their calling signatures are identical (beyond the method name obviously): Take(int count) Returns a sequence containing up to the specified number of items. Anything after the count is ignored. TakeWhile(Func<T, bool> predicate) Returns a sequence containing items as long as the predicate returns true. Anything from the point the predicate returns false and beyond is ignored. TakeWhile(Func<T, int, bool> predicate) Same as above, but passes not only the item itself to the predicate, but also the index of the item. So, for example, we could do the following: 1: var list = new[] { 1.0, 1.1, 1.2, 2.2, 2.3, 2.4 }; 2: 3: // sequence contains 1.0 and 1.1 4: var firstTwo = list.Take(2); 5: 6: // sequence contains 1.0, 1.1, 1.2 7: var underTwo = list.TakeWhile(i => i < 2.0); The same considerations for SkipWhile() with index apply to TakeWhile() with index, of course. Using Skip() and Take() for sub-sequences A few weeks back, I talked about The List<T> Range Methods and showed how they could be used to get a sub-list of a List<T>. This works well if you’re dealing with List<T>, or don’t mind converting to List<T>. But if you have a simple IEnumerable<T> sequence and want to get a sub-sequence, you can also use Skip() and Take() to much the same effect: 1: var list = new List<double> { 1.0, 1.1, 1.2, 2.2, 2.3, 2.4 }; 2: 3: // results in List<T> containing { 1.2, 2.2, 2.3 } 4: var subList = list.GetRange(2, 3); 5: 6: // results in sequence containing { 1.2, 2.2, 2.3 } 7: var subSequence = list.Skip(2).Take(3); I say “much the same effect” because there are some differences. First of all GetRange() will throw if the starting index or the count are greater than the number of items in the list, but Skip() and Take() do not. Also GetRange() is a method off of List<T>, thus it can use direct indexing to get to the items much more efficiently, whereas Skip() and Take() operate on sequences and may actually have to walk through the items they skip to create the resulting sequence. So each has their pros and cons. My general rule of thumb is if I’m already working with a List<T> I’ll use GetRange(), but for any plain IEnumerable<T> sequence I’ll tend to prefer Skip() and Take() instead. Summary The Skip() and Take() families of LINQ extension methods are handy for producing sub-sequences from any IEnumerable<T> sequence. Skip() will ignore the specified number of items and return the rest of the sequence, whereas Take() will return the specified number of items and ignore the rest of the sequence. Similarly, the SkipWhile() and TakeWhile() methods can be used to skip or take items, respectively, until a given predicate returns false. Technorati Tags: C#, CSharp, .NET, LINQ, IEnumerable<T>, Skip, Take, SkipWhile, TakeWhile

Read the article
Parallel Classloading Revisited: Fully Concurrent Loading

- by davidholmes

Java 7 introduced support for parallel classloading. A description of that project and its goals can be found here: http://openjdk.java.net/groups/core-libs/ClassLoaderProposal.html The solution for parallel classloading was to add to each class loader a ConcurrentHashMap, referenced through a new field, parallelLockMap. This contains a mapping from class names to Objects to use as a classloading lock for that class name. This was then used in the following way: protected Class loadClass(String name, boolean resolve) throws ClassNotFoundException { synchronized (getClassLoadingLock(name)) { // First, check if the class has already been loaded Class c = findLoadedClass(name); if (c == null) { long t0 = System.nanoTime(); try { if (parent != null) { c = parent.loadClass(name, false); } else { c = findBootstrapClassOrNull(name); } } catch (ClassNotFoundException e) { // ClassNotFoundException thrown if class not found // from the non-null parent class loader } if (c == null) { // If still not found, then invoke findClass in order // to find the class. long t1 = System.nanoTime(); c = findClass(name); // this is the defining class loader; record the stats sun.misc.PerfCounter.getParentDelegationTime().addTime(t1 - t0); sun.misc.PerfCounter.getFindClassTime().addElapsedTimeFrom(t1); sun.misc.PerfCounter.getFindClasses().increment(); } } if (resolve) { resolveClass(c); } return c; } } Where getClassLoadingLock simply does: protected Object getClassLoadingLock(String className) { Object lock = this; if (parallelLockMap != null) { Object newLock = new Object(); lock = parallelLockMap.putIfAbsent(className, newLock); if (lock == null) { lock = newLock; } } return lock; } This approach is very inefficient in terms of the space used per map and the number of maps. First, there is a map per-classloader. As per the code above under normal delegation the current classloader creates and acquires a lock for the given class, checks if it is already loaded, then asks its parent to load it; the parent in turn creates another lock in its own map, checks if the class is already loaded and then delegates to its parent and so on till the boot loader is invoked for which there is no map and no lock. So even in the simplest of applications, you will have two maps (in the system and extensions loaders) for every class that has to be loaded transitively from the application's main class. If you knew before hand which loader would actually load the class the locking would only need to be performed in that loader. As it stands the locking is completely unnecessary for all classes loaded by the boot loader. Secondly, once loading has completed and findClass will return the class, the lock and the map entry is completely unnecessary. But as it stands, the lock objects and their associated entries are never removed from the map. It is worth understanding exactly what the locking is intended to achieve, as this will help us understand potential remedies to the above inefficiencies. Given this is the support for parallel classloading, the class loader itself is unlikely to need to guard against concurrent load attempts - and if that were not the case it is likely that the classloader would need a different means to protect itself rather than a lock per class. Ultimately when a class file is located and the class has to be loaded, defineClass is called which calls into the VM - the VM does not require any locking at the Java level and uses its own mutexes for guarding its internal data structures (such as the system dictionary). The classloader locking is primarily needed to address the following situation: if two threads attempt to load the same class, one will initiate the request through the appropriate loader and eventually cause defineClass to be invoked. Meanwhile the second attempt will block trying to acquire the lock. Once the class is loaded the first thread will release the lock, allowing the second to acquire it. The second thread then sees that the class has now been loaded and will return that class. Neither thread can tell which did the loading and they both continue successfully. Consider if no lock was acquired in the classloader. Both threads will eventually locate the file for the class, read in the bytecodes and call defineClass to actually load the class. In this case the first to call defineClass will succeed, while the second will encounter an exception due to an attempted redefinition of an existing class. It is solely for this error condition that the lock has to be used. (Note that parallel capable classloaders should not need to be doing old deadlock-avoidance tricks like doing a wait() on the lock object\!). There are a number of obvious things we can try to solve this problem and they basically take three forms: Remove the need for locking. This might be achieved by having a new version of defineClass which acts like defineClassIfNotPresent - simply returning an existing Class rather than triggering an exception. Increase the coarseness of locking to reduce the number of lock objects and/or maps. For example, using a single shared lockMap instead of a per-loader lockMap. Reduce the lifetime of lock objects so that entries are removed from the map when no longer needed (eg remove after loading, use weak references to the lock objects and cleanup the map periodically). There are pros and cons to each of these approaches. Unfortunately a significant "con" is that the API introduced in Java 7 to support parallel classloading has essentially mandated that these locks do in fact exist, and they are accessible to the application code (indirectly through the classloader if it exposes them - which a custom loader might do - and regardless they are accessible to custom classloaders). So while we can reason that we could do parallel classloading with no locking, we can not implement this without breaking the specification for parallel classloading that was put in place for Java 7. Similarly we might reason that we can remove a mapping (and the lock object) because the class is already loaded, but this would again violate the specification because it can be reasoned that the following assertion should hold true: Object lock1 = loader.getClassLoadingLock(name); loader.loadClass(name); Object lock2 = loader.getClassLoadingLock(name); assert lock1 == lock2; Without modifying the specification, or at least doing some creative wordsmithing on it, options 1 and 3 are precluded. Even then there are caveats, for example if findLoadedClass is not atomic with respect to defineClass, then you can have concurrent calls to findLoadedClass from different threads and that could be expensive (this is also an argument against moving findLoadedClass outside the locked region - it may speed up the common case where the class is already loaded, but the cost of re-executing after acquiring the lock could be prohibitive. Even option 2 might need some wordsmithing on the specification because the specification for getClassLoadingLock states "returns a dedicated object associated with the specified class name". The question is, what does "dedicated" mean here? Does it mean unique in the sense that the returned object is only associated with the given class in the current loader? Or can the object actually guard loading of multiple classes, possibly across different class loaders? So it seems that changing the specification will be inevitable if we wish to do something here. In which case lets go for something that more cleanly defines what we want to be doing: fully concurrent class-loading. Note: defineClassIfNotPresent is already implemented in the VM as find_or_define_class. It is only used if the AllowParallelDefineClass flag is set. This gives us an easy hook into existing VM mechanics. Proposal: Fully Concurrent ClassLoaders The proposal is that we expand on the notion of a parallel capable class loader and define a "fully concurrent parallel capable class loader" or fully concurrent loader, for short. A fully concurrent loader uses no synchronization in loadClass and the VM uses the "parallel define class" mechanism. For a fully concurrent loader getClassLoadingLock() can return null (or perhaps not - it doesn't matter as we won't use the result anyway). At present we have not made any changes to this method. All the parallel capable JDK classloaders become fully concurrent loaders. This doesn't require any code re-design as none of the mechanisms implemented rely on the per-name locking provided by the parallelLockMap. This seems to give us a path to remove all locking at the Java level during classloading, while retaining full compatibility with Java 7 parallel capable loaders. Fully concurrent loaders will still encounter the performance penalty associated with concurrent attempts to find and prepare a class's bytecode for definition by the VM. What this penalty is depends on the number of concurrent load attempts possible (a function of the number of threads and the application logic, and dependent on the number of processors), and the costs associated with finding and preparing the bytecodes. This obviously has to be measured across a range of applications. Preliminary webrevs: http://cr.openjdk.java.net/~dholmes/concurrent-loaders/webrev.hotspot/ http://cr.openjdk.java.net/~dholmes/concurrent-loaders/webrev.jdk/ Please direct all comments to the mailing list [email protected].

Read the article
Explicit method tables in C# instead of OO - good? bad?

- by FunctorSalad

Hi! I hope the title doesn't sound too subjective; I absolutely do not mean to start a debate on OO in general. I'd merely like to discuss the basic pros and cons for different ways of solving the following sort of problem. Let's take this minimal example: you want to express an abstract datatype T with functions that may take T as input, output, or both: f1 : Takes a T, returns an int f2 : Takes a string, returns a T f3 : Takes a T and a double, returns another T I'd like to avoid downcasting and any other dynamic typing. I'd also like to avoid mutation whenever possible. 1: Abstract-class-based attempt abstract class T { abstract int f1(); // We can't have abstract constructors, so the best we can do, as I see it, is: abstract void f2(string s); // The convention would be that you'd replace calls to the original f2 by invocation of the nullary constructor of the implementing type, followed by invocation of f2. f2 would need to have side-effects to be of any use. // f3 is a problem too: abstract T f3(double d); // This doesn't express that the return value is of the *same* type as the object whose method is invoked; it just expresses that the return value is *some* T. } 2: Parametric polymorphism and an auxilliary class (all implementing classes of TImpl will be singleton classes): abstract class TImpl<T> { abstract int f1(T t); abstract T f2(string s); abstract T f3(T t, double d); } We no longer express that some concrete type actually implements our original spec -- an implementation is simply a type Foo for which we happen to have an instance of TImpl. This doesn't seem to be a problem: If you want a function that works on arbitrary implementations, you just do something like: // Say we want to return a Bar given an arbitrary implementation of our abstract type Bar bar<T>(TImpl<T> ti, T t); At this point, one might as well skip inheritance and singletons altogether and use a 3 First-class function table class /* or struct, even */ TDictT<T> { readonly Func<T,int> f1; readonly Func<string,T> f2; readonly Func<T,double,T> f3; TDict( ... ) { this.f1 = f1; this.f2 = f2; this.f3 = f3; } } Bar bar<T>(TDict<T> td; T t); Though I don't see much practical difference between #2 and #3. Example Implementation class MyT { /* raw data structure goes here; this class needn't have any methods */ } // It doesn't matter where we put the following; could be a static method of MyT, or some static class collecting dictionaries static readonly TDict<MyT> MyTDict = new TDict<MyT>( (t) => /* body of f1 goes here */ , // f2 (s) => /* body of f2 goes here */, // f3 (t,d) => /* body of f3 goes here */ ); Thoughts? #3 is unidiomatic, but it seems rather safe and clean. One question is whether there are any performance concerns with it. I don't usually need dynamic dispatch, and I'd prefer if these function bodies get statically inlined in places where the concrete implementing type is known statically. Is #2 better in that regard?

Read the article
what persistence layer (xml or mysql) should i use for this xml data?

- by fayer

i wonder how i could store a xml structure in a persistence layer. cause the relational data looks like: <entity id="1000070"> <name>apple</name> <entities> <entity id="7002870"> <name>mac</name> <entities> <entity id="7002907"> <name>leopard</name> <entities> <entity id="7024080"> <name>safari</name> </entity> <entity id="7024701"> <name>finder</name> </entity> </entities> </entity> </entities> </entity> <entity id="7024080"> <name>iphone</name> <entities> <entity id="7024080"> <name>3g</name> </entity> <entity id="7024701"> <name>3gs</name> </entity> </entities> </entity> <entity id="7024080"> <name>ipad</name> </entity> </entities> </entity> as you can see, it has no static structure but a dynamical one. mac got 2 descendant levels while iphone got 1 and ipad got 0. i wonder how i could store this data the best way? what are my options. cause it seems impossible to store it in a mysql database due to this dynamical structure. is the only way to store it as a xml file then? is the speed of getting information (xpath/xquery/simplexml) from a xml file worse or greater than from mysql? what are the pros and cons? do i have other options? is storing information in xml files, suited for a lot of users accessing it at the same time? would be great with feedbacks!! thanks! EDIT: now i noticed that i could use something called xml database to store xml data. could someone shed a light on this issue? cause apparently its not as simple as just store data in a xml file?

Read the article
database design help for game / user levels / progress

- by sprugman

Sorry this got long and all prose-y. I'm creating my first truly gamified web app and could use some help thinking about how to structure the data. The Set-up Users need to accomplish tasks in each of several categories before they can move up a level. I've got my Users, Tasks, and Categories tables, and a UserTasks table which joins the three. ("User 3 has added Task 42 in Category 8. Now they've completed it.") That's all fine and working wonderfully. The Challenge I'm not sure of the best way to track the progress in the individual categories toward each level. The "business" rules are: You have to achieve a certain number of points in each category to move up. If you get the number of points needed in Cat 8, but still have other work to do to complete the level, any new Cat 8 points count toward your overall score, but don't "roll over" into the next level. The number of Categories is small (five currently) and unlikely to change often, but by no means absolutely fixed. The number of points needed to level-up will vary per level, probably by a formula, or perhaps a lookup table. So the challenge is to track each user's progress toward the next level in each category. I've thought of a few potential approaches: Possible Solutions Add a column to the users table for each category and reset them all to zero each time a user levels-up. Have a separate UserProgress table with a row for each category for each user and the number of points they have. (Basically a Many-to-Many version of #1.) Add a userLevel column to the UserTasks table and use that to derive their progress with some kind of SUM statement. Their current level will be a simple int in the User table. Pros & Cons (1) seems like by far the most straightforward, but it's also the least flexible. Perhaps I could use a naming convention based on the category ids to help overcome some of that. (With code like "select cats; for each cat, get the value from Users.progress_{cat.id}.") It's also the one where I lose the most data -- I won't know which points counted toward leveling up. I don't have a need in mind for that, so maybe I don't care about that. (2) seems complicated: every time I add or subtract a user or a category, I have to maintain the other table. I foresee synchronization challenges. (3) Is somewhere in between -- cleaner than #2, but less intuitive than #1. In order to find out where a user is, I'd have mildly complex SQL like: SELECT categoryId, SUM(points) from UserTasks WHERE userId={user.id} & countsTowardLevel={user.level} groupBy categoryId Hmm... that doesn't seem so bad. I think I'm talking myself into #3 here, but would love any input, advice or other ideas.

Read the article
Laissez les bon temps rouler! (Microsoft BI Conference 2010)

- by smisner

"Laissez les bons temps rouler" is a Cajun phrase that I heard frequently when I lived in New Orleans in the mid-1990s. It means "Let the good times roll!" and encapsulates a feeling of happy expectation. As I met with many of my peers and new acquaintances at the Microsoft BI Conference last week, this phrase kept running through my mind as people spoke about their plans in their respective businesses, the benefits and opportunities that the recent releases in the BI stack are providing, and their expectations about the future of the BI stack. Notwithstanding some jabs here and there to point out the platform is neither perfect now nor will be anytime soon (along with admissions that the competitors are also not perfect), and notwithstanding several missteps by the event organizers (which I don't care to enumerate), the overarching mood at the conference was positive. It was a refreshing change from the doom and gloom hovering over several conferences that I attended in 2009. Although many people expect economic hardships to continue over the coming year or so, everyone I know in the BI field is busier than ever and expects to stay busy for quite a while. Self-Service BI Self-service was definitely a theme of the BI conference. In the keynote, Ted Kummert opened with a look back to a fairy tale vision of self-service BI that he told in 2008. At that time, the fairy tale future was a time when "every end user was able to use BI technologies within their job in order to move forward more effectively" and transitioned to the present time in which SQL Server 2008 R2, Office 2010, and SharePoint 2010 are available to deliver managed self-service BI. This set of technologies is presumably poised to address the needs of the 80% of users that Kummert said do not use BI today. He proceeded to outline a series of activities that users ought to be able to do themselves--from simple changes to a report like formatting or an addtional data visualization to integration of an additional data source. The keynote then continued with a series of demonstrations of both current and future technology in support of self-service BI. Some highlights that interested me: PowerPivot, of course, is the flagship product for self-service BI in the Microsoft BI stack. In the TechEd keynote, which was open to the BI conference attendees, Amir Netz (twitter) impressed the audience by demonstrating interactivity with a workbook containing 100 million rows. He upped the ante at the BI keynote with his demonstration of a future-state PowerPivot workbook containing over 2 billion records. It's important to note that this volume of data is being processed by a server engine, and not in the PowerPivot client engine. (Yes, I think it's impressive, but none of my clients are typically wrangling with 2 billion records at a time. Maybe they're thinking too small. This ability to work quickly with large data sets has greater implications for BI solutions than for self-service BI, in my opinion.) Amir also demonstrated KPIs for the future PowerPivot, which appeared to be easier to implement than in any other Microsoft product that supports KPIs, apart from simple KPIs in SharePoint. (My initial reaction is that we have one more place to build KPIs. Great. It's confusing enough. I haven't seen how well those KPIs integrate with other BI tools, which will be important for adoption.) One more PowerPivot feature that Amir showed was a graphical display of the lineage for calculations. (This is hugely practical, especially if you build up calculations incrementally. You can more easily follow the logic from calculation to calculation. Furthermore, if you need to make a change to one calculation, you can assess the impact on other calculations.) Another product demonstration will be available within the next 30 days--Pivot for Reporting Services. If you haven't seen this technology yet, check it out at www.getpivot.com. (It definitely has a wow factor, but I'm skeptical about its practicality. However, I'm looking forward to trying it out with data that I understand.) Michael Tejedor (twitter) demonstrated a feature that I think is really interesting and not emphasized nearly enough--overshadowed by PowerPivot, no doubt. That feature is the Microsoft Business Intelligence Indexing Connector, which enables search of the content of Excel workbooks and Reporting Services reports. (This capability existed in MOSS 2007, but was more cumbersome to implement. The search results in SharePoint 2010 are not only cooler, but more useful by describing whether the content is found in a table or a chart, for example.) This may yet be the dawning of the age of self-service BI - a phrase I've heard repeated from time to time over the last decade - but I think BI professionals are likely to stay busy for a long while, and need not start looking for a new line of work. Kummert repeatedly referenced strategic BI solutions in contrast to self-service BI to emphasize that self-service BI is not a replacement for the services that BI professionals provide. After all, self-service BI does not appear magically on user desktops (or whatever device they want to use). A supporting infrastructure is necessary, and grows in complexity in proportion to the need to simplify BI for users. It's one thing to hear the party line touted by Microsoft employees at the BI keynote, but it's another to hear from the people who are responsible for implementing and supporting it within an organization. Rob Collie (blog | twitter), Kasper de Jonge (blog | twitter), Vidas Matelis (site | twitter), and I were invited to join Andrew Brust (blog | twitter) as he led a Birds of a Feather session at TechEd entitled "PowerPivot: Is It the BI Deal-Changer for Developers and IT Pros?" I would single out the prevailing concern in this session as the issue of control. On one side of this issue were those who were concerned that they would lose control once PowerPivot is implemented. On the other side were those who believed that data should be freely accessible to users in PowerPivot, and even acknowledgment that users would get the data they want even if it meant they would have to manually enter into a workbook to have it ready for analysis. For another viewpoint on how PowerPivot played out at the conference, see Rob Collie's observations. Collaborative BI I have been intrigued by the notion of collaborative BI for a very long time. Before I discovered BI, I was a Lotus Notes developer and later a manager of developers, working in a software company that enabled collaboration in the legal industry. Not only did I help create collaborative systems for our clients, I created a complete project management from the ground up to collaboratively manage our custom development work. In that case, collaboration involved my team, my client contacts, and me. I was also able to produce my own BI from that system as well, but didn't know that's what I was doing at the time. Only in recent years has SharePoint begun to catch up with the capabilities that I had with Lotus Notes more than a decade ago. Eventually, I had the opportunity at that job to formally investigate BI as another product offering for our software, and the rest - as they say - is history. I built my first data warehouse with Scott Cameron (who has also ventured into the authoring world by writing Analysis Services 2008 Step by Step and was at the BI Conference last week where I got to reminisce with him for a bit) and that began a career that I never imagined at the time. Fast forward to 2010, and I'm still lauding the virtues of collaborative BI, if only the tools will catch up to my vision! Thus, I was anxious to see what Donald Farmer (blog | twitter) and Rita Sallam of Gartner had to say on the subject in their session "Collaborative Decision Making." As I suspected, the tools aren't quite there yet, but the vendors are moving in the right direction. One thing I liked about this session was a non-Microsoft perspective of the state of the industry with regard to collaborative BI. In addition, this session included a better demonstration of SharePoint collaborative BI capabilities than appeared in the BI keynote. Check out the video in the link to the session to see the demonstration. One of the use cases that was demonstrated was linking from information to a person, because, as Donald put it, "People don't trust data, they trust people." The Microsoft BI Stack in General A question I hear all the time from students when I'm teaching is how to know what tools to use when there is overlap between products in the BI stack. I've never taken the time to codify my thoughts on the subject, but saw that my friend Dan Bulos provided good insight on this topic from a variety of perspectives in his session, "So Many BI Tools, So Little Time." I thought one of his best points was that ideally you should be able to design in your tool of choice, and then deploy to your tool of choice. Unfortunately, the ideal is yet to become real across the platform. The closest we come is with the RDL in Reporting Services which can be produced from two different tools (Report Builder or Business Intelligence Development Studio's Report Designer), manually, or by a third-party or custom application. I have touted the idea for years (and publicly said so about 5 years ago) that eventually more products would be RDL producers or consumers, but we aren't there yet. Maybe in another 5 years. Another interesting session that covered the BI stack against a backdrop of competitive products was delivered by Andrew Brust. Andrew did a marvelous job of consolidating a lot of information in a way that clearly communicated how various vendors' offerings compared to the Microsoft BI stack. He also made a particularly compelling argument about how the existence of an ecosystem around the Microsoft BI stack provided innovation and opportunities lacking for other vendors. Check out his presentation, "How Does the Microsoft BI Stack...Stack Up?" Expo Hall I had planned to spend more time in the Expo Hall to see who was doing new things with the BI stack, but didn't manage to get very far. Each time I set out on an exploratory mission, I got caught up in some fascinating conversations with one or more of my peers. I find interacting with people that I meet at conferences just as important as attending sessions to learn something new. There were a couple of items that really caught me eye, however, that I'll share here. Pragmatic Works. Whether you develop SSIS packages, build SSAS cubes, or author SSRS reports (or all of the above), you really must take a look at BI Documenter. Brian Knight (twitter) walked me through the key features, and I must say I was impressed. Once you've seen what this product can do, you won't want to document your BI projects any other way. You can download a free single-user database edition, or choose from more feature-rich standard or professional editions. Microsoft Press ebooks. I also stopped by the O'Reilly Media booth to meet some folks that one of my acquisitions editors at Microsoft Press recommended. In case you haven't heard, Microsoft Press has partnered with O'Reilly Media for distribution and publishing. Apart from my interest in learning more about O'Reilly Media as an author, an advertisement in their booth caught me eye which I think is a really great move. When you buy Microsoft Press ebooks through the O'Reilly web site, you can receive it in any (or all) of the following formats where possible: PDF, epub, .mobi for Kindle and .apk for Android. You also have lifetime DRM-free access to the ebooks. As someone who is an avid collector of books, I fnd myself running out of room for storage. In addition, I travel a lot, and it's hard to lug my reference library with me. Today's e-reader options make the move to digital books a more viable way to grow my library. Having a variety of formats means I am not limited to a single device, and lifetime access means I don't have to worry about keeping track of where I've stored my files. Because the e-books are DRM-free, I can copy and paste when I'm compiling notes, and I can print pages when necessary. That's a winning combination in my mind! Overall, I was pleased with the BI conference. There were many more sessions that I couldn't attend, either because the room was full when I got there or there were multiple sessions running concurrently that I wanted to see. Fortunately, many of the sessions are accessible for viewing online at http://www.msteched.com/2010/NorthAmerica along with the TechEd sessions. You can spot the BI sessions by the yellow skyline on the title slide of the presentation as shown below. Share this post: email it! | bookmark it! | digg it! | reddit! | kick it! | live it!

Read the article
CLSF & CLK 2013 Trip Report by Jeff Liu

- by jamesmorris

This is a contributed post from Jeff Liu, lead XFS developer for the Oracle mainline Linux kernel team. Recently, I attended both the China Linux Storage and Filesystem workshop (CLSF), and the China Linux Kernel conference (CLK), which were held in Shanghai. Here are the highlights for both events. CLSF - 17th October XFS update (led by Jeff Liu) XFS keeps rapid progress with a lot of changes, especially focused on the infrastructure/performance improvements as well as new feature development. This can be reflected with a sample statistics among XFS/Ext4+JBD2/Btrfs via: # git diff --stat --minimal -C -M v3.7..v3.12-rc4 -- fs/xfs|fs/ext4+fs/jbd2|fs/btrfs XFS: 141 files changed, 27598 insertions(+), 19113 deletions(-) Ext4+JBD2: 39 files changed, 10487 insertions(+), 5454 deletions(-) Btrfs: 70 files changed, 19875 insertions(+), 8130 deletions(-) What made up those changes in XFS? Self-describing metadata(CRC32c). This is a new feature and it contributed about 70% code changes, it can be enabled via `mkfs.xfs -m crc=1 /dev/xxx` for v5 superblock. Transaction log space reservation improvements. With this change, we can calculate the log space reservation at mount time rather than runtime to reduce the the CPU overhead. User namespace support. So both XFS and USERNS can be enabled on kernel configuration begin from Linux 3.10. Thanks Dwight Engen's efforts for this thing. Split project/group quota inodes. Originally, project quota can not be enabled with group quota at the same time because they were share the same quota file inode, now it works but only for v5 super block. i.e, CRC enabled. CONFIG_XFS_WARN, an new lightweight runtime debugger which can be deployed in production environment. Readahead log object recovery, this change can speed up the log replay progress significantly. Speculative preallocation inode tracking, clearing and throttling. The main purpose is to deal with inodes with post-EOF space due to speculative preallocation, support improved quota management to free up a significant amount of unwritten space when at or near EDQUOT. It support backgroup scanning which occurs on a longish interval(5 mins by default, tunable), and on-demand scanning/trimming via ioctl(2). Bitter arguments ensued from this session, especially for the comparison between Ext4 and Btrfs in different areas, I have to spent a whole morning of the 1st day answering those questions. We basically agreed on XFS is the best choice in Linux nowadays because: Stable, XFS has a good record in stability in the past 10 years. Fengguang Wu who lead the 0-day kernel test project also said that he has observed less error than other filesystems in the past 1+ years, I own it to the XFS upstream code reviewer, they always performing serious code review as well as testing. Good performance for large/small files, XFS does not works very well for small files has already been an old story for years. Best choice (maybe) for distributed PB filesystems. e.g, Ceph recommends delopy OSD daemon on XFS because Ext4 has limited xattr size. Best choice for large storage (>16TB). Ext4 does not support a single file more than around 15.95TB. Scalability, any objection to XFS is best in this point? :) XFS is better to deal with transaction concurrency than Ext4, why? The maximum size of the log in XFS is 2038MB compare to 128MB in Ext4. Misc. Ext4 is widely used and it has been proved fast/stable in various loads and scenarios, XFS just need more customers, and Btrfs is still on the road to be a manhood. Ceph Introduction (Led by Li Wang) This a hot topic. Li gave us a nice introduction about the design as well as their current works. Actually, Ceph client has been included in Linux kernel since 2.6.34 and supported by Openstack since Folsom but it seems that it has not yet been widely deployment in production environment. Their major work is focus on the inline data support to separate the metadata and data storage, reduce the file access time, i.e, a file access need communication twice, fetch the metadata from MDS and then get data from OSD, and also, the small file access is limited by the network latency. The solution is, for the small files they would like to store the data at metadata so that when accessing a small file, the metadata server can push both metadata and data to the client at the same time. In this way, they can reduce the overhead of calculating the data offset and save the communication to OSD. For this feature, they have only run some small scale testing but really saw noticeable improvements. Test environment: Intel 2 CPU 12 Core, 64GB RAM, Ubuntu 12.04, Ceph 0.56.6 with 200GB SATA disk, 15 OSD, 1 MDS, 1 MON. The sequence read performance for 1K size files improved about 50%. I have asked Li and Zheng Yan (the core developer of Ceph, who also worked on Btrfs) whether Ceph is really stable and can be deployed at production environment for large scale PB level storage, but they can not give a positive answer, looks Ceph even does not spread over Dreamhost (subject to confirmation). From Li, they only deployed Ceph for a small scale storage(32 nodes) although they'd like to try 6000 nodes in the future. Improve Linux swap for Flash storage (led by Shaohua Li) Because of high density, low power and low price, flash storage (SSD) is a good candidate to partially replace DRAM. A quick answer for this is using SSD as swap. But Linux swap is designed for slow hard disk storage, so there are a lot of challenges to efficiently use SSD for swap. SWAPOUT swap_map scan swap_map is the in-memory data structure to track swap disk usage, but it is a slow linear scan. It will become a bottleneck while finding many adjacent pages in the use of SSD. Shaohua Li have changed it to a cluster(128K) list, resulting in O(1) algorithm. However, this apporoach needs restrictive cluster alignment and only enabled for SSD. IO pattern In most cases, the swap io is in interleaved pattern because of mutiple reclaimers or a free cluster is shared by all reclaimers. Even though block layer can merge interleaved IO to some extent, but we cannot count on it completely. Hence the per-cpu cluster is added base on the previous change, it can help reclaimer do sequential IO and the block layer will be easier to merge IO. TLB flush: If we're reclaiming one active page, we should first move the page from active lru list to inactive lru list, and then reclaim the page from inactive lru to swap it out. During the process, we need to clear PTE twice: first is 'A'(ACCESS) bit, second is 'P'(PRESENT) bit. Processors need to send lots of ipi which make the TLB flush really expensive. Some works have been done to improve this, including rework smp_call_functiom_many() or remove the first TLB flush in x86, but there still have some arguments here and only parts of works have been pushed to mainline. SWAPIN: Page fault does iodepth=1 sync io, but it's a little waste if only issue a page size's IO. The obvious solution is doing swap readahead. But the current in-kernel swap readahead is arbitary(always 8 pages), and it always doesn't perform well for both random and sequential access workload. Shaohua introduced a new flag for madvise(MADV_WILLNEED) to do swap prefetch, so the changes happen in userspace API and leave the in-kernel readahead unchanged(but I think some improvement can also be done here). SWAP discard As we know, discard is important for SSD write throughout, but the current swap discard implementation is synchronous. He changed it to async discard which allow discard and write run in the same time. Meanwhile, the unit of discard is also optimized to cluster. Misc: lock contention For many concurrent swapout and swapin , the lock contention such as anon_vma or swap_lock is high, so he changed the swap_lock to a per-swap lock. But there still have some lock contention in very high speed SSD because of swapcache address_space lock. Zproject (led by Bob Liu) Bob gave us a very nice introduction about the current memory compression status. Now there are 3 projects(zswap/zram/zcache) which all aim at smooth swap IO storm and promote performance, but they all have their own pros and cons. ZSWAP It is implemented based on frontswap API and it uses a dynamic allocater named Zbud to allocate free pages. Zbud means pairs of zpages are "buddied" and it can only store at most two compressed pages in one page frame, so the max compress ratio is 50%. Each page frame is lru-linked and can do shink in memory pressure. If the compressed memory pool reach its limitation, shink or reclaim happens. It decompress the page frame into two new allocated pages and then write them to real swap device, but it can fail when allocating the two pages. ZRAM Acts as a compressed ramdisk and used as swap device, and it use zsmalloc as its allocator which has high density but may have fragmentation issues. Besides, page reclaim is hard since it will need more pages to uncompress and free just one page. ZRAM is preferred by embedded system which may not have any real swap device. Now both ZRAM and ZSWAP are in driver/staging tree, and in the mm community there are some disscussions of merging ZRAM into ZSWAP or viceversa, but no agreement yet. ZCACHE Handles file page compression but it is removed out of staging recently. From industry (led by Tang Jie, LSI) An LSI engineer introduced several new produces to us. The first is raid5/6 cards that it use full stripe writes to improve performance. The 2nd one he introduced is SandForce flash controller, who can understand data file types (data entropy) to reduce write amplification (WA) for nearly all writes. It's called DuraWrite and typical WA is 0.5. What's more, if enable its Dynamic Logical Capacity function module, the controller can do data compression which is transparent to upper layer. LSI testing shows that with this virtual capacity enables 1x TB drive can support up to 2x TB capacity, but the application must monitor free flash space to maintain optimal performance and to guard against free flash space exhaustion. He said the most useful application is for datebase. Another thing I think it's worth to mention is that a NV-DRAM memory in NMR/Raptor which is directly exposed to host system. Applications can directly access the NV-DRAM via a memory address - using standard system call mmap(). He said that it is very useful for database logging now. This kind of NVM produces are beginning to appear in recent years, and it is said that Samsung is building a research center in China for related produces. IMHO, NVM will bring an effect to current os layer especially on file system, e.g. its journaling may need to redesign to fully utilize these nonvolatile memory. OCFS2 (led by Canquan Shen) Without a doubt, HuaWei is the biggest contributor to OCFS2 in the past two years. They have posted 46 upstream patches and 39 patches have been merged. Their current project is based on 32/64 nodes cluster, but they also tried 128 nodes at the experimental stage. The major work they are working is to support ATS (atomic test and set), it can be works with DLM at the same time. Looks this idea is inspired by the vmware VMFS locking, i.e, http://blogs.vmware.com/vsphere/2012/05/vmfs-locking-uncovered.html CLK - 18th October 2013 Improving Linux Development with Better Tools (Andi Kleen) This talk focused on how to find/solve bugs along with the Linux complexity growing. Generally, we can do this with the following kind of tools: Static code checkers tools. e.g, sparse, smatch, coccinelle, clang checker, checkpatch, gcc -W/LTO, stanse. This can help check a lot of things, simple mistakes, complex problems, but the challenges are: some are very slow, false positives, may need a concentrated effort to get false positives down. Especially, no static checker I found can follow indirect calls (“OO in C”, common in kernel): struct foo_ops { int (*do_foo)(struct foo *obj); } foo->do_foo(foo); Dynamic runtime checkers, e.g, thread checkers, kmemcheck, lockdep. Ideally all kernel code would come with a test suite, then someone could run all the dynamic checkers. Fuzzers/test suites. e.g, Trinity is a great tool, it finds many bugs, but needs manual model for each syscall. Modern fuzzers around using automatic feedback, but notfor kernel yet: http://taviso.decsystem.org/making_software_dumber.pdf Debuggers/Tracers to understand code, e.g, ftrace, can dump on events/oops/custom triggers, but still too much overhead in many cases to run always during debug. Tools to read/understand source, e.g, grep/cscope work great for many cases, but do not understand indirect pointers (OO in C model used in kernel), give us all “do_foo” instances: struct foo_ops { int (*do_foo)(struct foo *obj); } = { .do_foo = my_foo }; foo>do_foo(foo); That would be great to have a cscope like tool that understands this based on types/initializers XFS: The High Performance Enterprise File System (Jeff Liu) [slides] I gave a talk for introducing the disk layout, unique features, as well as the recent changes. The slides include some charts to reflect the performances between XFS/Btrfs/Ext4 for small files. About a dozen users raised their hands when I asking who has experienced with XFS. I remembered that when I asked the same question in LinuxCon/Japan, only 3 people raised their hands, but they are Chris Mason, Ric Wheeler, and another attendee. The attendee questions were mainly focused on stability, and comparison with other file systems. Linux Containers (Feng Gao) The speaker introduced us that the purpose for those kind of namespaces, include mount/UTS/IPC/Network/Pid/User, as well as the system API/ABI. For the userspace tools, He mainly focus on the Libvirt LXC rather than us(LXC). Libvirt LXC is another userspace container management tool, implemented as one type of libvirt driver, it can manage containers, create namespace, create private filesystem layout for container, Create devices for container and setup resources controller via cgroup. In this talk, Feng also mentioned another two possible new namespaces in the future, the 1st is the audit, but not sure if it should be assigned to user namespace or not. Another is about syslog, but the question is do we really need it? In-memory Compression (Bob Liu) Same as CLSF, a nice introduction that I have already mentioned above. Misc There were some other talks related to ACPI based memory hotplug, smart wake-affinity in scheduler etc., but my head is not big enough to record all those things. -- Jeff Liu

Read the article
C#/.NET Fundamentals: Choosing the Right Collection Class

- by James Michael Hare

The .NET Base Class Library (BCL) has a wide array of collection classes at your disposal which make it easy to manage collections of objects. While it's great to have so many classes available, it can be daunting to choose the right collection to use for any given situation. As hard as it may be, choosing the right collection can be absolutely key to the performance and maintainability of your application! This post will look at breaking down any confusion between each collection and the situations in which they excel. We will be spending most of our time looking at the System.Collections.Generic namespace, which is the recommended set of collections. The Generic Collections: System.Collections.Generic namespace The generic collections were introduced in .NET 2.0 in the System.Collections.Generic namespace. This is the main body of collections you should tend to focus on first, as they will tend to suit 99% of your needs right up front. It is important to note that the generic collections are unsynchronized. This decision was made for performance reasons because depending on how you are using the collections its completely possible that synchronization may not be required or may be needed on a higher level than simple method-level synchronization. Furthermore, concurrent read access (all writes done at beginning and never again) is always safe, but for concurrent mixed access you should either synchronize the collection or use one of the concurrent collections. So let's look at each of the collections in turn and its various pros and cons, at the end we'll summarize with a table to help make it easier to compare and contrast the different collections. The Associative Collection Classes Associative collections store a value in the collection by providing a key that is used to add/remove/lookup the item. Hence, the container associates the value with the key. These collections are most useful when you need to lookup/manipulate a collection using a key value. For example, if you wanted to look up an order in a collection of orders by an order id, you might have an associative collection where they key is the order id and the value is the order. The Dictionary<TKey,TVale> is probably the most used associative container class. The Dictionary<TKey,TValue> is the fastest class for associative lookups/inserts/deletes because it uses a hash table under the covers. Because the keys are hashed, the key type should correctly implement GetHashCode() and Equals() appropriately or you should provide an external IEqualityComparer to the dictionary on construction. The insert/delete/lookup time of items in the dictionary is amortized constant time - O(1) - which means no matter how big the dictionary gets, the time it takes to find something remains relatively constant. This is highly desirable for high-speed lookups. The only downside is that the dictionary, by nature of using a hash table, is unordered, so you cannot easily traverse the items in a Dictionary in order. The SortedDictionary<TKey,TValue> is similar to the Dictionary<TKey,TValue> in usage but very different in implementation. The SortedDictionary<TKey,TValye> uses a binary tree under the covers to maintain the items in order by the key. As a consequence of sorting, the type used for the key must correctly implement IComparable<TKey> so that the keys can be correctly sorted. The sorted dictionary trades a little bit of lookup time for the ability to maintain the items in order, thus insert/delete/lookup times in a sorted dictionary are logarithmic - O(log n). Generally speaking, with logarithmic time, you can double the size of the collection and it only has to perform one extra comparison to find the item. Use the SortedDictionary<TKey,TValue> when you want fast lookups but also want to be able to maintain the collection in order by the key. The SortedList<TKey,TValue> is the other ordered associative container class in the generic containers. Once again SortedList<TKey,TValue>, like SortedDictionary<TKey,TValue>, uses a key to sort key-value pairs. Unlike SortedDictionary, however, items in a SortedList are stored as an ordered array of items. This means that insertions and deletions are linear - O(n) - because deleting or adding an item may involve shifting all items up or down in the list. Lookup time, however is O(log n) because the SortedList can use a binary search to find any item in the list by its key. So why would you ever want to do this? Well, the answer is that if you are going to load the SortedList up-front, the insertions will be slower, but because array indexing is faster than following object links, lookups are marginally faster than a SortedDictionary. Once again I'd use this in situations where you want fast lookups and want to maintain the collection in order by the key, and where insertions and deletions are rare. The Non-Associative Containers The other container classes are non-associative. They don't use keys to manipulate the collection but rely on the object itself being stored or some other means (such as index) to manipulate the collection. The List<T> is a basic contiguous storage container. Some people may call this a vector or dynamic array. Essentially it is an array of items that grow once its current capacity is exceeded. Because the items are stored contiguously as an array, you can access items in the List<T> by index very quickly. However inserting and removing in the beginning or middle of the List<T> are very costly because you must shift all the items up or down as you delete or insert respectively. However, adding and removing at the end of a List<T> is an amortized constant operation - O(1). Typically List<T> is the standard go-to collection when you don't have any other constraints, and typically we favor a List<T> even over arrays unless we are sure the size will remain absolutely fixed. The LinkedList<T> is a basic implementation of a doubly-linked list. This means that you can add or remove items in the middle of a linked list very quickly (because there's no items to move up or down in contiguous memory), but you also lose the ability to index items by position quickly. Most of the time we tend to favor List<T> over LinkedList<T> unless you are doing a lot of adding and removing from the collection, in which case a LinkedList<T> may make more sense. The HashSet<T> is an unordered collection of unique items. This means that the collection cannot have duplicates and no order is maintained. Logically, this is very similar to having a Dictionary<TKey,TValue> where the TKey and TValue both refer to the same object. This collection is very useful for maintaining a collection of items you wish to check membership against. For example, if you receive an order for a given vendor code, you may want to check to make sure the vendor code belongs to the set of vendor codes you handle. In these cases a HashSet<T> is useful for super-quick lookups where order is not important. Once again, like in Dictionary, the type T should have a valid implementation of GetHashCode() and Equals(), or you should provide an appropriate IEqualityComparer<T> to the HashSet<T> on construction. The SortedSet<T> is to HashSet<T> what the SortedDictionary<TKey,TValue> is to Dictionary<TKey,TValue>. That is, the SortedSet<T> is a binary tree where the key and value are the same object. This once again means that adding/removing/lookups are logarithmic - O(log n) - but you gain the ability to iterate over the items in order. For this collection to be effective, type T must implement IComparable<T> or you need to supply an external IComparer<T>. Finally, the Stack<T> and Queue<T> are two very specific collections that allow you to handle a sequential collection of objects in very specific ways. The Stack<T> is a last-in-first-out (LIFO) container where items are added and removed from the top of the stack. Typically this is useful in situations where you want to stack actions and then be able to undo those actions in reverse order as needed. The Queue<T> on the other hand is a first-in-first-out container which adds items at the end of the queue and removes items from the front. This is useful for situations where you need to process items in the order in which they came, such as a print spooler or waiting lines. So that's the basic collections. Let's summarize what we've learned in a quick reference table. Collection Ordered? Contiguous Storage? Direct Access? Lookup Efficiency Manipulate Efficiency Notes Dictionary No Yes Via Key Key: O(1) O(1) Best for high performance lookups. SortedDictionary Yes No Via Key Key: O(log n) O(log n) Compromise of Dictionary speed and ordering, uses binary search tree. SortedList Yes Yes Via Key Key: O(log n) O(n) Very similar to SortedDictionary, except tree is implemented in an array, so has faster lookup on preloaded data, but slower loads. List No Yes Via Index Index: O(1) Value: O(n) O(n) Best for smaller lists where direct access required and no ordering. LinkedList No No No Value: O(n) O(1) Best for lists where inserting/deleting in middle is common and no direct access required. HashSet No Yes Via Key Key: O(1) O(1) Unique unordered collection, like a Dictionary except key and value are same object. SortedSet Yes No Via Key Key: O(log n) O(log n) Unique ordered collection, like SortedDictionary except key and value are same object. Stack No Yes Only Top Top: O(1) O(1)* Essentially same as List<T> except only process as LIFO Queue No Yes Only Front Front: O(1) O(1) Essentially same as List<T> except only process as FIFO The Original Collections: System.Collections namespace The original collection classes are largely considered deprecated by developers and by Microsoft itself. In fact they indicate that for the most part you should always favor the generic or concurrent collections, and only use the original collections when you are dealing with legacy .NET code. Because these collections are out of vogue, let's just briefly mention the original collection and their generic equivalents: ArrayList A dynamic, contiguous collection of objects. Favor the generic collection List<T> instead. Hashtable Associative, unordered collection of key-value pairs of objects. Favor the generic collection Dictionary<TKey,TValue> instead. Queue First-in-first-out (FIFO) collection of objects. Favor the generic collection Queue<T> instead. SortedList Associative, ordered collection of key-value pairs of objects. Favor the generic collection SortedList<T> instead. Stack Last-in-first-out (LIFO) collection of objects. Favor the generic collection Stack<T> instead. In general, the older collections are non-type-safe and in some cases less performant than their generic counterparts. Once again, the only reason you should fall back on these older collections is for backward compatibility with legacy code and libraries only. The Concurrent Collections: System.Collections.Concurrent namespace The concurrent collections are new as of .NET 4.0 and are included in the System.Collections.Concurrent namespace. These collections are optimized for use in situations where multi-threaded read and write access of a collection is desired. The concurrent queue, stack, and dictionary work much as you'd expect. The bag and blocking collection are more unique. Below is the summary of each with a link to a blog post I did on each of them. ConcurrentQueue Thread-safe version of a queue (FIFO). For more information see: C#/.NET Little Wonders: The ConcurrentStack and ConcurrentQueue ConcurrentStack Thread-safe version of a stack (LIFO). For more information see: C#/.NET Little Wonders: The ConcurrentStack and ConcurrentQueue ConcurrentBag Thread-safe unordered collection of objects. Optimized for situations where a thread may be bother reader and writer. For more information see: C#/.NET Little Wonders: The ConcurrentBag and BlockingCollection ConcurrentDictionary Thread-safe version of a dictionary. Optimized for multiple readers (allows multiple readers under same lock). For more information see C#/.NET Little Wonders: The ConcurrentDictionary BlockingCollection Wrapper collection that implement producers & consumers paradigm. Readers can block until items are available to read. Writers can block until space is available to write (if bounded). For more information see C#/.NET Little Wonders: The ConcurrentBag and BlockingCollection Summary The .NET BCL has lots of collections built in to help you store and manipulate collections of data. Understanding how these collections work and knowing in which situations each container is best is one of the key skills necessary to build more performant code. Choosing the wrong collection for the job can make your code much slower or even harder to maintain if you choose one that doesn’t perform as well or otherwise doesn’t exactly fit the situation. Remember to avoid the original collections and stick with the generic collections. If you need concurrent access, you can use the generic collections if the data is read-only, or consider the concurrent collections for mixed-access if you are running on .NET 4.0 or higher. Tweet Technorati Tags: C#,.NET,Collecitons,Generic,Concurrent,Dictionary,List,Stack,Queue,SortedList,SortedDictionary,HashSet,SortedSet

Read the article
Running SSIS packages from C#

- by Piotr Rodak

Most of the developers and DBAs know about two ways of deploying packages: You can deploy them to database server and run them using SQL Server Agent job or you can deploy the packages to file system and run them using dtexec.exe utility. Both approaches have their pros and cons. However I would like to show you that there is a third way (sort of) that is often overlooked, and it can give you capabilities the ‘traditional’ approaches can’t. I have been working for a few years with applications that run packages from host applications that are implemented in .NET. As you know, SSIS provides programming model that you can use to implement more flexible solutions. SSIS applications are usually thought to be batch oriented, with fairly rigid architecture and processing model, with fixed timeframes when the packages are executed to process data. It doesn’t to be the case, you don’t have to limit yourself to batch oriented architecture. I have very good experiences with service oriented architectures processing large amounts of data. These applications are more complex than what I would like to show here, but the principle stays the same: you can execute packages as a service, on ad-hoc basis. You can also implement and schedule various signals, HTTP calls, file drops, time schedules, Tibco messages and other to run the packages. You can implement event handler that will trigger execution of SSIS when a certain event occurs in StreamInsight stream. This post is just a small example of how you can use the API and other features to create a service that can run SSIS packages on demand. I thought it might be a good idea to implement a restful service that would listen to requests and execute appropriate actions. As it turns out, it is trivial in C#. The application is implemented as console application for the ease of debugging and running. In reality, you might want to implement the application as Windows service. To begin, you have to reference namespace System.ServiceModel.Web and then add a few lines of code: Uri baseAddress = new Uri("http://localhost:8011/"); WebServiceHost svcHost = new WebServiceHost(typeof(PackRunner), baseAddress); try { svcHost.Open(); Console.WriteLine("Service is running"); Console.WriteLine("Press enter to stop the service."); Console.ReadLine(); svcHost.Close(); } catch (CommunicationException cex) { Console.WriteLine("An exception occurred: {0}", cex.Message); svcHost.Abort(); } The interesting lines are 3, 7 and 13. In line 3 you create a WebServiceHost object. In line 7 you start listening on the defined URL and then in line 13 you shut down the service. As you have noticed, the WebServiceHost constructor is accepting type of an object (here: PackRunner) that will be instantiated as singleton and subsequently used to process the requests. This is the class where you put your logic, but to tell WebServiceHost how to use it, the class must implement an interface which declares methods to be used by the host. The interface itself must be ornamented with attribute ServiceContract. [ServiceContract] public interface IPackRunner { [OperationContract] [WebGet(UriTemplate = "runpack?package={name}")] string RunPackage1(string name); [OperationContract] [WebGet(UriTemplate = "runpackwithparams?package={name}&rows={rows}")] string RunPackage2(string name, int rows); } Each method that is going to be used by WebServiceHost has to have attribute OperationContract, as well as WebGet or WebInvoke attribute. The detailed discussion of the available options is outside of scope of this post. I also recommend using more descriptive names to methods . Then, you have to provide the implementation of the interface: public class PackRunner : IPackRunner { ... There are two methods defined in this class. I think that since the full code is attached to the post, I will show only the more interesting method, the RunPackage2. /// <summary> /// Runs package and sets some of its variables. /// </summary> /// <param name="name">Name of the package</param> /// <param name="rows">Number of rows to export</param> /// <returns></returns> public string RunPackage2(string name, int rows) { try { string pkgLocation = ConfigurationManager.AppSettings["PackagePath"]; pkgLocation = Path.Combine(pkgLocation, name.Replace("\"", "")); Console.WriteLine(); Console.WriteLine("Calling package {0} with parameter {1}.", name, rows); Application app = new Application(); Package pkg = app.LoadPackage(pkgLocation, null); pkg.Variables["User::ExportRows"].Value = rows; DTSExecResult pkgResults = pkg.Execute(); Console.WriteLine(); Console.WriteLine(pkgResults.ToString()); if (pkgResults == DTSExecResult.Failure) { Console.WriteLine(); Console.WriteLine("Errors occured during execution of the package:"); foreach (DtsError er in pkg.Errors) Console.WriteLine("{0}: {1}", er.ErrorCode, er.Description); Console.WriteLine(); return "Errors occured during execution. Contact your support."; } Console.WriteLine(); Console.WriteLine(); return "OK"; } catch (Exception ex) { Console.WriteLine(ex); return ex.ToString(); } } The method accepts package name and number of rows to export. The packages are deployed to the file system. The path to the packages is configured in the application configuration file. This way, you can implement multiple services on the same machine, provided you also configure the URL for each instance appropriately. To run a package, you have to reference Microsoft.SqlServer.Dts.Runtime namespace. This namespace is implemented in Microsoft.SQLServer.ManagedDTS.dll which in my case was installed in the folder “C:\Program Files (x86)\Microsoft SQL Server\100\SDK\Assemblies”. Once you have done it, you can create an instance of Microsoft.SqlServer.Dts.Runtime.Application as in line 18 in the above snippet. It may be a good idea to create the Application object in the constructor of the PackRunner class, to avoid necessity of recreating it each time the service is invoked. Then, in line 19 you see that an instance of Microsoft.SqlServer.Dts.Runtime.Package is created. The method LoadPackage in its simplest form just takes package file name as the first parameter. Before you run the package, you can set its variables to certain values. This is a great way of configuring your packages without all the hassle with dtsConfig files. In the above code sample, variable “User:ExportRows” is set to value of the parameter “rows” of the method. Eventually, you execute the package. The method doesn’t throw exceptions, you have to test the result of execution yourself. If the execution wasn’t successful, you can examine collection of errors exposed by the package. These are the familiar errors you often see during development and debugging of the package. I you run the package from the code, you have opportunity to persist them or log them using your favourite logging framework. The package itself is very simple; it connects to my AdventureWorks database and saves number of rows specified in variable “User::ExportRows” to a file. You should know that before you run the package, you can change its connection strings, logging, events and many more. I attach solution with the test service, as well as a project with two test packages. To test the service, you have to run it and wait for the message saying that the host is started. Then, just type (or copy and paste) the below command to your browser. http://localhost:8011/runpackwithparams?package=%22ExportEmployees.dtsx%22&rows=12 When everything works fine, and you modified the package to point to your AdventureWorks database, you should see "OK” wrapped in xml: I stopped the database service to simulate invalid connection string situation. The output of the request is different now: And the service console window shows more information: As you see, implementing service oriented ETL framework is not a very difficult task. You have ability to configure the packages before you run them, you can implement logging that is consistent with the rest of your system. In application I have worked with we also have resource monitoring and execution control. We don’t allow to run more than certain number of packages to run simultaneously. This ensures we don’t strain the server and we use memory and CPUs efficiently. The attached zip file contains two projects. One is the package runner. It has to be executed with administrative privileges as it registers HTTP namespace. The other project contains two simple packages. This is really a cool thing, you should check it out!

Read the article
Laissez les bon temps rouler! (Microsoft BI Conference 2010)

- by smisner

Laissez les bons temps rouler" is a Cajun phrase that I heard frequently when I lived in New Orleans in the mid-1990s. It means "Let the good times roll!" and encapsulates a feeling of happy expectation. As I met with many of my peers and new acquaintances at the Microsoft BI Conference last week, this phrase kept running through my mind as people spoke about their plans in their respective businesses, the benefits and opportunities that the recent releases in the BI stack are providing, and their expectations about the future of the BI stack.Notwithstanding some jabs here and there to point out the platform is neither perfect now nor will be anytime soon (along with admissions that the competitors are also not perfect), and notwithstanding several missteps by the event organizers (which I don't care to enumerate), the overarching mood at the conference was positive. It was a refreshing change from the doom and gloom hovering over several conferences that I attended in 2009. Although many people expect economic hardships to continue over the coming year or so, everyone I know in the BI field is busier than ever and expects to stay busy for quite a while.Self-Service BISelf-service was definitely a theme of the BI conference. In the keynote, Ted Kummert opened with a look back to a fairy tale vision of self-service BI that he told in 2008. At that time, the fairy tale future was a time when "every end user was able to use BI technologies within their job in order to move forward more effectively" and transitioned to the present time in which SQL Server 2008 R2, Office 2010, and SharePoint 2010 are available to deliver managed self-service BI.This set of technologies is presumably poised to address the needs of the 80% of users that Kummert said do not use BI today. He proceeded to outline a series of activities that users ought to be able to do themselves--from simple changes to a report like formatting or an addtional data visualization to integration of an additional data source. The keynote then continued with a series of demonstrations of both current and future technology in support of self-service BI. Some highlights that interested me:PowerPivot, of course, is the flagship product for self-service BI in the Microsoft BI stack. In the TechEd keynote, which was open to the BI conference attendees, Amir Netz (twitter) impressed the audience by demonstrating interactivity with a workbook containing 100 million rows. He upped the ante at the BI keynote with his demonstration of a future-state PowerPivot workbook containing over 2 billion records. It's important to note that this volume of data is being processed by a server engine, and not in the PowerPivot client engine. (Yes, I think it's impressive, but none of my clients are typically wrangling with 2 billion records at a time. Maybe they're thinking too small. This ability to work quickly with large data sets has greater implications for BI solutions than for self-service BI, in my opinion.)Amir also demonstrated KPIs for the future PowerPivot, which appeared to be easier to implement than in any other Microsoft product that supports KPIs, apart from simple KPIs in SharePoint. (My initial reaction is that we have one more place to build KPIs. Great. It's confusing enough. I haven't seen how well those KPIs integrate with other BI tools, which will be important for adoption.)One more PowerPivot feature that Amir showed was a graphical display of the lineage for calculations. (This is hugely practical, especially if you build up calculations incrementally. You can more easily follow the logic from calculation to calculation. Furthermore, if you need to make a change to one calculation, you can assess the impact on other calculations.)Another product demonstration will be available within the next 30 days--Pivot for Reporting Services. If you haven't seen this technology yet, check it out at www.getpivot.com. (It definitely has a wow factor, but I'm skeptical about its practicality. However, I'm looking forward to trying it out with data that I understand.)Michael Tejedor (twitter) demonstrated a feature that I think is really interesting and not emphasized nearly enough--overshadowed by PowerPivot, no doubt. That feature is the Microsoft Business Intelligence Indexing Connector, which enables search of the content of Excel workbooks and Reporting Services reports. (This capability existed in MOSS 2007, but was more cumbersome to implement. The search results in SharePoint 2010 are not only cooler, but more useful by describing whether the content is found in a table or a chart, for example.)This may yet be the dawning of the age of self-service BI - a phrase I've heard repeated from time to time over the last decade - but I think BI professionals are likely to stay busy for a long while, and need not start looking for a new line of work. Kummert repeatedly referenced strategic BI solutions in contrast to self-service BI to emphasize that self-service BI is not a replacement for the services that BI professionals provide. After all, self-service BI does not appear magically on user desktops (or whatever device they want to use). A supporting infrastructure is necessary, and grows in complexity in proportion to the need to simplify BI for users.It's one thing to hear the party line touted by Microsoft employees at the BI keynote, but it's another to hear from the people who are responsible for implementing and supporting it within an organization. Rob Collie (blog | twitter), Kasper de Jonge (blog | twitter), Vidas Matelis (site | twitter), and I were invited to join Andrew Brust (blog | twitter) as he led a Birds of a Feather session at TechEd entitled "PowerPivot: Is It the BI Deal-Changer for Developers and IT Pros?" I would single out the prevailing concern in this session as the issue of control. On one side of this issue were those who were concerned that they would lose control once PowerPivot is implemented. On the other side were those who believed that data should be freely accessible to users in PowerPivot, and even acknowledgment that users would get the data they want even if it meant they would have to manually enter into a workbook to have it ready for analysis. For another viewpoint on how PowerPivot played out at the conference, see Rob Collie's observations.Collaborative BII have been intrigued by the notion of collaborative BI for a very long time. Before I discovered BI, I was a Lotus Notes developer and later a manager of developers, working in a software company that enabled collaboration in the legal industry. Not only did I help create collaborative systems for our clients, I created a complete project management from the ground up to collaboratively manage our custom development work. In that case, collaboration involved my team, my client contacts, and me. I was also able to produce my own BI from that system as well, but didn't know that's what I was doing at the time. Only in recent years has SharePoint begun to catch up with the capabilities that I had with Lotus Notes more than a decade ago. Eventually, I had the opportunity at that job to formally investigate BI as another product offering for our software, and the rest - as they say - is history. I built my first data warehouse with Scott Cameron (who has also ventured into the authoring world by writing Analysis Services 2008 Step by Step and was at the BI Conference last week where I got to reminisce with him for a bit) and that began a career that I never imagined at the time.Fast forward to 2010, and I'm still lauding the virtues of collaborative BI, if only the tools will catch up to my vision! Thus, I was anxious to see what Donald Farmer (blog | twitter) and Rita Sallam of Gartner had to say on the subject in their session "Collaborative Decision Making." As I suspected, the tools aren't quite there yet, but the vendors are moving in the right direction. One thing I liked about this session was a non-Microsoft perspective of the state of the industry with regard to collaborative BI. In addition, this session included a better demonstration of SharePoint collaborative BI capabilities than appeared in the BI keynote. Check out the video in the link to the session to see the demonstration. One of the use cases that was demonstrated was linking from information to a person, because, as Donald put it, "People don't trust data, they trust people."The Microsoft BI Stack in GeneralA question I hear all the time from students when I'm teaching is how to know what tools to use when there is overlap between products in the BI stack. I've never taken the time to codify my thoughts on the subject, but saw that my friend Dan Bulos provided good insight on this topic from a variety of perspectives in his session, "So Many BI Tools, So Little Time." I thought one of his best points was that ideally you should be able to design in your tool of choice, and then deploy to your tool of choice. Unfortunately, the ideal is yet to become real across the platform. The closest we come is with the RDL in Reporting Services which can be produced from two different tools (Report Builder or Business Intelligence Development Studio's Report Designer), manually, or by a third-party or custom application. I have touted the idea for years (and publicly said so about 5 years ago) that eventually more products would be RDL producers or consumers, but we aren't there yet. Maybe in another 5 years.Another interesting session that covered the BI stack against a backdrop of competitive products was delivered by Andrew Brust. Andrew did a marvelous job of consolidating a lot of information in a way that clearly communicated how various vendors' offerings compared to the Microsoft BI stack. He also made a particularly compelling argument about how the existence of an ecosystem around the Microsoft BI stack provided innovation and opportunities lacking for other vendors. Check out his presentation, "How Does the Microsoft BI Stack...Stack Up?"Expo HallI had planned to spend more time in the Expo Hall to see who was doing new things with the BI stack, but didn't manage to get very far. Each time I set out on an exploratory mission, I got caught up in some fascinating conversations with one or more of my peers. I find interacting with people that I meet at conferences just as important as attending sessions to learn something new. There were a couple of items that really caught me eye, however, that I'll share here.Pragmatic Works. Whether you develop SSIS packages, build SSAS cubes, or author SSRS reports (or all of the above), you really must take a look at BI Documenter. Brian Knight (twitter) walked me through the key features, and I must say I was impressed. Once you've seen what this product can do, you won't want to document your BI projects any other way. You can download a free single-user database edition, or choose from more feature-rich standard or professional editions.Microsoft Press ebooks. I also stopped by the O'Reilly Media booth to meet some folks that one of my acquisitions editors at Microsoft Press recommended. In case you haven't heard, Microsoft Press has partnered with O'Reilly Media for distribution and publishing. Apart from my interest in learning more about O'Reilly Media as an author, an advertisement in their booth caught me eye which I think is a really great move. When you buy Microsoft Press ebooks through the O'Reilly web site, you can receive it in any (or all) of the following formats where possible: PDF, epub, .mobi for Kindle and .apk for Android. You also have lifetime DRM-free access to the ebooks. As someone who is an avid collector of books, I fnd myself running out of room for storage. In addition, I travel a lot, and it's hard to lug my reference library with me. Today's e-reader options make the move to digital books a more viable way to grow my library. Having a variety of formats means I am not limited to a single device, and lifetime access means I don't have to worry about keeping track of where I've stored my files. Because the e-books are DRM-free, I can copy and paste when I'm compiling notes, and I can print pages when necessary. That's a winning combination in my mind!Overall, I was pleased with the BI conference. There were many more sessions that I couldn't attend, either because the room was full when I got there or there were multiple sessions running concurrently that I wanted to see. Fortunately, many of the sessions are accessible for viewing online at http://www.msteched.com/2010/NorthAmerica along with the TechEd sessions. You can spot the BI sessions by the yellow skyline on the title slide of the presentation as shown below. Share this post: email it! | bookmark it! | digg it! | reddit! | kick it! | live it!

Read the article
How to Load Oracle Tables From Hadoop Tutorial (Part 5 - Leveraging Parallelism in OSCH)

- by Bob Hanckel

Normal 0 false false false EN-US X-NONE X-NONE MicrosoftInternetExplorer4 Using OSCH: Beyond Hello World In the previous post we discussed a “Hello World” example for OSCH focusing on the mechanics of getting a toy end-to-end example working. In this post we are going to talk about how to make it work for big data loads. We will explain how to optimize an OSCH external table for load, paying particular attention to Oracle’s DOP (degree of parallelism), the number of external table location files we use, and the number of HDFS files that make up the payload. We will provide some rules that serve as best practices when using OSCH. The assumption is that you have read the previous post and have some end to end OSCH external tables working and now you want to ramp up the size of the loads. Using OSCH External Tables for Access and Loading OSCH external tables are no different from any other Oracle external tables. They can be used to access HDFS content using Oracle SQL: SELECT * FROM my_hdfs_external_table; or use the same SQL access to load a table in Oracle. INSERT INTO my_oracle_table SELECT * FROM my_hdfs_external_table; To speed up the load time, you will want to control the degree of parallelism (i.e. DOP) and add two SQL hints. ALTER SESSION FORCE PARALLEL DML PARALLEL 8; ALTER SESSION FORCE PARALLEL QUERY PARALLEL 8; INSERT /*+ append pq_distribute(my_oracle_table, none) */ INTO my_oracle_table SELECT * FROM my_hdfs_external_table; There are various ways of either hinting at what level of DOP you want to use. The ALTER SESSION statements above force the issue assuming you (the user of the session) are allowed to assert the DOP (more on that in the next section). Alternatively you could embed additional parallel hints directly into the INSERT and SELECT clause respectively. /*+ parallel(my_oracle_table,8) *//*+ parallel(my_hdfs_external_table,8) */ Note that the "append" hint lets you load a target table by reserving space above a given "high watermark" in storage and uses Direct Path load. In other doesn't try to fill blocks that are already allocated and partially filled. It uses unallocated blocks. It is an optimized way of loading a table without incurring the typical resource overhead associated with run-of-the-mill inserts. The "pq_distribute" hint in this context unifies the INSERT and SELECT operators to make data flow during a load more efficient. Finally your target Oracle table should be defined with "NOLOGGING" and "PARALLEL" attributes. The combination of the "NOLOGGING" and use of the "append" hint disables REDO logging, and its overhead. The "PARALLEL" clause tells Oracle to try to use parallel execution when operating on the target table. Determine Your DOP It might feel natural to build your datasets in Hadoop, then afterwards figure out how to tune the OSCH external table definition, but you should start backwards. You should focus on Oracle database, specifically the DOP you want to use when loading (or accessing) HDFS content using external tables. The DOP in Oracle controls how many PQ slaves are launched in parallel when executing an external table. Typically the DOP is something you want to Oracle to control transparently, but for loading content from Hadoop with OSCH, it's something that you will want to control. Oracle computes the maximum DOP that can be used by an Oracle user. The maximum value that can be assigned is an integer value typically equal to the number of CPUs on your Oracle instances, times the number of cores per CPU, times the number of Oracle instances. For example, suppose you have a RAC environment with 2 Oracle instances. And suppose that each system has 2 CPUs with 32 cores. The maximum DOP would be 128 (i.e. 2*2*32). In point of fact if you are running on a production system, the maximum DOP you are allowed to use will be restricted by the Oracle DBA. This is because using a system maximum DOP can subsume all system resources on Oracle and starve anything else that is executing. Obviously on a production system where resources need to be shared 24x7, this can’t be allowed to happen. The use cases for being able to run OSCH with a maximum DOP are when you have exclusive access to all the resources on an Oracle system. This can be in situations when your are first seeding tables in a new Oracle database, or there is a time where normal activity in the production database can be safely taken off-line for a few hours to free up resources for a big incremental load. Using OSCH on high end machines (specifically Oracle Exadata and Oracle BDA cabled with Infiniband), this mode of operation can load up to 15TB per hour. The bottom line is that you should first figure out what DOP you will be allowed to run with by talking to the DBAs who manage the production system. You then use that number to derive the number of location files, and (optionally) the number of HDFS data files that you want to generate, assuming that is flexible. Rule 1: Find out the maximum DOP you will be allowed to use with OSCH on the target Oracle system Determining the Number of Location Files Let’s assume that the DBA told you that your maximum DOP was 8. You want the number of location files in your external table to be big enough to utilize all 8 PQ slaves, and you want them to represent equally balanced workloads. Remember location files in OSCH are metadata lists of HDFS files and are created using OSCH’s External Table tool. They also represent the workload size given to an individual Oracle PQ slave (i.e. a PQ slave is given one location file to process at a time, and only it will process the contents of the location file.) Rule 2: The size of the workload of a single location file (and the PQ slave that processes it) is the sum of the content size of the HDFS files it lists For example, if a location file lists 5 HDFS files which are each 100GB in size, the workload size for that location file is 500GB. The number of location files that you generate is something you control by providing a number as input to OSCH’s External Table tool. Rule 3: The number of location files chosen should be a small multiple of the DOP Each location file represents one workload for one PQ slave. So the goal is to keep all slaves busy and try to give them equivalent workloads. Obviously if you run with a DOP of 8 but have 5 location files, only five PQ slaves will have something to do and the other three will have nothing to do and will quietly exit. If you run with 9 location files, then the PQ slaves will pick up the first 8 location files, and assuming they have equal work loads, will finish up about the same time. But the first PQ slave to finish its job will then be rescheduled to process the ninth location file, potentially doubling the end to end processing time. So for this DOP using 8, 16, or 32 location files would be a good idea. Determining the Number of HDFS Files Let’s start with the next rule and then explain it: Rule 4: The number of HDFS files should try to be a multiple of the number of location files and try to be relatively the same size In our running example, the DOP is 8. This means that the number of location files should be a small multiple of 8. Remember that each location file represents a list of unique HDFS files to load, and that the sum of the files listed in each location file is a workload for one Oracle PQ slave. The OSCH External Table tool will look in an HDFS directory for a set of HDFS files to load. It will generate N number of location files (where N is the value you gave to the tool). It will then try to divvy up the HDFS files and do its best to make sure the workload across location files is as balanced as possible. (The tool uses a greedy algorithm that grabs the biggest HDFS file and delegates it to a particular location file. It then looks for the next biggest file and puts in some other location file, and so on). The tools ability to balance is reduced if HDFS file sizes are grossly out of balance or are too few. For example suppose my DOP is 8 and the number of location files is 8. Suppose I have only 8 HDFS files, where one file is 900GB and the others are 100GB. When the tool tries to balance the load it will be forced to put the singleton 900GB into one location file, and put each of the 100GB files in the 7 remaining location files. The load balance skew is 9 to 1. One PQ slave will be working overtime, while the slacker PQ slaves are off enjoying happy hour. If however the total payload (1600 GB) were broken up into smaller HDFS files, the OSCH External Table tool would have an easier time generating a list where each workload for each location file is relatively the same. Applying Rule 4 above to our DOP of 8, we could divide the workload into160 files that were approximately 10 GB in size. For this scenario the OSCH External Table tool would populate each location file with 20 HDFS file references, and all location files would have similar workloads (approximately 200GB per location file.) As a rule, when the OSCH External Table tool has to deal with more and smaller files it will be able to create more balanced loads. How small should HDFS files get? Not so small that the HDFS open and close file overhead starts having a substantial impact. For our performance test system (Exadata/BDA with Infiniband), I compared three OSCH loads of 1 TiB. One load had 128 HDFS files living in 64 location files where each HDFS file was about 8GB. I then did the same load with 12800 files where each HDFS file was about 80MB size. The end to end load time was virtually the same. However when I got ridiculously small (i.e. 128000 files at about 8MB per file), it started to make an impact and slow down the load time. What happens if you break rules 3 or 4 above? Nothing draconian, everything will still function. You just won’t be taking full advantage of the generous DOP that was allocated to you by your friendly DBA. The key point of the rules articulated above is this: if you know that HDFS content is ultimately going to be loaded into Oracle using OSCH, it makes sense to chop them up into the right number of files roughly the same size, derived from the DOP that you expect to use for loading. Next Steps So far we have talked about OLH and OSCH as alternative models for loading. That’s not quite the whole story. They can be used together in a way that provides for more efficient OSCH loads and allows one to be more flexible about scheduling on a Hadoop cluster and an Oracle Database to perform load operations. The next lesson will talk about Oracle Data Pump files generated by OLH, and loaded using OSCH. It will also outline the pros and cons of using various load methods. This will be followed up with a final tutorial lesson focusing on how to optimize OLH and OSCH for use on Oracle's engineered systems: specifically Exadata and the BDA. /* Style Definitions */ table.MsoNormalTable {mso-style-name:"Table Normal"; mso-tstyle-rowband-size:0; mso-tstyle-colband-size:0; mso-style-noshow:yes; mso-style-priority:99; mso-style-qformat:yes; mso-style-parent:""; mso-padding-alt:0in 5.4pt 0in 5.4pt; mso-para-margin-top:0in; mso-para-margin-right:0in; mso-para-margin-bottom:10.0pt; mso-para-margin-left:0in; line-height:115%; mso-pagination:widow-orphan; font-size:11.0pt; font-family:"Calibri","sans-serif"; mso-ascii-font-family:Calibri; mso-ascii-theme-font:minor-latin; mso-fareast-font-family:"Times New Roman"; mso-fareast-theme-font:minor-fareast; mso-hansi-font-family:Calibri; mso-hansi-theme-font:minor-latin;}

Read the article
Multiple routers, subnets, gateways etc

- by allentown

My current setup is: Cable modem dishes out 13 static IP's (/28), a GB switch is plugged into the cable modem, and has access to those 13 static IP's, I have about 6 "servers" in use right now. The cable modem is also a firewall, DHCP server, and 3 port 10/100 switch. I am using it as a firewall, but not currently as a DHCP server. I have plugged into the cable modem, two network cables, one which goes to the WAN port of a Linksys Dual Band Wireless 10/100/1000 router/switch. Into the linksys are a few workstations, a few printers, and some laptops connecting to wifi. I set the Linksys to use take static IP, and enabled DHCP for the workstations, printers, etc in 192.168.1.1/24. The network for the Linksys is mostly self contained, backups go to a SAN, on that network, it all happens through that switch, over GB. But I also get internet access from it as well via the cable modem using one static IP. This all works, however, I can not "see" the static IP machines when I am on the Linksys. I can get to them via ssh and other protocols, and if I want to from "outside", I open holes, like 80, 25, 587, 143, 22, etc. The second wire, from the cable modem/fireall/switch just uplinks to the managed GB switch. What are the pros and cons of this? I do not like giving up the static IP to the Linksys. I basically have a mixed network of public servers, and internal workstations. I want the public servers on public IP's because I do not want to mess with port forwarding and mappings. Is it correct also, that if someone breaches the Linksys wifi, they still would have a hard time getting to the static IP range, just by nature of the network topology? Today, just for a test, I toggled on the DHCP in the firewall/cable modem at 10.1.10.1/24 range, the Linksys is n the 192.168.1.100/24 range. At that point, all the static IP machines still had in and out access, but Linksys was unreachable. The cable modem only has 10/100 ports, so I will not plug anything but the network drop into it, which is 50Mb/10Mb. Which makes me think this could be less than ideal, as transfers from the workstation network to the server network will be bottlenecked at 100Mb when I have 1000Mb available. I may not need to solve that, if isolation is better though. I do not move a lot of data, if any, from Linsys network to server network, so for it to pretend to be remote is ok. Should I approach this any different? I could enable DHCP on the cable modem/firewall, it should still send out the statics to the GB switch, but will also be a DHCP in 10.1.10.1/24 range? I can then plug the Linksys into the GB switch, which is now picking up statics and the 10.1.10.1/24 ranges, tell the Linksys to use 10.1.10.5 or so. Now, do I disable DHCP on the Linksys, and the cable modem/firewall will pass through the statics and 10.0.10.1/24 ranges as well? Or, could I open a second DHCP pool on the Linksys? I guess doing so gives me network isolation again, but it is just the reverse of what I have now. But I get out of the bottleneck, not that the Linksys could ever really touch real GB speeds anyway, but the managed switch certainly can. This is all because 13 statics are not that many. Right now, 6 "servers", the Linksys, a managed switch, a few SSL certs, and I am running out. I do not want to waste a static IP on the managed GB switch, or the Linksys, unless it provides me some type of benefit. Final question, under my current setup, if I am on a workstation, sitting at 192.168.1.109, the Linksys, with GB, and I send a file over ssh to the static IP machine, is that literally leaving the internet, and coming back in, or does it stay local? To me it seems like: Workstation (192.168.1.109) -> Linksys DHCP -> Linksys Static IP -> Cable Modem -> Server ( and it hits the 10/100 ports on the cable modem, slowing me down. But does it round trip the network, leave and come back in, limiting me to the 50/10 internet speeds? *These are all made up numbers, I do not use default router IP's as I will one day add a VPN, and do not want collisions. I need some recommendations, do I want one big network, or two isolated ones. Printers these days need an IP, everything does, I can not get autoconf/bonjour to be reliable on most printers. but I am also not sure I want the "server" side of my operation to be polluted by the workstation side of my operation. Unless there is some magic subetting I have not learned yet, here is what I am thinking: Cable modem 10/100, has 13 static IP, publicly accessible -> Enable DHCP on the cable modem -> Cable modem plugs into managed switch -> Managed switch gets 10.1.10.1 ssh, telnet, https admin management address -> Managed switch sends static IP's to to servers -> Plug Linksys into managed switch, giving it 10.1.10.2 static internally in Linksys admin -> Linksys gets assigned 10.1.10.x as its DHCP sending range -> Local printers, workstations, iPhones etc, connect to this -> ( Do I enable DHCP or disable it on the Linksys, just define a non over lapping range, or create an entirely new DHCP at 10.1.50.0/24, I think I am back isolated again with that method too? ) Thank you for any suggestions. This is the first time I have had to deal with less than a /24, and most are larger than that, but it is just a drop to a cabinet. Otherwise, it's a router, a few repeaters, and soho stuff that is simple, with one IP. I know a few may suggest going all DHCP on the servers, and I may one day, just not now, there has been too much moving of gear for me to be interested in that, and I would want something in the Catalyst series to deal with that.

Read the article

Search Results

Search found 1281 results on 52 pages for 'joes 2 pros'.

Page 51/52 | < Previous Page | 47 48 49 50 51 52 | Next Page >

- by Michelle Kimihira

- by Kyle Burns

- by Your DisplayName here!

- by D'Arcy Lussier

- by Timothy Klenke

- by ToStringTheory

- by DoomStone

- by Stefan Hinker

- by Wayne Molina

- by James Michael Hare

- by BuckWoody

- by Jakub Holý

- by Grant Fritchey

- by James Michael Hare

- by davidholmes

- by FunctorSalad

- by fayer

- by sprugman

- by smisner

- by jamesmorris

- by James Michael Hare

- by Piotr Rodak

- by smisner

- by Bob Hanckel

- by allentown

< Previous Page | 47 48 49 50 51 52 | Next Page >