Search Results

Search found 109 results on 5 pages for 'acid grim'.

Page 4/5 | < Previous Page | 1 2 3 4 5  | Next Page >

  • Splitting Nucleotide Sequences in JS with Regexp

    - by TEmerson
    I'm trying to split up a nucleotide sequence into amino acid strings using a regular expression. I have to start a new string at each occurrence of the string "ATG", but I don't want to actually stop the first match at the "ATG". Valid input is any ordering of a string of As, Cs, Gs, and Ts. For example, given the input string: ATGAACATAGGACATGAGGAGTCA I should get two strings: ATGAACATAGGACATGAGGAGTCA (the whole thing) and ATGAGGAGTCA (the first match of "ATG" onward). A string that contains "ATG" n times should result in n results. I thought the expression /(?:[ACGT]*)(ATG)[ACGT]*/g would work, but it doesn't. If this can't be done with a regexp it's easy enough to just write out the code for, but I always prefer an elegant solution if one is available.

    Read the article

  • Need to Know

    - by Tony Davis
    Sometimes, I wonder whether writers of documentation, tutorials and articles stop to ask themselves one very important question: Does the reader really need to know this? I recently took on the task of writing a concise series of articles about the transaction log, what is it, how it works and why it's important. It was an enjoyable task; rather like peering inside a giant, complex clock mechanism. Initially, one sees only the basic components, which work to guarantee the integrity of database transactions, and preserve these transactions so that data can be restored to a previous point in time. On closer inspection, one notices all of small, arcane mechanisms that are necessary to make this happen; LSNs, virtual log files, log chains, database checkpoints, and so on. It was engrossing, escapist, stuff; what I'd written looked weighty and steeped in mysterious significance. Suddenly, however, I jolted myself back to reality with the awful thought "does anyone really need to know all this?" The driver of a car needs only to be dimly aware of what goes on under the hood, however exciting the mechanism is to the engineer. Similarly, while everyone who uses SQL Server ought to be aware of the transaction log, its role in guaranteeing the ACID properties, and how to control its growth, the intricate mechanisms ticking away under its clock face are a world away from the daily work of the harassed developer. The DBA needs to know more, such as the correct rituals for ensuring optimal performance and data integrity, setting the appropriate growth characteristics, backup routines, restore procedures, and so on. However, even then, the average DBA only needs to understand enough about the arcane processes to spot problems and react appropriately, or to know how to Google for the best way of dealing with it. The art of technical writing is tied up in intimate knowledge of your audience and what they need to know at any point. It means serving up just enough at each point to help the reader in a practical way, but not to overcook it, or stuff the reader with information that does them no good. When I think of the books and articles that have helped me the most, they have been full of brief, practical, and well-informed guidance, based on experience. This seems far-removed from the 900-page "beginner's guides" that one now sees everywhere. The more I write and edit, the more I become convinced that the real art of technical communication lies in knowing what to leave out. In what areas do the SQL Server technical materials suffer from "information overload"? Where else does it seem that concise, practical advice is drowned out by endless discussion of the "clock mechanisms"? Cheers, Tony.

    Read the article

  • MySQL and Hadoop Integration - Unlocking New Insight

    - by Mat Keep
    “Big Data” offers the potential for organizations to revolutionize their operations. With the volume of business data doubling every 1.2 years, analysts and business users are discovering very real benefits when integrating and analyzing data from multiple sources, enabling deeper insight into their customers, partners, and business processes. As the world’s most popular open source database, and the most deployed database in the web and cloud, MySQL is a key component of many big data platforms, with Hadoop vendors estimating 80% of deployments are integrated with MySQL. The new Guide to MySQL and Hadoop presents the tools enabling integration between the two data platforms, supporting the data lifecycle from acquisition and organisation to analysis and visualisation / decision, as shown in the figure below The Guide details each of these stages and the technologies supporting them: Acquire: Through new NoSQL APIs, MySQL is able to ingest high volume, high velocity data, without sacrificing ACID guarantees, thereby ensuring data quality. Real-time analytics can also be run against newly acquired data, enabling immediate business insight, before data is loaded into Hadoop. In addition, sensitive data can be pre-processed, for example healthcare or financial services records can be anonymized, before transfer to Hadoop. Organize: Data is transferred from MySQL tables to Hadoop using Apache Sqoop. With the MySQL Binlog (Binary Log) API, users can also invoke real-time change data capture processes to stream updates to HDFS. Analyze: Multi-structured data ingested from multiple sources is consolidated and processed within the Hadoop platform. Decide: The results of the analysis are loaded back to MySQL via Apache Sqoop where they inform real-time operational processes or provide source data for BI analytics tools. So how are companies taking advantage of this today? As an example, on-line retailers can use big data from their web properties to better understand site visitors’ activities, such as paths through the site, pages viewed, and comments posted. This knowledge can be combined with user profiles and purchasing history to gain a better understanding of customers, and the delivery of highly targeted offers. Of course, it is not just in the web that big data can make a difference. Every business activity can benefit, with other common use cases including: - Sentiment analysis; - Marketing campaign analysis; - Customer churn modeling; - Fraud detection; - Research and Development; - Risk Modeling; - And more. As the guide discusses, Big Data is promising a significant transformation of the way organizations leverage data to run their businesses. MySQL can be seamlessly integrated within a Big Data lifecycle, enabling the unification of multi-structured data into common data platforms, taking advantage of all new data sources and yielding more insight than was ever previously imaginable. Download the guide to MySQL and Hadoop integration to learn more. I'd also be interested in hearing about how you are integrating MySQL with Hadoop today, and your requirements for the future, so please use the comments on this blog to share your insights.

    Read the article

  • How does I/O work for large graph databases?

    - by tjb1982
    I should preface this by saying that I'm mostly a front end web developer, trained as a musician, but over the past few years I've been getting more and more into computer science. So one idea I have as a fun toy project to learn about data structures and C programming was to design and implement my own very simple database that would manage an adjacency list of posts. I don't want SQL (maybe I'll do my own query language? I'm just having fun). It should support ACID. It should be capable of storing 1TB let's say. So with that, I was trying to think of how a database even stores data, without regard to data structures necessarily. I'm working on linux, and I've read that in that world "everything is a file," including hardware (like /dev/*), so I think that that obviously has to apply to a database, too, and it clearly does--whether it's MySQL or PostgreSQL or Neo4j, the database itself is a collection of files you can see in the filesystem. That said, there would come a point in scale where loading the entire database into primary memory just wouldn't work, so it doesn't make sense to design it with that mindset (I assume). However, reading from secondary memory would be much slower and regardless some portion of the database has to be in primary memory in order for you to be able to do anything with it. I read this post: Why use a database instead of just saving your data to disk? And I found it difficult to understand how other databases, like SQLite or Neo4j, read and write from secondary memory and are still very fast (faster, it would seem, than simply writing files to the filesystem as the above question suggests). It seems the key is indexing. But even indexes need to be stored in secondary memory. They are inherently smaller than the database itself, but indexes in a very large database might be prohibitively large, too. So my question is how is I/O generally done with large databases like the one I described above that would be at least 1TB storing a big adjacency list? If indexing is more or less the answer, how exactly does indexing work--what data structures should be involved?

    Read the article

  • Big Data – Buzz Words: What is NewSQL – Day 10 of 21

    - by Pinal Dave
    In yesterday’s blog post we learned the importance of the relational database. In this article we will take a quick look at the what is NewSQL. What is NewSQL? NewSQL stands for new scalable and high performance SQL Database vendors. The products sold by NewSQL vendors are horizontally scalable. NewSQL is not kind of databases but it is about vendors who supports emerging data products with relational database properties (like ACID, Transaction etc.) along with high performance. Products from NewSQL vendors usually follow in memory data for speedy access as well are available immediate scalability. NewSQL term was coined by 451 groups analyst Matthew Aslett in this particular blog post. On the definition of NewSQL, Aslett writes: “NewSQL” is our shorthand for the various new scalable/high performance SQL database vendors. We have previously referred to these products as ‘ScalableSQL‘ to differentiate them from the incumbent relational database products. Since this implies horizontal scalability, which is not necessarily a feature of all the products, we adopted the term ‘NewSQL’ in the new report. And to clarify, like NoSQL, NewSQL is not to be taken too literally: the new thing about the NewSQL vendors is the vendor, not the SQL. In other words - NewSQL incorporates the concepts and principles of Structured Query Language (SQL) and NoSQL languages. It combines reliability of SQL with the speed and performance of NoSQL. Categories of NewSQL There are three major categories of the NewSQL New Architecture – In this framework each node owns a subset of the data and queries are split into smaller query to sent to nodes to process the data. E.g. NuoDB, Clustrix, VoltDB MySQL Engines – Highly Optimized storage engine for SQL with the interface of MySQ Lare the example of such category. E.g. InnoDB, Akiban Transparent Sharding – This system automatically split database across multiple nodes. E.g. Scalearc  Summary In simple words – NewSQL is kind of database following relational database principals and provides scalability like NoSQL. Tomorrow In tomorrow’s blog post we will discuss about the Role of Cloud Computing in Big Data. Reference: Pinal Dave (http://blog.sqlauthority.com) Filed under: Big Data, PostADay, SQL, SQL Authority, SQL Query, SQL Server, SQL Tips and Tricks, T SQL

    Read the article

  • What Counts For A DBA: Foresight

    - by drsql
    Of all the valuable attributes of a DBA covered so far in this series, ranging from passion to humility to practicality, perhaps one of the most important attributes may turn out to be the most seemingly-nebulous: foresight. According to Free Dictionary foresight is the "perception of the significance and nature of events before they have occurred". Foresight does not come naturally to most people, as the parent of any teenager will attest. No matter how clearly you see their problems coming they won't listen, and have to fail before eventually (hopefully) learning for themselves. Having graduated from the school of hard knocks, the DBA, the naive teenager no longer, acquires the ability to foretell how events will unfold in response to certain actions or attitudes with the unerring accuracy of a doom-laden prophet. Like Simba in the Lion King, after a few blows to the head, we foretell that a sore head that will be the inevitable consequence of a swing of Rafiki's stick, and we take evasive action. However, foresight is about more than simply learning when to duck. It's about taking the time to understand and prevent the habits that caused the stick to swing in the first place. And based on this definition, I often think there is a lot less foresight on display in my industry than there ought to be. Most DBAs reading this blog will spot a line such as the following in a piece of "working" code, understand immediately why it is less than optimimum, and take evasive action. …WHERE CAST (columnName as int) = 1 However, the programmers who regularly write this sort of code clearly lack that foresight, and this and numerous other examples of similarly-malodorous code prevail throughout our industry (and provide premium-grade fertilizer for the healthy growth of many a consultant's bank account). Sometimes, perhaps harried by impatient managers and painfully tight deadlines, everyone makes mistakes. Yes, I too occasionally write code that "works", but basically stinks. When the problems manifest, it is sometimes accompanied by a sense of grim recognition that somewhere in me existed the foresight to know that that approach would lead to this problem. However, in the headlong rush, warning signs got overlooked, lessons learned previously, which could supply the foresight to the current project, were lost and not applied.   Of course, the problem often is a simple lack of skills, training and knowledge in the relevant technology and/or business space; programmers and DBAs forced to do their best in the face of inadequate training, or to apply their skills in areas where they lack experience. However, often the problem goes deeper than this; I detect in some DBAs and programmers a certain laziness of attitude.   They veer from one project to the next, going with "whatever works", unwilling or unable to take the time to understand where their actions are leading them. Of course, the whole "Agile" mindset is often interpreted to favor flexibility and rapid production over aiming to get things right the first time. The faster you try to travel in the dark, frequently changing direction, the more important it is to have someone who has the foresight to know at least roughly where you are heading. This is doubly true for the data tier which, no matter how you try to deny it, simply cannot be "redone" every month as you learn aspects of the world you are trying to model that, with a little bit of foresight, you would have seen coming.   Sometimes, when as a DBA you can glance briefly at 200 lines of working SQL code and know instinctively why it will cause problems, foresight can feel like magic, but it isn't; it's more like muscle memory. It is acquired as the consequence of good experience, useful communication with those around you, and a willingness to learn continually, through continued education as well as from failure. Foresight can be deployed only by finding time to understand how the lessons learned from other DBAs, and other projects, can help steer the current project in the right direction.   C.S. Lewis once said "The future is something which everyone reaches at the rate of sixty minutes an hour, whatever he does, whoever he is." It cannot be avoided; the quality of what you build now is going to affect you, and others, at some point in the future. Take the time to acquire foresight; it is a love letter to your future self, to say you cared.

    Read the article

  • What scalability problems have you solved using a NoSQL data store?

    - by knorv
    NoSQL refers to non-relational data stores that break with the history of relational databases and ACID guarantees. Popular open source NoSQL data stores include: Cassandra (tabular, written in Java, used by Facebook, Twitter, Digg, Rackspace, Mahalo and Reddit) CouchDB (document, written in Erlang, used by Engine Yard and BBC) Dynomite (key-value, written in C++, used by Powerset) HBase (key-value, written in Java, used by Bing) Hypertable (tabular, written in C++, used by Baidu) Kai (key-value, written in Erlang) MemcacheDB (key-value, written in C, used by Reddit) MongoDB (document, written in C++, used by Sourceforge, Github, Electronic Arts and NY Times) Neo4j (graph, written in Java, used by Swedish Universities) Project Voldemort (key-value, written in Java, used by LinkedIn) Redis (key-value, written in C, used by Engine Yard, Github and Craigslist) Riak (key-value, written in Erlang, used by Comcast and Mochi Media) Ringo (key-value, written in Erlang, used by Nokia) Scalaris (key-value, written in Erlang, used by OnScale) ThruDB (document, written in C++, used by JunkDepot.com) Tokyo Cabinet/Tokyo Tyrant (key-value, written in C, used by Mixi.jp (Japanese social networking site)) I'd like to know about specific problems you - the SO reader - have solved using data stores and what NoSQL data store you used. Questions: What scalability problems have you used NoSQL data stores to solve? What NoSQL data store did you use? What database did you use before switching to a NoSQL data store? I'm looking for first-hand experiences, so please do not answer unless you have that.

    Read the article

  • Where are tables in Mnesia located?

    - by Sanoj
    I try to compare Mnesia with more traditional databases. As I understand it tables in Mnesia can be located to: ram_copies - tables are stored in RAM only, so no durability as in ACID. disc_copies - tables are located on disc and a copy is located in RAM, so the table can not be bigger than the available memory? disc_only_copies - tables are located to disc only, so no caching in memory and worse performance? And the size of the table are limited to the size of dets or the table has to be fragmented. So if I want the performance of doing reads from RAM and the durability of writes to disc, then the size of the tables are very limited compared to a traditional RDBMS like MySQL or PostgreSQL. I know that Mnesia aren't meant to replace traditional RDBMS:s, but can it be used as a big RDBMS or do I have to look for another database? The server I will use is a VPS with limited amount of memory, around 512MB, but I want good database performance. Are disc_copies and the other types of tables in Mnesia so limited as I have understood?

    Read the article

  • Pitfalls and practical Use-Cases: Toplink, Hibernate, Eclipse Link, Ibatis ...

    - by Martin K.
    I worked a lot with Hibernate as my JPA implementation. In most cases it works fine! But I have also seen a lot of pitfalls: Remoting with persisted Objects is difficult, because Hibernate replaces the Java collections with its own collection implementation. So the every client must have the Hibernate .jar libraries. You have to take care on LazyLoading exceptions etc. One way to get around this problem is the use of webservices. Dirty checking is done against the Database without any lock. "Delayed SQL", causes that the data access isn't ACID compliant. (Lost data...) Implict Updates So we don't know if an object is modified or not (commit causes updates). Are there similar issues with Toplink, Eclipse Link and Ibatis? When should I use them? Have they a similar performance? Are there reasons to choose Eclipse Link/Toplink... over Hibernate?

    Read the article

  • insert data to table based on another table C#

    - by user1017315
    I wrote a code which takes some values from one table and inserts the other table in these values.(not just these values, but also these values(this values=values from the based on table)) and I get this error: System.Data.OleDb.OleDbException (0x80040E10): value wan't given for one or more of the required parameters.` here's the code. I don't know what i've missed. string selectedItem = comboBox1.SelectedItem.ToString(); Codons cdn = new Codons(selectedItem); string codon1; int index; if (this.i != this.counter) { //take from the DataBase the matching codonsCodon1 to codonsFullName codon1 = cdn.GetCodon1(); //take the serialnumber of the last protein string connectionString = "Provider=Microsoft.ACE.OLEDB.12.0;" + "Data Source=C:\\Projects_2012\\Project_Noam\\Access\\myProject.accdb"; OleDbConnection conn = new OleDbConnection(connectionString); conn.Open(); string last= "SELECT proInfoSerialNum FROM tblProInfo WHERE proInfoScienceName = "+this.name ; OleDbCommand getSerial = new OleDbCommand(last, conn); OleDbDataReader dr = getSerial.ExecuteReader(); dr.Read(); index = dr.GetInt32(0); //add the amino acid to tblOrderAA using (OleDbConnection connection = new OleDbConnection(connectionString)) { string insertCommand = "INSERT INTO tblOrderAA(orderAASerialPro, orderAACodon1) " + " values (?, ?)"; using (OleDbCommand command = new OleDbCommand(insertCommand, connection)) { connection.Open(); command.Parameters.AddWithValue("orderAASerialPro", index); command.Parameters.AddWithValue("orderAACodon1", codon1); command.ExecuteNonQuery(); } } } EDIT:I put a messagebox after that line: index = dr.GetInt32(0); to see where is the problem, and i get the error before that.i don't see the messagebox

    Read the article

  • Create Chemistry Equations and Diagrams in Word

    - by Matthew Guay
    Microsoft Word is a great tool for formatting text, but what if you want to insert a chemistry formula or diagram?  Thanks to a new free add-in for Word, you can now insert high-quality chemistry formulas and diagrams directly from the Ribbon in Word. Microsoft’s new Education Labs has recently released the new Chemistry Add-in for Word 2007 and 2010.  This free download adds support for entering and editing chemistry symbols, diagrams, and formulas using the standard XML based Chemical Markup Language.  You can convert any chemical name, such as benzene, or formula, such as H2O, into a chemical diagram, standard name, or formula.  Whether you’re a professional chemist, just taking chemistry in school, or simply curious about the makeup of Citric Acid, this add-in is an exciting way to bring chemistry to your computer. This add-in works great on Word 2007 and 2010, including the 64 bit version of Word 2010.  Please note that the current version is still in beta, so only run it if you are comfortable running beta products. Getting Started Download the Chemistry add-in from Microsoft Education Labs (link below), and unzip the file.  Then, run the ChemistryAddinforWordBeta2.Setup.msi. It may inform you that you need to install the Visual Studio Tools for Office 3.0.  Simply click Yes to download these tools. This will open the download in your default browser.  Simply click run, or save and then run it when it is downloaded. Now, click next to install the Visual Studio Tools for Office as usual. When this is finished, run the ChemistryAddinforWordBeta2.Setup.msi again.  This time, you can easily install it with the default options. Once it’s finished installing, open Word to try out the Chemistry Add-in.  You will be asked if you want to install this customization, so click Install to enable it. Now you will have a new Chemistry tab in your Word ribbon.  Here’s the ribbon in Word 2010… And here it is in Word 2007.   Using the Chemistry Add-in It’s very easy to insert nice chemistry diagrams and formulas in Word with the Chemistry add-in.  You can quickly insert a premade diagram from the Chemistry Gallery: Or you can insert a formula from file.  Simply click “From File” and choose any Chemical Markup Language (.cml) formatted file to insert the chemical formula. You can also convert any chemical name to it’s chemical form.  Simply select the word, right-click, select “Convert to Chemistry Zone” and then click on its name. Now you can see the chemical form in the sidebar if you click the Chemistry Navigator button, and can choose to insert the diagram into the document.  Some chemicals will automatically convert to the diagram in the document, while others simply link to it in the sidebar.  Either way, you can display exactly what you want. You can also convert a chemical formula directly to it’s chemical diagram.  Here we entered H2O and converted it to Chemistry Zone: This directly converted it to the diagram directly in the document. You can click the Edit button on the top, and from there choose to either edit the 2D model of the chemical, or edit the labels. When you click Edit Labels, you may be asked which form you wish to display.  Here’s the options for potassium permanganate: You can then edit the names and formulas, and add or remove any you wish. If you choose to edit the chemical in 2D, you can even edit the individual atoms and change the chemical you’re diagramming.  This 2D editor has a lot of options, so you can get your chemical diagram to look just like you want. And, if you need any help or want to learn more about the Chemistry add-in and its features, simply click the help button in the Chemistry Ribbon.  This will open a Word document containing examples and explanations which can be helpful in mastering all the features of this add-in. All of this works perfectly, whether you’re running it in Word 2007 or 2010, 32 or 64 bit editions. Conclusion Whether you’re using chemistry formulas everyday or simply want to investigate a chemical makeup occasionally, this is a great way to do it with tools you already have on your computer.  It will also help make homework a bit easier if you’re struggling with it in high school or college. Links Download the Chemistry Add-in for Word Introducing Chemistry Add-in for Word – MSDN blogs Chemistry Markup Language – Wikipedia Similar Articles Productive Geek Tips Geek Reviews: Using Dia as a Free Replacement for Microsoft VisioEasily Summarize A Word 2007 DocumentCreate a Hyperlink in a Word 2007 Flow Chart and Hide Annoying ScreenTipsHow To Create and Publish Blog Posts in Word 2010 & 2007Using Word 2007 as a Blogging Tool TouchFreeze Alternative in AutoHotkey The Icy Undertow Desktop Windows Home Server – Backup to LAN The Clear & Clean Desktop Use This Bookmarklet to Easily Get Albums Use AutoHotkey to Assign a Hotkey to a Specific Window Latest Software Reviews Tinyhacker Random Tips Revo Uninstaller Pro Registry Mechanic 9 for Windows PC Tools Internet Security Suite 2010 PCmover Professional Windows 7 Easter Theme YoWindoW, a real time weather screensaver Optimize your computer the Microsoft way Stormpulse provides slick, real time weather data Geek Parents – Did you try Parental Controls in Windows 7? Change DNS servers on the fly with DNS Jumper

    Read the article

  • SQL – Quick Start with Explorer Sections of NuoDB – Query NuoDB Database

    - by Pinal Dave
    This is the third post in the series of the blog posts I am writing about NuoDB. NuoDB is very innovative and easy-to-use product. I can clearly see how one can scale-out NuoDB with so much ease and confidence. In my very first blog post we discussed how we can install NuoDB (link), and in my second post I discussed how we can manage the NuoDB database transaction engines and storage managers with a few clicks (link). Note: You can Download NuoDB from here. In this post, we will learn how we can use the Explorer feature of NuoDB to do various SQL operations. NuoDB has a browser-based Explorer, which is very powerful and has many of the features any IDE would normally have. Let us see how it works in the following step-by-step tutorial. Let us go to the NuoDBNuoDB Console by typing the following URL in your browser: http://localhost:8080/ It will bring you to the QuickStart screen. Make sure that you have created the sample database. If you have not created sample database, click on Create Database and create it successfully. Now go to the NuoDB Explorer by clicking on the main tab, and it will ask you for your domain username and password. Enter the username as a domain and password as a bird. Alternatively you can also enter username as a quickstart and password as a quickstart. Once you enter the password you will be able to see the databases. In our example we have installed the Sample Database hence you will see the Test database in our Database Hierarchy screen. When you click on database it will ask for the database login. Note that Database Login is different from Domain login and you will have to enter your database login over here. In our case the database username is dba and password is goalie. Once you enter a valid username and password it will display your database. Further expand your database and you will notice various objects in your database. Once you explore various objects, select any database and click on Open. When you click on execute, it will display the SQL script to select the data from the table. The autogenerated script displays entire result set from the database. The NuoDB Explorer is very powerful and makes the life of developers very easy. If you click on List SQL Statements it will list all the available SQL statements right away in Query Editor. You can see the popup window in following image. Here is the cool thing for geeks. You can even click on Query Plan and it will display the text based query plan as well. In case of a SELECT, the query plan will be much simpler, however, when we write complex queries it will be very interesting. We can use the query plan tab for performance tuning of the database. Here is another feature, when we click on List Tables in NuoDB Explorer.  It lists all the available tables in the query editor. This is very helpful when we are writing a long complex query. Here is a relatively complex example I have built using Inner Join syntax. Right below I have displayed the Query Plan. The query plan displays all the little details related to the query. Well, we just wrote multi-table query and executed it against the NuoDB database. You can use the NuoDB Admin section and do various analyses of the query and its performance. NuoDB is a distributed database built on a patented emergent architecture with full support for SQL and ACID guarantees.  It allows you to add Transaction Engine processes to a running system to improve the performance of your system.  You can also add a second Storage Engine to your running system for redundancy purposes.  Conversely, you can shut down processes when you don’t need the extra database resources. NuoDB also provides developers and administrators with a single intuitive interface for centrally monitoring deployments. If you have read my blog posts and have not tried out NuoDB, I strongly suggest that you download it today and catch up with the learnings with me. Trust me though the product is very powerful, it is extremely easy to learn and use. Reference: Pinal Dave (http://blog.sqlauthority.com)   Filed under: Big Data, PostADay, SQL, SQL Authority, SQL Query, SQL Server, SQL Tips and Tricks, T SQL, Technology Tagged: NuoDB

    Read the article

  • Big Data – Operational Databases Supporting Big Data – RDBMS and NoSQL – Day 12 of 21

    - by Pinal Dave
    In yesterday’s blog post we learned the importance of the Cloud in the Big Data Story. In this article we will understand the role of Operational Databases Supporting Big Data Story. Even though we keep on talking about Big Data architecture, it is extremely crucial to understand that Big Data system can’t just exist in the isolation of itself. There are many needs of the business can only be fully filled with the help of the operational databases. Just having a system which can analysis big data may not solve every single data problem. Real World Example Think about this way, you are using Facebook and you have just updated your information about the current relationship status. In the next few seconds the same information is also reflected in the timeline of your partner as well as a few of the immediate friends. After a while you will notice that the same information is now also available to your remote friends. Later on when someone searches for all the relationship changes with their friends your change of the relationship will also show up in the same list. Now here is the question – do you think Big Data architecture is doing every single of these changes? Do you think that the immediate reflection of your relationship changes with your family member is also because of the technology used in Big Data. Actually the answer is Facebook uses MySQL to do various updates in the timeline as well as various events we do on their homepage. It is really difficult to part from the operational databases in any real world business. Now we will see a few of the examples of the operational databases. Relational Databases (This blog post) NoSQL Databases (This blog post) Key-Value Pair Databases (Tomorrow’s post) Document Databases (Tomorrow’s post) Columnar Databases (The Day After’s post) Graph Databases (The Day After’s post) Spatial Databases (The Day After’s post) Relational Databases We have earlier discussed about the RDBMS role in the Big Data’s story in detail so we will not cover it extensively over here. Relational Database is pretty much everywhere in most of the businesses which are here for many years. The importance and existence of the relational database are always going to be there as long as there are meaningful structured data around. There are many different kinds of relational databases for example Oracle, SQL Server, MySQL and many others. If you are looking for Open Source and widely accepted database, I suggest to try MySQL as that has been very popular in the last few years. I also suggest you to try out PostgreSQL as well. Besides many other essential qualities PostgreeSQL have very interesting licensing policies. PostgreSQL licenses allow modifications and distribution of the application in open or closed (source) form. One can make any modifications and can keep it private as well as well contribute to the community. I believe this one quality makes it much more interesting to use as well it will play very important role in future. Nonrelational Databases (NOSQL) We have also covered Nonrelational Dabases in earlier blog posts. NoSQL actually stands for Not Only SQL Databases. There are plenty of NoSQL databases out in the market and selecting the right one is always very challenging. Here are few of the properties which are very essential to consider when selecting the right NoSQL database for operational purpose. Data and Query Model Persistence of Data and Design Eventual Consistency Scalability Though above all of the properties are interesting to have in any NoSQL database but the one which most attracts to me is Eventual Consistency. Eventual Consistency RDBMS uses ACID (Atomicity, Consistency, Isolation, Durability) as a key mechanism for ensuring the data consistency, whereas NonRelational DBMS uses BASE for the same purpose. Base stands for Basically Available, Soft state and Eventual consistency. Eventual consistency is widely deployed in distributed systems. It is a consistency model used in distributed computing which expects unexpected often. In large distributed system, there are always various nodes joining and various nodes being removed as they are often using commodity servers. This happens either intentionally or accidentally. Even though one or more nodes are down, it is expected that entire system still functions normally. Applications should be able to do various updates as well as retrieval of the data successfully without any issue. Additionally, this also means that system is expected to return the same updated data anytime from all the functioning nodes. Irrespective of when any node is joining the system, if it is marked to hold some data it should contain the same updated data eventually. As per Wikipedia - Eventual consistency is a consistency model used in distributed computing that informally guarantees that, if no new updates are made to a given data item, eventually all accesses to that item will return the last updated value. In other words -  Informally, if no additional updates are made to a given data item, all reads to that item will eventually return the same value. Tomorrow In tomorrow’s blog post we will discuss about various other Operational Databases supporting Big Data. Reference: Pinal Dave (http://blog.sqlauthority.com) Filed under: Big Data, PostADay, SQL, SQL Authority, SQL Query, SQL Server, SQL Tips and Tricks, T SQL

    Read the article

  • MySQL Connect Only 10 Days Away - Focus on InnoDB Sessions

    - by Bertrand Matthelié
    Time flies and MySQL Connect is only 10 days away! You can check out the full program here as well as in the September edition of the MySQL newsletter. Mat recently blogged about the MySQL Cluster sessions you’ll have the opportunity to attend, and below are those focused on InnoDB. Remember you can plan your schedule with Schedule Builder. Saturday, 1.00 pm, Room Golden Gate 3: 10 Things You Should Know About InnoDB—Calvin Sun, Oracle InnoDB is the default storage engine for Oracle’s MySQL as of MySQL Release 5.5. It provides the standard ACID-compliant transactions, row-level locking, multiversion concurrency control, and referential integrity. InnoDB also implements several innovative technologies to improve its performance and reliability. This presentation gives a brief history of InnoDB; its main features; and some recent enhancements for better performance, scalability, and availability. Saturday, 5.30 pm, Room Golden Gate 4: Demystified MySQL/InnoDB Performance Tuning—Dimitri Kravtchuk, Oracle This session covers performance tuning with MySQL and the InnoDB storage engine for MySQL and explains the main improvements made in MySQL Release 5.5 and Release 5.6. Which setting for which workload? Which value will be better for my system? How can I avoid potential bottlenecks from the beginning? Do I need a purge thread? Is it true that InnoDB doesn't need thread concurrency anymore? These and many other questions are asked by DBAs and developers. Things are changing quickly and constantly, and there is no “silver bullet.” But understanding the configuration setting’s impact is already a huge step in performance improvement. Bring your ideas and problems to share them with others—the discussion is open, just moderated by a speaker. Sunday, 10.15 am, Room Golden Gate 4: Better Availability with InnoDB Online Operations—Calvin Sun, Oracle Many top Web properties rely on Oracle’s MySQL as a critical piece of infrastructure for serving millions of users. Database availability has become increasingly important. One way to enhance availability is to give users full access to the database during data definition language (DDL) operations. The online DDL operations in recent MySQL releases offer users the flexibility to perform schema changes while having full access to the database—that is, with minimal delay of operations on a table and without rebuilding the entire table. These enhancements provide better responsiveness and availability in busy production environments. This session covers these improvements in the InnoDB storage engine for MySQL for online DDL operations such as add index, drop foreign key, and rename column. Sunday, 11.45 am, Room Golden Gate 7: Developing High-Throughput Services with NoSQL APIs to InnoDB and MySQL Cluster—Andrew Morgan and John Duncan, Oracle Ever-increasing performance demands of Web-based services have generated significant interest in providing NoSQL access methods to MySQL (MySQL Cluster and the InnoDB storage engine of MySQL), enabling users to maintain all the advantages of their existing relational databases while providing blazing-fast performance for simple queries. Get the best of both worlds: persistence; consistency; rich SQL queries; high availability; scalability; and simple, flexible APIs and schemas for agile development. This session describes the memcached connectors and examines some use cases for how MySQL and memcached fit together in application architectures. It does the same for the newest MySQL Cluster native connector, an easy-to-use, fully asynchronous connector for Node.js. Sunday, 1.15 pm, Room Golden Gate 4: InnoDB Performance Tuning—Inaam Rana, Oracle The InnoDB storage engine has always been highly efficient and includes many unique architectural elements to ensure high performance and scalability. In MySQL 5.5 and MySQL 5.6, InnoDB includes many new features that take better advantage of recent advances in operating systems and hardware platforms than previous releases did. This session describes unique InnoDB architectural elements for performance, new features, and how to tune InnoDB to achieve better performance. Sunday, 4.15 pm, Room Golden Gate 3: InnoDB Compression for OLTP—Nizameddin Ordulu, Facebook and Inaam Rana, Oracle Data compression is an important capability of the InnoDB storage engine for Oracle’s MySQL. Compressed tables reduce the size of the database on disk, resulting in fewer reads and writes and better throughput by reducing the I/O workload. Facebook pushes the limit of InnoDB compression and has made several enhancements to InnoDB, making this technology ready for online transaction processing (OLTP). In this session, you will learn the fundamentals of InnoDB compression. You will also learn the enhancements the Facebook team has made to improve InnoDB compression, such as reducing compression failures, not logging compressed page images, and allowing changes of compression level. Not registered yet? You can still save US$ 300 over the on-site fee – Register Now!

    Read the article

  • NoSQL Java API for MySQL Cluster: Questions & Answers

    - by Mat Keep
    The MySQL Cluster engineering team recently ran a live webinar, available now on-demand demonstrating the ClusterJ and ClusterJPA NoSQL APIs for MySQL Cluster, and how these can be used in building real-time, high scale Java-based services that require continuous availability. Attendees asked a number of great questions during the webinar, and I thought it would be useful to share those here, so others are also able to learn more about the Java NoSQL APIs. First, a little bit about why we developed these APIs and why they are interesting to Java developers. ClusterJ and Cluster JPA ClusterJ is a Java interface to MySQL Cluster that provides either a static or dynamic domain object model, similar to the data model used by JDO, JPA, and Hibernate. A simple API gives users extremely high performance for common operations: insert, delete, update, and query. ClusterJPA works with ClusterJ to extend functionality, including - Persistent classes - Relationships - Joins in queries - Lazy loading - Table and index creation from object model By eliminating data transformations via SQL, users get lower data access latency and higher throughput. In addition, Java developers have a more natural programming method to directly manage their data, with a complete, feature-rich solution for Object/Relational Mapping. As a result, the development of Java applications is simplified with faster development cycles resulting in accelerated time to market for new services. MySQL Cluster offers multiple NoSQL APIs alongside Java: - Memcached for a persistent, high performance, write-scalable Key/Value store, - HTTP/REST via an Apache module - C++ via the NDB API for the lowest absolute latency. Developers can use SQL as well as NoSQL APIs for access to the same data set via multiple query patterns – from simple Primary Key lookups or inserts to complex cross-shard JOINs using Adaptive Query Localization Marrying NoSQL and SQL access to an ACID-compliant database offers developers a number of benefits. MySQL Cluster’s distributed, shared-nothing architecture with auto-sharding and real time performance makes it a great fit for workloads requiring high volume OLTP. Users also get the added flexibility of being able to run real-time analytics across the same OLTP data set for real-time business insight. OK – hopefully you now have a better idea of why ClusterJ and JPA are available. Now, for the Q&A. Q & A Q. Why would I use Connector/J vs. ClusterJ? A. Partly it's a question of whether you prefer to work with SQL (Connector/J) or objects (ClusterJ). Performance of ClusterJ will be better as there is no need to pass through the MySQL Server. A ClusterJ operation can only act on a single table (e.g. no joins) - ClusterJPA extends that capability Q. Can I mix different APIs (ie ClusterJ, Connector/J) in our application for different query types? A. Yes. You can mix and match all of the API types, SQL, JDBC, ODBC, ClusterJ, Memcached, REST, C++. They all access the exact same data in the data nodes. Update through one API and new data is instantly visible to all of the others. Q. How many TCP connections would a SessionFactory instance create for a cluster of 8 data nodes? A. SessionFactory has a connection to the mgmd (management node) but otherwise is just a vehicle to create Sessions. Without using connection pooling, a SessionFactory will have one connection open with each data node. Using optional connection pooling allows multiple connections from the SessionFactory to increase throughput. Q. Can you give details of how Cluster J optimizes sharding to enhance performance of distributed query processing? A. Each data node in a cluster runs a Transaction Coordinator (TC), which begins and ends the transaction, but also serves as a resource to operate on the result rows. While an API node (such as a ClusterJ process) can send queries to any TC/data node, there are performance gains if the TC is where most of the result data is stored. ClusterJ computes the shard (partition) key to choose the data node where the row resides as the TC. Q. What happens if we perform two primary key lookups within the same transaction? Are they sent to the data node in one transaction? A. ClusterJ will send identical PK lookups to the same data node. Q. How is distributed query processing handled by MySQL Cluster ? A. If the data is split between data nodes then all of the information will be transparently combined and passed back to the application. The session will connect to a data node - typically by hashing the primary key - which then interacts with its neighboring nodes to collect the data needed to fulfil the query. Q. Can I use Foreign Keys with MySQL Cluster A. Support for Foreign Keys is included in the MySQL Cluster 7.3 Early Access release Summary The NoSQL Java APIs are packaged with MySQL Cluster, available for download here so feel free to take them for a spin today! Key Resources MySQL Cluster on-line demo  MySQL ClusterJ and JPA On-demand webinar  MySQL ClusterJ and JPA documentation MySQL ClusterJ and JPA whitepaper and tutorial

    Read the article

  • Source-control 'wet-work'?

    - by Phil Factor
    When a design or creative work is flawed beyond remedy, it is often best to destroy it and start again. The other day, I lost the code to a long and intricate SQL batch I was working on. I’d thought it was impossible, but it happened. With all the technology around that is designed to prevent this occurring, this sort of accident has become a rare event.  If it weren’t for a deranged laptop, and my distraction, the code wouldn’t have been lost this time.  As always, I sighed, had a soothing cup of tea, and typed it all in again.  The new code I hastily tapped in  was much better: I’d held in my head the essence of how the code should work rather than the details: I now knew for certain  the start point, the end, and how it should be achieved. Instantly the detritus of half-baked thoughts fell away and I was able to write logical code that performed better.  Because I could work so quickly, I was able to hold the details of all the columns and variables in my head, and the dynamics of the flow of data. It was, in fact, easier and quicker to start from scratch rather than tidy up and refactor the existing code with its inevitable fumbling and half-baked ideas. What a shame that technology is now so good that developers rarely experience the cleansing shock of losing one’s code and having to rewrite it from scratch.  If you’ve never accidentally lost  your code, then it is worth doing it deliberately once for the experience. Creative people have, until Technology mistakenly prevented it, torn up their drafts or sketches, threw them in the bin, and started again from scratch.  Leonardo’s obsessive reworking of the Mona Lisa was renowned because it was so unusual:  Most artists have been utterly ruthless in destroying work that didn’t quite make it. Authors are particularly keen on writing afresh, and the results are generally positive. Lawrence of Arabia actually lost the entire 250,000 word manuscript of ‘The Seven Pillars of Wisdom’ by accidentally leaving it on a train at Reading station, before rewriting a much better version.  Now, any writer or artist is seduced by technology into altering or refining their work rather than casting it dramatically in the bin or setting a light to it on a bonfire, and rewriting it from the blank page.  It is easy to pick away at a flawed work, but the real creative process is far more brutal. Once, many years ago whilst running a software house that supplied commercial software to local businesses, I’d been supervising an accounting system for a farming cooperative. No packaged system met their needs, and it was all hand-cut code.  For us, it represented a breakthrough as it was for a government organisation, and success would guarantee more contracts. As you’ve probably guessed, the code got mangled in a disk crash just a week before the deadline for delivery, and the many backups all proved to be entirely corrupted by a faulty tape drive.  There were some fragments left on individual machines, but they were all of different versions.  The developers were in despair.  Strangely, I managed to re-write the bulk of a three-month project in a manic and caffeine-soaked weekend.  Sure, that elegant universally-applicable input-form routine was‘nt quite so elegant, but it didn’t really need to be as we knew what forms it needed to support.  Yes, the code lacked architectural elegance and reusability. By dawn on Monday, the application passed its integration tests. The developers rose to the occasion after I’d collapsed, and tidied up what I’d done, though they were reproachful that some of the style and elegance had gone out of the application. By the delivery date, we were able to install it. It was a smaller, faster application than the beta they’d seen and the user-interface had a new, rather Spartan, appearance that we swore was done to conform to the latest in user-interface guidelines. (we switched to Helvetica font to look more ‘Bauhaus’ ). The client was so delighted that he forgave the new bugs that had crept in. I still have the disk that crashed, up in the attic. In IT, we have had mixed experiences from complete re-writes. Lotus 123 never really recovered from a complete rewrite from assembler into C, Borland made the mistake with Arago and Quattro Pro  and Netscape’s complete rewrite of their Navigator 4 browser was a white-knuckle ride. In all cases, the decision to rewrite was a result of extreme circumstances where no other course of action seemed possible.   The rewrite didn’t come out of the blue. I prefer to remember the rewrite of Minix by young Linus Torvalds, or the rewrite of Bitkeeper by a slightly older Linus.  The rewrite of CP/M didn’t do too badly either, did it? Come to think of it, the guy who decided to rewrite the windowing system of the Xerox Star never regretted the decision. I’ll agree that one should often resist calls for a rewrite. One of the worst habits of the more inexperienced programmer is to denigrate whatever code he or she inherits, and then call loudly for a complete rewrite. They are buoyed up by the mistaken belief that they can do better. This, however, is a different psychological phenomenon, more related to the idea of some motorcyclists that they are operating on infinite lives, or the occasional squaddies that if they charge the machine-guns determinedly enough all will be well. Grim experience brings out the humility in any experienced programmer.  I’m referring to quite different circumstances here. Where a team knows the requirements perfectly, are of one mind on methodology and coding standards, and they already have a solution, then what is wrong with considering  a complete rewrite? Rewrites are so painful in the early stages, until that point where one realises the payoff, that even I quail at the thought. One needs a natural disaster to push one over the edge. The trouble is that source-control systems, and disaster recovery systems, are just too good nowadays.   If I were to lose this draft of this very blog post, I know I’d rewrite it much better. However, if you read this, you’ll know I didn’t have the nerve to delete it and start again.  There was a time that one prayed that unreliable hardware would deliver you from an unmaintainable mess of a codebase, but now technology has made us almost entirely immune to such a merciful act of God. An old friend of mine with long experience in the software industry has long had the idea of the ‘source-control wet-work’,  where one hires a malicious hacker in some wild eastern country to hack into one’s own  source control system to destroy all trace of the source to an application. Alas, backup systems are just too good to make this any more than a pipedream. Somehow, it would be difficult to promote the idea. As an alternative, could one construct a source control system that, on doing all the code-quality metrics, would systematically destroy all trace of source code that failed the quality test? Alas, I can’t see many managers buying into the idea. In reading the full story of the near-loss of Toy Story 2, it set me thinking. It turned out that the lucky restoration of the code wasn’t the happy ending one first imagined it to be, because they eventually came to the conclusion that the plot was fundamentally flawed and it all had to be rewritten anyway.  Was this an early  case of the ‘source-control wet-job’?’ It is very hard nowadays to do a rapid U-turn in a development project because we are far too prone to cling to our existing source-code.

    Read the article

  • Agile Testing Days 2012 – Day 3 – Agile or agile?

    - by Chris George
    Another early start for my last Lean Coffee of the conference, and again it was not wasted. We had some really interesting discussions around how to determine what test automation is useful, if agile is not faster, why do it? and a rather existential discussion on whether unicorns exist! First keynote of the day was entitled “Fast Feedback Teams” by Ola Ellnestam. Again this relates nicely to the releasing faster talk on day 2, and something that we are looking at and some teams are actively trying. Introducing the notion of feedback, Ola describes a game he wrote for his eldest child. It was a simple game where every time he clicked a button, it displayed “You’ve Won!”. He then changed it to be a Win-Lose-Win-Lose pattern and watched the feedback from his son who then twigged the pattern and got his younger brother to play, alternating turns… genius! (must do that with my children). The idea behind this was that you need that feedback loop to learn and progress. If you are not getting the feedback you need to close that loop. An interesting point Ola made was to solve problems BEFORE writing software. It may be that you don’t have to write anything at all, perhaps it’s a communication/training issue? Perhaps the problem can be solved another way. Writing software, although it’s the business we are in, is expensive, and this should be taken into account. He again mentions frequent releases, and how they should be made as soon as stuff is ready to be released, don’t leave stuff on the shelf cause it’s not earning you anything, money or data. I totally agree with this and it’s something that we will be aiming for moving forwards. “Exceptions, Assumptions and Ambiguity: Finding the truth behind the story” by David Evans started off very promising by making references to ‘Grim up North’ referring to the north of England. Not sure it was appreciated by most of the audience, but it made me laugh! David explained how there are always risks associated with exceptions, giving the example of a one-way road near where he lives, with an exception sign giving rights to coaches to go the wrong way. Therefore you could merrily swing around the corner of the one way road straight into a coach! David showed the danger in making assumptions with lyrical quotes from Lola by The Kinks “I’m glad I’m a man, and so is Lola” and with a picture of a toilet flush that needed instructions to operate the full and half flush. With this particular flush, you pulled the handle all the way down to half flush, and half way down to full flush! hmmm, a bit of a crappy user experience methinks! Then through a clever use of a passage from the Jabberwocky, David then went onto show how mis-translation/ambiguity is the can completely distort the original meaning of something, and this is a real enemy of software development. This was all helping to demonstrate that the term Story is often heavily overloaded in the Agile world, and should really be stripped back to what it is really for, stating a business problem, and offering a technical solution. Therefore a story could be worded as “In order to {make some improvement}, we will { do something}”. The first ‘in order to’ statement is stakeholder neutral, and states the problem through requesting an improvement to the software/process etc. The second part of the story is the verb, the doing bit. So to achieve the ‘improvement’ which is not currently true, we will do something to make this true in the future. My PM is very interested in this, and he’s observed some of the problems of overloading stories so I’m hoping between us we can use some of David’s suggestions to help clarify our stories better. The second keynote of the day (and our last) proved to be the most entertaining and exhausting of the conference for me. “The ongoing evolution of testing in agile development” by Scott Barber. I’ve never had the pleasure of seeing Scott before… OMG I would love to have even half of the energy he has! What struck me during this presentation was Scott’s explanation of how testing has become the role/job that it is (largely) today, and how this has led to the need for ‘methodologies’ to make dev and test work! The argument that we should be trying to converge the roles again is a very valid one, and one that a couple of the teams at work are actively doing with great results. Making developers as responsible for quality as testers is something that has been lost over the years, but something that we are now striving to achieve. The idea that we (testers) should be testing experts/specialists, not testing ‘union members’, supports this idea so the entire team works on all aspects of a feature/product, with the ‘specialists’ taking the lead and advising/coaching the others. This leads to better propagation of information around the team, a greater holistic understanding of the project and it allows the team to continue functioning if some of it’s members are off sick, for example. Feeling somewhat drained from Scott’s keynote (but at the same time excited that alot of the points he raised supported actions we are taking at work), I headed into my last presentation for Agile Testing Days 2012 before having to make my way to Tegel to catch the flight home. “Thinking and working agile in an unbending world” with Pete Walen was a talk I was not going to miss! Having spoken to Pete several times during the past few days, I was looking forward to hearing what he was going to say, and I was not disappointed. Pete started off by trying to separate the definitions of ‘Agile’ as in the methodology, and ‘agile’ as in the adjective by pronouncing them the ‘english’ and ‘american’ ways. So Agile pronounced (Ajyle) and agile pronounced (ajul). There was much confusion around what the hell he was talking about, although I thought it was quite clear. Agile – Software development methodology agile – Marked by ready ability to move with quick easy grace; Having a quick resourceful and adaptable character. Anyway, that aside (although it provided a few laughs during the presentation), the point was that many teams that claim to be ‘Agile’ but are not, in fact, ‘agile’ by nature. Implementing ‘Agile’ methodologies that are so prescriptive actually goes against the very nature of Agile development where a team should anticipate, adapt and explore. Pete made a valid point that very few companies intentionally put up roadblocks to impede work, so if work is being blocked/delayed, why? This is where being agile as a team pays off because the team can inspect what’s going on, explore options and adapt their processes. It is through experimentation (and that means trying and failing as well as trying and succeeding) that a team will improve and grow leading to focussing on what really needs to be done to achieve X. So, that was it, the last talk of our conference. I was gutted that we had to miss the closing keynote from Matt Heusser, as Matt was another person I had spoken too a few times during the conference, but the flight would not wait, and just as well we left when we did because the traffic was a nightmare! My Takeaway Triple from Day 3: Release often and release small – don’t leave stuff on the shelf Keep the meaning of the word ‘agile’ in mind when working in ‘Agile Look at testing as more of a skill than a role  

    Read the article

  • Source-control 'wet-work'?

    - by Phil Factor
    When a design or creative work is flawed beyond remedy, it is often best to destroy it and start again. The other day, I lost the code to a long and intricate SQL batch I was working on. I’d thought it was impossible, but it happened. With all the technology around that is designed to prevent this occurring, this sort of accident has become a rare event.  If it weren’t for a deranged laptop, and my distraction, the code wouldn’t have been lost this time.  As always, I sighed, had a soothing cup of tea, and typed it all in again.  The new code I hastily tapped in  was much better: I’d held in my head the essence of how the code should work rather than the details: I now knew for certain  the start point, the end, and how it should be achieved. Instantly the detritus of half-baked thoughts fell away and I was able to write logical code that performed better.  Because I could work so quickly, I was able to hold the details of all the columns and variables in my head, and the dynamics of the flow of data. It was, in fact, easier and quicker to start from scratch rather than tidy up and refactor the existing code with its inevitable fumbling and half-baked ideas. What a shame that technology is now so good that developers rarely experience the cleansing shock of losing one’s code and having to rewrite it from scratch.  If you’ve never accidentally lost  your code, then it is worth doing it deliberately once for the experience. Creative people have, until Technology mistakenly prevented it, torn up their drafts or sketches, threw them in the bin, and started again from scratch.  Leonardo’s obsessive reworking of the Mona Lisa was renowned because it was so unusual:  Most artists have been utterly ruthless in destroying work that didn’t quite make it. Authors are particularly keen on writing afresh, and the results are generally positive. Lawrence of Arabia actually lost the entire 250,000 word manuscript of ‘The Seven Pillars of Wisdom’ by accidentally leaving it on a train at Reading station, before rewriting a much better version.  Now, any writer or artist is seduced by technology into altering or refining their work rather than casting it dramatically in the bin or setting a light to it on a bonfire, and rewriting it from the blank page.  It is easy to pick away at a flawed work, but the real creative process is far more brutal. Once, many years ago whilst running a software house that supplied commercial software to local businesses, I’d been supervising an accounting system for a farming cooperative. No packaged system met their needs, and it was all hand-cut code.  For us, it represented a breakthrough as it was for a government organisation, and success would guarantee more contracts. As you’ve probably guessed, the code got mangled in a disk crash just a week before the deadline for delivery, and the many backups all proved to be entirely corrupted by a faulty tape drive.  There were some fragments left on individual machines, but they were all of different versions.  The developers were in despair.  Strangely, I managed to re-write the bulk of a three-month project in a manic and caffeine-soaked weekend.  Sure, that elegant universally-applicable input-form routine was‘nt quite so elegant, but it didn’t really need to be as we knew what forms it needed to support.  Yes, the code lacked architectural elegance and reusability. By dawn on Monday, the application passed its integration tests. The developers rose to the occasion after I’d collapsed, and tidied up what I’d done, though they were reproachful that some of the style and elegance had gone out of the application. By the delivery date, we were able to install it. It was a smaller, faster application than the beta they’d seen and the user-interface had a new, rather Spartan, appearance that we swore was done to conform to the latest in user-interface guidelines. (we switched to Helvetica font to look more ‘Bauhaus’ ). The client was so delighted that he forgave the new bugs that had crept in. I still have the disk that crashed, up in the attic. In IT, we have had mixed experiences from complete re-writes. Lotus 123 never really recovered from a complete rewrite from assembler into C, Borland made the mistake with Arago and Quattro Pro  and Netscape’s complete rewrite of their Navigator 4 browser was a white-knuckle ride. In all cases, the decision to rewrite was a result of extreme circumstances where no other course of action seemed possible.   The rewrite didn’t come out of the blue. I prefer to remember the rewrite of Minix by young Linus Torvalds, or the rewrite of Bitkeeper by a slightly older Linus.  The rewrite of CP/M didn’t do too badly either, did it? Come to think of it, the guy who decided to rewrite the windowing system of the Xerox Star never regretted the decision. I’ll agree that one should often resist calls for a rewrite. One of the worst habits of the more inexperienced programmer is to denigrate whatever code he or she inherits, and then call loudly for a complete rewrite. They are buoyed up by the mistaken belief that they can do better. This, however, is a different psychological phenomenon, more related to the idea of some motorcyclists that they are operating on infinite lives, or the occasional squaddies that if they charge the machine-guns determinedly enough all will be well. Grim experience brings out the humility in any experienced programmer.  I’m referring to quite different circumstances here. Where a team knows the requirements perfectly, are of one mind on methodology and coding standards, and they already have a solution, then what is wrong with considering  a complete rewrite? Rewrites are so painful in the early stages, until that point where one realises the payoff, that even I quail at the thought. One needs a natural disaster to push one over the edge. The trouble is that source-control systems, and disaster recovery systems, are just too good nowadays.   If I were to lose this draft of this very blog post, I know I’d rewrite it much better. However, if you read this, you’ll know I didn’t have the nerve to delete it and start again.  There was a time that one prayed that unreliable hardware would deliver you from an unmaintainable mess of a codebase, but now technology has made us almost entirely immune to such a merciful act of God. An old friend of mine with long experience in the software industry has long had the idea of the ‘source-control wet-work’,  where one hires a malicious hacker in some wild eastern country to hack into one’s own  source control system to destroy all trace of the source to an application. Alas, backup systems are just too good to make this any more than a pipedream. Somehow, it would be difficult to promote the idea. As an alternative, could one construct a source control system that, on doing all the code-quality metrics, would systematically destroy all trace of source code that failed the quality test? Alas, I can’t see many managers buying into the idea. In reading the full story of the near-loss of Toy Story 2, it set me thinking. It turned out that the lucky restoration of the code wasn’t the happy ending one first imagined it to be, because they eventually came to the conclusion that the plot was fundamentally flawed and it all had to be rewritten anyway.  Was this an early  case of the ‘source-control wet-job’?’ It is very hard nowadays to do a rapid U-turn in a development project because we are far too prone to cling to our existing source-code.

    Read the article

  • NoSQL DB for .Net document-based database (ECM)

    - by Dane
    I'm halfway through coding a basic multi-tenant SaaS ECM solution. Each client has it's own instance of the database / datastore, but the .Net app is single instance. The documents are pretty much read only (i.e. an image archive of tiffs or PDFs) I've used MSSQL so far, but then started thinking this might be viable in a NoSQL DB (e.g. MongoDB, CouchDB). The basic premise is that it stores documents, each with their own particular indexes. Each tenant can have multiple document types. e.g. One tenant might have an invoice type, which has Customer ID, Invoice Number and Invoice Date. Another tenant might have an application form, which has Member Number, Application Number, Member Name, and Application Date. So far I've used the old method which Sharepoint (used?) to use, and created a document table which has int_field_1, int_field_2, date_field_1, date_field_2, etc. Then, I've got a "mapping" table which stores the customer specific index name, and the database field that will map to. I've avoided the key-value pair model in the DB due to volume of documents. This way, we can support multiple document types in the one table, and get reasonably high performance out of it, and allow for custom document type searches (i.e. user selects a document type, then they're presented with a list of search fields). However, a NoSQL DB might make this a lot simpler, as I don't need to worry about denormalizing the document. However, I've just got concerns about the rest of the data around a document. We store an "action history" against the document. This tracks views, whether someone emails the document from within the system, and other "future" functionality (e.g. faxing). We have control over the document load process, so we can manipulate the data however it needs to be to get it in the document store (e.g. assign unique IDs). Users will not be adding in their own documents, so we shouldn't need to worry about ACID compliance, as the documents are relatively static. So, my questions I guess : Is a NoSQL DB a good fit Is MongoDB the best for Asp.Net (I saw Raven and Velocity, but they're still kinda beta) Can I store a key for each document, and then store the action history in a MSSQL DB with this key? I don't need to do joins, it would be if a person clicks "View History" against a document. How would performance compare between the two (NoSQL DB vs denormalized "document" table) Volumes would be up to 200,000 new documents per month for a single tenant. My current scaling plan with the SQL DB involves moving the SQL DB into a cluster when certain thresholds are reached, and then reviewing partitioning and indexing structures.

    Read the article

  • Persistent (purely functional) Red-Black trees on disk performance

    - by Waneck
    I'm studying the best data structures to implement a simple open-source object temporal database, and currently I'm very fond of using Persistent Red-Black trees to do it. My main reasons for using persistent data structures is first of all to minimize the use of locks, so the database can be as parallel as possible. Also it will be easier to implement ACID transactions and even being able to abstract the database to work in parallel on a cluster of some kind. The great thing of this approach is that it makes possible implementing temporal databases almost for free. And this is something quite nice to have, specially for web and for data analysis (e.g. trends). All of this is very cool, but I'm a little suspicious about the overall performance of using a persistent data structure on disk. Even though there are some very fast disks available today, and all writes can be done asynchronously, so a response is always immediate, I don't want to build all application under a false premise, only to realize it isn't really a good way to do it. Here's my line of thought: - Since all writes are done asynchronously, and using a persistent data structure will enable not to invalidate the previous - and currently valid - structure, the write time isn't really a bottleneck. - There are some literature on structures like this that are exactly for disk usage. But it seems to me that these techniques will add more read overhead to achieve faster writes. But I think that exactly the opposite is preferable. Also many of these techniques really do end up with a multi-versioned trees, but they aren't strictly immutable, which is something very crucial to justify the persistent overhead. - I know there still will have to be some kind of locking when appending values to the database, and I also know there should be a good garbage collecting logic if not all versions are to be maintained (otherwise the file size will surely rise dramatically). Also a delta compression system could be thought about. - Of all search trees structures, I really think Red-Blacks are the most close to what I need, since they offer the least number of rotations. But there are some possible pitfalls along the way: - Asynchronous writes -could- affect applications that need the data in real time. But I don't think that is the case with web applications, most of the time. Also when real-time data is needed, another solutions could be devised, like a check-in/check-out system of specific data that will need to be worked on a more real-time manner. - Also they could lead to some commit conflicts, though I fail to think of a good example of when it could happen. Also commit conflicts can occur in normal RDBMS, if two threads are working with the same data, right? - The overhead of having an immutable interface like this will grow exponentially and everything is doomed to fail soon, so this all is a bad idea. Any thoughts? Thanks! edit: There seems to be a misunderstanding of what a persistent data structure is: http://en.wikipedia.org/wiki/Persistent_data_structure

    Read the article

  • Java looping through array - Optimization

    - by oudouz
    I've got some Java code that runs quite the expected way, but it's taking some amount of time -some seconds- even if the job is just looping through an array. The input file is a Fasta file as shown in the image below. The file I'm using is 2.9Mo, and there are some other Fasta file that can take up to 20Mo. And in the code im trying to loop through it by bunches of threes, e.g: AGC TTT TCA ... etc The code has no functional sens for now but what I want is to append each Amino Acid to it's equivalent bunch of Bases. Example : AGC - Ser / CUG Leu / ... etc So what's wrong with the code ? and Is there any way to do it better ? Any optimization ? Looping through the whole String is taking some time, maybe just seconds, but need to find a better way to do it. import java.io.BufferedReader; import java.io.File; import java.io.FileNotFoundException; import java.io.FileReader; import java.io.IOException; public class fasta { public static void main(String[] args) throws IOException { File fastaFile; FileReader fastaReader; BufferedReader fastaBuffer = null; StringBuilder fastaString = new StringBuilder(); try { fastaFile = new File("res/NC_017108.fna"); fastaReader = new FileReader(fastaFile); fastaBuffer = new BufferedReader(fastaReader); String fastaDescription = fastaBuffer.readLine(); String line = fastaBuffer.readLine(); while (line != null) { fastaString.append(line); line = fastaBuffer.readLine(); } System.out.println(fastaDescription); System.out.println(); String currentFastaAcid; for (int i = 0; i < fastaString.length(); i+=3) { currentFastaAcid = fastaString.toString().substring(i, i + 3); System.out.println(currentFastaAcid); } } catch (NullPointerException e) { System.out.println(e.getMessage()); } catch (FileNotFoundException e) { System.out.println(e.getMessage()); } catch (IOException e) { System.out.println(e.getMessage()); } finally { fastaBuffer.close(); } } }

    Read the article

  • Is Berkeley DB a NoSQL solution?

    - by Gregory Burd
    Berkeley DB is a library. To use it to store data you must link the library into your application. You can use most programming languages to access the API, the calls across these APIs generally mimic the Berkeley DB C-API which makes perfect sense because Berkeley DB is written in C. The inspiration for Berkeley DB was the DBM library, a part of the earliest versions of UNIX written by AT&T's Ken Thompson in 1979. DBM was a simple key/value hashtable-based storage library. In the early 1990s as BSD UNIX was transitioning from version 4.3 to 4.4 and retrofitting commercial code owned by AT&T with unencumbered code, it was the future founders of Sleepycat Software who wrote libdb (aka Berkeley DB) as the replacement for DBM. The problem it addressed was fast, reliable local key/value storage. At that time databases almost always lived on a single node, even the most sophisticated databases only had simple fail-over two node solutions. If you had a lot of data to store you would choose between the few commercial RDBMS solutions or to write your own custom solution. Berkeley DB took the headache out of the custom approach. These basic market forces inspired other DBM implementations. There was the "New DBM" (ndbm) and the "GNU DBM" (GDBM) and a few others, but the theme was the same. Even today TokyoCabinet calls itself "a modern implementation of DBM" mimicking, and improving on, something first created over thirty years ago. In the mid-1990s, DBM was the name for what you needed if you were looking for fast, reliable local storage. Fast forward to today. What's changed? Systems are connected over fast, very reliable networks. Disks are cheep, fast, and capable of storing huge amounts of data. CPUs continued to follow Moore's Law, processing power that filled a room in 1990 now fits in your pocket. PCs, servers, and other computers proliferated both in business and the personal markets. In addition to the new hardware entire markets, social systems, and new modes of interpersonal communication moved onto the web and started evolving rapidly. These changes cause a massive explosion of data and a need to analyze and understand that data. Taken together this resulted in an entirely different landscape for database storage, new solutions were needed. A number of novel solutions stepped up and eventually a category called NoSQL emerged. The new market forces inspired the CAP theorem and the heated debate of BASE vs. ACID. But in essence this was simply the market looking at what to trade off to meet these new demands. These new database systems shared many qualities in common. There were designed to address massive amounts of data, millions of requests per second, and scale out across multiple systems. The first large-scale and successful solution was Dynamo, Amazon's distributed key/value database. Dynamo essentially took the next logical step and added a twist. Dynamo was to be the database of record, it would be distributed, data would be partitioned across many nodes, and it would tolerate failure by avoiding single points of failure. Amazon did this because they recognized that the majority of the dynamic content they provided to customers visiting their web store front didn't require the services of an RDBMS. The queries were simple, key/value look-ups or simple range queries with only a few queries that required more complex joins. They set about to use relational technology only in places where it was the best solution for the task, places like accounting and order fulfillment, but not in the myriad of other situations. The success of Dynamo, and it's design, inspired the next generation of Non-SQL, distributed database solutions including Cassandra, Riak and Voldemort. The problem their designers set out to solve was, "reliability at massive scale" so the first focal point was distributed database algorithms. Underneath Dynamo there is a local transactional database; either Berkeley DB, Berkeley DB Java Edition, MySQL or an in-memory key/value data structure. Dynamo was an evolution of local key/value storage onto networks. Cassandra, Riak, and Voldemort all faced similar design decisions and one, Voldemort, choose Berkeley DB Java Edition for it's node-local storage. Riak at first was entirely in-memory, but has recently added write-once, append-only log-based on-disk storage similar type of storage as Berkeley DB except that it is based on a hash table which must reside entirely in-memory rather than a btree which can live in-memory or on disk. Berkeley DB evolved too, we added high availability (HA) and a replication manager that makes it easy to setup replica groups. Berkeley DB's replication doesn't partitioned the data, every node keeps an entire copy of the database. For consistency, there is a single node where writes are committed first - a master - then those changes are delivered to the replica nodes as log records. Applications can choose to wait until all nodes are consistent, or fire and forget allowing Berkeley DB to eventually become consistent. Berkeley DB's HA scales-out quite well for read-intensive applications and also effectively eliminates the central point of failure by allowing replica nodes to be elected (using a PAXOS algorithm) to mastership if the master should fail. This implementation covers a wide variety of use cases. MemcacheDB is a server that implements the Memcache network protocol but uses Berkeley DB for storage and HA to replicate the cache state across all the nodes in the cache group. Google Accounts, the user authentication layer for all Google properties, was until recently running Berkeley DB HA. That scaled to a globally distributed system. That said, most NoSQL solutions try to partition (shard) data across nodes in the replication group and some allow writes as well as reads at any node, Berkeley DB HA does not. So, is Berkeley DB a "NoSQL" solution? Not really, but it certainly is a component of many of the existing NoSQL solutions out there. Forgetting all the noise about how NoSQL solutions are complex distributed databases when you boil them down to a single node you still have to store the data to some form of stable local storage. DBMs solved that problem a long time ago. NoSQL has more to do with the layers on top of the DBM; the distributed, sometimes-consistent, partitioned, scale-out storage that manage key/value or document sets and generally have some form of simple HTTP/REST-style network API. Does Berkeley DB do that? Not really. Is Berkeley DB a "NoSQL" solution today? Nope, but it's the most robust solution on which to build such a system. Re-inventing the node-local data storage isn't easy. A lot of people are starting to come to appreciate the sophisticated features found in Berkeley DB, even mimic them in some cases. Could Berkeley DB grow into a NoSQL solution? Absolutely. Our key/value API could be extended over the net using any of a number of existing network protocols such as memcache or HTTP/REST. We could adapt our node-local data partitioning out over replicated nodes. We even have a nice query language and cost-based query optimizer in our BDB XML product that we could reuse were we to build out a document-based NoSQL-style product. XML and JSON are not so different that we couldn't adapt one to work with the other interchangeably. Without too much effort we could add what's missing, we could jump into this No SQL market withing a single product development cycle. Why isn't Berkeley DB already a NoSQL solution? Why aren't we working on it? Why indeed...

    Read the article

  • MySQL Cluster 7.3 Labs Release – Foreign Keys Are In!

    - by Mat Keep
    0 0 1 1097 6254 Homework 52 14 7337 14.0 Normal 0 false false false EN-US JA X-NONE /* Style Definitions */ table.MsoNormalTable {mso-style-name:"Table Normal"; mso-tstyle-rowband-size:0; mso-tstyle-colband-size:0; mso-style-noshow:yes; mso-style-priority:99; mso-style-parent:""; mso-padding-alt:0cm 5.4pt 0cm 5.4pt; mso-para-margin:0cm; mso-para-margin-bottom:.0001pt; mso-pagination:widow-orphan; font-size:12.0pt; font-family:Cambria; mso-ascii-font-family:Cambria; mso-ascii-theme-font:minor-latin; mso-hansi-font-family:Cambria; mso-hansi-theme-font:minor-latin; mso-ansi-language:EN-US;} Summary (aka TL/DR): Support for Foreign Key constraints has been one of the most requested feature enhancements for MySQL Cluster. We are therefore extremely excited to announce that Foreign Keys are part of the first Labs Release of MySQL Cluster 7.3 – available for download, evaluation and feedback now! (Select the mysql-cluster-7.3-labs-June-2012 build) In this blog, I will attempt to discuss the design rationale, implementation, configuration and steps to get started in evaluating the first MySQL Cluster 7.3 Labs Release. Pace of Innovation It was only a couple of months ago that we announced the General Availability (GA) of MySQL Cluster 7.2, delivering 1 billion Queries per Minute, with 70x higher cross-shard JOIN performance, Memcached NoSQL key-value API and cross-data center replication.  This release has been a huge hit, with downloads and deployments quickly reaching record levels. The announcement of the first MySQL Cluster 7.3 Early Access lab release at today's MySQL Innovation Day event demonstrates the continued pace in Cluster development, and provides an opportunity for the community to evaluate and feedback on new features they want to see. What’s the Plan for MySQL Cluster 7.3? Well, Foreign Keys, as you may have gathered by now (!), and this is the focus of this first Labs Release. As with MySQL Cluster 7.2, we plan to publish a series of preview releases for 7.3 that will incrementally add new candidate features for a final GA release (subject to usual safe harbor statement below*), including: - New NoSQL APIs; - Features to automate the configuration and provisioning of multi-node clusters, on premise or in the cloud; - Performance and scalability enhancements; - Taking advantage of features in the latest MySQL 5.x Server GA. Design Rationale MySQL Cluster is designed as a “Not-Only-SQL” database. It combines attributes that enable users to blend the best of both relational and NoSQL technologies into solutions that deliver web scalability with 99.999% availability and real-time performance, including: Concurrent NoSQL and SQL access to the database; Auto-sharding with simple scale-out across commodity hardware; Multi-master replication with failover and recovery both within and across data centers; Shared-nothing architecture with no single point of failure; Online scaling and schema changes; ACID compliance and support for complex queries, across shards. Native support for Foreign Key constraints enables users to extend the benefits of MySQL Cluster into a broader range of use-cases, including: - Packaged applications in areas such as eCommerce and Web Content Management that prescribe databases with Foreign Key support. - In-house developments benefiting from Foreign Key constraints to simplify data models and eliminate the additional application logic needed to maintain data consistency and integrity between tables. Implementation The Foreign Key functionality is implemented directly within MySQL Cluster’s data nodes, allowing any client API accessing the cluster to benefit from them – whether using SQL or one of the NoSQL interfaces (Memcached, C++, Java, JPA or HTTP/REST.) The core referential actions defined in the SQL:2003 standard are implemented: CASCADE RESTRICT NO ACTION SET NULL In addition, the MySQL Cluster implementation supports the online adding and dropping of Foreign Keys, ensuring the Cluster continues to serve both read and write requests during the operation. An important difference to note with the Foreign Key implementation in InnoDB is that MySQL Cluster does not support the updating of Primary Keys from within the Data Nodes themselves - instead the UPDATE is emulated with a DELETE followed by an INSERT operation. Therefore an UPDATE operation will return an error if the parent reference is using a Primary Key, unless using CASCADE action, in which case the delete operation will result in the corresponding rows in the child table being deleted. The Engineering team plans to change this behavior in a subsequent preview release. Also note that when using InnoDB "NO ACTION" is identical to "RESTRICT". In the case of MySQL Cluster “NO ACTION” means “deferred check”, i.e. the constraint is checked before commit, allowing user-defined triggers to automatically make changes in order to satisfy the Foreign Key constraints. Configuration There is nothing special you have to do here – Foreign Key constraint checking is enabled by default. If you intend to migrate existing tables from another database or storage engine, for example from InnoDB, there are a couple of best practices to observe: 1. Analyze the structure of the Foreign Key graph and run the ALTER TABLE ENGINE=NDB in the correct sequence to ensure constraints are enforced 2. Alternatively drop the Foreign Key constraints prior to the import process and then recreate when complete. Getting Started Read this blog for a demonstration of using Foreign Keys with MySQL Cluster.  You can download MySQL Cluster 7.3 Labs Release with Foreign Keys today - (select the mysql-cluster-7.3-labs-June-2012 build) If you are new to MySQL Cluster, the Getting Started guide will walk you through installing an evaluation cluster on a singe host (these guides reflect MySQL Cluster 7.2, but apply equally well to 7.3) Post any questions to the MySQL Cluster forum where our Engineering team will attempt to assist you. Post any bugs you find to the MySQL bug tracking system (select MySQL Cluster from the Category drop-down menu) And if you have any feedback, please post them to the Comments section of this blog. Summary MySQL Cluster 7.2 is the GA, production-ready release of MySQL Cluster. This first Labs Release of MySQL Cluster 7.3 gives you the opportunity to preview and evaluate future developments in the MySQL Cluster database, and we are very excited to be able to share that with you. Let us know how you get along with MySQL Cluster 7.3, and other features that you want to see in future releases. * Safe Harbor Statement This information is intended to outline our general product direction. It is intended for information purposes only, and may not be incorporated into any contract. It is not a commitment to deliver any material, code, or functionality, and should not be relied upon in making purchasing decisions. The development, release, and timing of any features or functionality described for Oracle’s products remains at the sole discretion of Oracle.

    Read the article

  • The Data Scientist

    - by BuckWoody
    A new term - well, perhaps not that new - has come up and I’m actually very excited about it. The term is Data Scientist, and since it’s new, it’s fairly undefined. I’ll explain what I think it means, and why I’m excited about it. In general, I’ve found the term deals at its most basic with analyzing data. Of course, we all do that, and the term itself in that definition is redundant. There is no science that I know of that does not work with analyzing lots of data. But the term seems to refer to more than the common practices of looking at data visually, putting it in a spreadsheet or report, or even using simple coding to examine data sets. The term Data Scientist (as far as I can make out this early in it’s use) is someone who has a strong understanding of data sources, relevance (statistical and otherwise) and processing methods as well as front-end displays of large sets of complicated data. Some - but not all - Business Intelligence professionals have these skills. In other cases, senior developers, database architects or others fill these needs, but in my experience, many lack the strong mathematical skills needed to make these choices properly. I’ve divided the knowledge base for someone that would wear this title into three large segments. It remains to be seen if a given Data Scientist would be responsible for knowing all these areas or would specialize. There are pretty high requirements on the math side, specifically in graduate-degree level statistics, but in my experience a company will only have a few of these folks, so they are expected to know quite a bit in each of these areas. Persistence The first area is finding, cleaning and storing the data. In some cases, no cleaning is done prior to storage - it’s just identified and the cleansing is done in a later step. This area is where the professional would be able to tell if a particular data set should be stored in a Relational Database Management System (RDBMS), across a set of key/value pair storage (NoSQL) or in a file system like HDFS (part of the Hadoop landscape) or other methods. Or do you examine the stream of data without storing it in another system at all? This is an important decision - it’s a foundation choice that deals not only with a lot of expense of purchasing systems or even using Cloud Computing (PaaS, SaaS or IaaS) to source it, but also the skillsets and other resources needed to care and feed the system for a long time. The Data Scientist sets something into motion that will probably outlast his or her career at a company or organization. Often these choices are made by senior developers, database administrators or architects in a company. But sometimes each of these has a certain bias towards making a decision one way or another. The Data Scientist would examine these choices in light of the data itself, starting perhaps even before the business requirements are created. The business may not even be aware of all the strategic and tactical data sources that they have access to. Processing Once the decision is made to store the data, the next set of decisions are based around how to process the data. An RDBMS scales well to a certain level, and provides a high degree of ACID compliance as well as offering a well-known set-based language to work with this data. In other cases, scale should be spread among multiple nodes (as in the case of Hadoop landscapes or NoSQL offerings) or even across a Cloud provider like Windows Azure Table Storage. In fact, in many cases - most of the ones I’m dealing with lately - the data should be split among multiple types of processing environments. This is a newer idea. Many data professionals simply pick a methodology (RDBMS with Star Schemas, NoSQL, etc.) and put all data there, regardless of its shape, processing needs and so on. A Data Scientist is familiar not only with the various processing methods, but how they work, so that they can choose the right one for a given need. This is a huge time commitment, hence the need for a dedicated title like this one. Presentation This is where the need for a Data Scientist is most often already being filled, sometimes with more or less success. The latest Business Intelligence systems are quite good at allowing you to create amazing graphics - but it’s the data behind the graphics that are the most important component of truly effective displays. This is where the mathematics requirement of the Data Scientist title is the most unforgiving. In fact, someone without a good foundation in statistics is not a good candidate for creating reports. Even a basic level of statistics can be dangerous. Anyone who works in analyzing data will tell you that there are multiple errors possible when data just seems right - and basic statistics bears out that you’re on the right track - that are only solvable when you understanding why the statistical formula works the way it does. And there are lots of ways of presenting data. Sometimes all you need is a “yes” or “no” answer that can only come after heavy analysis work. In that case, a simple e-mail might be all the reporting you need. In others, complex relationships and multiple components require a deep understanding of the various graphical methods of presenting data. Knowing which kind of chart, color, graphic or shape conveys a particular datum best is essential knowledge for the Data Scientist. Why I’m excited I love this area of study. I like math, stats, and computing technologies, but it goes beyond that. I love what data can do - how it can help an organization. I’ve been fortunate enough in my professional career these past two decades to work with lots of folks who perform this role at companies from aerospace to medical firms, from manufacturing to retail. Interestingly, the size of the company really isn’t germane here. I worked with one very small bio-tech (cryogenics) company that worked deeply with analysis of complex interrelated data. So  watch this space. No, I’m not leaving Azure or distributed computing or Microsoft. In fact, I think I’m perfectly situated to investigate this role further. We have a huge set of tools, from RDBMS to Hadoop to allow me to explore. And I’m happy to share what I learn along the way.

    Read the article

  • CodePlex Daily Summary for Friday, February 25, 2011

    CodePlex Daily Summary for Friday, February 25, 2011Popular ReleasesMono.Addins: Mono.Addins 0.6: The 0.6 release of Mono.Addins includes many improvements, bug fixes and new features: Add-in engine Add-in name and description can now be localized. There are new custom attributes for defining them, and can also be specified as xml elements in an add-in manifest instead of attributes. Support for custom add-in properties. It is now possible to specify arbitrary properties in add-ins, which can be queried at install time (using the Mono.Addins.Setup API) or at run-time. Custom extensio...patterns & practices: Project Silk: Project Silk Community Drop 3 - 25 Feb 2011: IntroductionWelcome to the third community drop of Project Silk. For this drop we are requesting feedback on overall application architecture, code review of the JavaScript Conductor and Widgets, and general direction of the application. Project Silk provides guidance and sample implementations that describe and illustrate recommended practices for building modern web applications using technologies such as HTML5, jQuery, CSS3 and Internet Explorer 9. This guidance is intended for experien...PhoneyTools: Initial Release (0.1): This is the 0.1 version for preview of the features.Minemapper: Minemapper v0.1.5: Now supports new Minecraft beta v1.3 map format, thanks to updated mcmap. Disabled biomes, until Minecraft Biome Extractor supports new format.Smartkernel: Smartkernel: ????,??????Document.Editor: 2011.7: Whats new for Document.Editor 2011.7: New Find dialog Improved Email dialog Improved Home tab Improved Format tab Minor Bug Fix's, improvements and speed upsChiave File Encryption: Chiave 0.9.1: Application for file encryption and decryption using 512 Bit rijndael encyrption algorithm with simple to use UI. Its written in C# and compiled in .Net version 3.5. It incorporates features of Windows 7 like Jumplists, Taskbar progress and Aero Glass. Change Log from 0.9 Beta to 0.9.1: ======================= >Added option for system shutdown, sleep, hibernate after operation completed. >Minor Changes to the UI. >Numerous Bug fixes. Feedbacks are Welcome!....Coding4Fun Tools: Coding4Fun.Phone.Toolkit v1.2: New control, Toast Prompt! Removed progress bar since Silverlight Toolkit Feb 2010 has it.Umbraco CMS: Umbraco 4.7: Service release fixing 31 issues. A full changelog will be available with the final stable release of 4.7 Important when upgradingUpgrade as if it was a patch release (update /bin, /umbraco and /umbraco_client). For general upgrade information follow the guide found at http://our.umbraco.org/wiki/install-and-setup/upgrading-an-umbraco-installation 4.7 requires the .NET 4.0 framework Web.Config changes Update the web web.config to include the 4 changes found in (they're clearly marked in...HubbleDotNet - Open source full-text search engine: V1.1.0.0: Add Sqlite3 DBAdapter Add App Report when Query Cache is Collecting. Improve the performance of index through Synchronize. Add top 0 feature so that we can only get count of the result. Improve the score calculating algorithm of match. Let the score of the record that match all items large then others. Add MySql DBAdapter Improve performance for multi-fields sort . Using hash table to access the Payload data. The version before used bin search. Using heap sort instead of qui...Silverlight????[???]: silverlight????[???]2.0: ???????,?????,????????silverlight??????。DBSourceTools: DBSourceTools_1.3.0.0: Release 1.3.0.0 Changed editors from FireEdit to ICSharpCode.TextEditor. Complete re-vamp of Intellisense ( further testing needed). Hightlight Field and Table Names in sql scripts. Added field dropdown on all tables and views in DBExplorer. Added data option for viewing data in Tables. Fixed comment / uncomment bug as reported by tareq. Included Synonyms in scripting engine ( nickt_ch ).IronPython: 2.7 Release Candidate 1: We are pleased to announce the first Release Candidate for IronPython 2.7. This release contains over two dozen bugs fixed in preparation for 2.7 Final. See the release notes for 60193 for details and what has already been fixed in the earlier 2.7 prereleases. - IronPython TeamCaliburn Micro: A Micro-Framework for WPF, Silverlight and WP7: Caliburn.Micro 1.0 RC: This is the official Release Candicate for Caliburn.Micro 1.0. The download contains the binaries, samples and VS templates. VS Templates The templates included are designed for situations where the Caliburn.Micro source needs to be embedded within a single project solution. This was targeted at government and other organizations that expressed specific requirements around using an open source project like this. NuGet This release does not have a corresponding NuGet package. The NuGet pack...Caliburn: A Client Framework for WPF and Silverlight: Caliburn 2.0 RC: This is the official Release Candidate for Caliburn 2.0. It contains all binaries, samples and generated code docs.Rawr: Rawr 4.0.20 Beta: Rawr is now web-based. The link to use Rawr4 is: http://elitistjerks.com/rawr.phpThis is the Cataclysm Beta Release. More details can be found at the following link http://rawr.codeplex.com/Thread/View.aspx?ThreadId=237262 As of the 4.0.16 release, you can now also begin using the new Downloadable WPF version of Rawr!This is a pre-alpha release of the WPF version, there are likely to be a lot of issues. If you have a problem, please follow the Posting Guidelines and put it into the Issue Trac...Azure Storage Samples: Version 1.0 (February 2011): These downloads contain source code. Each is a complete sample that fully exercises Windows Azure Storage across blobs, queues, and tables. The difference between the downloads is implementation approach. Storage DotNet CS.zip is a .NET StorageClient library implementation in the C# language. This library come with the Windows Azure SDK. Contains helper classes for accessing blobs, queues, and tables. Storage REST CS.zip is a REST implementation in the C# language. The code to implement R...PowerGUI Visual Studio Extension: PowerGUI VSX 1.3.2: New FeaturesPowerGUI Console Tool Window PowerShell Project Type PowerGUI 2.4 SupportMiniTwitter: 1.66: MiniTwitter 1.66 ???? ?? ?????????? 2 ??????????????????? User Streams ?????????Windows Phone 7 Isolated Storage Explorer: WP7 Isolated Storage Explorer v1.0 Beta: Current release features:WPF desktop explorer client Visual Studio integrated tool window explorer client (Visual Studio 2010 Professional and above) Supported operations: Refresh (isolated storage information), Add Folder, Add Existing Item, Download File, Delete Folder, Delete File Explorer supports operations running on multiple remote applications at the same time Explorer detects application disconnect (1-2 second delay) Explorer confirms operation completed status Explorer d...New ProjectsAgriscope: This is an open information visualization tool used to assist RADA and other Agriculture officers in retrieving and analyzing data in day to day tasks.AVCampos NF-e: Realizar a emissão e controle de nf-e, através de ambientes moveis.Babel Obfuscator NAnt Tasks: This is an NAnt task for Babel Obfuscator. Babel Obfuscator protect software components realized with Microsoft .NET Framework in order to make reverse engineering difficult. Babel Obfuscator can be downloaded at http://www.babelfor.netConcurrent Programming Library: Concurrent Programming Library provides an opportunity to develop a parallel programs using .net framework 2.0 and above. It includes an implementation of various parallel algorithms, thread-safe collections and patterns.EOrg: Gelistirme maksatli yaptigim çalismalar.Extend Grid View: Extend grid view is user control. It help paging a dataset is set on gridview.FinlogiK ReSharper Contrib: FinlogiK ReSharper Contrib is a plugin for ReSharper 5.1 which adds code cleanup and inspection options for static qualifiers.Game development with Playstation Move and Ogre3D: This project is a research aiming to develop a program which can handle the Playstation Move on PC. After that, we will implement a game based on it. The programming language is C++. The graphics is handled by Ogre3D.JAD: Projeto de software.JSARP: This tool allows describing and verifying Petri Nets with the support of a graphical interface. This tool, is being developed in Java.KangmoDB - A replacement for the storage engine of SQLite: KangmoDB claims to be a real-time storage engine that replaces the one in SQLite. KangmoDB tries to achieve the lowest latency time for a transaction with ACID properties. It will be mainly used for the stock market that requires lowest latency with highest stability. MetaprogrammingInDotNetBook: This project will contain code and other artifacts related to the "Metaprogramming in .NET" book that should be avaible in October 2011.munix workstation: The µnix project is an endeavour to create a complete workstation and UNIX-like OS using standard logic IC's and 8-bit AVR microcontrollers. The goal isn't to make something that will compete with a traditional workstation in computation but instead to have a great DIY project.PhoneyTools: Set of controls and utilities for WP7 development.Plist Builder: Serialize non-circular-referencing .NET objects to plist in .NET.Quake3.NET: A port of the Quake 3 engine to C#. This is not merely a port of Quake 3 to run in a managed environment, but a complete rewrite of the engine using C# 4.0's powerful language features.SecViz: Web server security attack graph alert correlation IDS SerialNome: This is a multiport serial applicationsprout sms: a wp7 cabbage clientUsing external assembly in Biztalk 2009 map: Using external assembly in Biztalk 2009 map.

    Read the article

< Previous Page | 1 2 3 4 5  | Next Page >