Data Quality Services is a very important concept of SQL Server. I have recently started to explore the same and I am really learning some good concepts. Here are two very important blog posts which one should go over before continuing this blog post.
Installing Data Quality Services (DQS) on SQL Server 2012
Connecting Error to Data Quality Services (DQS) on SQL Server 2012
This article is introduction to Data Quality Services for beginners. We will be using an Excel file
Click on the image to enlarge the it.
In the first article we learned to install DQS. In this article we will see how we can learn about building Knowledge Base and using it to help us identify the quality of the data as well help correct the bad quality of the data.
Here are the two very important steps we will be learning in this tutorial.
Building a New Knowledge Base
Creating a New Data Quality Project
Let us start the building the Knowledge Base. Click on New Knowledge Base.
In our project we will be using the Excel as a knowledge base. Here is the Excel which we will be using. There are two columns. One is Colors and another is Shade. They are independent columns and not related to each other. The point which I am trying to show is that in Column A there are unique data and in Column B there are duplicate records.
Clicking on New Knowledge Base will bring up the following screen. Enter the name of the new knowledge base.
Clicking NEXT will bring up following screen where it will allow to select the EXCE file and it will also let users select the source column. I have selected Colors and Shade both as a source column.
Creating a domain is very important. Here you can create a unique domain or domain which is compositely build from Colors and Shade.
As this is the first example, I will create unique domain – for Colors I will create domain Colors and for Shade I will create domain Shade.
Here is the screen which will demonstrate how the screen will look after creating domains.
Clicking NEXT it will bring you to following screen where you can do the data discovery. Clicking on the START will start the processing of the source data provided.
Pre-processed data will show various information related to the source data. In our case it shows that Colors column have unique data whereas Shade have non-unique data and unique data rows are only two.
In the next screen you can actually add more rows as well see the frequency of the data as the values are listed unique.
Clicking next will publish the knowledge base which is just created.
Now the knowledge base is created. We will try to take any random data and attempt to do DQS implementation over it. I am using another excel sheet here for simplicity purpose. In reality you can easily use SQL Server table for the same.
Click on New Data Quality Project to see start DQS Project.
In the next screen it will ask which knowledge base to use. We will be using our Colors knowledge base which we have recently created.
In the Colors knowledge base we had two columns – 1) Colors and 2) Shade. In our case we will be using both of the mappings here. User can select one or multiple column mapping over here.
Now the most important phase of the complete project. Click on Start and it will make the cleaning process and shows various results.
In our case there were two columns to be processed and it completed the task with necessary information. It demonstrated that in Colors columns it has not corrected any value by itself but in Shade value there is a suggestion it has. We can train the DQS to correct values but let us keep that subject for future blog posts.
Now click next and keep the domain Colors selected left side. It will demonstrate that there are two incorrect columns which it needs to be corrected. Here is the place where once corrected value will be auto-corrected in future.
I manually corrected the value here and clicked on Approve radio buttons. As soon as I click on Approve buttons the rows will be disappeared from this tab and will move to Corrected Tab. If I had rejected tab it would have moved the rows to Invalid tab as well.
In this screen you can see how the corrected 2 rows are demonstrated. You can click on Correct tab and see previously validated 6 rows which passed the DQS process.
Now let us click on the Shade domain on the left side of the screen. This domain shows very interesting details as there DQS system guessed the correct answer as Dark with the confidence level of 77%. It is quite a high confidence level and manual observation also demonstrate that Dark is the correct answer. I clicked on Approve and the row moved to corrected tab.
On the next screen DQS shows the summary of all the activities. It also demonstrates how the correction of the quality of the data was performed. The user can explore their data to a SQL Server Table, CSV file or Excel.
The user also has an option to either explore data and all the associated cleansing info or data only. I will select Data only for demonstration purpose.
Clicking explore will generate the files.
Let us open the generated file. It will look as following and it looks pretty complete and corrected.
Well, we have successfully completed DQS Process. The process is indeed very easy. I suggest you try this out yourself and you will find it very easy to learn. In future we will go over advanced concepts.
Are you using this feature on your production server? If yes, would you please leave a comment with your environment and business need. It will be indeed interesting to see where it is implemented.
Reference: Pinal Dave (http://blog.SQLAuthority.com)
Filed under: Business Intelligence, Data Warehousing, PostADay, SQL, SQL Authority, SQL Query, SQL Server, SQL Tips and Tricks, T SQL, Technology Tagged: Data Quality Services, DQS