Search Results

Search found 58774 results on 2351 pages for 'ssis data tranformations'.

Page 10/2351 | < Previous Page | 6 7 8 9 10 11 12 13 14 15 16 17  | Next Page >

  • Search for Variable Usage In SSIS tasks

    - by yoni.s
    Hi all: As what seems to be some sort of penance for sins in a prior life, I have been tasked with maintaining some SSIS packages. (NO! NO BADMOUTHING SSIS!! BAD PROGRAMMMER! NO DOUGHNUT!). Anyhoo, I many of the packages have variables, defined in an outer container, which are used in multiple inner containers, in script tasks. What I want to do, is find out all the places in a package a variable being used; in other words, search for instances of variable usage in all tasks of a package. This would be a huge help, but I cannot for the life of me find out how this can be done in BIDS. (this is SSIS/BIDS 2008) Thanks for any help, YS

    Read the article

  • transforming binary data using ssis and sql server 2008

    - by Rick
    Hello All - I have a task to import/transform and extract zipped binary files that contain both text data as well as embeded binary data. Within the data is data that is relational in nature and needs to be processed into a defined database structure. Currently I have a C# single threaded app that essentially grabs all the files from the directory (currently there is 13K files of varying sizes) and extracts the data on a single thread line by line inserts to the database. As you could imagine this is a very slow process and unacceptable. There are several different parsing routines used depending on the header record in the file. There are potentially upto a million rows per file when all the data is extracted to the row level of detail. Follow on task is to parse those rows into their appropriate tables based on is content. i.e. the textual content has to be parsed further into "buckets" of like data in the database. That about sums up the big picture. Now for the problem task list. How do i iterate through a packet of data using SSIS? In the app the file is decompressed and then is parsed using streams data type and byte arrays and is routed to the required parsing routine based on the header data of each packet. There is bit swapping involved as well. Should i wrap up the app code into a script task(s) and let it do the custom processing? The data is seperated by year and the sql server tables is partitioned by year as well. I need to be able to "catch" bad file data as well and process by hand most likely. Should i simply load the zipped file to sql as a blob and parse the file with T-SQL? Would that be multi threaded if done that way? Not sure how to do the parsing in tsql that is involved here. Which do you think would be faster? Potentially the data that is currently processed via files could come to us via a socket. Can SSIS collect that data in real time? How would i go about setting that up? Processing these new files from the directorys will become a daily task. I can manage the data once i get it to sql server. Getting it there in a timely fashion seems to be the long pole in the tent for me. I would appreciate any comments or suggestions from the group. Rick

    Read the article

  • How to retrieve a bigint from a database and place it into an Int64 in SSIS

    - by b0fh
    I ran into this problem a couple years back and am hoping there has been a fix and I just don't know about it. I am using an 'Execute SQL Task' in the Control Flow of an SSIS package to retrieve a 'bigint' ID value. The task is supposed to place this in an Int64 SSIS variable but I getting the error: "The type of the value being assigned to variable "User::AuditID" differs from the current variable type. Variables may not change type during execution. Variable types are strict, except for variables of type Object." When I brought this to MS' attention a couple years back they stated that I had to 'work around' this by placing the bigint into an SSIS object variable and then converting the value to Int64 as needed. Does anyone know if this has been fixed or do I still have to 'work around' this mess? edit: Server Stats Product: Microsoft SQL Server Enterprise Edition Operating System: Microsoft Windows NT 5.2 (3790) Platform: NT INTEL X86 Version: 9.00.1399.06

    Read the article

  • SSIS process files from folder

    - by RT
    Background: I've a folder that gets pumped with files continuously. My SSIS package needs to process the files and delete them. The SSIS package is scheduled to run once every minute. I'm picking up the files in ascending order of file creation time. I'm building an array of files and then processing-deleting them one at a time. Problem: If an instance of my package takes longer than one minute to run, the next instance of the SSIS package will pick up some of the files the previous instance has in its buffer. By the time the second instance of teh package gets around to processing a file, it may already have been deleted by the first instance, creating an exception condition. I was wondering whether there was a way to avoid the exception condition. Thanks.

    Read the article

  • Convert date from access to SQL Server with SSIS

    - by Arne
    Hi, I want to convert a database from access to SQL Server using SSIS. I cannot convert the date/time columns of the access db. SSIS says something like: conversion between DT_Date and DT_DBTIMESTAMP is not supported. (Its translated from my German version, might be different in English version). In Access I have Date/Time column, in SQL Server I have datetime. In the dataflow chart of the SSIS I have a OLE DB source for the access db, an sql server target and a data conversion. In the data conversion I convert the columns to date[DT_DATE]. They are connected like this: AccessDB -> conversion -> SQL DB What am I doing wrong? How can I convert the Access date columns to SQL Server date columns?

    Read the article

  • Run SSIS Package from T-SQL

    - by Dr. Zim
    I noticed you can use the following stored procedures (in order) to schedule a SSIS package: msdb.dbo.sp_add_category @class=N'JOB', @type=N'LOCAL', @name=N'[Uncategorized (Local)]' msdb.dbo.sp_add_job ... msdb.dbo.sp_add_jobstep ... msdb.dbo.sp_update_job ... msdb.dbo.sp_add_jobschedule ... msdb.dbo.sp_add_jobserver ... (You can see an example by right clicking a scheduled job and selecting "Script Job as- Create To".) AND you can use sp_start_job to execute the job immediately, effectively running SSIS packages on demand. Question: does anyone know of any msdb.dbo.[...] stored procedures that simply allow you to run SSIS packages on the fly without using sp_cmdshell directly, or some easier approach?

    Read the article

  • SSIS 2005 How to add a zip task?

    - by swapna
    Hi , I have a ssis 2005 project. I have to add a zipping task.ie, i have to zip a text file. But i have to follow DB standards of the ORG. strictly. 1) No 3rd party software in the DB server.Not even resourec kit tools 2) No exe's in the DB server. so i cant use(windows resource kit tools,7-zip, and a execute process task to invoke a c# exe.) please suggest me a way to achieve zipping in ssis without compromising the DB standards.Any zip task is available as an addin to SSIS 2005? Thanks SNA

    Read the article

  • SQL SERVER – List of Article on Expressor Data Integration Platform

    - by pinaldave
    The ability to transform data into meaningful and actionable information is the most important information in current business world. In this fast growing and changing business needs effective data integration is single most important thing in making proper decision making. I have been following expressor software since November 2010, when I met expressor team in Seattle. Here are my posts on their innovative data integration platform and expressor Studio, a free desktop ETL tool: 4 Tips for ETL Software IDE Developers Introduction to Adaptive ETL Tool – How adaptive is your ETL? Sharing your ETL Resources Across Applications with Ease expressor Studio Includes Powerful Scripting Capabilities expressor 3.2 Release Review 5 Tips for Improving Your Data with expressor Studio As I had mentioned in some of my blog posts on them, I encourage you to download and test-drive their Studio product – it’s free. Reference: Pinal Dave (http://blog.SQLAuthority.com) Filed under: PostADay, SQL, SQL Authority, SQL Documentation, SQL Download, SQL Query, SQL Server, SQL Tips and Tricks, T SQL, Technology Tagged: SSIS

    Read the article

  • Performance problems loading XML with SSIS, an alternative way!

    - by AtulThakor
    I recently needed to load several thousand XML files into a SQL database, I created an SSIS package which was created as followed: Using a foreach container to loop through a directory and load each file path into a variable, the “Import XML” dataflow would then load each XML file into a SQL table.       Running this, it took approximately 1 second to load each file which seemed a massive amount of time to parse the XML and load the data, speaking to my colleague Martin Croft, he suggested the use of T-SQL Bulk Insert and OpenRowset, so we adjusted the package as followed:     The same foreach container was used but instead the following SQL command was executed (this is an expression):     "INSERT INTO MyTable(FileDate) SELECT   CAST(bulkcolumn AS XML)     FROM OPENROWSET(         BULK         '" + @[User::CurrentFile]  + "',         SINGLE_BLOB ) AS x"     Using this method we managed to load approximately 20 records per second, much faster…for data loading! For what we wanted to achieve this was perfect but I’ll leave you with the following points when making your own decision on which solution you decide to choose!      Openrowset Method Much faster to get the data into SQL You’ll need to parse or create a view over the XML data to allow the data to be more usable(another post on this!) Not able to apply validation/transformation against the data when loading it The SQL Server service account will need permission to the file No schema validation when loading files SSIS Slower (in our case) Schema validation Allows you to apply transformations/joins to the data Permissions should be less of a problem Data can be loaded into the final form through the package When using a schema validation errors can fail the package (I’ll do another post on this)

    Read the article

  • Big Data – Final Wrap and What Next – Day 21 of 21

    - by Pinal Dave
    In yesterday’s blog post we explored various resources related to learning Big Data and in this blog post we will wrap up this 21 day series on Big Data. I have been exploring various terms and technology related to Big Data this entire month. It was indeed fun to write about Big Data in 21 days but the subject of Big Data is much bigger and larger than someone can cover it in 21 days. My first goal was to write about the basics and I think we have got that one covered pretty well. During this 21 days I have received many questions and answers related to Big Data. I have covered a few of the questions in this series and a few more I will be covering in the next coming months. Now after understanding Big Data basics. I am personally going to do a list of the things next. I thought I will share the same with you as this will give you a good idea how to continue the journey of the Big Data. Build a schedule to read various Apache documentations Watch all Pluralsight Courses Explore HortonWorks Sandbox Start building presentation about Big Data – this is a great way to learn something new Present in User Groups Meetings on Big Data Topics Write more blog posts about Big Data I am going to continue learning about Big Data – I want you to continue learning Big Data. Please leave a comment how you are going to continue learning about Big Data. I will publish all the informative comments on this blog with due credit. I want to end this series with the infographic by UMUC. Reference: Pinal Dave (http://blog.sqlauthority.com) Filed under: Big Data, PostADay, SQL, SQL Authority, SQL Query, SQL Server, SQL Tips and Tricks, T SQL

    Read the article

  • Partner Webcast - Focus on Oracle Data Profiling and Data Quality 11g

    - by lukasz.romaszewski(at)oracle.com
    Normal 0 false false false EN-US X-NONE X-NONE /* Style Definitions */ table.MsoNormalTable {mso-style-name:"Table Normal"; mso-tstyle-rowband-size:0; mso-tstyle-colband-size:0; mso-style-noshow:yes; mso-style-priority:99; mso-style-qformat:yes; mso-style-parent:""; mso-padding-alt:0cm 5.4pt 0cm 5.4pt; mso-para-margin-top:0cm; mso-para-margin-right:0cm; mso-para-margin-bottom:10.0pt; mso-para-margin-left:0cm; line-height:115%; mso-pagination:widow-orphan; font-size:11.0pt; font-family:"Calibri","sans-serif"; mso-ascii-font-family:Calibri; mso-ascii-theme-font:minor-latin; mso-hansi-font-family:Calibri; mso-hansi-theme-font:minor-latin; mso-bidi-font-family:"Times New Roman"; mso-bidi-theme-font:minor-bidi; mso-ansi-language:RO;} Partner Webcast Focus on Oracle Data Profiling and Data Quality 11g February 24th, 12am  CET   Oracle offers an integrated suite Data Quality software architected to discover and correct today's data quality problems and establish a platform prepared for tomorrow's yet unknown data challenges. Oracle Data Profiling provides data investigation, discovery, and profiling in support of quality, migration, integration, stewardship, and governance initiatives. It includes a broad range of features that expand upon basic profiling, including automated monitoring, business-rule validation, and trend analysis. Oracle Data Quality for Data Integrator provides cleansing, standardization, matching, address validation, location enrichment, and linking functions for global customer data and operational business data. It ensures that data adheres to established standards that are adaptable to fit each organization's specific needs.  Both single - and double - byte data are processed in local languages to provide a unique and centralized view of customers, products and services.   During this in-person briefing, Data Integration Solution Specialists will be providing a technical overview and a walkthrough.   Agenda ·         Oracle Data Integration Strategy overview ·         A focus on Oracle Data Profiling and Oracle Data Quality for Data Integrator: o   Oracle Data Profiling o   Oracle Data Quality for Data Integrator o   Live demoo   Q&A Delivery Format  This FREE online LIVE eSeminar will be delivered over the Web and Conference Call. Registrations   received less than 24hours  prior to start time may not receive confirmation to attend. To register , click here. For any questions please contact [email protected]

    Read the article

  • How to build a data flow?

    - by salvationishere
    I am running Visual Studio 2008, the SSIS Tutorial described on: http://msdn.microsoft.com/en-us/library/ms167106.aspx I finished all of the tasks but am getting following errors: Error 1 Validation error. Extract Sample Currency Data: Extract Sample Currency Data: input column "CurrencyAlternateKey" (123) has lineage ID 55 that was not previously used in the Data Flow task. Lesson 1.dtsx 0 0 Error 2 Validation error. Extract Sample Currency Data SSIS.Pipeline: input column "CurrencyAlternateKey" (123) has lineage ID 55 that was not previously used in the Data Flow task. Lesson 1.dtsx 0 0 Can you tell what I need to do to make this build without errors?

    Read the article

  • SSIS pckage run from file system

    - by Pramodtech
    Initially I deployed packages on SQL server but since my machine is not having SSIS installed I faced issue of version while executing the packages. Then I deployed packages to file system on server which has SQL server enterprise edition with SSIS installed on it. I access the folder on server where I have deployed packages from my system and execute the package but I get error saying "cannot run on this edition of integration services, need higher version." Do I need to rmote login (RDP) to execute package?

    Read the article

  • Loading Hybrid Dimension Table with SCD1 and SCD2 attributes + SSIS

    - by Nev_Rahd
    Hello I am just in a process of starting a new task, wherein in i need to load Hybrid Dimension Table with SCD1 and SCD2. This need to be achieved as a SSIS Package. Can someone guide what would be the best way dealing this in SSIS, should i used SCD component or there is other way? What are the best practices for this. For SCD2 type, am using Merge statement. Thanks

    Read the article

  • Unzip password protected Zip file in SSIS

    - by MarcoF
    Does anyone know how to Unzip password protected files in an SSIS package? We have an SSIS package that currently use java.util.zip to unzip zip file and has been working perfectly for some time now. They now want to password protect the files and unfortunetly this library cannot do it. We are using a SQL server 2008 with .Net 3.5 on the windows server 2008 R2.

    Read the article

  • Using version control with SSIS packages (saving 'sensitive' data)

    - by sandesh247
    Hi. We are a team working on a bunch of SSIS packages, which we share using version control (SVN). We have three ways of saving sensitive data in these packages : not storing them at all storing them with a user key storing them with a password However, each of these options is inconvenient while testing packages saved and committed by an other developer. For each such package, one has to update the credentials, no matter how the sensitive data was persisted. Is there a better way to collaborate on SSIS packages?

    Read the article

  • SSIS Package deployed - Fails when executed from schedule

    - by alex
    I've deployed a SSIS package to my SQL server. I can run the package fine by connecting to Integration Services in SSMS and right clicking on it and choosing "Run Package" However, if I schedule the package, it fails. It tells me to check the logs for information on why, but there is nothing in there... Any ideas? (this is my first SSIS package by the way)

    Read the article

  • SSIS Data Transformation

    - by bbbbb
    Hi, I am new to SSIS, so please bare with me. I am trying to transfer data from one db to a new one. i am fetching data from one table say i fetch name of person, then i insert this into say Table Person. this will generate a personID which i want to insert into say Address Table. What should be the approach using SSIS. Any suggestions.

    Read the article

  • Big Data – Buzz Words: What is HDFS – Day 8 of 21

    - by Pinal Dave
    In yesterday’s blog post we learned what is MapReduce. In this article we will take a quick look at one of the four most important buzz words which goes around Big Data – HDFS. What is HDFS ? HDFS stands for Hadoop Distributed File System and it is a primary storage system used by Hadoop. It provides high performance access to data across Hadoop clusters. It is usually deployed on low-cost commodity hardware. In commodity hardware deployment server failures are very common. Due to the same reason HDFS is built to have high fault tolerance. The data transfer rate between compute nodes in HDFS is very high, which leads to reduced risk of failure. HDFS creates smaller pieces of the big data and distributes it on different nodes. It also copies each smaller piece to multiple times on different nodes. Hence when any node with the data crashes the system is automatically able to use the data from a different node and continue the process. This is the key feature of the HDFS system. Architecture of HDFS The architecture of the HDFS is master/slave architecture. An HDFS cluster always consists of single NameNode. This single NameNode is a master server and it manages the file system as well regulates access to various files. In additional to NameNode there are multiple DataNodes. There is always one DataNode for each data server. In HDFS a big file is split into one or more blocks and those blocks are stored in a set of DataNodes. The primary task of the NameNode is to open, close or rename files and directory and regulate access to the file system, whereas the primary task of the DataNode is read and write to the file systems. DataNode is also responsible for the creation, deletion or replication of the data based on the instruction from NameNode. In reality, NameNode and DataNode are software designed to run on commodity machine build in Java language. Visual Representation of HDFS Architecture Let us understand how HDFS works with the help of the diagram. Client APP or HDFS Client connects to NameSpace as well as DataNode. Client App access to the DataNode is regulated by NameSpace Node. NameSpace Node allows Client App to connect to the DataNode based by allowing the connection to the DataNode directly. A big data file is divided into multiple data blocks (let us assume that those data chunks are A,B,C and D. Client App will later on write data blocks directly to the DataNode. Client App does not have to directly write to all the node. It just has to write to any one of the node and NameNode will decide on which other DataNode it will have to replicate the data. In our example Client App directly writes to DataNode 1 and detained 3. However, data chunks are automatically replicated to other nodes. All the information like in which DataNode which data block is placed is written back to NameNode. High Availability During Disaster Now as multiple DataNode have same data blocks in the case of any DataNode which faces the disaster, the entire process will continue as other DataNode will assume the role to serve the specific data block which was on the failed node. This system provides very high tolerance to disaster and provides high availability. If you notice there is only single NameNode in our architecture. If that node fails our entire Hadoop Application will stop performing as it is a single node where we store all the metadata. As this node is very critical, it is usually replicated on another clustered as well as on another data rack. Though, that replicated node is not operational in architecture, it has all the necessary data to perform the task of the NameNode in the case of the NameNode fails. The entire Hadoop architecture is built to function smoothly even there are node failures or hardware malfunction. It is built on the simple concept that data is so big it is impossible to have come up with a single piece of the hardware which can manage it properly. We need lots of commodity (cheap) hardware to manage our big data and hardware failure is part of the commodity servers. To reduce the impact of hardware failure Hadoop architecture is built to overcome the limitation of the non-functioning hardware. Tomorrow In tomorrow’s blog post we will discuss the importance of the relational database in Big Data. Reference: Pinal Dave (http://blog.sqlauthority.com) Filed under: Big Data, PostADay, SQL, SQL Authority, SQL Query, SQL Server, SQL Tips and Tricks, T SQL

    Read the article

  • SSIS DSN Not Showing as ODBC Data Source

    - by user1114330
    I have been following the directions here: http://social.msdn.microsoft.com/Forums/en-US/sqlintegrationservices/thread/05ccd778-b78c-4a83-a10a-c4ae412cc6e4 And ran into a problem where my System DSN is not showing up as a ODBC provider. I found this which seemed promising: http://support.microsoft.com/kb/2000277 I was not able to delete the key but followed but I did what was suggested: "If unable to delete the key, double-click the key and erase the Data value entered. Once done, the value should read ' (value not set) '" However, after following the instructions my System DSN still does not appear as an option. The USER DSN however does show...has shown but does not work as I get a permissions error. Any ideas?

    Read the article

  • Oracle Data Integration 12c: Perspectives of Industry Experts, Customers and Partners

    - by Irem Radzik
    Normal 0 false false false EN-US X-NONE X-NONE MicrosoftInternetExplorer4 As you may have seen from our recent blog posts on Oracle Data Integrator 12c and Oracle GoldenGate 12c, we are very excited to share with you the great new features the 12c release brings to Oracle’s data integration solutions. And, fortunately we are not alone in this sentiment. Since the press announcement October 17th, which incorporates our customers' and experts' testimonials, we have seen positive comments in leading technology publications and social media as well. Here are some examples: In CIO and PCWorld you can find Joab Jackson’s article, Oracle Data Integrator 12c ready for real-time analysis, where wrote about the tight integration between Oracle Data Integrator and Oracle GoldenGate . He noted “Heeding the call from enterprise customers who clamor for more immediacy in their data-driven reports, Oracle has updated its data-integration software portfolio so that it can more rapidly deliver data to data warehouses and analysis applications.” Integration Developer News’ Vance McCarthy wrote the article Oracle Ships ‘Future Proofs’ Integration Tools for Traditional, Cloud, Big Data, Real-Time Projects and mentioned that “Oracle Data Integrator 12c and Oracle GoldenGate 12c sport a wide range of improvements to let devs more easily deliver data integration for cloud, analytics, big data and other new projects that leverage multiple datasets for business.“ InformationWeek’s Doug Henschen gave a great overview to several key features including the new flow-based UI in Oracle Data Integrator. Doug said “Oracle Data Integrator 12c introduces a complete makeover of the job-building experience, while real-time oriented GoldenGate 12c introduces performance gains “. In Database Trends and Applications’ article Oracle Strengthens Data Integration with Release of Oracle Data Integrator 12c and Oracle GoldenGate 12c highlighted the productivity aspect of the new solution with his remarks: “tight integration between Oracle Data Integrator 12c and Oracle GoldenGate 12c enables developers to leverage Oracle GoldenGate’s low overhead, real-time change data capture completely within the Oracle Data Integrator Studio without additional training”. We are also thrilled about what our customers and partners have to say about our products and the new release. And we are equally excited to share those perspectives with you in our upcoming launch video webcast on November 12th. SolarWorld Industries America’s Senior Database Manager, Russ Toyama will join our executives in our studio in Redwood Shores to discuss GoldenGate’s core benefits and the new release, while Surren Partharb, CTO of Strategic Technology Services for BT, and Mark Rittman, CTO of Rittman Mead, will provide their comments via the interviews conducted in the UK. This interactive panel discussion in the video webcast will unveil the new release with the expertise of our development executives and the great insight from our customers and partners. In addition, our product experts will be available online to answer chat questions. This is really a great opportunity to learn how Oracle's data integration offering has changed the integration and replication technology space with the new release, and established itself as the new leader. If you have not registered for this free event yet, you can do so via this link. We will run the live event at 8am PT/4pm GMT, followed by a replay of the event with live chat for Q&A  at 10am PT/6pm GMT. The replay will be available on-demand for those who register but cannot attend either session on November 12th. /* Style Definitions */ table.MsoNormalTable {mso-style-name:"Table Normal"; mso-tstyle-rowband-size:0; mso-tstyle-colband-size:0; mso-style-noshow:yes; mso-style-priority:99; mso-style-qformat:yes; mso-style-parent:""; mso-padding-alt:0in 5.4pt 0in 5.4pt; mso-para-margin-top:0in; mso-para-margin-right:0in; mso-para-margin-bottom:10.0pt; mso-para-margin-left:0in; line-height:115%; mso-pagination:widow-orphan; font-size:11.0pt; font-family:"Times New Roman","serif"; mso-fareast-font-family:"Times New Roman";}

    Read the article

  • New Big Data Appliance Security Features

    - by mgubar
    The Oracle Big Data Appliance (BDA) is an engineered system for big data processing.  It greatly simplifies the deployment of an optimized Hadoop Cluster – whether that cluster is used for batch or real-time processing.  The vast majority of BDA customers are integrating the appliance with their Oracle Databases and they have certain expectations – especially around security.  Oracle Database customers have benefited from a rich set of security features:  encryption, redaction, data masking, database firewall, label based access control – and much, much more.  They want similar capabilities with their Hadoop cluster.    Unfortunately, Hadoop wasn’t developed with security in mind.  By default, a Hadoop cluster is insecure – the antithesis of an Oracle Database.  Some critical security features have been implemented – but even those capabilities are arduous to setup and configure.  Oracle believes that a key element of an optimized appliance is that its data should be secure.  Therefore, by default the BDA delivers the “AAA of security”: authentication, authorization and auditing. Security Starts at Authentication A successful security strategy is predicated on strong authentication – for both users and software services.  Consider the default configuration for a newly installed Oracle Database; it’s been a long time since you had a legitimate chance at accessing the database using the credentials “system/manager” or “scott/tiger”.  The default Oracle Database policy is to lock accounts thereby restricting access; administrators must consciously grant access to users. Default Authentication in Hadoop By default, a Hadoop cluster fails the authentication test. For example, it is easy for a malicious user to masquerade as any other user on the system.  Consider the following scenario that illustrates how a user can access any data on a Hadoop cluster by masquerading as a more privileged user.  In our scenario, the Hadoop cluster contains sensitive salary information in the file /user/hrdata/salaries.txt.  When logged in as the hr user, you can see the following files.  Notice, we’re using the Hadoop command line utilities for accessing the data: $ hadoop fs -ls /user/hrdataFound 1 items-rw-r--r--   1 oracle supergroup         70 2013-10-31 10:38 /user/hrdata/salaries.txt$ hadoop fs -cat /user/hrdata/salaries.txtTom Brady,11000000Tom Hanks,5000000Bob Smith,250000Oprah,300000000 User DrEvil has access to the cluster – and can see that there is an interesting folder called “hrdata”.  $ hadoop fs -ls /user Found 1 items drwx------   - hr supergroup          0 2013-10-31 10:38 /user/hrdata However, DrEvil cannot view the contents of the folder due to lack of access privileges: $ hadoop fs -ls /user/hrdata ls: Permission denied: user=drevil, access=READ_EXECUTE, inode="/user/hrdata":oracle:supergroup:drwx------ Accessing this data will not be a problem for DrEvil. He knows that the hr user owns the data by looking at the folder’s ACLs. To overcome this challenge, he will simply masquerade as the hr user. On his local machine, he adds the hr user, assigns that user a password, and then accesses the data on the Hadoop cluster: $ sudo useradd hr $ sudo passwd $ su hr $ hadoop fs -cat /user/hrdata/salaries.txt Tom Brady,11000000 Tom Hanks,5000000 Bob Smith,250000 Oprah,300000000 Hadoop has not authenticated the user; it trusts that the identity that has been presented is indeed the hr user. Therefore, sensitive data has been easily compromised. Clearly, the default security policy is inappropriate and dangerous to many organizations storing critical data in HDFS. Big Data Appliance Provides Secure Authentication The BDA provides secure authentication to the Hadoop cluster by default – preventing the type of masquerading described above. It accomplishes this thru Kerberos integration. Figure 1: Kerberos Integration The Key Distribution Center (KDC) is a server that has two components: an authentication server and a ticket granting service. The authentication server validates the identity of the user and service. Once authenticated, a client must request a ticket from the ticket granting service – allowing it to access the BDA’s NameNode, JobTracker, etc. At installation, you simply point the BDA to an external KDC or automatically install a highly available KDC on the BDA itself. Kerberos will then provide strong authentication for not just the end user – but also for important Hadoop services running on the appliance. You can now guarantee that users are who they claim to be – and rogue services (like fake data nodes) are not added to the system. It is common for organizations to want to leverage existing LDAP servers for common user and group management. Kerberos integrates with LDAP servers – allowing the principals and encryption keys to be stored in the common repository. This simplifies the deployment and administration of the secure environment. Authorize Access to Sensitive Data Kerberos-based authentication ensures secure access to the system and the establishment of a trusted identity – a prerequisite for any authorization scheme. Once this identity is established, you need to authorize access to the data. HDFS will authorize access to files using ACLs with the authorization specification applied using classic Linux-style commands like chmod and chown (e.g. hadoop fs -chown oracle:oracle /user/hrdata changes the ownership of the /user/hrdata folder to oracle). Authorization is applied at the user or group level – utilizing group membership found in the Linux environment (i.e. /etc/group) or in the LDAP server. For SQL-based data stores – like Hive and Impala – finer grained access control is required. Access to databases, tables, columns, etc. must be controlled. And, you want to leverage roles to facilitate administration. Apache Sentry is a new project that delivers fine grained access control; both Cloudera and Oracle are the project’s founding members. Sentry satisfies the following three authorization requirements: Secure Authorization:  the ability to control access to data and/or privileges on data for authenticated users. Fine-Grained Authorization:  the ability to give users access to a subset of the data (e.g. column) in a database Role-Based Authorization:  the ability to create/apply template-based privileges based on functional roles. With Sentry, “all”, “select” or “insert” privileges are granted to an object. The descendants of that object automatically inherit that privilege. A collection of privileges across many objects may be aggregated into a role – and users/groups are then assigned that role. This leads to simplified administration of security across the system. Figure 2: Object Hierarchy – granting a privilege on the database object will be inherited by its tables and views. Sentry is currently used by both Hive and Impala – but it is a framework that other data sources can leverage when offering fine-grained authorization. For example, one can expect Sentry to deliver authorization capabilities to Cloudera Search in the near future. Audit Hadoop Cluster Activity Auditing is a critical component to a secure system and is oftentimes required for SOX, PCI and other regulations. The BDA integrates with Oracle Audit Vault and Database Firewall – tracking different types of activity taking place on the cluster: Figure 3: Monitored Hadoop services. At the lowest level, every operation that accesses data in HDFS is captured. The HDFS audit log identifies the user who accessed the file, the time that file was accessed, the type of access (read, write, delete, list, etc.) and whether or not that file access was successful. The other auditing features include: MapReduce:  correlate the MapReduce job that accessed the file Oozie:  describes who ran what as part of a workflow Hive:  captures changes were made to the Hive metadata The audit data is captured in the Audit Vault Server – which integrates audit activity from a variety of sources, adding databases (Oracle, DB2, SQL Server) and operating systems to activity from the BDA. Figure 4: Consolidated audit data across the enterprise.  Once the data is in the Audit Vault server, you can leverage a rich set of prebuilt and custom reports to monitor all the activity in the enterprise. In addition, alerts may be defined to trigger violations of audit policies. Conclusion Security cannot be considered an afterthought in big data deployments. Across most organizations, Hadoop is managing sensitive data that must be protected; it is not simply crunching publicly available information used for search applications. The BDA provides a strong security foundation – ensuring users are only allowed to view authorized data and that data access is audited in a consolidated framework.

    Read the article

  • How to present a stable data model in a public API that allows internal data structures to be changed without breaking the public view of the data?

    - by Max Palmer
    I am in the process of developing an application that allows users to write C# scripts. These scripts allow users to call selected methods and to access and manipulate data in a document. This works well, however, in the development version, scripts access the document's (internal) data structures directly. This means that if we were to change the internal data model/structure, there is a good chance that someone's script will no longer compile. We obviously want to prevent this breaking change from happening, but still want to allow the user to write sensible C# code (whilst not restricting how we develop our internal data model as a result). We therefore need to decouple our scripting API and its data structures from our internal methods and data structures. We've a few ideas as to how we might allow the user to access a what is effectively a stable public version of the document's internal data*, but I wanted to throw the question out there to someone who might have some real experience of this problem. NB our internal document's data structure is quite complex and it could be quite difficult to wrap. We know we want to expose as little as possible in our public API, especially as once it's out there, it's out there for good. Can anyone help? How do scripting languages / APIs decouple their public API and data structures from their internal data structures? Is there no real alternative to having to write a complex interaction layer? If we need to do this, what's a good approach or pattern for wrapping complex data structures that include nested objects, including collections? I've looked at the API facade pattern, which looks like it's trying to address these kinds of issues, but are there alternatives? *One idea is to build a data facade that is kept stable across versions of our application. The facade exposes a set of facade data objects that are used in the script code. These maintain backwards compatibility and wrap access to our internal document's data model.

    Read the article

< Previous Page | 6 7 8 9 10 11 12 13 14 15 16 17  | Next Page >