Search Results

Search found 59041 results on 2362 pages for 'data replication'.

Page 1/2362 | 1 2 3 4 5 6 7 8 9 10 11 12  | Next Page >

  • Data Quality and Master Data Management Resources

    - by Dejan Sarka
    Many companies or organizations do regular data cleansing. When you cleanse the data, the data quality goes up to some higher level. The data quality level is determined by the amount of work invested in the cleansing. As time passes, the data quality deteriorates, and you need to repeat the cleansing process. If you spend an equal amount of effort as you did with the previous cleansing, you can expect the same level of data quality as you had after the previous cleansing. And then the data quality deteriorates over time again, and the cleansing process starts over and over again. The idea of Data Quality Services is to mitigate the cleansing process. While the amount of time you need to spend on cleansing decreases, you will achieve higher and higher levels of data quality. While cleansing, you learn what types of errors to expect, discover error patterns, find domains of correct values, etc. You don’t throw away this knowledge. You store it and use it to find and correct the same issues automatically during your next cleansing process. The following figure shows this graphically. The idea of master data management, which you can perform with Master Data Services (MDS), is to prevent data quality from deteriorating. Once you reach a particular quality level, the MDS application—together with the defined policies, people, and master data management processes—allow you to maintain this level permanently. This idea is shown in the following picture. OK, now you know what DQS and MDS are about. You can imagine the importance on maintaining the data quality. Here are some resources that help you preparing and executing the data quality (DQ) and master data management (MDM) activities. Books Dejan Sarka and Davide Mauri: Data Quality and Master Data Management with Microsoft SQL Server 2008 R2 – a general introduction to MDM, MDS, and data profiling. Matching explained in depth. Dejan Sarka, Matija Lah and Grega Jerkic: MCTS Self-Paced Training Kit (Exam 70-463): Building Data Warehouses with Microsoft SQL Server 2012 – I wrote quite a few chapters about DQ and MDM, and introduced also SQL Server 2012 DQS. Thomas Redman: Data Quality: The Field Guide – you should start with this book. Thomas Redman is the father of DQ and MDM. Tyler Graham: Microsoft SQL Server 2012 Master Data Services – MDS in depth from a product team mate. Arkady Maydanchik: Data Quality Assessment – data profiling in depth. Tamraparni Dasu, Theodore Johnson: Exploratory Data Mining and Data Cleaning – advanced data profiling with data mining. Forthcoming presentations I am presenting a DQS and MDM seminar at PASS SQL Rally Amsterdam 2013: Wednesday, November 6th, 2013: Enterprise Information Management with SQL Server 2012 – a good kick start to your first DQ and / or MDM project. Courses Data Quality and Master Data Management with SQL Server 2012 – I wrote a 2-day course for SolidQ. If you are interested in this course, which I could also deliver in a shorter seminar way, you can contact your closes SolidQ subsidiary, or, of course, me directly on addresses [email protected] or [email protected]. This course could also complement the existing courseware portfolio of training providers, which are welcome to contact me as well. Start improving the quality of your data now!

    Read the article

  • Coherence - How to develop a custom push replication publisher

    - by cosmin.tudor(at)oracle.com
    CoherencePushReplicationDB.zipIn the example bellow I'm describing a way of developing a custom push replication publisher that publishes data to a database via JDBC. This example can be easily changed to publish data to other receivers (JMS,...) by performing changes to step 2 and small changes to step 3, steps that are presented bellow. I've used Eclipse as the development tool. To develop a custom push replication publishers we will need to go through 6 steps: Step 1: Create a custom publisher scheme class Step 2: Create a custom publisher class that should define what the publisher is doing. Step 3: Create a class data is performing the actions (publish to JMS, DB, etc ) for the custom publisher. Step 4: Register the new publisher against a ContentHandler. Step 5: Add the new custom publisher in the cache configuration file. Step 6: Add the custom publisher scheme class to the POF configuration file. All these steps are detailed bellow. The coherence project is attached and conclusions are presented at the end. Step 1: In the Coherence Eclipse project create a class called CustomPublisherScheme that should implement com.oracle.coherence.patterns.pushreplication.publishers.AbstractPublisherScheme. In this class define the elements of the custom-publisher-scheme element. For instance for a CustomPublisherScheme that looks like that: <sync:publisher> <sync:publisher-name>Active2-JDBC-Publisher</sync:publisher-name> <sync:publisher-scheme> <sync:custom-publisher-scheme> <sync:jdbc-string>jdbc:oracle:thin:@machine-name:1521:XE</sync:jdbc-string> <sync:username>hr</sync:username> <sync:password>hr</sync:password> </sync:custom-publisher-scheme> </sync:publisher-scheme> </sync:publisher> the code is: package com.oracle.coherence; import java.io.DataInput; import java.io.DataOutput; import java.io.IOException; import com.oracle.coherence.patterns.pushreplication.Publisher; import com.oracle.coherence.configuration.Configurable; import com.oracle.coherence.configuration.Mandatory; import com.oracle.coherence.configuration.Property; import com.oracle.coherence.configuration.parameters.ParameterScope; import com.oracle.coherence.environment.Environment; import com.tangosol.io.pof.PofReader; import com.tangosol.io.pof.PofWriter; import com.tangosol.util.ExternalizableHelper; @Configurable public class CustomPublisherScheme extends com.oracle.coherence.patterns.pushreplication.publishers.AbstractPublisherScheme { /** * */ private static final long serialVersionUID = 1L; private String jdbcString; private String username; private String password; public String getJdbcString() { return this.jdbcString; } @Property("jdbc-string") @Mandatory public void setJdbcString(String jdbcString) { this.jdbcString = jdbcString; } public String getUsername() { return username; } @Property("username") @Mandatory public void setUsername(String username) { this.username = username; } public String getPassword() { return password; } @Property("password") @Mandatory public void setPassword(String password) { this.password = password; } public Publisher realize(Environment environment, ClassLoader classLoader, ParameterScope parameterScope) { return new CustomPublisher(getJdbcString(), getUsername(), getPassword()); } public void readExternal(DataInput in) throws IOException { super.readExternal(in); this.jdbcString = ExternalizableHelper.readSafeUTF(in); this.username = ExternalizableHelper.readSafeUTF(in); this.password = ExternalizableHelper.readSafeUTF(in); } public void writeExternal(DataOutput out) throws IOException { super.writeExternal(out); ExternalizableHelper.writeSafeUTF(out, this.jdbcString); ExternalizableHelper.writeSafeUTF(out, this.username); ExternalizableHelper.writeSafeUTF(out, this.password); } public void readExternal(PofReader reader) throws IOException { super.readExternal(reader); this.jdbcString = reader.readString(100); this.username = reader.readString(101); this.password = reader.readString(102); } public void writeExternal(PofWriter writer) throws IOException { super.writeExternal(writer); writer.writeString(100, this.jdbcString); writer.writeString(101, this.username); writer.writeString(102, this.password); } } Step 2: Define what the CustomPublisher should basically do by creating a new java class called CustomPublisher that implements com.oracle.coherence.patterns.pushreplication.Publisher package com.oracle.coherence; import com.oracle.coherence.patterns.pushreplication.EntryOperation; import com.oracle.coherence.patterns.pushreplication.Publisher; import com.oracle.coherence.patterns.pushreplication.exceptions.PublisherNotReadyException; import java.io.BufferedWriter; import java.util.Iterator; public class CustomPublisher implements Publisher { private String jdbcString; private String username; private String password; private transient BufferedWriter bufferedWriter; public CustomPublisher() { } public CustomPublisher(String jdbcString, String username, String password) { this.jdbcString = jdbcString; this.username = username; this.password = password; this.bufferedWriter = null; } public String getJdbcString() { return this.jdbcString; } public String getUsername() { return username; } public String getPassword() { return password; } public void publishBatch(String cacheName, String publisherName, Iterator<EntryOperation> entryOperations) { DatabasePersistence databasePersistence = new DatabasePersistence( jdbcString, username, password); while (entryOperations.hasNext()) { EntryOperation entryOperation = (EntryOperation) entryOperations .next(); databasePersistence.databasePersist(entryOperation); } } public void start(String cacheName, String publisherName) throws PublisherNotReadyException { System.err .printf("Started: Custom JDBC Publisher for Cache %s with Publisher %s\n", new Object[] { cacheName, publisherName }); } public void stop(String cacheName, String publisherName) { System.err .printf("Stopped: Custom JDBC Publisher for Cache %s with Publisher %s\n", new Object[] { cacheName, publisherName }); } } In the publishBatch method from above we inform the publisher that he is supposed to persist data to a database: DatabasePersistence databasePersistence = new DatabasePersistence( jdbcString, username, password); while (entryOperations.hasNext()) { EntryOperation entryOperation = (EntryOperation) entryOperations .next(); databasePersistence.databasePersist(entryOperation); } Step 3: The class that deals with the persistence is a very basic one that uses JDBC to perform inserts/updates against a database. package com.oracle.coherence; import com.oracle.coherence.patterns.pushreplication.EntryOperation; import java.sql.*; import java.text.SimpleDateFormat; import com.oracle.coherence.Order; public class DatabasePersistence { public static String INSERT_OPERATION = "INSERT"; public static String UPDATE_OPERATION = "UPDATE"; public Connection dbConnection; public DatabasePersistence(String jdbcString, String username, String password) { this.dbConnection = createConnection(jdbcString, username, password); } public Connection createConnection(String jdbcString, String username, String password) { Connection connection = null; System.err.println("Connecting to: " + jdbcString + " Username: " + username + " Password: " + password); try { // Load the JDBC driver String driverName = "oracle.jdbc.driver.OracleDriver"; Class.forName(driverName); // Create a connection to the database connection = DriverManager.getConnection(jdbcString, username, password); System.err.println("Connected to:" + jdbcString + " Username: " + username + " Password: " + password); } catch (ClassNotFoundException e) { e.printStackTrace(); } // driver catch (SQLException e) { e.printStackTrace(); } return connection; } public void databasePersist(EntryOperation entryOperation) { if (entryOperation.getOperation().toString() .equalsIgnoreCase(INSERT_OPERATION)) { insert(((Order) entryOperation.getPublishableEntry().getValue())); } else if (entryOperation.getOperation().toString() .equalsIgnoreCase(UPDATE_OPERATION)) { update(((Order) entryOperation.getPublishableEntry().getValue())); } } public void update(Order order) { String update = "UPDATE Orders set QUANTITY= '" + order.getQuantity() + "', AMOUNT='" + order.getAmount() + "', ORD_DATE= '" + (new SimpleDateFormat("dd-MMM-yyyy")).format(order .getOrdDate()) + "' WHERE SYMBOL='" + order.getSymbol() + "'"; System.err.println("UPDATE = " + update); try { Statement stmt = getDbConnection().createStatement(); stmt.execute(update); stmt.close(); } catch (SQLException ex) { System.err.println("SQLException: " + ex.getMessage()); } } public void insert(Order order) { String insert = "insert into Orders values('" + order.getSymbol() + "'," + order.getQuantity() + "," + order.getAmount() + ",'" + (new SimpleDateFormat("dd-MMM-yyyy")).format(order .getOrdDate()) + "')"; System.err.println("INSERT = " + insert); try { Statement stmt = getDbConnection().createStatement(); stmt.execute(insert); stmt.close(); } catch (SQLException ex) { System.err.println("SQLException: " + ex.getMessage()); } } public Connection getDbConnection() { return dbConnection; } public void setDbConnection(Connection dbConnection) { this.dbConnection = dbConnection; } } Step 4: Now we need to register our publisher against a ContentHandler. In order to achieve that we need to create in our eclipse project a new class called CustomPushReplicationNamespaceContentHandler that should extend the com.oracle.coherence.patterns.pushreplication.configuration.PushReplicationNamespaceContentHandler. In the constructor of the new class we define a new handler for our custom publisher. package com.oracle.coherence; import com.oracle.coherence.configuration.Configurator; import com.oracle.coherence.environment.extensible.ConfigurationContext; import com.oracle.coherence.environment.extensible.ConfigurationException; import com.oracle.coherence.environment.extensible.ElementContentHandler; import com.oracle.coherence.patterns.pushreplication.PublisherScheme; import com.oracle.coherence.environment.extensible.QualifiedName; import com.oracle.coherence.patterns.pushreplication.configuration.PushReplicationNamespaceContentHandler; import com.tangosol.run.xml.XmlElement; public class CustomPushReplicationNamespaceContentHandler extends PushReplicationNamespaceContentHandler { public CustomPushReplicationNamespaceContentHandler() { super(); registerContentHandler("custom-publisher-scheme", new ElementContentHandler() { public Object onElement(ConfigurationContext context, QualifiedName qualifiedName, XmlElement xmlElement) throws ConfigurationException { PublisherScheme publisherScheme = new CustomPublisherScheme(); Configurator.configure(publisherScheme, context, qualifiedName, xmlElement); return publisherScheme; } }); } } Step 5: Now we should define our CustomPublisher in the cache configuration file according to the following documentation. <cache-config xmlns:sync="class:com.oracle.coherence.CustomPushReplicationNamespaceContentHandler" xmlns:cr="class:com.oracle.coherence.environment.extensible.namespaces.InstanceNamespaceContentHandler"> <caching-schemes> <sync:provider pof-enabled="false"> <sync:coherence-provider /> </sync:provider> <caching-scheme-mapping> <cache-mapping> <cache-name>publishing-cache</cache-name> <scheme-name>distributed-scheme-with-publishing-cachestore</scheme-name> <autostart>true</autostart> <sync:publisher> <sync:publisher-name>Active2 Publisher</sync:publisher-name> <sync:publisher-scheme> <sync:remote-cluster-publisher-scheme> <sync:remote-invocation-service-name>remote-site1</sync:remote-invocation-service-name> <sync:remote-publisher-scheme> <sync:local-cache-publisher-scheme> <sync:target-cache-name>publishing-cache</sync:target-cache-name> </sync:local-cache-publisher-scheme> </sync:remote-publisher-scheme> <sync:autostart>true</sync:autostart> </sync:remote-cluster-publisher-scheme> </sync:publisher-scheme> </sync:publisher> <sync:publisher> <sync:publisher-name>Active2-Output-Publisher</sync:publisher-name> <sync:publisher-scheme> <sync:stderr-publisher-scheme> <sync:autostart>true</sync:autostart> <sync:publish-original-value>true</sync:publish-original-value> </sync:stderr-publisher-scheme> </sync:publisher-scheme> </sync:publisher> <sync:publisher> <sync:publisher-name>Active2-JDBC-Publisher</sync:publisher-name> <sync:publisher-scheme> <sync:custom-publisher-scheme> <sync:jdbc-string>jdbc:oracle:thin:@machine_name:1521:XE</sync:jdbc-string> <sync:username>hr</sync:username> <sync:password>hr</sync:password> </sync:custom-publisher-scheme> </sync:publisher-scheme> </sync:publisher> </cache-mapping> </caching-scheme-mapping> <!-- The following scheme is required for each remote-site when using a RemoteInvocationPublisher --> <remote-invocation-scheme> <service-name>remote-site1</service-name> <initiator-config> <tcp-initiator> <remote-addresses> <socket-address> <address>localhost</address> <port>20001</port> </socket-address> </remote-addresses> <connect-timeout>2s</connect-timeout> </tcp-initiator> <outgoing-message-handler> <request-timeout>5s</request-timeout> </outgoing-message-handler> </initiator-config> </remote-invocation-scheme> <!-- END: com.oracle.coherence.patterns.pushreplication --> <proxy-scheme> <service-name>ExtendTcpProxyService</service-name> <acceptor-config> <tcp-acceptor> <local-address> <address>localhost</address> <port>20002</port> </local-address> </tcp-acceptor> </acceptor-config> <autostart>true</autostart> </proxy-scheme> </caching-schemes> </cache-config> As you can see in the red-marked text from above I've:       - set new Namespace Content Handler       - define the new custom publisher that should work together with other publishers like: stderr and remote publishers in our case. Step 6: Add the com.oracle.coherence.CustomPublisherScheme to your custom-pof-config file: <pof-config> <user-type-list> <!-- Built in types --> <include>coherence-pof-config.xml</include> <include>coherence-common-pof-config.xml</include> <include>coherence-messagingpattern-pof-config.xml</include> <include>coherence-pushreplicationpattern-pof-config.xml</include> <!-- Application types --> <user-type> <type-id>1901</type-id> <class-name>com.oracle.coherence.Order</class-name> <serializer> <class-name>com.oracle.coherence.OrderSerializer</class-name> </serializer> </user-type> <user-type> <type-id>1902</type-id> <class-name>com.oracle.coherence.CustomPublisherScheme</class-name> </user-type> </user-type-list> </pof-config> CONCLUSIONSThis approach allows for publishers to publish data to almost any other receiver (database, JMS, MQ, ...). The only thing that needs to be changed is the DatabasePersistence.java class that should be adapted to the chosen receiver. Only minor changes are needed for the rest of the code (to publishBatch method from CustomPublisher class).

    Read the article

  • Is there any replication standard or concept for application server data replication

    - by Naga
    Hi friends, Say, I have a server that handles file based mass data and can process thousands of read requests and hundreds of provisioning requests(Add, modify, delete) per second. This is not SQL based database. Now i planned to implement replication. There should be master- master replication, master slave replication, partial replication(entries matching a criteria) and fractional replication(part of an entry). Is there any standard for implementing such replication mechanisms? Can you please suggest some solution overview? Thanks, Naga

    Read the article

  • Replication Services in a BI environment

    - by jorg
    In this blog post I will explain the principles of SQL Server Replication Services without too much detail and I will take a look on the BI capabilities that Replication Services could offer in my opinion. SQL Server Replication Services provides tools to copy and distribute database objects from one database system to another and maintain consistency afterwards. These tools basically copy or synchronize data with little or no transformations, they do not offer capabilities to transform data or apply business rules, like ETL tools do. The only “transformations” Replication Services offers is to filter records or columns out of your data set. You can achieve this by selecting the desired columns of a table and/or by using WHERE statements like this: SELECT <published_columns> FROM [Table] WHERE [DateTime] >= getdate() - 60 There are three types of replication: Transactional Replication This type replicates data on a transactional level. The Log Reader Agent reads directly on the transaction log of the source database (Publisher) and clones the transactions to the Distribution Database (Distributor), this database acts as a queue for the destination database (Subscriber). Next, the Distribution Agent moves the cloned transactions that are stored in the Distribution Database to the Subscriber. The Distribution Agent can either run at scheduled intervals or continuously which offers near real-time replication of data! So for example when a user executes an UPDATE statement on one or multiple records in the publisher database, this transaction (not the data itself) is copied to the distribution database and is then also executed on the subscriber. When the Distribution Agent is set to run continuously this process runs all the time and transactions on the publisher are replicated in small batches (near real-time), when it runs on scheduled intervals it executes larger batches of transactions, but the idea is the same. Snapshot Replication This type of replication makes an initial copy of database objects that need to be replicated, this includes the schemas and the data itself. All types of replication must start with a snapshot of the database objects from the Publisher to initialize the Subscriber. Transactional replication need an initial snapshot of the replicated publisher tables/objects to run its cloned transactions on and maintain consistency. The Snapshot Agent copies the schemas of the tables that will be replicated to files that will be stored in the Snapshot Folder which is a normal folder on the file system. When all the schemas are ready, the data itself will be copied from the Publisher to the snapshot folder. The snapshot is generated as a set of bulk copy program (BCP) files. Next, the Distribution Agent moves the snapshot to the Subscriber, if necessary it applies schema changes first and copies the data itself afterwards. The application of schema changes to the Subscriber is a nice feature, when you change the schema of the Publisher with, for example, an ALTER TABLE statement, that change is propagated by default to the Subscriber(s). Merge Replication Merge replication is typically used in server-to-client environments, for example when subscribers need to receive data, make changes offline, and later synchronize changes with the Publisher and other Subscribers, like with mobile devices that need to synchronize one in a while. Because I don’t really see BI capabilities here, I will not explain this type of replication any further. Replication Services in a BI environment Transactional Replication can be very useful in BI environments. In my opinion you never want to see users to run custom (SSRS) reports or PowerPivot solutions directly on your production database, it can slow down the system and can cause deadlocks in the database which can cause errors. Transactional Replication can offer a read-only, near real-time database for reporting purposes with minimal overhead on the source system. Snapshot Replication can also be useful in BI environments, if you don’t need a near real-time copy of the database, you can choose to use this form of replication. Next to an alternative for Transactional Replication it can be used to stage data so it can be transformed and moved into the data warehousing environment afterwards. In many solutions I have seen developers create multiple SSIS packages that simply copies data from one or more source systems to a staging database that figures as source for the ETL process. The creation of these packages takes a lot of (boring) time, while Replication Services can do the same in minutes. It is possible to filter out columns and/or records and it can even apply schema changes automatically so I think it offers enough features here. I don’t know how the performance will be and if it really works as good for this purpose as I expect, but I want to try this out soon!

    Read the article

  • re-enabling a table for mysql replication

    - by jessieE
    We were able to setup mysql master-slave replication with the following version on both master/slave: mysqld Ver 5.5.28-29.1-log for Linux on x86_64 (Percona Server (GPL), Release 29.1) One day, we noticed that replication has stopped, we tried skipping over the entries that caused the replication errors. The errors persisted so we decided to skip replication for the 4 problematic tables. The slave has now caught up with the master except for the 4 tables. What is the best way to enable replication again for the 4 tables? This is what I have in mind but I don't know if it will work: 1) Modify slave config to enable replication again for the 4 tables 2) stop slave replication 3) for each of the 4 tables, use pt-table-sync --execute --verbose --print --sync-to-master h=localhost,D=mydb,t=mytable 4) restart slave database to reload replication configuration 5) start slave replication

    Read the article

  • Manage and Monitor Identity Ranges in SQL Server Transactional Replication

    - by Yaniv Etrogi
    Problem When using transactional replication to replicate data in a one way topology from a publisher to a read-only subscriber(s) there is no need to manage identity ranges. However, when using  transactional replication to replicate data in a two way replication topology - between two or more servers there is a need to manage identity ranges in order to prevent a situation where an INSERT commands fails on a PRIMARY KEY violation error  due to the replicated row being inserted having a value for the identity column which already exists at the destination database. Solution There are two ways to address this situation: Assign a range of identity values per each server. Work with parallel identity values. The first method requires some maintenance while the second method does not and so the scripts provided with this article are very useful for anyone using the first method. I will explore this in more detail later in the article. In the first solution set server1 to work in the range of 1 to 1,000,000,000 and server2 to work in the range of 1,000,000,001 to 2,000,000,000.  The ranges are set and defined using the DBCC CHECKIDENT command and when the ranges in this example are well maintained you meet the goal of preventing the INSERT commands to fall due to a PRIMARY KEY violation. The first insert at server1 will get the identity value of 1, the second insert will get the value of 2 and so on while on server2 the first insert will get the identity value of 1000000001, the second insert 1000000002 and so on thus avoiding a conflict. Be aware that when a row is inserted the identity value (seed) is generated as part of the insert command at each server and the inserted row is replicated. The replicated row includes the identity column’s value so the data remains consistent across all servers but you will be able to tell on what server the original insert took place due the range that  the identity value belongs to. In the second solution you do not manage ranges but enforce a situation in which identity values can never get overlapped by setting the first identity value (seed) and the increment property one time only during the CREATE TABLE command of each table. So a table on server1 looks like this: CREATE TABLE T1 (  c1 int NOT NULL IDENTITY(1, 5) PRIMARY KEY CLUSTERED ,c2 int NOT NULL ); And a table on server2 looks like this: CREATE TABLE T1(  c1 int NOT NULL IDENTITY(2, 5) PRIMARY KEY CLUSTERED ,c2 int NOT NULL ); When these two tables are inserted the results of the identity values look like this: Server1:  1, 6, 11, 16, 21, 26… Server2:  2, 7, 12, 17, 22, 27… This assures no identity values conflicts while leaving a room for 3 additional servers to participate in this same environment. You can go up to 9 servers using this method by setting an increment value of 9 instead of 5 as I used in this example. Continues…

    Read the article

  • mysql replication 1x master, 1x slave

    - by clarkk
    I have just setup one master and one slave server, but its not working.. On my website I connect to the slave server and I insert some rows, but they do not appear on the master and vice versa.. What is wrong? This is what I did: Master: -> /etc/mysql/my.cnf [mysqld] log-bin = mysql-master-bin server-id=1 # bind-address = 127.0.0.1 binlog-do-db = test_db Slave: -> /etc/mysql/my.cnf [mysqld] log-bin = mysql-slave-bin server-id=2 # bind-address = 127.0.0.1 replicate-do-db = test_db Slave: terminal 0 > mysql> STOP SLAVE; // and drop tables Master: terminal 1 > mysql> CREATE USER 'repl_slave'@'slave_ip' IDENTIFIED BY 'repl_pass'; mysql> GRANT REPLICATION SLAVE ON *.* TO 'repl_slave'@'slave_ip'; mysql> FLUSH PRIVILEGES; mysql> FLUSH TABLES WITH READ LOCK; -- leave terminal open terminal 2 > shell> mysqldump -u root -pPASSWORD test_db --lock-all-tables > dump.sql mysql> SHOW MASTER STATUS; Slave: terminal 3 > shell> mysql -u root -pPASSWORD test_db < dump.sql terminal 0 > mysql> CHANGE MASTER TO mysql> MASTER_HOST='master_ip', mysql> MASTER_USER='repl_slave', mysql> MASTER_PASSWORD='repl_pass', mysql> MASTER_PORT=3306, mysql> MASTER_LOG_FILE='mysql-master-bin.000003', // terminal 2 > SHOW MASTER STATUS mysql> MASTER_LOG_POS=4, // terminal 2 > SHOW MASTER STATUS mysql> MASTER_CONNECT_RETRY=10; mysql> START SLAVE; mysql> SHOW SLAVE STATUS; Here is the slave status: Array ( [Slave_IO_State] => Waiting for master to send event [Master_Host] => xx.xx.xx.xx [Master_User] => repl_slave [Master_Port] => 3306 [Connect_Retry] => 10 [Master_Log_File] => mysql-master-bin.000003 [Read_Master_Log_Pos] => 106 [Relay_Log_File] => mysqld-relay-bin.000002 [Relay_Log_Pos] => 258 [Relay_Master_Log_File] => mysql-master-bin.000003 [Slave_IO_Running] => Yes [Slave_SQL_Running] => Yes [Replicate_Do_DB] => test_db [Replicate_Ignore_DB] => [Replicate_Do_Table] => [Replicate_Ignore_Table] => [Replicate_Wild_Do_Table] => [Replicate_Wild_Ignore_Table] => [Last_Errno] => 0 [Last_Error] => [Skip_Counter] => 0 [Exec_Master_Log_Pos] => 106 [Relay_Log_Space] => 414 [Until_Condition] => None [Until_Log_File] => [Until_Log_Pos] => 0 [Master_SSL_Allowed] => No [Master_SSL_CA_File] => [Master_SSL_CA_Path] => [Master_SSL_Cert] => [Master_SSL_Cipher] => [Master_SSL_Key] => [Seconds_Behind_Master] => 0 [Master_SSL_Verify_Server_Cert] => No [Last_IO_Errno] => 0 [Last_IO_Error] => [Last_SQL_Errno] => 0 [Last_SQL_Error] => )

    Read the article

  • Replication Services as ETL extraction tool

    - by jorg
    In my last blog post I explained the principles of Replication Services and the possibilities it offers in a BI environment. One of the possibilities I described was the use of snapshot replication as an ETL extraction tool: “Snapshot Replication can also be useful in BI environments, if you don’t need a near real-time copy of the database, you can choose to use this form of replication. Next to an alternative for Transactional Replication it can be used to stage data so it can be transformed and moved into the data warehousing environment afterwards. In many solutions I have seen developers create multiple SSIS packages that simply copies data from one or more source systems to a staging database that figures as source for the ETL process. The creation of these packages takes a lot of (boring) time, while Replication Services can do the same in minutes. It is possible to filter out columns and/or records and it can even apply schema changes automatically so I think it offers enough features here. I don’t know how the performance will be and if it really works as good for this purpose as I expect, but I want to try this out soon!” Well I have tried it out and I must say it worked well. I was able to let replication services do work in a fraction of the time it would cost me to do the same in SSIS. What I did was the following: Configure snapshot replication for some Adventure Works tables, this was quite simple and straightforward. Create an SSIS package that executes the snapshot replication on demand and waits for its completion. This is something that you can’t do with out of the box functionality. While configuring the snapshot replication two SQL Agent Jobs are created, one for the creation of the snapshot and one for the distribution of the snapshot. Unfortunately these jobs are  asynchronous which means that if you execute them they immediately report back if the job started successfully or not, they do not wait for completion and report its result afterwards. So I had to create an SSIS package that executes the jobs and waits for their completion before the rest of the ETL process continues. Fortunately I was able to create the SSIS package with the desired functionality. I have made a step-by-step guide that will help you configure the snapshot replication and I have uploaded the SSIS package you need to execute it. Configure snapshot replication   The first step is to create a publication on the database you want to replicate. Connect to SQL Server Management Studio and right-click Replication, choose for New.. Publication…   The New Publication Wizard appears, click Next Choose your “source” database and click Next Choose Snapshot publication and click Next   You can now select tables and other objects that you want to publish Expand Tables and select the tables that are needed in your ETL process In the next screen you can add filters on the selected tables which can be very useful. Think about selecting only the last x days of data for example. Its possible to filter out rows and/or columns. In this example I did not apply any filters. Schedule the Snapshot Agent to run at a desired time, by doing this a SQL Agent Job is created which we need to execute from a SSIS package later on. Next you need to set the Security Settings for the Snapshot Agent. Click on the Security Settings button.   In this example I ran the Agent under the SQL Server Agent service account. This is not recommended as a security best practice. Fortunately there is an excellent article on TechNet which tells you exactly how to set up the security for replication services. Read it here and make sure you follow the guidelines!   On the next screen choose to create the publication at the end of the wizard Give the publication a name (SnapshotTest) and complete the wizard   The publication is created and the articles (tables in this case) are added Now the publication is created successfully its time to create a new subscription for this publication.   Expand the Replication folder in SSMS and right click Local Subscriptions, choose New Subscriptions   The New Subscription Wizard appears   Select the publisher on which you just created your publication and select the database and publication (SnapshotTest)   You can now choose where the Distribution Agent should run. If it runs at the distributor (push subscriptions) it causes extra processing overhead. If you use a separate server for your ETL process and databases choose to run each agent at its subscriber (pull subscriptions) to reduce the processing overhead at the distributor. Of course we need a database for the subscription and fortunately the Wizard can create it for you. Choose for New database   Give the database the desired name, set the desired options and click OK You can now add multiple SQL Server Subscribers which is not necessary in this case but can be very useful.   You now need to set the security settings for the Distribution Agent. Click on the …. button Again, in this example I ran the Agent under the SQL Server Agent service account. Read the security best practices here   Click Next   Make sure you create a synchronization job schedule again. This job is also necessary in the SSIS package later on. Initialize the subscription at first synchronization Select the first box to create the subscription when finishing this wizard Complete the wizard by clicking Finish The subscription will be created In SSMS you see a new database is created, the subscriber. There are no tables or other objects in the database available yet because the replication jobs did not ran yet. Now expand the SQL Server Agent, go to Jobs and search for the job that creates the snapshot:   Rename this job to “CreateSnapshot” Now search for the job that distributes the snapshot:   Rename this job to “DistributeSnapshot” Create an SSIS package that executes the snapshot replication We now need an SSIS package that will take care of the execution of both jobs. The CreateSnapshot job needs to execute and finish before the DistributeSnapshot job runs. After the DistributeSnapshot job has started the package needs to wait until its finished before the package execution finishes. The Execute SQL Server Agent Job Task is designed to execute SQL Agent Jobs from SSIS. Unfortunately this SSIS task only executes the job and reports back if the job started succesfully or not, it does not report if the job actually completed with success or failure. This is because these jobs are asynchronous. The SSIS package I’ve created does the following: It runs the CreateSnapshot job It checks every 5 seconds if the job is completed with a for loop When the CreateSnapshot job is completed it starts the DistributeSnapshot job And again it waits until the snapshot is delivered before the package will finish successfully Quite simple and the package is ready to use as standalone extract mechanism. After executing the package the replicated tables are added to the subscriber database and are filled with data:   Download the SSIS package here (SSIS 2008) Conclusion In this example I only replicated 5 tables, I could create a SSIS package that does the same in approximately the same amount of time. But if I replicated all the 70+ AdventureWorks tables I would save a lot of time and boring work! With replication services you also benefit from the feature that schema changes are applied automatically which means your entire extract phase wont break. Because a snapshot is created using the bcp utility (bulk copy) it’s also quite fast, so the performance will be quite good. Disadvantages of using snapshot replication as extraction tool is the limitation on source systems. You can only choose SQL Server or Oracle databases to act as a publisher. So if you plan to build an extract phase for your ETL process that will invoke a lot of tables think about replication services, it would save you a lot of time and thanks to the Extract SSIS package I’ve created you can perfectly fit it in your usual SSIS ETL process.

    Read the article

  • Big Data – Learning Basics of Big Data in 21 Days – Bookmark

    - by Pinal Dave
    Earlier this month I had a great time to write Bascis of Big Data series. This series received great response and lots of good comments I have received, I am going to follow up this basics series with further in-depth series in near future. Here is the consolidated blog post where you can find all the 21 days blog posts together. Bookmark this page for future reference. Big Data – Beginning Big Data – Day 1 of 21 Big Data – What is Big Data – 3 Vs of Big Data – Volume, Velocity and Variety – Day 2 of 21 Big Data – Evolution of Big Data – Day 3 of 21 Big Data – Basics of Big Data Architecture – Day 4 of 21 Big Data – Buzz Words: What is NoSQL – Day 5 of 21 Big Data – Buzz Words: What is Hadoop – Day 6 of 21 Big Data – Buzz Words: What is MapReduce – Day 7 of 21 Big Data – Buzz Words: What is HDFS – Day 8 of 21 Big Data – Buzz Words: Importance of Relational Database in Big Data World – Day 9 of 21 Big Data – Buzz Words: What is NewSQL – Day 10 of 21 Big Data – Role of Cloud Computing in Big Data – Day 11 of 21 Big Data – Operational Databases Supporting Big Data – RDBMS and NoSQL – Day 12 of 21 Big Data – Operational Databases Supporting Big Data – Key-Value Pair Databases and Document Databases – Day 13 of 21 Big Data – Operational Databases Supporting Big Data – Columnar, Graph and Spatial Database – Day 14 of 21 Big DataData Mining with Hive – What is Hive? – What is HiveQL (HQL)? – Day 15 of 21 Big Data – Interacting with Hadoop – What is PIG? – What is PIG Latin? – Day 16 of 21 Big Data – Interacting with Hadoop – What is Sqoop? – What is Zookeeper? – Day 17 of 21 Big Data – Basics of Big Data Analytics – Day 18 of 21 Big Data – How to become a Data Scientist and Learn Data Science? – Day 19 of 21 Big Data – Various Learning Resources – How to Start with Big Data? – Day 20 of 21 Big Data – Final Wrap and What Next – Day 21 of 21 Reference: Pinal Dave (http://blog.sqlauthority.com) Filed under: Big Data, PostADay, SQL, SQL Authority, SQL Query, SQL Server, SQL Tips and Tricks, T SQL

    Read the article

  • MySQL Connect 8 Days Away - Replication Sessions

    - by Mat Keep
    Following on from my post about MySQL Cluster sessions at the forthcoming Connect conference, its now the turn of MySQL Replication - another technology at the heart of scaling and high availability for MySQL. Unless you've only just returned from a 6-month alien abduction, you will know that MySQL 5.6 includes the largest set of replication enhancements ever packaged into a single new release: - Global Transaction IDs + HA utilities for self-healing cluster..(yes both automatic failover and manual switchover available!) - Crash-safe slaves and binlog - Binlog Group Commit and Multi-Threaded Slaves for high performance - Replication Event Checksums and Time-Delayed replication - and many more There are a number of sessions dedicated to learn more about these important new enhancements, delivered by the same engineers who developed them. Here is a summary Saturday 29th, 13.00 Replication Tips and Tricks, Mats Kindahl In this session, the developers of MySQL Replication present a bag of useful tips and tricks related to the MySQL 5.5 GA and MySQL 5.6 development milestone releases, including multisource replication, using logs for auditing, handling filtering, examining the binary log, using relay slaves, splitting the replication stream, and handling failover. Saturday 29th, 17.30 Enabling the New Generation of Web and Cloud Services with MySQL 5.6 Replication, Lars Thalmann This session showcases the new replication features, including • High performance (group commit, multithreaded slave) • High availability (crash-safe slaves, failover utilities) • Flexibility and usability (global transaction identifiers, annotated row-based replication [RBR]) • Data integrity (event checksums) Saturday 29th, 1900 MySQL Replication Birds of a Feather In this session, the MySQL Replication engineers discuss all the goodies, including global transaction identifiers (GTIDs) with autofailover; multithreaded, crash-safe slaves; checksums; and more. The team discusses the design behind these enhancements and how to get started with them. You will get the opportunity to present your feedback on how these can be further enhanced and can share any additional replication requirements you have to further scale your critical MySQL-based workloads. Sunday 30th, 10.15 Hands-On Lab, MySQL Replication, Luis Soares and Sven Sandberg But how do you get started, how does it work, and what are the best practices and tools? During this hands-on lab, you will learn how to get started with replication, how it works, architecture, replication prerequisites, setting up a simple topology, and advanced replication configurations. The session also covers some of the new features in the MySQL 5.6 development milestone releases. Sunday 30th, 13.15 Hands-On Lab, MySQL Utilities, Chuck Bell Would you like to learn how to more effectively manage a host of MySQL servers and manage high-availability features such as replication? This hands-on lab addresses these areas and more. Participants will get familiar with all of the MySQL utilities, using each of them with a variety of options to configure and manage MySQL servers. Sunday 30th, 14.45 Eliminating Downtime with MySQL Replication, Luis Soares The presentation takes a deep dive into new replication features such as global transaction identifiers and crash-safe slaves. It also showcases a range of Python utilities that, combined with the Release 5.6 feature set, results in a self-healing data infrastructure. By the end of the session, attendees will be familiar with the new high-availability features in the whole MySQL 5.6 release and how to make use of them to protect and grow their business. Sunday 30th, 17.45 Scaling for the Web and the Cloud with MySQL Replication, Luis Soares In a Replication topology, high performance directly translates into improving read consistency from slaves and reducing the risk of data loss if a master fails. MySQL 5.6 introduces several new replication features to enhance performance. In this session, you will learn about these new features, how they work, and how you can leverage them in your applications. In addition, you will learn about some other best practices that can be used to improve performance. So how can you make sure you don't miss out - the good news is that registration is still open ;-) And just to whet your appetite, listen to the On-Demand webinar that presents an overview of MySQL 5.6 Replication.  

    Read the article

  • PostgreSQL 9.1 Database Replication Between Two Production Environments with Load Balancer

    - by littleK
    I'm investigating different solutions for database replication between two PostgreSQL 9.1 databases. The setup will include two production servers on the cloud (Amazon EC2 X-Large Instances), with an elastic load balancer. What is the typical database implementation for for this type of setup? A master-master replication (with Bucardo or rubyrep)? Or perhaps use only one shared database between the two environments, with a shared disk failover? I've been getting some ideas from http://www.postgresql.org/docs/9.0/static/different-replication-solutions.html. Since I don't have a lot of experience in database replication, I figured I would ask the experts. What would you recommend for the described setup?

    Read the article

  • <solved> MySQL Replication A->B->C

    - by nonus25
    I was setting the MySQL Replication for master - slave/master - slave and Replication for master - slave its works fine but when i have enable this option in my.cnf log-slave-updates=1 for updating the master bin log my replications is starting be slower and the time Seconds_Behind_Master is growing. I use innodb engine but the DB is big. Any idea how i can improve the replication, looks like the network is not the issue. Also i was think to use binlog_format=ROW but master is using default setting for replication 'statement' and i cant reset master ;) Thanks ...

    Read the article

  • Data Quality Through Data Governance

    Data Quality Governance Data quality is very important to every organization, bad data cost an organization time, money, and resources that could be prevented if the proper governance was put in to place.  Data Governance Program Criteria: Support from Executive Management and all Business Units Data Stewardship Program  Cross Functional Team of Data Stewards Data Governance Committee Quality Structured Data It should go without saying but any successful project in today’s business world must get buy in from executive management and all stakeholders involved with the project. If management does not fully support a project because they see it is in there and the company’s best interest then they will remove/eliminate funding, resources and allocated time to work on the project. In essence they can render a project dead until it is official killed by the business. In addition, buy in from stake holders is also very important because they can cause delays increased spending in time, money and resources because they do not support a project. Data Stewardship programs are administered by a data steward manager who primary focus is to support, train and manage a cross functional data stewards team. A cross functional team of data stewards are pulled from various departments act to ensure that all systems work to ensure that an organization’s goals are achieved. Typically, data stewards are subject matter experts that act as mediators between their respective departments and IT. Data Quality Procedures Data Governance Committees are composed of data stewards, Upper management, IT Leadership and various subject matter experts depending on a company. The primary goal of this committee is to define strategic goals, coordinate activities, set data standards and offer data guidelines for the business. Data Quality Policies In 1997, Claudia Imhoff defined a Data Stewardship’s responsibility as to approve business naming standards, develop consistent data definitions, determine data aliases, develop standard calculations and derivations, document the business rules of the corporation, monitor the quality of the data in the data warehouse, define security requirements, and so forth. She further explains data stewards responsible for creating and enforcing polices on the following but not limited to issues. Resolving Data Integration Issues Determining Data Security Documenting Data Definitions, Calculations, Summarizations, etc. Maintaining/Updating Business Rules Analyzing and Improving Data Quality

    Read the article

  • SQL SERVER – Advanced Data Quality Services with Melissa Data – Azure Data Market

    - by pinaldave
    There has been much fanfare over the new SQL Server 2012, and especially around its new companion product Data Quality Services (DQS). Among the many new features is the addition of this integrated knowledge-driven product that enables data stewards everywhere to profile, match, and cleanse data. In addition to the homegrown rules that data stewards can design and implement, there are also connectors to third party providers that are hosted in the Azure Datamarket marketplace.  In this review, I leverage SQL Server 2012 Data Quality Services, and proceed to subscribe to a third party data cleansing product through the Datamarket to showcase this unique capability. Crucial Questions For the purposes of the review, I used a database I had in an Excel spreadsheet with name and address information. Upon a cursory inspection, there are miscellaneous problems with these records; some addresses are missing ZIP codes, others missing a city, and some records are slightly misspelled or have unparsed suites. With DQS, I can easily add a knowledge base to help standardize my values, such as for state abbreviations. But how do I know that my address is correct? And if my address is not correct, what should it be corrected to? The answer lies in a third party knowledge base by the acknowledged USPS certified address accuracy experts at Melissa Data. Reference Data Services Within DQS there is a handy feature to actually add reference data from many different third-party Reference Data Services (RDS) vendors. DQS simplifies the processes of cleansing, standardizing, and enriching data through custom rules and through service providers from the Azure Datamarket. A quick jump over to the Datamarket site shows me that there are a handful of providers that offer data directly through Data Quality Services. Upon subscribing to these services, one can attach a DQS domain or composite domain (fields in a record) to a reference data service provider, and begin using it to cleanse, standardize, and enrich that data. Besides what I am looking for (address correction and enrichment), it is possible to subscribe to a host of other services including geocoding, IP address reference, phone checking and enrichment, as well as name parsing, standardization, and genderization.  These capabilities extend the data quality that DQS has natively by quite a bit. For my current address correction review, I needed to first sign up to a reference data provider on the Azure Data Market site. For this example, I used Melissa Data’s Address Check Service. They offer free one-month trials, so if you wish to follow along, or need to add address quality to your own data, I encourage you to sign up with them. Once I subscribed to the desired Reference Data Provider, I navigated my browser to the Account Keys within My Account to view the generated account key, which I then inserted into the DQS Client – Configuration under the Administration area. Step by Step to Guide That was all it took to hook in the subscribed provider -Melissa Data- directly to my DQS Client. The next step was for me to attach and map in my Reference Data from the newly acquired reference data provider, to a domain in my knowledge base. On the DQS Client home screen, I selected “New Knowledge Base” under Knowledge Base Management on the left-hand side of the home screen. Under New Knowledge Base, I typed a Name and description of my new knowledge base, then proceeded to the Domain Management screen. Here I established a series of domains (fields) and then linked them all together as a composite domain (record set). Using the Create Domain button, I created the following domains according to the fields in my incoming data: Name Address Suite City State Zip I added a Suite column in my domain because Melissa Data has the ability to return missing Suites based on last name or company. And that’s a great benefit of using these third party providers, as they have data that the data steward would not normally have access to. The bottom line is, with these third party data providers, I can actually improve my data. Next, I created a composite domain (fulladdress) and added the (field) domains into the composite domain. This essentially groups our address fields together in a record to facilitate the full address cleansing they perform. I then selected my newly created composite domain and under the Reference Data tab, added my third party reference data provider –Melissa Data’s Address Check- and mapped in each domain that I had to the provider’s Schema. Now that my composite domain has been married to the Reference Data service, I can take the newly published knowledge base and create a project to cleanse and enrich my data. My next task was to create a new Data Quality project, mapping in my data source and matching it to the appropriate domain column, and then kick off the verification process. It took just a few minutes with some progress indicators indicating that it was working. When the process concluded, there was a helpful set of tabs that place the response records into categories: suggested; new; invalid; corrected (automatically); and correct. Accepting the suggestions provided by  Melissa Data allowed me to clean up all the records and flag the invalid ones. It is very apparent that DQS makes address data quality simplistic for any IT professional. Final Note As I have shown, DQS makes data quality very easy. Within minutes I was able to set up a data cleansing and enrichment routine within my data quality project, and ensure that my address data was clean, verified, and standardized against real reference data. As reviewed here, it’s easy to see how both SQL Server 2012 and DQS work to take what used to require a highly skilled developer, and empower an average business or database person to consume external services and clean data. Reference: Pinal Dave (http://blog.sqlauthority.com) Filed under: PostADay, SQL, SQL Authority, SQL Query, SQL Server, SQL Tips and Tricks, SQL Utility, T SQL, Technology Tagged: DQS

    Read the article

  • SQL 2008 R2 replication error: The process could not connect to Distributor

    - by Lance Lefebure
    I have two servers running SQL 2008 R2 Standard, each with an instance named "MAIN". I have a small test database on my primary server (one table, 13 rows) that I want to replicate to a second server as a proof-of-concept for some larger databases that I want to replicate. I set up the primary server to be a publisher and distributor, and set the database to do transactional replication. I copied the data to the second server via a backup/restore, not via a snapshot (which I'll have to do with the larger databases due to database size and limited bandwidth). I followed the instructions here: http://gnawgnu.blogspot.com/2009/11/sql-2008-transactional-replication-and.html Now on the subscriber, I go under Replication / Local Subscriptions / Right click / Properties on my subscription to the DB. The status of the last synchronization shows a status of: "The process could not connect to Distributor 'PRIMARYSERVER\MAIN'." Data IS replicating from the primary to the secondary. Any record I add on the primary shows up on the secondary server within seconds. Is the Distributor part of the Snapshot system that I'm not using, or is it part of the transaction replication stuff? Thanks, Lance

    Read the article

  • why use mixed-based replication for mysql

    - by Alistair Prestidge
    I am in the process of configuring MySQL replication and am intending to use row-based-replication but I was also reading up about mixed-based replication. This is where statement-based is the default and then for certain circumstances (http://dev.mysql.com/doc/refman/5.1/en/binary-log-mixed.html) MySQL will switch to row-based. The list is quit vast on when it will switch to row-based. My questions are: Does any one use mixed? If yes why did you chose this over just using one or the other? Thanks in advance

    Read the article

  • Big Data – Is Big Data Relevant to me? – Big Data Questionnaires – Guest Post by Vinod Kumar

    - by Pinal Dave
    This guest post is by Vinod Kumar. Vinod Kumar has worked with SQL Server extensively since joining the industry over a decade ago. Working on various versions of SQL Server 7.0, Oracle 7.3 and other database technologies – he now works with the Microsoft Technology Center (MTC) as a Technology Architect. Let us read the blog post in Vinod’s own voice. I think the series from Pinal is a good one for anyone planning to start on Big Data journey from the basics. In my daily customer interactions this buzz of “Big Data” always comes up, I react generally saying – “Sir, do you really have a ‘Big Data’ problem or do you have a big Data problem?” Generally, there is a silence in the air when I ask this question. Data is everywhere in organizations – be it big data, small data, all data and for few it is bad data which is same as no data :). Wow, don’t discount me as someone who opposes “Big Data”, I am a big supporter as much as I am a critic of the abuse of this term by the people. In this post, I wanted to let my mind flow so that you can also think in the direction I want you to see these concepts. In any case, this is not an exhaustive dump of what is in my mind – but you will surely get the drift how I am going to question Big Data terms from customers!!! Is Big Data Relevant to me? Many of my customers talk to me like blank whiteboard with no idea – “why Big Data”. They want to jump into the bandwagon of technology and they want to decipher insights from their unexplored data a.k.a. unstructured data with structured data. So what are these industry scenario’s that come to mind? Here are some of them: Financials Fraud detection: Banks and Credit cards are monitoring your spending habits on real-time basis. Customer Segmentation: applies in every industry from Banking to Retail to Aviation to Utility and others where they deal with end customer who consume their products and services. Customer Sentiment Analysis: Responding to negative brand perception on social or amplify the positive perception. Sales and Marketing Campaign: Understand the impact and get closer to customer delight. Call Center Analysis: attempt to take unstructured voice recordings and analyze them for content and sentiment. Medical Reduce Re-admissions: How to build a proactive follow-up engagements with patients. Patient Monitoring: How to track Inpatient, Out-Patient, Emergency Visits, Intensive Care Units etc. Preventive Care: Disease identification and Risk stratification is a very crucial business function for medical. Claims fraud detection: There is no precise dollars that one can put here, but this is a big thing for the medical field. Retail Customer Sentiment Analysis, Customer Care Centers, Campaign Management. Supply Chain Analysis: Every sensors and RFID data can be tracked for warehouse space optimization. Location based marketing: Based on where a check-in happens retail stores can be optimize their marketing. Telecom Price optimization and Plans, Finding Customer churn, Customer loyalty programs Call Detail Record (CDR) Analysis, Network optimizations, User Location analysis Customer Behavior Analysis Insurance Fraud Detection & Analysis, Pricing based on customer Sentiment Analysis, Loyalty Management Agents Analysis, Customer Value Management This list can go on to other areas like Utility, Manufacturing, Travel, ITES etc. So as you can see, there are obviously interesting use cases for each of these industry verticals. These are just representative list. Where to start? A lot of times I try to quiz customers on a number of dimensions before starting a Big Data conversation. Are you getting the data you need the way you want it and in a timely manner? Can you get in and analyze the data you need? How quickly is IT to respond to your BI Requests? How easily can you get at the data that you need to run your business/department/project? How are you currently measuring your business? Can you get the data you need to react WITHIN THE QUARTER to impact behaviors to meet your numbers or is it always “rear-view mirror?” How are you measuring: The Brand Customer Sentiment Your Competition Your Pricing Your performance Supply Chain Efficiencies Predictive product / service positioning What are your key challenges of driving collaboration across your global business?  What the challenges in innovation? What challenges are you facing in getting more information out of your data? Note: Garbage-in is Garbage-out. Hold good for all reporting / analytics requirements Big Data POCs? A number of customers get into the realm of setting a small team to work on Big Data – well it is a great start from an understanding point of view, but I tend to ask a number of other questions to such customers. Some of these common questions are: To what degree is your advanced analytics (natural language processing, sentiment analysis, predictive analytics and classification) paired with your Big Data’s efforts? Do you have dedicated resources exploring the possibilities of advanced analytics in Big Data for your business line? Do you plan to employ machine learning technology while doing Advanced Analytics? How is Social Media being monitored in your organization? What is your ability to scale in terms of storage and processing power? Do you have a system in place to sort incoming data in near real time by potential value, data quality, and use frequency? Do you use event-driven architecture to manage incoming data? Do you have specialized data services that can accommodate different formats, security, and the management requirements of multiple data sources? Is your organization currently using or considering in-memory analytics? To what degree are you able to correlate data from your Big Data infrastructure with that from your enterprise data warehouse? Have you extended the role of Data Stewards to include ownership of big data components? Do you prioritize data quality based on the source system (that is Facebook/Twitter data has lower quality thresholds than radio frequency identification (RFID) for a tracking system)? Do your retention policies consider the different legal responsibilities for storing Big Data for a specific amount of time? Do Data Scientists work in close collaboration with Data Stewards to ensure data quality? How is access to attributes of Big Data being given out in the organization? Are roles related to Big Data (Advanced Analyst, Data Scientist) clearly defined? How involved is risk management in the Big Data governance process? Is there a set of documented policies regarding Big Data governance? Is there an enforcement mechanism or approach to ensure that policies are followed? Who is the key sponsor for your Big Data governance program? (The CIO is best) Do you have defined policies surrounding the use of social media data for potential employees and customers, as well as the use of customer Geo-location data? How accessible are complex analytic routines to your user base? What is the level of involvement with outside vendors and third parties in regard to the planning and execution of Big Data projects? What programming technologies are utilized by your data warehouse/BI staff when working with Big Data? These are some of the important questions I ask each customer who is actively evaluating Big Data trends for their organizations. These questions give you a sense of direction where to start, what to use, how to secure, how to analyze and more. Sign off Any Big data is analysis is incomplete without a compelling story. The best way to understand this is to watch Hans Rosling – Gapminder (2:17 to 6:06) videos about the third world myths. Don’t get overwhelmed with the Big Data buzz word, the destination to what your data speaks is important. In this blog post, we did not particularly look at any Big Data technologies. This is a set of questionnaire one needs to keep in mind as they embark their journey of Big Data. I did write some of the basics in my blog: Big Data – Big Hype yet Big Opportunity. Do let me know if these questions make sense?  Reference: Pinal Dave (http://blog.sqlauthority.com)Filed under: Big Data, PostADay, SQL, SQL Authority, SQL Query, SQL Server, SQL Tips and Tricks, T SQL

    Read the article

  • Big Data – Basics of Big Data Architecture – Day 4 of 21

    - by Pinal Dave
    In yesterday’s blog post we understood how Big Data evolution happened. Today we will understand basics of the Big Data Architecture. Big Data Cycle Just like every other database related applications, bit data project have its development cycle. Though three Vs (link) for sure plays an important role in deciding the architecture of the Big Data projects. Just like every other project Big Data project also goes to similar phases of the data capturing, transforming, integrating, analyzing and building actionable reporting on the top of  the data. While the process looks almost same but due to the nature of the data the architecture is often totally different. Here are few of the question which everyone should ask before going ahead with Big Data architecture. Questions to Ask How big is your total database? What is your requirement of the reporting in terms of time – real time, semi real time or at frequent interval? How important is the data availability and what is the plan for disaster recovery? What are the plans for network and physical security of the data? What platform will be the driving force behind data and what are different service level agreements for the infrastructure? This are just basic questions but based on your application and business need you should come up with the custom list of the question to ask. As I mentioned earlier this question may look quite simple but the answer will not be simple. When we are talking about Big Data implementation there are many other important aspects which we have to consider when we decide to go for the architecture. Building Blocks of Big Data Architecture It is absolutely impossible to discuss and nail down the most optimal architecture for any Big Data Solution in a single blog post, however, we can discuss the basic building blocks of big data architecture. Here is the image which I have built to explain how the building blocks of the Big Data architecture works. Above image gives good overview of how in Big Data Architecture various components are associated with each other. In Big Data various different data sources are part of the architecture hence extract, transform and integration are one of the most essential layers of the architecture. Most of the data is stored in relational as well as non relational data marts and data warehousing solutions. As per the business need various data are processed as well converted to proper reports and visualizations for end users. Just like software the hardware is almost the most important part of the Big Data Architecture. In the big data architecture hardware infrastructure is extremely important and failure over instances as well as redundant physical infrastructure is usually implemented. NoSQL in Data Management NoSQL is a very famous buzz word and it really means Not Relational SQL or Not Only SQL. This is because in Big Data Architecture the data is in any format. It can be unstructured, relational or in any other format or from any other data source. To bring all the data together relational technology is not enough, hence new tools, architecture and other algorithms are invented which takes care of all the kind of data. This is collectively called NoSQL. Tomorrow Next four days we will answer the Buzz Words – Hadoop. Reference: Pinal Dave (http://blog.sqlauthority.com) Filed under: Big Data, PostADay, SQL, SQL Authority, SQL Query, SQL Server, SQL Tips and Tricks, T SQL

    Read the article

  • Big Data – Evolution of Big Data – Day 3 of 21

    - by Pinal Dave
    In yesterday’s blog post we answered what is the Big Data. Today we will understand why and how the evolution of Big Data has happened. Though the answer is very simple, I would like to tell it in the form of a history lesson. Data in Flat File In earlier days data was stored in the flat file and there was no structure in the flat file.  If any data has to be retrieved from the flat file it was a project by itself. There was no possibility of retrieving the data efficiently and data integrity has been just a term discussed without any modeling or structure around. Database residing in the flat file had more issues than we would like to discuss in today’s world. It was more like a nightmare when there was any data processing involved in the application. Though, applications developed at that time were also not that advanced the need of the data was always there and there was always need of proper data management. Edgar F Codd and 12 Rules Edgar Frank Codd was a British computer scientist who, while working for IBM, invented the relational model for database management, the theoretical basis for relational databases. He presented 12 rules for the Relational Database and suddenly the chaotic world of the database seems to see discipline in the rules. Relational Database was a promising land for all the unstructured database users. Relational Database brought into the relationship between data as well improved the performance of the data retrieval. Database world had immediately seen a major transformation and every single vendors and database users suddenly started to adopt the relational database models. Relational Database Management Systems Since Edgar F Codd proposed 12 rules for the RBDMS there were many different vendors who started them to build applications and tools to support the relationship between database. This was indeed a learning curve for many of the developer who had never worked before with the modeling of the database. However, as time passed by pretty much everybody accepted the relationship of the database and started to evolve product which performs its best with the boundaries of the RDBMS concepts. This was the best era for the databases and it gave the world extreme experts as well as some of the best products. The Entity Relationship model was also evolved at the same time. In software engineering, an Entity–relationship model (ER model) is a data model for describing a database in an abstract way. Enormous Data Growth Well, everything was going fine with the RDBMS in the database world. As there were no major challenges the adoption of the RDBMS applications and tools was pretty much universal. There was a race at times to make the developer’s life much easier with the RDBMS management tools. Due to the extreme popularity and easy to use system pretty much every data was stored in the RDBMS system. New age applications were built and social media took the world by the storm. Every organizations was feeling pressure to provide the best experience for their users based the data they had with them. While this was all going on at the same time data was growing pretty much every organization and application. Data Warehousing The enormous data growth now presented a big challenge for the organizations who wanted to build intelligent systems based on the data and provide near real time superior user experience to their customers. Various organizations immediately start building data warehousing solutions where the data was stored and processed. The trend of the business intelligence becomes the need of everyday. Data was received from the transaction system and overnight was processed to build intelligent reports from it. Though this is a great solution it has its own set of challenges. The relational database model and data warehousing concepts are all built with keeping traditional relational database modeling in the mind and it still has many challenges when unstructured data was present. Interesting Challenge Every organization had expertise to manage structured data but the world had already changed to unstructured data. There was intelligence in the videos, photos, SMS, text, social media messages and various other data sources. All of these needed to now bring to a single platform and build a uniform system which does what businesses need. The way we do business has also been changed. There was a time when user only got the features what technology supported, however, now users ask for the feature and technology is built to support the same. The need of the real time intelligence from the fast paced data flow is now becoming a necessity. Large amount (Volume) of difference (Variety) of high speed data (Velocity) is the properties of the data. The traditional database system has limits to resolve the challenges this new kind of the data presents. Hence the need of the Big Data Science. We need innovation in how we handle and manage data. We need creative ways to capture data and present to users. Big Data is Reality! Tomorrow In tomorrow’s blog post we will try to answer discuss Basics of Big Data Architecture. Reference: Pinal Dave (http://blog.sqlauthority.com) Filed under: Big Data, PostADay, SQL, SQL Authority, SQL Query, SQL Server, SQL Tips and Tricks, T SQL

    Read the article

  • Working with Temporal Data in SQL Server

    - by Dejan Sarka
    My third Pluralsight course, Working with Temporal Data in SQL Server, is published. I am really proud on the second part of the course, where I discuss optimization of temporal queries. This was a nearly impossible task for decades. First solutions appeared only lately. I present all together six solutions (and one more that is not a solution), and I invented four of them. http://pluralsight.com/training/Courses/TableOfContents/working-with-temporal-data-sql-server

    Read the article

  • MySQL Database Replication and Server Load

    - by Willy
    Hi Everyone, I have an online service with around 5000 MySQL databases. Now, I am interested in building a development area of the exactly same environment in my office, therefore, I am about to setup MySQL replication between my live MySQL server and development MySQL server. But my concern is the load which will occur on my live MySQL server once replication is started. Do you have any experience? Will this process cause extra load on my production server? Thanks, have a nice weekend.

    Read the article

  • New Communications Industry Data Model with "Factory Installed" Predictive Analytics using Oracle Da

    - by charlie.berger
    Oracle Introduces Oracle Communications Data Model to Provide Actionable Insight for Communications Service Providers   We've integrated pre-installed analytical methodologies with the new Oracle Communications Data Model to deliver automated, simple, yet powerful predictive analytics solutions for customers.  Churn, sentiment analysis, identifying customer segments - all things that can be anticipated and hence, preconcieved and implemented inside an applications.  Read on for more information! TM Forum Management World, Nice, France - 18 May 2010 News Facts To help communications service providers (CSPs) manage and analyze rapidly growing data volumes cost effectively, Oracle today introduced the Oracle Communications Data Model. With the Oracle Communications Data Model, CSPs can achieve rapid time to value by quickly implementing a standards-based enterprise data warehouse that features communications industry-specific reporting, analytics and data mining. The combination of the Oracle Communications Data Model, Oracle Exadata and the Oracle Business Intelligence (BI) Foundation represents the most comprehensive data warehouse and BI solution for the communications industry. Also announced today, Hong Kong Broadband Network enhanced their data warehouse system, going live on Oracle Communications Data Model in three months. The leading provider increased its subscriber base by 37 percent in six months and reduced customer churn to less than one percent. Product Details Oracle Communications Data Model provides industry-specific schema and embedded analytics that address key areas such as customer management, marketing segmentation, product development and network health. CSPs can efficiently capture and monitor critical data and transform it into actionable information to support development and delivery of next-generation services using: More than 1,300 industry-specific measurements and key performance indicators (KPIs) such as network reliability statistics, provisioning metrics and customer churn propensity. Embedded OLAP cubes for extremely fast dimensional analysis of business information. Embedded data mining models for sophisticated trending and predictive analysis. Support for multiple lines of business, such as cable, mobile, wireline and Internet, which can be easily extended to support future requirements. With Oracle Communications Data Model, CSPs can jump start the implementation of a communications data warehouse in line with communications-industry standards including the TM Forum Information Framework (SID), formerly known as the Shared Information Model. Oracle Communications Data Model is optimized for any Oracle Database 11g platform, including Oracle Exadata, which can improve call data record query performance by 10x or more. Supporting Quotes "Oracle Communications Data Model covers a wide range of business areas that are relevant to modern communications service providers and is a comprehensive solution - with its data model and pre-packaged templates including BI dashboards, KPIs, OLAP cubes and mining models. It helps us save a great deal of time in building and implementing a customized data warehouse and enables us to leverage the advanced analytics quickly and more effectively," said Yasuki Hayashi, executive manager, NTT Comware Corporation. "Data volumes will only continue to grow as communications service providers expand next-generation networks, deploy new services and adopt new business models. They will increasingly need efficient, reliable data warehouses to capture key insights on data such as customer value, network value and churn probability. With the Oracle Communications Data Model, Oracle has demonstrated its commitment to meeting these needs by delivering data warehouse tools designed to fill communications industry-specific needs," said Elisabeth Rainge, program director, Network Software, IDC. "The TM Forum Conformance Mark provides reassurance to customers seeking standards-based, and therefore, cost-effective and flexible solutions. TM Forum is extremely pleased to work with Oracle to certify its Oracle Communications Data Model solution. Upon successful completion, this certification will represent the broadest and most complete implementation of the TM Forum Information Framework to date, with more than 130 aggregate business entities," said Keith Willetts, chairman and chief executive officer, TM Forum. Supporting Resources Oracle Communications Oracle Communications Data Model Data Sheet Oracle Communications Data Model Podcast Oracle Data Warehousing Oracle Communications on YouTube Oracle Communications on Delicious Oracle Communications on Facebook Oracle Communications on Twitter Oracle Communications on LinkedIn Oracle Database on Twitter The Data Warehouse Insider Blog

    Read the article

  • Big Data – How to become a Data Scientist and Learn Data Science? – Day 19 of 21

    - by Pinal Dave
    In yesterday’s blog post we learned the importance of the analytics in Big Data Story. In this article we will understand how to become a Data Scientist for Big Data Story. Data Scientist is a new buzz word, everyone seems to be wanting to become Data Scientist. Let us go over a few key topics related to Data Scientist in this blog post. First of all we will understand what is a Data Scientist. In the new world of Big Data, I see pretty much everyone wants to become Data Scientist and there are lots of people I have already met who claims that they are Data Scientist. When I ask what is their role, I have got a wide variety of answers. What is Data Scientist? Data scientists are the experts who understand various aspects of the business and know how to strategies data to achieve the business goals. They should have a solid foundation of various data algorithms, modeling and statistics methodology. What do Data Scientists do? Data scientists understand the data very well. They just go beyond the regular data algorithms and builds interesting trends from available data. They innovate and resurrect the entire new meaning from the existing data. They are artists in disguise of computer analyst. They look at the data traditionally as well as explore various new ways to look at the data. Data Scientists do not wait to build their solutions from existing data. They think creatively, they think before the data has entered into the system. Data Scientists are visionary experts who understands the business needs and plan ahead of the time, this tremendously help to build solutions at rapid speed. Besides being data expert, the major quality of Data Scientists is “curiosity”. They always wonder about what more they can get from their existing data and how to get maximum out of future incoming data. Data Scientists do wonders with the data, which goes beyond the job descriptions of Data Analysist or Business Analysist. Skills Required for Data Scientists Here are few of the skills a Data Scientist must have. Expert level skills with statistical tools like SAS, Excel, R etc. Understanding Mathematical Models Hands-on with Visualization Tools like Tableau, PowerPivots, D3. j’s etc. Analytical skills to understand business needs Communication skills On the technology front any Data Scientists should know underlying technologies like (Hadoop, Cloudera) as well as their entire ecosystem (programming language, analysis and visualization tools etc.) . Remember that for becoming a successful Data Scientist one require have par excellent skills, just having a degree in a relevant education field will not suffice. Final Note Data Scientists is indeed very exciting job profile. As per research there are not enough Data Scientists in the world to handle the current data explosion. In near future Data is going to expand exponentially, and the need of the Data Scientists will increase along with it. It is indeed the job one should focus if you like data and science of statistics. Courtesy: emc Tomorrow In tomorrow’s blog post we will discuss about various Big Data Learning resources. Reference: Pinal Dave (http://blog.sqlauthority.com) Filed under: Big Data, PostADay, SQL, SQL Authority, SQL Query, SQL Server, SQL Tips and Tricks, T SQL

    Read the article

  • Unstructured Data - The future of Data Administration

    Some have claimed that there is a problem with the way data is currently managed using the relational paradigm do to the rise of unstructured data in modern business. PCMag.com defines unstructured data as data that does not reside in a fixed location. They further explain that unstructured data refers to data in a free text form that is not bound to any specific structure. With the rise of unstructured data in the form of emails, spread sheets, images and documents the critics have a right to argue that the relational paradigm is not as effective as the object oriented data paradigm in managing this type of data. The relational paradigm relies heavily on structure and relationships in and between items of data. This type of paradigm works best in a relation database management system like Microsoft SQL, MySQL, and Oracle because data is forced to conform to a structure in the form of tables and relations can be derived from the existence of one or more tables. These critics also claim that database administrators have not kept up with reality because their primary focus in regards to data administration deals with structured data and the relational paradigm. The relational paradigm was developed in the 1970’s as a way to improve data management when compared to standard flat files. Little has changed since then, and modern database administrators need to know more than just how to handle structured data. That is why critics claim that today’s data professionals do not have the proper skills in order to store and maintain data for modern systems when compared to the skills of system designers, programmers , software engineers, and data designers  due to the industry trend of object oriented design and development. I think that they are wrong. I do not disagree that the industry is moving toward an object oriented approach to development with the potential to use more of an object oriented approach to data.   However, I think that it is business itself that is limiting database administrators from changing how data is stored because of the potential costs, and impact that might occur by altering any part of stored data. Furthermore, database administrators like all technology workers constantly are trying to improve their technical skills in order to excel in their job, so I think that accusing data professional is not just when the root cause of the lack of innovation is controlled by business, and it is business that will suffer for their inability to keep up with technology. One way for database professionals to better prepare for the future of database management is start working with data in the form of objects and so that they can extract data from the objects so that the stored information within objects can be used in relation to the data stored in a using the relational paradigm. Furthermore, I think the use of pattern matching will increase with the increased use of unstructured data because object can be selected, filtered and altered based on the existence of a pattern found within an object.

    Read the article

1 2 3 4 5 6 7 8 9 10 11 12  | Next Page >