In developing Schema Compare for Oracle, one of the issues we came across was the size of the databases. As detailed in my last blog post, we had to allow schema pre-filtering due to the number of objects in a standard Oracle database. Unfortunately, this leads to some quite tricky situations regarding object dependencies. This post explains how we deal with these dependencies.
1. Cross-schema dependencies
Say, in the following database, you're populating SchemaA, and synchronizing SchemaA.Table1:
SOURCE
TARGET
CREATE TABLE SchemaA.Table1 ( Col1 NUMBER REFERENCES SchemaB.Table1(Col1));
CREATE TABLE SchemaA.Table1 ( Col1 VARCHAR2(100) REFERENCES SchemaB.Table1(Col1));
CREATE TABLE SchemaB.Table1 ( Col1 NUMBER PRIMARY KEY);
CREATE TABLE SchemaB.Table1 ( Col1 VARCHAR2(100) PRIMARY KEY);
We need to do a rebuild of SchemaA.Table1 to change Col1 from a VARCHAR2(100) to a NUMBER. This consists of:
Creating a table with the new schema
Inserting data from the old table to the new table, with appropriate conversion functions (in this case, TO_NUMBER)
Dropping the old table
Rename new table to same name as old table
Unfortunately, in this situation, the rebuild will fail at step 1, as we're trying to create a NUMBER column with a foreign key reference to a VARCHAR2(100) column. As we're only populating SchemaA, the naive implementation of the object population prefiltering (sticking a WHERE owner = 'SCHEMAA' on all the data dictionary queries) will generate an incorrect sync script. What we actually have to do is:
Drop foreign key constraint on SchemaA.Table1
Rebuild SchemaB.Table1
Rebuild SchemaA.Table1, adding the foreign key constraint to the new table
This means that in order to generate a correct synchronization script for SchemaA.Table1 we have to know what SchemaB.Table1 is, and that it also needs to be rebuilt to successfully rebuild SchemaA.Table1. SchemaB isn't the schema that the user wants to synchronize, but we still have to load the table and column information for SchemaB.Table1 the same way as any table in SchemaA.
Fortunately, Oracle provides (mostly) complete dependency information in the dictionary views. Before we actually read the information on all the tables and columns in the database, we can get dependency information on all the objects that are either pointed at by objects in the schemas we’re populating, or point to objects in the schemas we’re populating (think about what would happen if SchemaB was being explicitly populated instead), with a suitable query on all_constraints (for foreign key relationships) and all_dependencies (for most other types of dependencies eg a function using another function).
The extra objects found can then be included in the actual object population, and the sync wizard then has enough information to figure out the right thing to do when we get to actually synchronize the objects. Unfortunately, this isn’t enough.
2. Dependency chains
The solution above will only get the immediate dependencies of objects in populated schemas. What if there’s a chain of dependencies?
A.tbl1 -> B.tbl1 -> C.tbl1 -> D.tbl1
If we’re only populating SchemaA, the implementation above will only include B.tbl1 in the dependent objects list, whereas we might need to know about C.tbl1 and D.tbl1 as well, in order to ensure a modification on A.tbl1 can succeed. What we actually need is a graph traversal on the dependency graph that all_dependencies represents.
Fortunately, we don’t have to read all the database dependency information from the server and run the graph traversal on the client computer, as Oracle provides a method of doing this in SQL – CONNECT BY. So, we can put all the dependencies we want to include together in big bag with UNION ALL, then run a SELECT ... CONNECT BY on it, starting with objects in the schema we’re populating. We should end up with all the objects that might be affected by modifications in the initial schema we’re populating.
Good solution? Well, no. For one thing, it’s sloooooow. all_dependencies, on my test databases, has got over 110,000 rows in it, and the entire query, for which Oracle was creating a temporary table to hold the big bag of graph edges, was often taking upwards of two minutes. This is too long, and would only get worse for large databases. But it had some more fundamental problems than just performance.
3. Comparison dependencies
Consider the following schema:
SOURCE
TARGET
CREATE TABLE SchemaA.Table1 ( Col1 NUMBER REFERENCES SchemaB.Table1(col1));
CREATE TABLE SchemaA.Table1 ( Col1 VARCHAR2(100));
CREATE TABLE SchemaB.Table1 ( Col1 NUMBER PRIMARY KEY);
CREATE TABLE SchemaB.Table1 ( Col1 VARCHAR2(100));
What will happen if we used the dependency algorithm above on the source & target database?
Well, SchemaA.Table1 has a foreign key reference to SchemaB.Table1, so that will be included in the source database population.
On the target, SchemaA.Table1 has no such reference. Therefore SchemaB.Table1 will not be included in the target database population.
In the resulting comparison of the two objects models, what you will end up with is:
SOURCE
TARGET
SchemaA.Table1
->
SchemaA.Table1
SchemaB.Table1
->
(no object exists)
When this comparison is synchronized, we will see that SchemaB.Table1 does not exist, so we will try the following sequence of actions:
Create SchemaB.Table1
Rebuild SchemaA.Table1, with foreign key to SchemaB.Table1
Oops.
Because the dependencies are only followed within a single database, we’ve tried to create an object that already exists.
To fix this we can include any objects found as dependencies in the source or target databases in the object population of both databases. SchemaB.Table1 will then be included in the target database population, and we won’t try and create objects that already exist. All good? Well, consider the following schema (again, only explicitly populating SchemaA, and synchronizing SchemaA.Table1):
SOURCE
TARGET
CREATE TABLE SchemaA.Table1 ( Col1 NUMBER REFERENCES SchemaB.Table1(col1));
CREATE TABLE SchemaA.Table1 ( Col1 VARCHAR2(100));
CREATE TABLE SchemaB.Table1 ( Col1 NUMBER PRIMARY KEY);
CREATE TABLE SchemaB.Table1 ( Col1 VARCHAR2(100) PRIMARY KEY);
CREATE TABLE SchemaC.Table1 ( Col1 NUMBER);
CREATE TABLE SchemaC.Table1 ( Col1 VARCHAR2(100) REFERENCES SchemaB.Table1);
Although we’re now including SchemaB.Table1 on both sides of the comparison, there’s a third table (SchemaC.Table1) that we don’t know about that will cause the rebuild of SchemaB.Table1 to fail if we try and synchronize SchemaA.Table1. That’s because we’re only running the dependency query on the schemas we’re explicitly populating; to solve this issue, we would have to run the dependency query again, but this time starting the graph traversal from the objects found in the other database. Furthermore, this dependency chain could be arbitrarily extended.This leads us to the following algorithm for finding all the dependencies of a comparison:
Find initial dependencies of schemas the user has selected to compare on the source and target
Include these objects in both the source and target object populations
Run the dependency query on the source, starting with the objects found as dependents on the target, and vice versa
Repeat 2 & 3 until no more objects are found
For the schema above, this will result in the following sequence of actions:
Find initial dependenciesSchemaA.Table1 -> SchemaB.Table1 found on sourceNo objects found on target
Include objects in both source and targetSchemaB.Table1 included in source and target
Run dependency query, starting with found objectsNo objects to start with on sourceSchemaB.Table1 -> SchemaC.Table1 found on target
Include objects in both source and targetSchemaC.Table1 included in source and target
Run dependency query on found objectsNo objects found in sourceNo objects to start with in target
Stop
This will ensure that we include all the necessary objects to make any synchronization work. However, there is still the issue of query performance; the CONNECT BY on the entire database dependency graph is still too slow.
After much sitting down and drawing complicated diagrams, we decided to move the graph traversal algorithm from the server onto the client (which turned out to run much faster on the client than on the server); and to ensure we don’t read the entire dependency graph onto the client we also pull the graph across in bits – we start off with dependency edges involving schemas selected for explicit population, and whenever the graph traversal comes across a dependency reference to a schema we don’t yet know about a thunk is hit that pulls in the dependency information for that schema from the database.
We continue passing more dependent objects back and forth between the source and target until no more dependency references are found. This gives us the list of all the extra objects to populate in the source and target, and object population can then proceed.
4. Object blacklists and fast dependencies
When we tested this solution, we were puzzled in that in some of our databases most of the system schemas (WMSYS, ORDSYS, EXFSYS, XDB, etc) were being pulled in, and this was increasing the database registration and comparison time quite significantly. After debugging, we discovered that the culprits were database tables that used one of the Oracle PL/SQL types (eg the SDO_GEOMETRY spatial type). These were creating a dependency chain from the database tables we were populating to the system schemas, and hence pulling in most of the system objects in that schema.
To solve this we introduced blacklists of objects we wouldn’t follow any dependency chain through. As well as the Oracle-supplied PL/SQL types (MDSYS.SDO_GEOMETRY, ORDSYS.SI_COLOR, among others) we also decided to blacklist the entire PUBLIC and SYS schemas, as any references to those would likely lead to a blow up in the dependency graph that would massively increase the database registration time, and could result in the client running out of memory.
Even with these improvements, each dependency query was taking upwards of a minute. We discovered from Oracle execution plans that there were some columns, with dependency information we required, that were querying system tables with no indexes on them! To cut a long story short, running the following query:
SELECT * FROM all_tab_cols WHERE data_type_owner = ‘XDB’;
results in a full table scan of the SYS.COL$ system table! This single clause was responsible for over half the execution time of the dependency query. Hence, the ‘Ignore slow dependencies’ option was born – not querying this and a couple of similar clauses to drastically speed up the dependency query execution time, at the expense of producing incorrect sync scripts in rare edge cases.
Needless to say, along with the sync script action ordering, the dependency code in the database registration is one of the most complicated and most rewritten parts of the Schema Compare for Oracle engine. The beta of Schema Compare for Oracle is out now; if you find a bug in it, please do tell us so we can get it fixed!