Developing Schema Compare for Oracle (Part 2): Dependencies
Posted
by Simon Cooper
on Simple Talk
See other posts from Simple Talk
or by Simon Cooper
Published on Tue, 20 Apr 2010 09:49:00 GMT
Indexed on
2010/04/20
11:14 UTC
Read the original article
Hit count: 344
In developing Schema Compare for Oracle, one of the issues we came across was the size of the databases. As detailed in my last blog post, we had to allow schema pre-filtering due to the number of objects in a standard Oracle database. Unfortunately, this leads to some quite tricky situations regarding object dependencies. This post explains how we deal with these dependencies.
1. Cross-schema dependencies
Say, in the following database, you're populating SchemaA
, and synchronizing SchemaA.Table1
:
SOURCE | TARGET | |
---|---|---|
CREATE TABLE SchemaA.Table1 ( |
CREATE TABLE SchemaA.Table1 ( | |
CREATE TABLE SchemaB.Table1 ( |
CREATE TABLE SchemaB.Table1 ( |
We need to do a rebuild of SchemaA.Table1
to change Col1
from a VARCHAR2(100)
to a NUMBER
. This consists of:
- Creating a table with the new schema
- Inserting data from the old table to the new table, with appropriate conversion functions (in this case,
TO_NUMBER
) - Dropping the old table
- Rename new table to same name as old table
Unfortunately, in this situation, the rebuild will fail at step 1, as we're trying to create a NUMBER
column with a foreign key reference to a VARCHAR2(100)
column. As we're only populating SchemaA
, the naive implementation of the object population prefiltering (sticking a WHERE owner = 'SCHEMAA'
on all the data dictionary queries) will generate an incorrect sync script. What we actually have to do is:
- Drop foreign key constraint on
SchemaA.Table1
- Rebuild
SchemaB.Table1
- Rebuild
SchemaA.Table1
, adding the foreign key constraint to the new table
This means that in order to generate a correct synchronization script for SchemaA.Table1
we have to know what SchemaB.Table1
is, and that it also needs to be rebuilt to successfully rebuild SchemaA.Table1
. SchemaB
isn't the schema that the user wants to synchronize, but we still have to load the table and column information for SchemaB.Table1
the same way as any table in SchemaA
.
Fortunately, Oracle provides (mostly) complete dependency information in the dictionary views. Before we actually read the information on all the tables and columns in the database, we can get dependency information on all the objects that are either pointed at by objects in the schemas we’re populating, or point to objects in the schemas we’re populating (think about what would happen if SchemaB
was being explicitly populated instead), with a suitable query on all_constraints
(for foreign key relationships) and all_dependencies
(for most other types of dependencies eg a function using another function).
The extra objects found can then be included in the actual object population, and the sync wizard then has enough information to figure out the right thing to do when we get to actually synchronize the objects. Unfortunately, this isn’t enough.
2. Dependency chains
The solution above will only get the immediate dependencies of objects in populated schemas. What if there’s a chain of dependencies?
A.tbl1 -> B.tbl1 -> C.tbl1 -> D.tbl1
If we’re only populating SchemaA
, the implementation above will only include B.tbl1
in the dependent objects list, whereas we might need to know about C.tbl1
and D.tbl1
as well, in order to ensure a modification on A.tbl1
can succeed. What we actually need is a graph traversal on the dependency graph that all_dependencies
represents.
Fortunately, we don’t have to read all the database dependency information from the server and run the graph traversal on the client computer, as Oracle provides a method of doing this in SQL – CONNECT BY
. So, we can put all the dependencies we want to include together in big bag with UNION ALL
, then run a SELECT ... CONNECT BY
on it, starting with objects in the schema we’re populating. We should end up with all the objects that might be affected by modifications in the initial schema we’re populating.
Good solution? Well, no. For one thing, it’s sloooooow. all_dependencies
, on my test databases, has got over 110,000 rows in it, and the entire query, for which Oracle was creating a temporary table to hold the big bag of graph edges, was often taking upwards of two minutes. This is too long, and would only get worse for large databases. But it had some more fundamental problems than just performance.
3. Comparison dependencies
Consider the following schema:
SOURCE | TARGET | |
---|---|---|
CREATE TABLE SchemaA.Table1 ( |
CREATE TABLE SchemaA.Table1 ( |
|
CREATE TABLE SchemaB.Table1 ( |
CREATE TABLE SchemaB.Table1 ( |
What will happen if we used the dependency algorithm above on the source & target database?
Well, SchemaA.Table1
has a foreign key reference to SchemaB.Table1
, so that will be included in the source database population.
On the target, SchemaA.Table1
has no such reference. Therefore SchemaB.Table1
will not be included in the target database population.
In the resulting comparison of the two objects models, what you will end up with is:
SOURCE | TARGET | |
---|---|---|
SchemaA.Table1 |
-> |
SchemaA.Table1 |
SchemaB.Table1 |
-> |
(no object exists) |
When this comparison is synchronized, we will see that SchemaB.Table1
does not exist, so we will try the following sequence of actions:
- Create
SchemaB.Table1
- Rebuild
SchemaA.Table1
, with foreign key toSchemaB.Table1
Oops.
Because the dependencies are only followed within a single database, we’ve tried to create an object that already exists.
To fix this we can include any objects found as dependencies in the source or target databases in the object population of both databases. SchemaB.Table1
will then be included in the target database population, and we won’t try and create objects that already exist. All good? Well, consider the following schema (again, only explicitly populating SchemaA
, and synchronizing SchemaA.Table1
):
SOURCE | TARGET | |
---|---|---|
CREATE TABLE SchemaA.Table1 ( |
CREATE TABLE SchemaA.Table1 ( |
|
CREATE TABLE SchemaB.Table1 ( |
CREATE TABLE SchemaB.Table1 ( |
|
CREATE TABLE SchemaC.Table1 ( |
CREATE TABLE SchemaC.Table1 ( |
Although we’re now including SchemaB.Table1
on both sides of the comparison, there’s a third table (SchemaC.Table1
) that we don’t know about that will cause the rebuild of SchemaB.Table1
to fail if we try and synchronize SchemaA.Table1
. That’s because we’re only running the dependency query on the schemas we’re explicitly populating; to solve this issue, we would have to run the dependency query again, but this time starting the graph traversal from the objects found in the other database. Furthermore, this dependency chain could be arbitrarily extended.
This leads us to the following algorithm for finding all the dependencies of a comparison:
- Find initial dependencies of schemas the user has selected to compare on the source and target
- Include these objects in both the source and target object populations
- Run the dependency query on the source, starting with the objects found as dependents on the target, and vice versa
- Repeat 2 & 3 until no more objects are found
For the schema above, this will result in the following sequence of actions:
Find initial dependencies
SchemaA.Table1 -> SchemaB.Table1
found on source
No objects found on targetInclude objects in both source and target
SchemaB.Table1
included in source and targetRun dependency query, starting with found objects
No objects to start with on sourceSchemaB.Table1 -> SchemaC.Table1
found on targetInclude objects in both source and target
SchemaC.Table1
included in source and targetRun dependency query on found objects
No objects found in source
No objects to start with in targetStop
This will ensure that we include all the necessary objects to make any synchronization work. However, there is still the issue of query performance; the CONNECT BY
on the entire database dependency graph is still too slow.
After much sitting down and drawing complicated diagrams, we decided to move the graph traversal algorithm from the server onto the client (which turned out to run much faster on the client than on the server); and to ensure we don’t read the entire dependency graph onto the client we also pull the graph across in bits – we start off with dependency edges involving schemas selected for explicit population, and whenever the graph traversal comes across a dependency reference to a schema we don’t yet know about a thunk is hit that pulls in the dependency information for that schema from the database.
We continue passing more dependent objects back and forth between the source and target until no more dependency references are found. This gives us the list of all the extra objects to populate in the source and target, and object population can then proceed.
4. Object blacklists and fast dependencies
When we tested this solution, we were puzzled in that in some of our databases most of the system schemas (WMSYS, ORDSYS, EXFSYS, XDB, etc) were being pulled in, and this was increasing the database registration and comparison time quite significantly. After debugging, we discovered that the culprits were database tables that used one of the Oracle PL/SQL types (eg the SDO_GEOMETRY spatial type). These were creating a dependency chain from the database tables we were populating to the system schemas, and hence pulling in most of the system objects in that schema.
To solve this we introduced blacklists of objects we wouldn’t follow any dependency chain through. As well as the Oracle-supplied PL/SQL types (MDSYS.SDO_GEOMETRY, ORDSYS.SI_COLOR, among others) we also decided to blacklist the entire PUBLIC and SYS schemas, as any references to those would likely lead to a blow up in the dependency graph that would massively increase the database registration time, and could result in the client running out of memory.
Even with these improvements, each dependency query was taking upwards of a minute. We discovered from Oracle execution plans that there were some columns, with dependency information we required, that were querying system tables with no indexes on them! To cut a long story short, running the following query:
SELECT * FROM all_tab_cols WHERE data_type_owner = ‘XDB’;
results in a full table scan of the SYS.COL$
system table! This single clause was responsible for over half the execution time of the dependency query. Hence, the ‘Ignore slow dependencies’ option was born – not querying this and a couple of similar clauses to drastically speed up the dependency query execution time, at the expense of producing incorrect sync scripts in rare edge cases.
Needless to say, along with the sync script action ordering, the dependency code in the database registration is one of the most complicated and most rewritten parts of the Schema Compare for Oracle engine. The beta of Schema Compare for Oracle is out now; if you find a bug in it, please do tell us so we can get it fixed!
© Simple Talk or respective owner