Handling inheritance with overriding efficiently
- by Fyodor Soikin
I have the following two data structures.
First, a list of properties applied to object triples:
Object1 Object2 Object3 Property Value
O1 O2 O3 P1 "abc"
O1 O2 O3 P2 "xyz"
O1 O3 O4 P1 "123"
O2 O4 O5 P1 "098"
Second, an inheritance tree:
O1
O2
O4
O3
O5
Or viewed as a relation:
Object Parent
O2 O1
O4 O2
O3 O1
O5 O3
O1 null
The semantics of this being that O2 inherits properties from O1; O4 - from O2 and O1; O3 - from O1; and O5 - from O3 and O1, in that order of precedence.
NOTE 1: I have an efficient way to select all children or all parents of a given object. This is currently implemented with left and right indexes, but hierarchyid could also work. This does not seem important right now.
NOTE 2: I have tiggers in place that make sure that the "Object" column always contains all possible objects, even when they do not really have to be there (i.e. have no parent or children defined). This makes it possible to use inner joins rather than severely less effiecient outer joins.
The objective is: Given a pair of (Property, Value), return all object triples that have that property with that value either defined explicitly or inherited from a parent.
NOTE 1: An object triple (X,Y,Z) is considered a "parent" of triple (A,B,C) when it is true that either X = A or X is a parent of A, and the same is true for (Y,B) and (Z,C).
NOTE 2: A property defined on a closer parent "overrides" the same property defined on a more distant parent.
NOTE 3: When (A,B,C) has two parents - (X1,Y1,Z1) and (X2,Y2,Z2), then (X1,Y1,Z1) is considered a "closer" parent when:
(a) X2 is a parent of X1, or
(b) X2 = X1 and Y2 is a parent of Y1, or
(c) X2 = X1 and Y2 = Y1 and Z2 is a parent of Z1
In other words, the "closeness" in ancestry for triples is defined based on the first components of the triples first, then on the second components, then on the third components.
This rule establishes an unambigous partial order for triples in terms of ancestry.
For example, given the pair of (P1, "abc"), the result set of triples will be:
O1, O2, O3 -- Defined explicitly
O1, O2, O5 -- Because O5 inherits from O3
O1, O4, O3 -- Because O4 inherits from O2
O1, O4, O5 -- Because O4 inherits from O2 and O5 inherits from O3
O2, O2, O3 -- Because O2 inherits from O1
O2, O2, O5 -- Because O2 inherits from O1 and O5 inherits from O3
O2, O4, O3 -- Because O2 inherits from O1 and O4 inherits from O2
O3, O2, O3 -- Because O3 inherits from O1
O3, O2, O5 -- Because O3 inherits from O1 and O5 inherits from O3
O3, O4, O3 -- Because O3 inherits from O1 and O4 inherits from O2
O3, O4, O5 -- Because O3 inherits from O1 and O4 inherits from O2 and O5 inherits from O3
O4, O2, O3 -- Because O4 inherits from O1
O4, O2, O5 -- Because O4 inherits from O1 and O5 inherits from O3
O4, O4, O3 -- Because O4 inherits from O1 and O4 inherits from O2
O5, O2, O3 -- Because O5 inherits from O1
O5, O2, O5 -- Because O5 inherits from O1 and O5 inherits from O3
O5, O4, O3 -- Because O5 inherits from O1 and O4 inherits from O2
O5, O4, O5 -- Because O5 inherits from O1 and O4 inherits from O2 and O5 inherits from O3
Note that the triple (O2, O4, O5) is absent from this list. This is because property P1 is defined explicitly for the triple (O2, O4, O5) and this prevents that triple from inheriting that property from (O1, O2, O3).
Also note that the triple (O4, O4, O5) is also absent. This is because that triple inherits its value of P1="098" from (O2, O4, O5), because it is a closer parent than (O1, O2, O3).
The straightforward way to do it is the following.
First, for every triple that a property is defined on, select all possible child triples:
select Children1.Id as O1, Children2.Id as O2, Children3.Id as O3, tp.Property, tp.Value
from TriplesAndProperties tp
-- Select corresponding objects of the triple
inner join Objects as Objects1 on Objects1.Id = tp.O1
inner join Objects as Objects2 on Objects2.Id = tp.O2
inner join Objects as Objects3 on Objects3.Id = tp.O3
-- Then add all possible children of all those objects
inner join Objects as Children1 on Objects1.Id [isparentof] Children1.Id
inner join Objects as Children2 on Objects2.Id [isparentof] Children2.Id
inner join Objects as Children3 on Objects3.Id [isparentof] Children3.Id
But this is not the whole story: if some triple inherits the same property from several parents, this query will yield conflicting results.
Therefore, second step is to select just one of those conflicting results:
select * from
(
select
Children1.Id as O1, Children2.Id as O2, Children3.Id as O3, tp.Property, tp.Value,
row_number() over(
partition by Children1.Id, Children2.Id, Children3.Id, tp.Property
order by Objects1.[depthInTheTree] descending, Objects2.[depthInTheTree] descending, Objects3.[depthInTheTree] descending
)
as InheritancePriority
from
... (see above)
)
where InheritancePriority = 1
The window function row_number() over( ... ) does the following: for every unique combination of objects triple and property, it sorts all values by the ancestral distance from the triple to the parents that the value is inherited from, and then I only select the very first of the resulting list of values.
A similar effect can be achieved with a GROUP BY and ORDER BY statements, but I just find the window function semantically cleaner (the execution plans they yield are identical).
The point is, I need to select the closest of contributing ancestors, and for that I need to group and then sort within the group.
And finally, now I can simply filter the result set by Property and Value.
This scheme works. Very reliably and predictably.
It has proven to be very powerful for the business task it implements.
The only trouble is, it is awfuly slow.
One might point out the join of seven tables might be slowing things down, but that is actually not the bottleneck.
According to the actual execution plan I'm getting from the SQL Management Studio (as well as SQL Profiler), the bottleneck is the sorting.
The problem is, in order to satisfy my window function, the server has to sort by Children1.Id, Children2.Id, Children3.Id, tp.Property, Parents1.[depthInTheTree] descending, Parents2.[depthInTheTree] descending, Parents3.[depthInTheTree] descending, and there can be no indexes it can use, because the values come from a cross join of several tables.
EDIT: Per Michael Buen's suggestion (thank you, Michael), I have posted the whole puzzle to sqlfiddle here. One can see in the execution plan that the Sort operation accounts for 32% of the whole query, and that is going to grow with the number of total rows, because all the other operations use indexes.
Usually in such cases I would use an indexed view, but not in this case, because indexed views cannot contain self-joins, of which there are six.
The only way that I can think of so far is to create six copies of the Objects table and then use them for the joins, thus enabling an indexed view.
Did the time come that I shall be reduced to that kind of hacks? The despair sets in.