I've been analyzing a recurring "bug report" (perf issue) in one of our systems related to a particularly slow delete operation. Long story short: It seems that the CASCADE DELETE keys were largely responsible, and I'd like to know (a) if this makes sense, and (b) why it's the case.
We have a schema of, let's say, widgets, those being at the root of a large graph of related tables and related-to-related tables and so on. To be perfectly clear, deleting from this table is actively discouraged; it is the "nuclear option" and users are under no illusions to the contrary. Nevertheless, it sometimes just has to be done.
The schema looks something like this:
Widgets
|
+--- Anvils (1:1)
| |
| +--- AnvilTestData (1:N)
|
+--- WidgetHistory (1:N)
|
+--- WidgetHistoryDetails (1:N)
Nothing too scary, really. A Widget can be different types, an Anvil is a special type, so that relationship is 1:1 (or more accurately 1:0..1). Then there's a large amount of data - perhaps thousands of rows of AnvilTestData per Anvil collected over time, dealing with hardness, corrosion, exact weight, hammer compatibility, usability issues, and impact tests with cartoon heads.
Then every Widget has a long, boring history of various types of transactions - production, inventory moves, sales, defect investigations, RMAs, repairs, customer complaints, etc. There might be 10-20k details for a single widget, or none at all, depending on its age.
So, unsurprisingly, there's a CASCADE DELETE relationship at every level here. If a Widget needs to be deleted, it means something's gone terribly wrong and we need to erase any records of that widget ever existing, including its history, test data, etc. Again, nuclear option.
Relations are all indexed, statistics are up to date. Normal queries are fast. The system tends to hum along pretty smoothly for everything except deletes.
Getting to the point here, finally, for various reasons we only allow deleting one widget at a time, so a delete statement would look like this:
DELETE FROM Widgets
WHERE WidgetID = @WidgetID
Pretty simple, innocuous looking delete... that takes over 2 minutes to run, for a widget with no data!
After slogging through execution plans I was finally able to pick out the AnvilTestData and WidgetHistoryDetails deletes as the sub-operations with the highest cost. So I experimented with turning off the CASCADE (but keeping the actual FK, just setting it to NO ACTION) and rewriting the script as something very much like the following:
DECLARE @AnvilID int
SELECT @AnvilID = AnvilID FROM Anvils WHERE WidgetID = @WidgetID
DELETE FROM AnvilTestData
WHERE AnvilID = @AnvilID
DELETE FROM WidgetHistory
WHERE HistoryID IN (
SELECT HistoryID
FROM WidgetHistory
WHERE WidgetID = @WidgetID)
DELETE FROM Widgets WHERE WidgetID = @WidgetID
Both of these "optimizations" resulted in significant speedups, each one shaving nearly a full minute off the execution time, so that the original 2-minute deletion now takes about 5-10 seconds - at least for new widgets, without much history or test data.
Just to be absolutely clear, there is still a CASCADE from WidgetHistory to WidgetHistoryDetails, where the fanout is highest, I only removed the one originating from Widgets.
Further "flattening" of the cascade relationships resulted in progressively less dramatic but still noticeable speedups, to the point where deleting a new widget was almost instantaneous once all of the cascade deletes to larger tables were removed and replaced with explicit deletes.
I'm using DBCC DROPCLEANBUFFERS and DBCC FREEPROCCACHE before each test. I've disabled all triggers that might be causing further slowdowns (although those would show up in the execution plan anyway). And I'm testing against older widgets, too, and noticing a significant speedup there as well; deletes that used to take 5 minutes now take 20-40 seconds.
Now I'm an ardent supporter of the "SELECT ain't broken" philosophy, but there just doesn't seem to be any logical explanation for this behaviour other than crushing, mind-boggling inefficiency of the CASCADE DELETE relationships.
So, my questions are:
Is this a known issue with DRI in SQL Server? (I couldn't seem to find any references to this sort of thing on Google or here in SO; I suspect the answer is no.)
If not, is there another explanation for the behaviour I'm seeing?
If it is a known issue, why is it an issue, and are there better workarounds I could be using?