How to index a table with a Type 2 slowly changing dimension for optimal performance
Posted
by The Lazy DBA
on Stack Overflow
See other posts from Stack Overflow
or by The Lazy DBA
Published on 2010-01-08T19:24:59Z
Indexed on
2010/05/18
2:40 UTC
Read the original article
Hit count: 266
Suppose you have a table with a Type 2 slowly-changing dimension.
Let's express this table as follows, with the following columns:
* [Key]
* [Value1]
* ...
* [ValueN]
* [StartDate]
* [ExpiryDate]
In this example, let's suppose that [StartDate] is effectively the date in which the values for a given [Key] become known to the system. So our primary key would be composed of both [StartDate] and [Key].
When a new set of values arrives for a given [Key], we assign [ExpiryDate] to some pre-defined high surrogate value such as '12/31/9999'. We then set the existing "most recent" records for that [Key] to have an [ExpiryDate] that is equal to the [StartDate] of the new value. A simple update based on a join.
So if we always wanted to get the most recent records for a given [Key], we know we could create a clustered index that is:
* [ExpiryDate] ASC
* [Key] ASC
Although the keyspace may be very wide (say, a million keys), we can minimize the number of pages between reads by initially ordering them by [ExpiryDate]. And since we know the most recent record for a given key will always have an [ExpiryDate] of '12/31/9999', we can use that to our advantage.
However... what if we want to get a point-in-time snapshot of all [Key]s at a given time? Theoretically, the entirety of the keyspace isn't all being updated at the same time. Therefore for a given point-in-time, the window between [StartDate] and [ExpiryDate] is variable, so ordering by either [StartDate] or [ExpiryDate] would never yield a result in which all the records you're looking for are contiguous. Granted, you can immediately throw out all records in which the [StartDate] is greater than your defined point-in-time.
In essence, in a typical RDBMS, what indexing strategy affords the best way to minimize the number of reads to retrieve the values for all keys for a given point-in-time? I realize I can at least maximize IO by partitioning the table by [Key], however this certainly isn't ideal.
Alternatively, is there a different type of slowly-changing-dimension that solves this problem in a more performant manner?
© Stack Overflow or respective owner