How to index a table with a Type 2 slowly changing dimension for optimal performance

Posted by The Lazy DBA on Stack Overflow See other posts from Stack Overflow or by The Lazy DBA
Published on 2010-01-08T19:24:59Z Indexed on 2010/05/18 2:40 UTC
Read the original article Hit count: 270

Suppose you have a table with a Type 2 slowly-changing dimension.

Let's express this table as follows, with the following columns:

* [Key]
* [Value1]
* ...
* [ValueN]
* [StartDate]
* [ExpiryDate]

In this example, let's suppose that [StartDate] is effectively the date in which the values for a given [Key] become known to the system. So our primary key would be composed of both [StartDate] and [Key].

When a new set of values arrives for a given [Key], we assign [ExpiryDate] to some pre-defined high surrogate value such as '12/31/9999'. We then set the existing "most recent" records for that [Key] to have an [ExpiryDate] that is equal to the [StartDate] of the new value. A simple update based on a join.


So if we always wanted to get the most recent records for a given [Key], we know we could create a clustered index that is:

* [ExpiryDate] ASC
* [Key] ASC

Although the keyspace may be very wide (say, a million keys), we can minimize the number of pages between reads by initially ordering them by [ExpiryDate]. And since we know the most recent record for a given key will always have an [ExpiryDate] of '12/31/9999', we can use that to our advantage.

However... what if we want to get a point-in-time snapshot of all [Key]s at a given time? Theoretically, the entirety of the keyspace isn't all being updated at the same time. Therefore for a given point-in-time, the window between [StartDate] and [ExpiryDate] is variable, so ordering by either [StartDate] or [ExpiryDate] would never yield a result in which all the records you're looking for are contiguous. Granted, you can immediately throw out all records in which the [StartDate] is greater than your defined point-in-time.


In essence, in a typical RDBMS, what indexing strategy affords the best way to minimize the number of reads to retrieve the values for all keys for a given point-in-time? I realize I can at least maximize IO by partitioning the table by [Key], however this certainly isn't ideal.

Alternatively, is there a different type of slowly-changing-dimension that solves this problem in a more performant manner?

© Stack Overflow or respective owner

Related posts about sql

Related posts about data-warehouse