How to index a table with a Type 2 slowly changing dimension for optimal performance

Posted by The Lazy DBA on Stack Overflow See other posts from Stack Overflow or by The Lazy DBA
Published on 2010-01-08T19:24:59Z Indexed on 2010/05/18 2:40 UTC
Read the original article Hit count: 288

Filed under:

sql-server-2005

Suppose you have a table with a Type 2 slowly-changing dimension.

Let's express this table as follows, with the following columns:

* [Key]
* [Value1]
* ...
* [ValueN]
* [StartDate]
* [ExpiryDate]

In this example, let's suppose that [StartDate] is effectively the date in which the values for a given [Key] become known to the system. So our primary key would be composed of both [StartDate] and [Key].

When a new set of values arrives for a given [Key], we assign [ExpiryDate] to some pre-defined high surrogate value such as '12/31/9999'. We then set the existing "most recent" records for that [Key] to have an [ExpiryDate] that is equal to the [StartDate] of the new value. A simple update based on a join.

So if we always wanted to get the most recent records for a given [Key], we know we could create a clustered index that is:

* [ExpiryDate] ASC
* [Key] ASC

Although the keyspace may be very wide (say, a million keys), we can minimize the number of pages between reads by initially ordering them by [ExpiryDate]. And since we know the most recent record for a given key will always have an [ExpiryDate] of '12/31/9999', we can use that to our advantage.

However... what if we want to get a point-in-time snapshot of all [Key]s at a given time? Theoretically, the entirety of the keyspace isn't all being updated at the same time. Therefore for a given point-in-time, the window between [StartDate] and [ExpiryDate] is variable, so ordering by either [StartDate] or [ExpiryDate] would never yield a result in which all the records you're looking for are contiguous. Granted, you can immediately throw out all records in which the [StartDate] is greater than your defined point-in-time.

In essence, in a typical RDBMS, what indexing strategy affords the best way to minimize the number of reads to retrieve the values for all keys for a given point-in-time? I realize I can at least maximize IO by partitioning the table by [Key], however this certainly isn't ideal.

Alternatively, is there a different type of slowly-changing-dimension that solves this problem in a more performant manner?

Developer IT

How to index a table with a Type 2 slowly changing dimension for optimal performance - Developer IT

How to index a table with a Type 2 slowly changing dimension for optimal performance

sql

data-warehouse

indexing

sql-server

sql-server-2005

Related posts about sql

SQL SERVER – Concat Strings in SQL Server using T-SQL – SQL in Sixty Seconds #035 – Video

SQL SERVER – Concat Function in SQL Server – SQL Concatenation

Error with SQL Server Setup 2012 on Windows 2012

How can I detect which version of SQL (eg SQL 2008 or SQL Azure)

Nested SQL Select statement fails on SQL Server 2000, ok on SQL Server 2005

Related posts about data-warehouse

PostgreSQL to Data-Warehouse: Best approach for near-real-time ETL / extraction of data

Time and date dimension in data warehouse

Oracle Healthcare Data Warehouse Foundations RELEASED!

BIWA Wednesday TechCast Series - Opposition to Data Warehouse Initiatives

OpenWorld on your iPad and iPhone

Categories cloud