How can I improve my select query for storing large versioned data sets?
- by Jason Francis
At work, we build large multi-page web applications, consisting mostly of radio and check boxes. The primary purpose of each application is to gather data, but as users return to a page they have previously visited, we report back to them their previous responses. Worst-case scenario, we might have up to 900 distinct variables and around 1.5 million users.
For several reasons, it makes sense to use an insert-only approach to storing the data (as opposed to update-in-place) so that we can capture historical data about repeated interactions with variables. The net result is that we might have several responses per user per variable.
Our table to collect the responses looks something like this:
CREATE TABLE [dbo].[results](
[id] [bigint] IDENTITY(1,1) NOT NULL,
[userid] [int] NULL,
[variable] [varchar](8) NULL,
[value] [tinyint] NULL,
[submitted] [smalldatetime] NULL)
Where id serves as the primary key.
Virtually every request results in a series of insert statements (one per variable submitted), and then we run a select to produce previous responses for the next page (something like this):
SELECT t.id, t.variable, t.value
FROM results t WITH (NOLOCK)
WHERE t.userid = '2111846' AND
(t.variable='internat' OR t.variable='veteran' OR t.variable='athlete') AND
t.id IN (SELECT MAX(id) AS id
FROM results WITH (NOLOCK)
WHERE userid = '2111846' AND (t.variable='internat' OR t.variable='veteran' OR t.variable='athlete')
GROUP BY variable)
Which, in this case, would return the most recent responses for the variables "internat", "veteran", and "athlete" for user 2111846.
We have followed the advice of the database tuning tools in indexing the tables, and against our data, this is the best-performing version of the select query that we have been able to come up with. Even so, there seems to be significant performance degradation as the table approaches 1 million records (and we might have about 150x that). We have a fairly-elegant solution in place for sharding the data across multiple tables which has been working quite well, but I am open for any advice about how I might construct a better version of the select query. We use this structure frequently for storing lots of independent data points, and we like the benefits it provides.
So the question is, how can I improve the performance of the select query? I assume the nested select statement is a bad idea, but I have yet to find an alternative that performs as well.
Thanks in advance.
NB: Since we emphasize creating over reading in this case, and since we never update in place, there doesn't seem to be any penalty (and some advantage) for using the NOLOCK directive in this case.