SQL query - choosing 'last updated' record in a group, better db design?
- by Jimmy
Hi,
Let's say I have a MySQL database with 3 tables:
table 1: Persons, with 1 column ID (int)
table 2: Newsletters, with 1 column ID (int)
table 3: Subscriptions, with columns Person_ID (int), Newsletter_ID (int), Subscribed (bool), Updated (Datetime)
Subscriptions.Person_ID points to a Person, and Subscription.Newsletter_ID points to a Newsletter. Thus, each person may have 0 or more subscriptions to 0 or more magazines at once. The table Subscriptions will also store the entire history of each person's subscriptions to each newsletter. If a particular Person_ID-Newsletter_ID pair doesn't have a row in the Subscriptions table, then it's equivalent to that pair having a subscription status of 'false'.
Here is a sample dataset
Persons
ID
1
2
3
Newsletters
ID
4
5
6
Subscriptions
Person_ID Newsletter_ID Subscribed Updated
2 4 true 2010-05-01
3 4 true 2010-05-01
3 5 true 2010-05-10
3 4 false 2010-05-15
Thus, as of 2010-05-16, Person 1 has no subscription, Person 2 has a subscription to Newsletter 4, and Person 3 has a subscription to Newsletter 5. Person 3 had a subscription to Newsletter 4 for a while, but not anymore.
I'm trying to do 2 kinds of query.
A query that shows everyone's active subscriptions as of query time (we can assume that updated will never be in the future -- thus, this means returning the record with the latest 'updated' value for each Person_ID-Newsletter_ID pair, as long as Subscribed is true (if the latest record for a Person_ID-Newsletter_ID pair has a Subscribed status of false, then I don't want that record returned)).
A query that returns all active subscriptions for a specific newsletter - same qualification as in 1. regarding records with 'false' in the Subscribed column.
I don't use SQL/databases often enough to tell if this design is good, or if the SQL queries needed would be slow on a database with, say, 1M records in the Subscriptions table.
I was using the Visual query builder tool in Visual Studio 2010 but I can't even get the query to return the latest updated record for each Person_ID-Newsletter_ID pair.
Is it possible to come up with SQL queries that don't involve using subqueries (presumably because they would become too slow with a larger data set)? If not, would it be a better design to have a separate Subscriptions_History table, and every time a subscription status for a Person_ID-Newsletter-ID pair is added to Subscriptions, any existing record for that pair is moved to Subscriptions_History (that way the Subscriptions table only ever contains the latest status update for any Person_ID-Newsletter_ID pair)?
I'm using .net on Windows, so would it be easier (or the same, or harder) to do this kind of queries using Linq? Entity Framework?
Thanks!