Database schema for simple stats project
Posted
by
Bubnoff
on Stack Overflow
See other posts from Stack Overflow
or by Bubnoff
Published on 2010-12-24T00:18:18Z
Indexed on
2010/12/24
0:54 UTC
Read the original article
Hit count: 237
database-design
Backdrop:
I have a file hierarchy of cvs files for multiple locations named by dates they cover ...by month specifically. Each cvs file in the folder is named after the location.
eg', folder name: 2010-feb
contains: location1.csv location2.csv
Each CSV file holds records like this:
2010-06-28, 20:30:00 , 0
2010-06-29, 08:30:00 , 0
2010-06-29, 09:30:00 , 0
2010-06-29, 10:30:00 , 0
2010-06-29, 11:30:00 , 0
meaning of record columns ( column names ):
Date, time, # of sessions
I have a perl script that pulls the data from this mess and originally I was going to store it as json files, but am thinking a database might be more appropriate long term ...comparing year to year trends ...fun stuff like that.
Pt 2 - My question/problem:
So I now have a REST service that coughs up json with a test database. My question is [ I suck at db design ], how best to design a database backend for this?
I am thinking the following tables would suffice and keep it simple:
Location: (PK)location_code, name
session: (PK)id, (FK)location_code, month, hour, num_sessions
I need to be able to average sessions (plus min and max) for each hour across days of week in addition to days of week in a given month or months. I've been using perl hashes to do this and am trying to decide how best to implement this with a database.
Do you think stored procedures should be used?
As to the database, depending on info gathered here, it will be postgresql or sqlite. If there is no compelling reason for postgresql I'll stick with sqlite.
How and where should I compare the data to hours of operation. I am storing the hours of operation in a yaml file. I currently 'match' the hour in the data to a hash from the yaml to do this. Would a database open simpler methods? I am thinking I would do this comparison as I do now then insert the data. Can be recalled with:
SELECT hour, num_sessions FROM session WHERE location_code=LOC1
Since only hours of operation are present, I do not need to worry about it. Should I calculate all results as I do now then store as a stats table for different 'reports'? This, rather than processing on demand? How would this look?
Anyway ...I ramble.
Thanks for reading!
Bubnoff
© Stack Overflow or respective owner