Database schema for simple stats project
- by Bubnoff
Backdrop:
I have a file hierarchy of cvs files for multiple locations named by dates they cover ...by month specifically. Each cvs file in the folder is named after the location.
eg',
folder name: 2010-feb
contains:
location1.csv
location2.csv
Each CSV file holds records like this:
2010-06-28, 20:30:00 , 0
2010-06-29, 08:30:00 , 0
2010-06-29, 09:30:00 , 0
2010-06-29, 10:30:00 , 0
2010-06-29, 11:30:00 , 0
meaning of record columns ( column names ):
Date, time, # of sessions
I have a perl script that pulls the data from this mess and originally I was going to store it as json files, but am thinking a database might be more appropriate long term ...comparing year to year trends ...fun stuff like that.
Pt 2 - My question/problem:
So I now have a REST service that coughs up json with a test database. My question is [ I suck at db design ], how best to design a database backend for this?
I am thinking the following tables would suffice and keep it simple:
Location: (PK)location_code, name
session: (PK)id, (FK)location_code, month, hour, num_sessions
I need to be able to average sessions (plus min and max) for each hour across days of week in addition to days of week in a given month or months. I've been using perl hashes to do this and am trying to decide how best to implement this with a database.
Do you think stored procedures should be used?
As to the database, depending on info gathered here, it will be postgresql or sqlite.
If there is no compelling reason for postgresql I'll stick with sqlite.
How and where should I compare the data to hours of operation. I am storing the hours
of operation in a yaml file. I currently 'match' the hour in the data to a hash from the yaml to do this. Would a database open simpler methods? I am thinking I would do this comparison as I do now then insert the data. Can be recalled with:
SELECT hour, num_sessions FROM session WHERE location_code=LOC1
Since only hours of operation are present, I do not need to worry about it.
Should I calculate all results as I do now then store as a stats table for
different 'reports'? This, rather than processing on demand? How would this look?
Anyway ...I ramble.
Thanks for reading!
Bubnoff