Database schema for simple stats project

Posted by Bubnoff on Stack Overflow See other posts from Stack Overflow or by Bubnoff
Published on 2010-12-24T00:18:18Z Indexed on 2010/12/24 0:54 UTC
Read the original article Hit count: 237

Filed under:

Backdrop:

I have a file hierarchy of cvs files for multiple locations named by dates they cover ...by month specifically. Each cvs file in the folder is named after the location.

eg', folder name: 2010-feb

contains: location1.csv location2.csv

Each CSV file holds records like this:

2010-06-28, 20:30:00 , 0
2010-06-29, 08:30:00 , 0
2010-06-29, 09:30:00 , 0
2010-06-29, 10:30:00 , 0
2010-06-29, 11:30:00 , 0

meaning of record columns ( column names ):

Date, time, # of sessions

I have a perl script that pulls the data from this mess and originally I was going to store it as json files, but am thinking a database might be more appropriate long term ...comparing year to year trends ...fun stuff like that.

Pt 2 - My question/problem:

So I now have a REST service that coughs up json with a test database. My question is [ I suck at db design ], how best to design a database backend for this?

I am thinking the following tables would suffice and keep it simple:

Location: (PK)location_code, name 
session: (PK)id, (FK)location_code, month, hour, num_sessions

I need to be able to average sessions (plus min and max) for each hour across days of week in addition to days of week in a given month or months. I've been using perl hashes to do this and am trying to decide how best to implement this with a database.

Do you think stored procedures should be used?

As to the database, depending on info gathered here, it will be postgresql or sqlite. If there is no compelling reason for postgresql I'll stick with sqlite.

How and where should I compare the data to hours of operation. I am storing the hours of operation in a yaml file. I currently 'match' the hour in the data to a hash from the yaml to do this. Would a database open simpler methods? I am thinking I would do this comparison as I do now then insert the data. Can be recalled with:

SELECT hour, num_sessions FROM session WHERE location_code=LOC1

Since only hours of operation are present, I do not need to worry about it. Should I calculate all results as I do now then store as a stats table for different 'reports'? This, rather than processing on demand? How would this look?

Anyway ...I ramble.

Thanks for reading!

Bubnoff

© Stack Overflow or respective owner

Related posts about database-design