Survey Data Model - How to avoid EAV and excessive denormalization?
- by AlexDPC
Hi everyone,
My database skills are mediocre at best and I have to design a data model for survey data. I have spent some thoughts on this and right now I feel that I am stuck between some kind of EAV model and a design involving hundreds of tables, each with hundreds of columns (and thousands of records). There must be a better way to do this and I hope that the wise folks on this forum can help me.
I have already searched various forums, but I couldn't really find a solution. If it has already been given elsewhere, please excuse me and provide me with a link so I can read it up.
Some assumptions about the data I have to deal with:
Each survey consists of 1 to n questionnaires
Each questionnaire consists of 100-2,000 questions (please ignore that 2,000 questions really sound like a lot to answer...)
Questions can be of various types: multiple-choice, free text, a number (like age, income, percentages, ...)
Each survey involves 10-200 countries (These are not the respondents. The respondents are actually people in the countries.)
Depending on the type of questionnaire, each questionnaire is answered by 100-20,000 respondents per country.
A country can adapt the questionnaires for a survey, i.e. add, remove or edit questions
The data for one country is gathered in a separate database in that country. There is no possibility for online integration from the start.
The data for all countries has to be integrated later. This means for example, if a country has deleted a question, that data must somehow be derived from what they sent in order to achieve a uniform design across all countries
I will have to write the integration and cleaning software, which will need to work with every country's data
In the end the data needs to be exported to flat files, one rectangular grid per country and questionnaire.
I have already discussed this topic with people from various backgrounds and have not come to a good solution yet. I mainly got two kinds of opinions.
The domain experts, who are used to working with flat files (spreadsheet-style) for data processing and analysis vote for a denormalized structure with loads of tables and columns as I described above (1 table per country and questionnaire). This sounds terrible to me, because I learned that wide tables are to be avoided, it will be annoying to determine which columns are actually in a table when working with it, the database will become cluttered with hundreds of tables (or I even need to set up multiple databases, each with a similar yet a bit differetn design), etc.
O-O-programmers vote for a strongly "normalized" design, which would effectively lead to a central table containing all the answers from all respondents to all questions. This table would either need to contain a column of type sql_variant type or multiple answer columns with different types to store answers of different types (multiple choice, free text, ..). The former would essentially be a EAV model. I tend to follow Joe Celko here, who strongly discourages its use (he calls it OTLT or "One True Lookup Table"). The latter would imply that each row would contain null cells for the not applicable types by design.
Another alternative I could think of would be to create one table per answer type, i.e., one for multiple-choice questions, one for free text questions, etc.. That's not so generic, it would lead to a lot of union joins, I think and I would have to add a table if a new answer type is invented.
Sorry for boring you with all this text and thank you for your input!
Cheers,
Alex
PS: I asked the same question here: http://www.eggheadcafe.com/community/aspnet/13/10242616/survey-data-model--how-to-avoid-eav-and-excessive-denormalization.aspx