How to overcome shortcomings in reporting from EAV database?
- by David Archer
The major shortcomings with Entity-Attribute-Value database designs in SQL all seem to be related to being able to query and report on the data efficiently and quickly. Most of the information I read on the subject warn against implementing EAV due to these problems and the commonality of querying/reporting for almost all applications.
I am currently designing a system where almost all the fields necessary for data storage are not known at design/compile time and are defined by the end-user of the system. EAV seems like a good fit for this requirement but due to the problems I've read about, I am hesitant in implementing it as there are also some pretty heavy reporting requirements for this system as well. I think I've come up with a way around this but would like to pose the question to the SO community.
Given that typical normalized database (OLTP) still isn't always the best option for running reports, a good practice seems to be having a "reporting" database (OLAP) where the data from the normalized database is copied to, indexed extensively, and possibly denormalized for easier querying. Could the same idea be used to work around the shortcomings of an EAV design?
The main downside I see are the increased complexity of transferring the data from the EAV database to reporting as you may end up having to alter the tables in the reporting database as new fields are defined in the EAV database. But that is hardly impossible and seems to be an acceptable tradeoff for the increased flexibility given by the EAV design. This downside also exists if I use a non-SQL data store (i.e. CouchDB or similar) for the main data storage since all the standard reporting tools are expecting a SQL backend to query against.
Do the issues with EAV systems mostly go away if you have a seperate reporting database for querying?
EDIT: Thanks for the comments so far. One of the important things about the system I'm working on it that I'm really only talking about using EAV for one of the entities, not everything in the system.
The whole gist of the system is to be able to pull data from multiple disparate sources that are not known ahead of time and crunch the data to come up with some "best known" data about a particular entity. So every "field" I'm dealing with is multi-valued and I'm also required to track history for each. The normalized design for this ends up being 1 table per field which makes querying it kind of painful anyway.
Here are the table schemas and sample data I'm looking at (obviously changed from what I'm working on but I think it illustrates the point well):
EAV Tables
Person
-------------------
- Id - Name -
-------------------
- 123 - Joe Smith -
-------------------
Person_Value
-------------------------------------------------------------------
- PersonId - Source - Field - Value - EffectiveDate -
-------------------------------------------------------------------
- 123 - CIA - HomeAddress - 123 Cherry Ln - 2010-03-26 -
- 123 - DMV - HomeAddress - 561 Stoney Rd - 2010-02-15 -
- 123 - FBI - HomeAddress - 676 Lancas Dr - 2010-03-01 -
-------------------------------------------------------------------
Reporting Table
Person_Denormalized
----------------------------------------------------------------------------------------
- Id - Name - HomeAddress - HomeAddress_Confidence - HomeAddress_EffectiveDate -
----------------------------------------------------------------------------------------
- 123 - Joe Smith - 123 Cherry Ln - 0.713 - 2010-03-26 -
----------------------------------------------------------------------------------------
Normalized Design
Person
-------------------
- Id - Name -
-------------------
- 123 - Joe Smith -
-------------------
Person_HomeAddress
------------------------------------------------------
- PersonId - Source - Value - Effective Date -
------------------------------------------------------
- 123 - CIA - 123 Cherry Ln - 2010-03-26 -
- 123 - DMV - 561 Stoney Rd - 2010-02-15 -
- 123 - FBI - 676 Lancas Dr - 2010-03-01 -
------------------------------------------------------
The "Confidence" field here is generated using logic that cannot be expressed easily (if at all) using SQL so my most common operation besides inserting new values will be pulling ALL data about a person for all fields so I can generate the record for the reporting table. This is actually easier in the EAV model as I can do a single query. In the normalized design, I end up having to do 1 query per field to avoid a massive cartesian product from joining them all together.