I have a data-intensive application that desperately needs a database make-over.
The general data model:
There are records with RIDs, grouped together by group IDs (GID). The records have arbitrary data fields, (maybe 5-15) with a few of them mandatory and the rest optional, and thus sparse.
The general use model:
There are LOTS and LOTS of Writes. Millions to Billions of records are stored. Very often, they are associated with new GIDs, but sometimes, they are associated with existing GIDs.
There aren't as many reads, but when they happen, they need to be pretty fast or at least constant speed regardless of the database size. And when the reads happen, it will need to retrieve all the records/RIDs with a certain GID.
I don't have a need to search by the record field values. Primarily, I will need to query by the GID and maybe RID.
What database implementation should I use?
I did some initial research between document-oriented and column-oriented databases and it seems the document-oriented ones are a good fit, model-wise. I could store all the records together under the same document key using the GID. But I don't really have any use for their ability to search the document contents itself.
I like the simplicity and scalability of column-oriented databases like Cassandra, but how should I model my data in this paradigm for optimal performance? Should my key be the GID and should I create a column for each record/RID? (there maybe thousands or hundreds of thousands of records in a group/GID). Or should my key be the RID and ensure each row has a column for the GID value? What results in faster writes and reads under this model?