Can MySQL reasonably perform queries on billions of rows?
- by haxney
I am planning on storing scans from a mass spectrometer in a MySQL database and
would like to know whether storing and analyzing this amount of data is remotely
feasible. I know performance varies wildly depending on the environment, but I'm
looking for the rough order of magnitude: will queries take 5 days or 5
milliseconds?
Input format
Each input file contains a single run of the spectrometer; each run is comprised
of a set of scans, and each scan has an ordered array of datapoints. There is a
bit of metadata, but the majority of the file is comprised of arrays 32- or
64-bit ints or floats.
Host system
|----------------+-------------------------------|
| OS | Windows 2008 64-bit |
| MySQL version | 5.5.24 (x86_64) |
| CPU | 2x Xeon E5420 (8 cores total) |
| RAM | 8GB |
| SSD filesystem | 500 GiB |
| HDD RAID | 12 TiB |
|----------------+-------------------------------|
There are some other services running on the server using negligible processor
time.
File statistics
|------------------+--------------|
| number of files | ~16,000 |
| total size | 1.3 TiB |
| min size | 0 bytes |
| max size | 12 GiB |
| mean | 800 MiB |
| median | 500 MiB |
| total datapoints | ~200 billion |
|------------------+--------------|
The total number of datapoints is a very rough estimate.
Proposed schema
I'm planning on doing things "right" (i.e. normalizing the data like crazy) and
so would have a runs table, a spectra table with a foreign key to runs,
and a datapoints table with a foreign key to spectra.
The 200 Billion datapoint question
I am going to be analyzing across multiple spectra and possibly even multiple
runs, resulting in queries which could touch millions of rows. Assuming I index
everything properly (which is a topic for another question) and am not trying to
shuffle hundreds of MiB across the network, is it remotely plausible for MySQL
to handle this?
UPDATE: additional info
The scan data will be coming from files in the XML-based
mzML format. The meat of this format is in the
<binaryDataArrayList> elements where the data is stored. Each scan produces =
2 <binaryDataArray> elements which, taken together, form a 2-dimensional (or
more) array of the form [[123.456, 234.567, ...], ...].
These data are write-once, so update performance and transaction safety are not
concerns.
My naïve plan for a database schema is:
runs table
| column name | type |
|-------------+-------------|
| id | PRIMARY KEY |
| start_time | TIMESTAMP |
| name | VARCHAR |
|-------------+-------------|
spectra table
| column name | type |
|----------------+-------------|
| id | PRIMARY KEY |
| name | VARCHAR |
| index | INT |
| spectrum_type | INT |
| representation | INT |
| run_id | FOREIGN KEY |
|----------------+-------------|
datapoints table
| column name | type |
|-------------+-------------|
| id | PRIMARY KEY |
| spectrum_id | FOREIGN KEY |
| mz | DOUBLE |
| num_counts | DOUBLE |
| index | INT |
|-------------+-------------|
Is this reasonable?