Can MySQL reasonably perform queries on billions of rows?
Posted
by
haxney
on Stack Overflow
See other posts from Stack Overflow
or by haxney
Published on 2012-07-02T19:36:13Z
Indexed on
2012/07/03
15:16 UTC
Read the original article
Hit count: 222
I am planning on storing scans from a mass spectrometer in a MySQL database and would like to know whether storing and analyzing this amount of data is remotely feasible. I know performance varies wildly depending on the environment, but I'm looking for the rough order of magnitude: will queries take 5 days or 5 milliseconds?
Input format
Each input file contains a single run of the spectrometer; each run is comprised of a set of scans, and each scan has an ordered array of datapoints. There is a bit of metadata, but the majority of the file is comprised of arrays 32- or 64-bit ints or floats.
Host system
|----------------+-------------------------------| | OS | Windows 2008 64-bit | | MySQL version | 5.5.24 (x86_64) | | CPU | 2x Xeon E5420 (8 cores total) | | RAM | 8GB | | SSD filesystem | 500 GiB | | HDD RAID | 12 TiB | |----------------+-------------------------------|
There are some other services running on the server using negligible processor time.
File statistics
|------------------+--------------| | number of files | ~16,000 | | total size | 1.3 TiB | | min size | 0 bytes | | max size | 12 GiB | | mean | 800 MiB | | median | 500 MiB | | total datapoints | ~200 billion | |------------------+--------------|
The total number of datapoints is a very rough estimate.
Proposed schema
I'm planning on doing things "right" (i.e. normalizing the data like crazy) and
so would have a runs
table, a spectra
table with a foreign key to runs
,
and a datapoints
table with a foreign key to spectra
.
The 200 Billion datapoint question
I am going to be analyzing across multiple spectra and possibly even multiple runs, resulting in queries which could touch millions of rows. Assuming I index everything properly (which is a topic for another question) and am not trying to shuffle hundreds of MiB across the network, is it remotely plausible for MySQL to handle this?
UPDATE: additional info
The scan data will be coming from files in the XML-based
mzML format. The meat of this format is in the
<binaryDataArrayList>
elements where the data is stored. Each scan produces >=
2 <binaryDataArray>
elements which, taken together, form a 2-dimensional (or
more) array of the form [[123.456, 234.567, ...], ...]
.
These data are write-once, so update performance and transaction safety are not concerns.
My naïve plan for a database schema is:
runs
table
| column name | type | |-------------+-------------| | id | PRIMARY KEY | | start_time | TIMESTAMP | | name | VARCHAR | |-------------+-------------|
spectra
table
| column name | type | |----------------+-------------| | id | PRIMARY KEY | | name | VARCHAR | | index | INT | | spectrum_type | INT | | representation | INT | | run_id | FOREIGN KEY | |----------------+-------------|
datapoints
table
| column name | type | |-------------+-------------| | id | PRIMARY KEY | | spectrum_id | FOREIGN KEY | | mz | DOUBLE | | num_counts | DOUBLE | | index | INT | |-------------+-------------|
Is this reasonable?
© Stack Overflow or respective owner