MySQL Binary Storage using BLOB VS OS File System: large files, large quantities, large problems.
- by Quantico773
Hi Guys,
Versions I am running (basically
latest of everything):
PHP:    5.3.1
MySQL:  5.1.41
Apache: 2.2.14
OS: CentOS (latest)
Here is the situation.
I have thousands of very important documents, ranging from customer contracts to voice signatures (recordings of customer authorisation for contracts), with file types including, but not limited to jpg, gif, png, tiff, doc, docx, xls, wav, mp3, pdf, etc.
All of these documents are currently stored on several servers including Windows 32 bit, CentOS and Mac, among others. Some files are also stored on employees desktop computers and laptops, and some are still hard copies stored in hundreds of boxes and filing cabinets.
Now because customers or lawyers could demand evidence of contracts at any time, my company has to be able to search and locate the correct document(s) effectively, for this reason ALL of these files have to be digitised (if not already) and correlated into some sort of order for searching and accessing.
As the programmer, I have created a full Customer Relations Management tool that the whole company uses. This includes Customer Profiles management, Order and job Tracking tools, Job/sale creation and management modules, etc, and at the moment any file that is needed at a customer profile level (drivers licence, credit authority, etc) or at a job/sale level (contracts, voice signatures, etc) can be uploaded to the server and sits in a parent/child hierarchy structure, just like Windows Explorer or any other typical file managment model.
The structure appears as such:
drivers_license
|- DL_123.jpg
voice_signatures
|- VS_123.wav
|- VS_4567.wav
contracts  
So the files are uplaoded using PHP and Apache, and are stored in the file system of the OS. At the time of uploading, certain information about the file(s) is stored in a MySQL database. Some of the information stored is:
TABLE: FileUploads
FileID
CustomerID (the customer id that the file belongs to, they all have this.)
JobID/SaleID (the id of the job/sale associated, if any.)
FileSize
FileType
UploadedDateTime
UploadedBy
FilePath (the directory path the file is stored in.)
FileName (current file name of uploaded file, combination of CustomerID and JobID/SaleID if applicable.)
FileDescription
OriginalFileName (original name of the source file when uploaded, including extension.)  
So as you can see, the file is linked to the database by the File Name. When I want to provide a customers' files for download to a user all I have to do is "SELECT * FROM FileUploads WHERE CustomerID = 123 OR JobID = 2345;" and this will output all the file details I require, and with the FilePath and FileName I can provide the link for download.
http... server / FilePath / FileName
There are a number of problems with this method:
Storing files in this "database unconcious" environment means data integrity is not kept. If a record is deleted, the file may not be deleted also, or vice versa.
Files are strewn all over the place, different servers, computers, etc.
The file name is the ONLY thing matching the binary to the database and customer profile and customer records.
etc, etc. There are so many reasons, some of which are described here: http://www.dreamwerx.net/site/article01 . Also there is an interesting article here too: sietch.net/ViewNewsItem.aspx?NewsItemID=124 .
SO, after much research I have pretty much decided I am going to store ALL of these files in the database, as a BLOB or LONGBLOB, but there are still many considerations before I do this.
I know that storing them in the database is a viable option, however there are a number of methods of storing them. I also know storing them is one thing; correlating and accessing them in a manageable way is another thing entirely.
The article provided at this link:  dreamwerx.net/site/article01  describes a way of splitting the uploaded binary files into 64kb chunks and storing each chunk with the FileID, and then streaming the actual binary file to the client using headers. This is a really cool idea since it alleviates preassure on the servers memory; instead of loading an entire 100mb file into the RAM and then sending it to the client, it is doing it 64kb at a time. I have tried this (and updated his scripts) and this is totally successful, in a very small frame of testing.
So if you are in agreeance that this method is a viable, stable and robust long-term option to store moderately large files (1kb to couple hundred megs), and large quantities of these files, let me know what other considerations or ideas you have.
Also, I am considering getting a current "File Management" PHP script that gives an interface for managing files stored in the File System and converting it to manage files stored in the database. If there is already any software out there that does this, please let me know.
I guess there are many questions I could ask, and all the information is up there ^^ so please, discuss all aspects of this and we can pass ideas back and forth and teach each other.
Cheers,
Quantico773