Am I correctly extracting JPEG binary data from this mysqldump?
- by Glenn
I have a very old .sql backup of a vbulletin site that I ran around 8 years ago. I am trying to see the file attachments that are stored in the DB. The script below extracts them all and is verified to be JPEG by hex dumping and checking the SOI (start of image) and EOI (end of image) bytes (FFD8 and FFD9, respectively) according to the JPEG wiki page.
But when I try to open them with evince, I get this message "Error interpreting JPEG image file (JPEG datastream contains no image)"
What could be going on here?
Some background info:
sqldump is around 8 years old
vbulletin 2.x was the software that stored the info
most likely php 4 was used
most likely mysql 4.0, possibly even 3.x
the column datatype these attachments are stored in is mediumtext
My Python 3.1 script:
#!/usr/bin/env python3.1
import re
trim_l = re.compile(b"""^INSERT INTO attachment VALUES\('\d+', '\d+', '\d+', '(.+)""")
trim_r = re.compile(b"""(.+)', '\d+', '\d+'\);$""")
extractor = re.compile(b"""^(.*(?:\.jpe?g|\.gif|\.bmp))', '(.+)$""")
with open('attachments.sql', 'rb') as fh:
for line in fh:
data = trim_l.findall(line)[0]
data = trim_r.findall(data)[0]
data = extractor.findall(data)
if data:
name, data = data[0]
try:
filename = 'files/%s' % str(name, 'UTF-8')
ah = open(filename, 'wb')
ah.write(data)
except UnicodeDecodeError:
continue
finally:
ah.close()
fh.close()
update
The JPEG wiki page says FF bytes are section markers, with the next byte indicating the section type. I see some that are not listed in the wiki page (specifically, I see a lot of 5C bytes, so FF5C). But the list is of "common markers" so I'm trying to find a more complete list. Any guidance here would also be appreciated.