Importing a large dataset into a database
Posted
by peaceful
on Stack Overflow
See other posts from Stack Overflow
or by peaceful
Published on 2010-03-15T17:35:41Z
Indexed on
2010/03/15
17:39 UTC
Read the original article
Hit count: 627
I'm a beginning programmer in the relevant areas to this question, so if possible, it'd be helpful to avoid assuming I know a lot already.
I'm trying to import the OpenLibrary dataset into a local Postgres database. After it's imported, I plan to use it as a starting seed for a Ruby on Rails application that will include information on books.
The OpenLibrary datasets are available here, in a modified JSON format: http://openlibrary.org/dev/docs/jsondump
I only need very basic information for my application, much less than what is provided in the dumps. I'm only trying to get out book titles, author names, and relationships between books and authors.
Below are two typical entries from their dataset, the first for an author, and the second for a book (they seem to have an entry for each edition of a book). The entries seem to lead off with a primary key, and then with a type, before including the actual JSON database dump.
/a/OL2A /type/author {"name": "U. Venkatakrishna Rao", "personal_name": "U. Venkatakrishna Rao", "last_modified": {"type": "/type/datetime", "value": "2008-09-10 08:44:01.978456"}, "key": "/a/OL2A", "birth_date": "1904", "type": {"key": "/type/author"}, "id": 99, "revision": 3}
/b/OL345M /type/edition {"publishers": ["Social Science Research Project, Dept. of Geography, University of Dacca"], "pagination": "ii, 54 p.", "title": "Land use in Fayadabad area", "lccn": ["sa 65000491"], "subject_place": ["East Pakistan", "Dacca region."], "number_of_pages": 54, "languages": [{"comment": "initial import", "code": "eng", "name": "English", "key": "/l/eng"}], "lc_classifications": ["S471.P162 E23"], "publish_date": "1963", "publish_country": "pk ", "key": "/b/OL345M", "authors": [{"birth_date": "1911", "name": "Nafis Ahmad", "key": "/a/OL302A", "personal_name": "Nafis Ahmad"}], "publish_places": ["Dacca, East Pakistan"], "by_statement": "[by] Nafis Ahmad and F. Karim Khan.", "oclc_numbers": ["4671066"], "contributions": ["Khan, Fazle Karim, joint author."], "subjects": ["Land use -- East Pakistan -- Dacca region."]}
The size of the uncompressed dumps are enormous, about 2GB for the authors list, and 18GB for the book editions list. OpenLibrary does not provide any tools for this themselves, they provide a simple unoptimized Python script for reading in sample data (which unlike the actual dumps comes in pure JSON format), but they estimate if that was modified for use on their actual data it would take 2 months (!) to finish loading the data.
How can I read this into the database? I assume I'll need to write a program to do this. What language and any guidance on how I should do it to finish in a reasonable amount of time? The only scripting language I have any experience with is Ruby.
© Stack Overflow or respective owner