hadoop - large database query

Posted by Mastergeek on Stack Overflow See other posts from Stack Overflow or by Mastergeek
Published on 2013-10-30T20:56:57Z Indexed on 2013/10/30 21:54 UTC
Read the original article Hit count: 154

Filed under:
|
|
|

Situation: I have a Postgres DB that contains a table with several million rows and I'm trying to query all of those rows for a MapReduce job.

From the research I've done on DBInputFormat, Hadoop might try and use the same query again for a new mapper and since these queries take a considerable amount of time I'd like to prevent this in one of two ways that I've thought up:

1) Limit the job to only run 1 mapper that queries the whole table and call it 
   good.

or

2) Somehow incorporate an offset in the query so that if Hadoop does try to use
   a new mapper it won't grab the same stuff.

I feel like option (1) seems more promising, but I don't know if such a configuration is possible. Option(2) sounds nice in theory but I have no idea how I would keep track of the mappers being made and if it is at all possible to detect that and reconfigure.

Help is appreciated and I'm namely looking for a way to pull all of the DB table data and not have several of the same query running because that would be a waste of time.

© Stack Overflow or respective owner

Related posts about postgresql

Related posts about hadoop