Search Results

Search found 1 results on 1 pages for 'mastergeek'.

Page 1/1 | 1 

  • hadoop - large database query

    - by Mastergeek
    Situation: I have a Postgres DB that contains a table with several million rows and I'm trying to query all of those rows for a MapReduce job. From the research I've done on DBInputFormat, Hadoop might try and use the same query again for a new mapper and since these queries take a considerable amount of time I'd like to prevent this in one of two ways that I've thought up: 1) Limit the job to only run 1 mapper that queries the whole table and call it good. or 2) Somehow incorporate an offset in the query so that if Hadoop does try to use a new mapper it won't grab the same stuff. I feel like option (1) seems more promising, but I don't know if such a configuration is possible. Option(2) sounds nice in theory but I have no idea how I would keep track of the mappers being made and if it is at all possible to detect that and reconfigure. Help is appreciated and I'm namely looking for a way to pull all of the DB table data and not have several of the same query running because that would be a waste of time.

    Read the article

1