hadoop - large database query

Posted by Mastergeek on Stack Overflow See other posts from Stack Overflow or by Mastergeek
Published on 2013-10-30T20:56:57Z Indexed on 2013/10/30 21:54 UTC
Read the original article Hit count: 219

Filed under:

postgresql

|

hadoop

|

mapreduce

|

bigdata

Situation: I have a Postgres DB that contains a table with several million rows and I'm trying to query all of those rows for a MapReduce job.

From the research I've done on DBInputFormat, Hadoop might try and use the same query again for a new mapper and since these queries take a considerable amount of time I'd like to prevent this in one of two ways that I've thought up:

1) Limit the job to only run 1 mapper that queries the whole table and call it 
   good.

or

2) Somehow incorporate an offset in the query so that if Hadoop does try to use
   a new mapper it won't grab the same stuff.

I feel like option (1) seems more promising, but I don't know if such a configuration is possible. Option(2) sounds nice in theory but I have no idea how I would keep track of the mappers being made and if it is at all possible to detect that and reconfigure.

Help is appreciated and I'm namely looking for a way to pull all of the DB table data and not have several of the same query running because that would be a waste of time.

© Stack Overflow or respective owner

Related posts about postgresql

Postgresql fails to start on Ubuntu 10.04.4 LTS

as seen on Ask Ubuntu - Search for 'Ask Ubuntu'
I installed postgresql 9.2 from add-apt-repository ppa:pitti/postgresql using apt-get install postgresql-9.2 At the end of the install and every time I try to launch postgresql by using the following command /etc/init.d/postgresql start or service postgresql start I get this error: Error:… >>> More
can't install psycopg2 in my env on mac os x lion

as seen on Server Fault - Search for 'Server Fault'
I tried install psycopg2 via pip in my virtual env, but got this error: ld: library not found for -lpq (full log here: http://pastebin.com/XdmGyJ4u ) I tried install postgres 9.1 from .dmg and via port, (gksks)iMac-Alexander:~ lorddaedra$ locate libpq /Developer/SDKs/MacOSX10.7.sdk/usr/include/libpq /Developer/SDKs/MacOSX10… >>> More
Postgresql has broken apt-get on Ubuntu

as seen on Super User - Search for 'Super User'
On ubuntu 12.04, whenever I try to install a package using apt-get I'm greeted by: The following packages have unmet dependencies: postgresql-9.1 : Depends: postgresql-client-9.1 but it is not going to be instal led E: Unmet dependencies. Try 'apt-get -f install' with no packages (or specify a so lution)… >>> More
Installing PostgreSQL on FreeBSD (with ports)

as seen on Server Fault - Search for 'Server Fault'
Hey everyone, I am trying to install (using ports) PostgreSQL on a virtual server, running FreeBSD. My one question is this: Which of the following should I install? postgresql-contrib postgresql-docs postgresql-jdbc postgresql-libpgeasy postgresql-libpq++ postgresql-libpqxx postgresql-odbc … >>> More
Strange permission errors in new PostgreSQL installation

as seen on Server Fault - Search for 'Server Fault'
A freshly installed PostgreSQL (with configuration overwritten) won't start: $ sudo service postgresql start * Starting PostgreSQL 9.1 database server * Error: could not read /etc/postgresql/9.1/main/postgresql.conf: Permission denied Looks like it should be able to read it though: $ ls -l postgresql… >>> More

Related posts about hadoop

prerequisites of learnig hadoop, can php developer learn hadoop without java experience [closed]

as seen on Programmers - Search for 'Programmers'
i am willing to learn hadoop as a Developer , but i am confused over the prerequisite of learning it.? is having a good experience in java programming very essential to learn hadoop? I have 4 years of experience in application development in LAMP. But i am not in touch with java programming as a part… >>> More
Hadoop hdfs namenode is throwing an error

as seen on Server Fault - Search for 'Server Fault'
Full list of error: hb@localhost:/etc/hadoop/conf$ sudo service hadoop-hdfs-namenode start * Starting Hadoop namenode: starting namenode, logging to /var/log/hadoop-hdfs/hadoop-hdfs-namenode-localhost.out 12/09/10 14:41:09 INFO namenode.NameNode: STARTUP_MSG: /************************************************************ STARTUP_MSG:… >>> More
Combining HBase and HDFS results in Exception in makeDirOnFileSystem

as seen on Server Fault - Search for 'Server Fault'
Introduction An attempt to combine HBase and HDFS results in the following: 2014-06-09 00:15:14,777 WARN org.apache.hadoop.hbase.HBaseFileSystem: Create Dir ectory, retries exhausted 2014-06-09 00:15:14,780 FATAL org.apache.hadoop.hbase.master.HMaster: Unhandled exception. Starting shutdown. java… >>> More
Problem compiling hive with ant

as seen on Stack Overflow - Search for 'Stack Overflow'
I compiling with Solaris 10 SPARC, jdk 1.6 from Sun, Ant 1.7.1 from OpenCSW. I have no problem running hadoop 0.17.2.1 However, I have problem compiling/integrating hive with the error 'cannot find symbol', although I followed the tutorial. I have the hive source code from SVN exactly from tutorial… >>> More
no namenode error in pseudo-mode

as seen on Stack Overflow - Search for 'Stack Overflow'
I'm new to hadoop and is in learning phase. As per Hadoop Definitve guide, i have set up my hadoop in pseudo distributed mode and everything was working fine. I was even able to execute all the examples from chapter 3 yesterday. Today, when i rebooted my unix and tried to run start-dfs.sh and then… >>> More