replication jobs job - Page 64

SQL replicaton - collecting data

- by Cicik

Hi, I have master SQL server with DB Central and a lot of satellite SQL servers with DB Client. I need to collect data from log tables(LogTable) on Client(each client has own ID in log table) to one big table on Central(LogTableCentral). Data must go only from Client to Central On each Client I want to have only data for this Client I need solution with minimal amount of work on client side because of count of clients Central is MS SQL server Enterprise, Clients are MS SQL server 2005, 2008 Thanks a lot EDIT: data can be collected periodically(for example: every day at 01:00)

Read the article

How to find two most distant points?

- by depesz

This is a question that I was asked on a job interview some time ago. And I still can't figure out sensible answer. Question is: you are given set of points (x,y). Find 2 most distant points. Distant from each other. For example, for points: (0,0), (1,1), (-8, 5) - the most distant are: (1,1) and (-8,5) because the distance between them is larger from both (0,0)-(1,1) and (0,0)-(-8,5). The obvious approach is to calculate all distances between all points, and find maximum. The problem is that it is O(n^2), which makes it prohibitively expensive for large datasets. There is approach with first tracking points that are on the boundary, and then calculating distances for them, on the premise that there will be less points on boundary than "inside", but it's still expensive, and will fail in worst case scenario. Tried to search the web, but didn't find any sensible answer - although this might be simply my lack of search skills.

Read the article

Surgical slave reads for Ruby on Rails, mulitple databases.

- by Daniel

Greetings, I'm currently working on a multiple database rails application. I want to off load the SELECT queries on to the slave databases for only SOME of the databases or specific models. The issue is that in places, we swap out the current database connection and put in a different one for a short time; to load fixtures or to handle sharding. Does anyone have any recommendations on a ruby gem that 1. will split select/(sql writes) with a considerable amount of control. We want to handle just some models and we are looking for a neat surgical fix. 2. does not monkey around with activerecord. 3. is still being maintained. TIA -daniel

Read the article

rails wiki site - article edit highlighting/strikethrough with htmldiff maxes cpu

- by mark

Hi I'm implementing a wiki style site and want to highlight changes made to articles between successive versions. Using htmldiff to highlight changes works great, except it is rather cpu intensive. I'm using the awesome vestal_versions plugin for versioning. So how best to handle this? I considered having an on_create callback on version creation create a delayed job that processes and then stores the htmldiff processed article (in the version table row). If this is a good approach, how can I extend vestal_versions without touching the gem? Or maybe there would be a better approach. Any advice is much appreciated. :)

Read the article

How to achieve high availability?

- by tanyehzheng

My boss wants to have a system that takes into concern of continent wide catastrophic event. He wants to have two servers in US and two servers in Asia (1 login server and 1 worker server in each continent). In the event that earthquake breaks the connection between the two continents, both should work alone. When the connection is revived, they should sync each other back to normal. External cloud system not allowed as he has no confidence. The system should take into account of scalability which means addition of new servers should be easy to configure. The servers should be load balanced. The connection between the servers should be very secure(encrypted and send through SSL although SSL takes care of encryption). The system should let one and only one user log in with one account. (beware of latency between continent and two users sharing account may reach both login server at the same time) Please help. I'm already at the end of my wit. Thank you in advance.

Read the article

Tracking Hadoop job status via web interface? (Exposing Hadoop to internal clients in the company)

- by Eran Kampf

I want to develop a website that will allow analysts within the company to run Hadoop jobs (choose from a set of defined jobs) and see their job's status\progress. Is there an easy way to do this (get running jobs statuses etc.) via Ruby\Python? How do you expose your Hadoop cluster to internal clients on your company?

Read the article

Search filenames in MySQL database table restricted by filetype?

- by ju

Hello I have a MySQL database that I replicate from another server. The database contains a table with this columns ID, FileName and FileSize In the table there are more than 4'000'000 records. I want to make fast a search in FileName (varchar) column I found that I can use for this Sphinx search engine. The problem is that I want to restrict searches by filetype. Do I have to and how (trigers?) to extract file extensions for all rows? May be I have to create another table (because this one is replicated) and join them in 1:1 relation? Can you give me some advices please :)

Read the article

Harvesting Dynamic HTTP Content to produce Replicating HTTP Static Content

- by Neil Pitman

I have a slowly evolving dynamic website served from J2EE. The response time and load capacity of the server are inadequate for client needs. Moreover, ad hoc requests can unexpectedly affect other services running on the same application server/database. I know the reasons and can't address them in the short term. I understand HTTP caching hints (expiry, etags....) and for the purpose of this question, please assume that I have maxed out the opportunities to reduce load. I am thinking of doing a brute force traversal of all URLs in the system to prime a cache and then copying the cache contents to geodispersed cache servers near the clients. I'm thinking of Squid or Apache HTTPD mod_disk_cache. I want to prime one copy and (manually) replicate the cache contents. I don't need a federation or intelligence amongst the slaves. When the data changes, invalidating the cache, I will refresh my master cache and update the slave versions, probably once a night. Has anyone done this? Is it a good idea? Are there other technologies that I should investigate? I can program this, but I would prefer a configuration of open source technologies solution Thanks

Read the article

Are there any frameworks for data subscription and update?

- by Timothy Pratley

There is one server with multiple clients. The clients are viewing subsets of the servers entire data. If the data that a client is viewing changes, the client should be informed of the changes so that it displays the current data. Example: Two clients are viewing a list of users in an administration screen. One client adds a new user to the list and modifies the permissions of another user. The other client sees the changes propagated to their view.

Read the article

How can I force a subscriber to be synchronized from a local snapshot?

- by Brian

Hello, I have a SQL 2005 server replicating(merge\push) to SQL 2005 and SQL 2000 servers. I have multiple subscribers spread througout the United states. I have set , @snapshot_in_defaultfolder = N'false', @alt_snapshot_folder = N'c:\snapshots\Merge\' (sample location). I take the snapshot from the publisher that is in the same location, 'c:\snapshots\Merge\', and copy it to the subscribers. I wanted to avoid applying the snapshot over the WAN but from the performance I am getting the synchronization is going over the WAN. Does anybody have any ideas how to make sure that I am using the local copy of the snapshot and not the copy at the publisher? Thanks

Read the article

What frameworks exist for data subscription and update?

- by Timothy Pratley

There is one server with multiple clients. The clients are viewing subsets of the servers entire data. If the data that a client is viewing changes, the client should be informed of the changes so that it displays the current data. Example: Two clients are viewing a list of users in an administration screen. One client adds a new user to the list and modifies the permissions of another user. The other client sees the changes propagated to their view. In the client side code I would like the users list to be updated by the framework itself, raising changed events such that it will be redrawn - similar to 'cells' or dataflow. I am looking specifically for a .NET or java implementation.

Read the article

Git push from post-receive

- by meka

I have two servers, let's call them first and second. First one is where the real development is done, and second one should be the replica. What I would like to do is put "git push" in post-receive, but there is one problem. Post-receive is executed as the user doing git push to first server, so I can't chmod 600 ssh key with no pass. What is the best practice for this? Thanx!

Read the article

New replicaset resident memory is larger than the existing sets

- by eded

From the mongodb tutorial of how to resync a set, I wipe all the files in /data/db and restart the mongod process to resync the data. Everything looks ok, I get the same number of documents as the existing two sets(primary and one secondary). However, when I check the memory on MMS. it shows me my new resynced set/mongod process has a different memory status value than the other two. For existing twos using db.serverStatus.mem shows like the following: "mem" : { "bits" : 64, "resident" : 239, "virtual" : 66348, "supported" : true, "mapped" : 32865, "mappedWithJournal" : 65730 } however, the new resynced set shows like: "mem" : { "bits" : 64, "resident" : 1239, "virtual" : 52447, "supported" : true, "mapped" : 25700, "mappedWithJournal" : 51400 } the resynced resident memory is 6-10 times more than the existing ones. I wouder if it is normal because all data comes in suddenly during the resyncing?? and even virtual and mapped value are different too. Can anyone explain?? thanks

Read the article

Jenkins to not allow the same job to run concurrently on the same node?

- by Marek Gimza

I have 4 nodes and 2 jobs. Any node can run 2 jobs concurrently and any job can be executed concurrently. I want to be able to restrict running the same job concurrently on the same machine. For example: Jobs: J1 and J2 nodes: N1,N2,N3 and N4 I can run J1 and J2 on the same node at the same time. I can run J1 on N1 and N3 at the same time. BUT I do not want to run J1 and another build of J1 on the same node at the same time. I have tried "Locks and Latches", "Jenkins Exclusive Execution", "Exclusion Plugin" plugins, and these will work well when trying to coordinate different jobs. But my case is trying to manage different build-instances of the same job.

Read the article

Would you tell your prospective boss your SO username?

- by Sebi

Today I met a friend who is also using stackoverflow. He had a job interview today at a small business and during the interview, the prospective boss asked him how he assures that he's alawys up-to-date concerning technical questions and what he's doing to seek for a solution for a problem he can't solve by its own. Besides some magazines, journals, books and blogs my friend also mentioned stackoverflow. The prospective boss seems very interested about that and asked him if he could tell him his username. It appears that was the most difficult during the whole interview ;) Would you tell your prospective boss your username? An the pro side one can mention that the boss sees that you're very involved in your business and community but on the other hand it is a really private thing and you cant post anymore in thread like "what was the worst working environment?" My friend circumnaviagted this question by a rather lame answer (more or less: i use autologin, thats why i have to check the username later at home, ill maybe send you an email)

Read the article

How do you know when to change jobs? [closed]

- by dustyprogrammer

Possible Duplicate: When do you know it's time to move on from your current job? I have been working for a couple years now. I just want to know what people think about leaving one company for another, or to start looking around for other positions. I tend to use people's resumes as a guideline for when to change from one company to another. I am approaching, the time in my life where most of those people I look too, move away from their first position to pursue others. I know that isn't something good to base my decisions on what other do. I was wondering when is it time to move companies. I am currently happy at my position, and I am learning tons. Its just something I have been seeing a lot, I would like to get a feel for what people think. Thanks.

Read the article

What does N years of experience with a language really mean?

- by marcgg

I've been looking at jobs descriptions since I'm graduating soon and looking for a job and what's always coming back - I'm not teaching you anything - is the "N years of experience in this language". It has been discussed in this question that if you work professionally with let's say Ruby for 2 years, but during these two years you also did some C# and PHP and were actually coding in Ruby 50% of the time. Do you say you have 1 year of experience in Ruby? 2 years? Another issue that hasn't been reviewed in the other post is for "non-professional experience". I'll give you a personal example: I've been working with Ruby on Rails since 2004 while at school. I did a lot of personal projects and school projects using this technology. I also used Rails in 2 6-month internships. Do I have 5 years of Rails experience (2004-now)? Do I have 1 year(2 internships)? Do I have nothing? I feel like I don't deserve the credit for 5 years, because the first years I wasn't working a lot with rails, but since last year I launched some websites and invested myself a lot in this technology and just saying 1 year doesn't really reflect how much I know the technology... Another example: I Learned C++ at school and did 1 big project with it (2-3 month of work and a semester of classes). I never used it in a company but I'd be able to be productive fairly quickly if I had to work on a C++ project and I have a good grasp of the concepts. Do I have no experience? 3 months? 6 months? ... something else? What I'm really trying to do is to find a way to present my skill set in a way that is compliant to what recruiters expect. I also don't want to end up at an interview that would go something like this... Recruiter (finding out the horrible truth): Oh but you said that you had 2 years of experience with this when you have none! / slaps me in the face / Me (in pain): Oh! The irony! Recruiter (yelling): Get out of my office / calls security, punches me in the throat /

Read the article

What's the advantage of using a bash script for cron jobs?

- by AlxVallejo

From my understanding you can write your crons by editing crontab -e I've found several sources that instead refer to a bash script in the cron job, rather than writing a job line for line. Is the only benefit that you can consolidate many tasks into one cron job using a bash script? Additional question for a newbie: Editing crontab -e refers to one file correct? I've noticed that if I open crontab -e and close without editing, when I open the file again there is a different numerical extension such as: "/tmp/crontab.XXXXk1DEaM" 0L, 0C I though the crontab is stored in /var/spool/cron or /etc/crontab ?? Why would it store the cron in the tmp folder?

Read the article

Mgmt wants to re-title my position: Any help...? [closed]

- by JohnFlyTN

Management here wants to re-title my position, since I'm doing quite a bit of different work than was originally planned. They want my input. After a quick glance over my skill set and job duties, what would we need to describe this position as? I'll just list things I'm at least proficient in, I will not list things I have a passing knowledge of. About me : ~10 years software development. Languages : C, C++, Perl, PHP, C#, TCL, Unix shell scripting, SQL (TSQL, PLSQL) Systems : MS-Dos, Windows 3.1 to 7 for client, NT 4 to 2008 for server, OS/2, IBM MVS & z/OS, Linux ( multiple distros), AIX Current position: I do all sorts of in-house software. The range is single user apps to large systems spanning multiple OS's. One of the larger projects I've designed and coded is about 100k lines of C#, and a database where I have been the sole designer and maintainer. I have near total freedom to design as I see fit, restraints are usually budgetary. Skills required to replace me in my current role: Windows and Unix admin, Database design, .NET up to 3.5 (C#, ASP.NET), C++, Perl, good skills in designing large and efficient data processing systems. Given this small level of information what would you see this as being titled? (is more information required to render a decision?)

Read the article

Why learning new things is not important on a job hunt? [closed]

- by IAdapter

I have just finished my job hunt. I think it was about 40 job interviews, I like to travel and get to know many companies. One thing I did not like is that they don't care about new technologies. I think only 2 persons asked me about new stuff in Java world. Most of them care if I know Java (certification and many years of experiance is not enough for them, they need to test me) For example in IBM they only cared what IBM products do I know. Have I ever used any custom extensions of WebSphere? I don't understand those questions. If I learn new frameworks every day then I can learn whatever technology they have very fast. So why it matters if I have ever used those "great" custom extensions of WebSphere? After those 40 interviews I have no reason to learn any new framework, because I see that they don't care. Why those "developers" don't ask questions about new technologies? are they so long at those comapnies that they don't care about new stuff?

Read the article

How does a CS student negotiate in/after a job interview?

- by Billy ONeal

Alright, I've gotten to the second step in the interview process. At this point I'm working under the assumption that I might be offered a position -- flying my butt to Redmond would be quite an expense if they weren't at least considering me for something (*crosses fingers*). So, if one is offered a position, how should a CS student negotiate? I've heard a few strategies about dealing with software companies when you are being considered for a hire, but most of them are considering the developer in a powerful position. In such examinations, (s)he has lots of job experience, and may even be overqualified for what the employer is looking for. (s)he is part of a small job market of qualified developers, because 99% of applications companies receive are from those who are woefully under qualified. I'm in a completely different position. I think I compare favorably to most of my fellow students, and I have been a programmer for almost 10 years, but often I still feel green compared to most of my coworkers. I'm in a position where the employer holds most of the chips; they'd be doing me quite a favor by hiring me. I think this scenario is considerably different than the targets for most of the advice I've seen. Above all, I don't want to be such a prick negotiating that it damages my chances to actually operate in a position, even if it means not negotiating at all. How should one approach a scenario like this? P.S. If this is off topic feel free to close it -- I think it's borderline and I'm of the opinion that it's better to ask and be closed than not ask at all ;)

Read the article

Will an online degree get you a job that requires "CS or equivalent 4-year degree"? [on hold]

- by qel

I'm a nerdy slacker type who didn't get my life together till I was 30. I've had a real job for a couple years doing C#/SQL. I've gotten several raises, but I'm making less than most developers, and the atmosphere is ... not positive. Looking for a new job, I think my applications get thrown out because I don't have a degree. And I want to finish a Bachelor's just to feel like less of a loser. I have a lot of college credits from 1996-2003 and a low GPA, so I don't know if that's worth much. An online degree looks like a good option, but I just don't know what I should be looking at for online schools because they all look like fake degrees. If they had programs equivalent to a real Comp Sci degree, I don't think they would have weird sounding names like they do. University of Phoenix has a B.S./Information Technology-Software Engineering. DeVry has a B.S./Computer Engineering Technology program. But that's not CS, and most other things I see have even more fake-sounding names. Are these useless degrees? Some people say DeVry and UoP are acceptable, some people say they're a joke. I have enough experience now, though, that maybe all I'm missing is being able to check the box that I have a 4-year degree. Harvard Extension seems like a real degree, even if it isn't a real Harvard degree, but I'd have to live there at least 3 months, which kinda defeats the purpose of an online degree fitting around work.

Read the article

Is there a website that scrapes job postings to determine the popularity of web technologies? [closed]

- by dB'

I'm often in a position where I need to choose between a number of web technologies. These technologies might be programming languages, or web application frameworks, or types of databases, or some other kind of toolkit used by programmers. More often than not, after some doing research, I end up with a list of contenders that are all equally viable. They're all powerful enough to solve my problem, they're all popular and well supported, and they're all equally familiar/unfamiliar to me. There's no obvious rationale by which to choose between them. Still, I need to pick one, so at this point I usually ask myself a hypothetical question: which one of these technologies, if I invest in learning it, would be most helpful to me in a job search? Where can I go on the internet to answer this question? Is there a website/service that scrapes the texts of worldwide job postings and would allow me to compare, say, the number of employers looking for expertise in technology x vs. technology y? (Where x and y are Rails vs. Djando, Java vs. Python, Brainfuck vs. LOLCode, etc.)

Read the article

How do I tell my parents that landing a job is what actually counts?

- by shovonr

On one side, I just want to get a degree with a 3.0 GPA. On the other side, my parents want more than just a 3. Now here's the thing. I program with a passion. I spend day and night programming. And I ace all my programming courses. However, I do terrible on all my elective courses -- such as writing, history, and all that stuff -- which only leaves me with a 3.1 to 3.2 GPA. And my parents want more. They think that university is like high school, where you need super-stellar grades to get to the next level. But they don't realize that good enough grades will land me a job. And they don't realize that a programmer needs to practice to become good at programming, and that having good skills is what will land a job in a nice software development company. Thankfully, though, they don't threaten to beat me with a baseball bat or anything like that. They just occasionally give me the little "tsk-tsk". But even that little "tsk-tsk" makes me feel guilty for opening up an IDE. And on top of that, I procrastinate because of that feeling of guilt. So now, I want to come clean with them. I want to know what's a good way to do that. [Edit] OK, so now, I realized, I should aim for higher grades, as some have suggested below.

Read the article

Building Simple Workflows in Oozie

- by dan.mcclary

Introduction More often than not, data doesn't come packaged exactly as we'd like it for analysis. Transformation, match-merge operations, and a host of data munging tasks are usually needed before we can extract insights from our Big Data sources. Few people find data munging exciting, but it has to be done. Once we've suffered that boredom, we should take steps to automate the process. We want codify our work into repeatable units and create workflows which we can leverage over and over again without having to write new code. In this article, we'll look at how to use Oozie to create a workflow for the parallel machine learning task I described on Cloudera's site. Hive Actions: Prepping for Pig In my parallel machine learning article, I use data from the National Climatic Data Center to build weather models on a state-by-state basis. NCDC makes the data freely available as gzipped files of day-over-day observations stretching from the 1930s to today. In reading that post, one might get the impression that the data came in a handy, ready-to-model files with convenient delimiters. The truth of it is that I need to perform some parsing and projection on the dataset before it can be modeled. If I get more observations, I'll want to retrain and test those models, which will require more parsing and projection. This is a good opportunity to start building up a workflow with Oozie. I store the data from the NCDC in HDFS and create an external Hive table partitioned by year. This gives me flexibility of Hive's query language when I want it, but let's me put the dataset in a directory of my choosing in case I want to treat the same data with Pig or MapReduce code. CREATE EXTERNAL TABLE IF NOT EXISTS historic_weather(column 1, column2) PARTITIONED BY (yr string) STORED AS ... LOCATION '/user/oracle/weather/historic'; As new weather data comes in from NCDC, I'll need to add partitions to my table. That's an action I should put in the workflow. Similarly, the weather data requires parsing in order to be useful as a set of columns. Because of their long history, the weather data is broken up into fields of specific byte lengths: x bytes for the station ID, y bytes for the dew point, and so on. The delimiting is consistent from year to year, so writing SerDe or a parser for transformation is simple. Once that's done, I want to select columns on which to train, classify certain features, and place the training data in an HDFS directory for my Pig script to access. ALTER TABLE historic_weather ADD IF NOT EXISTS PARTITION (yr='2010') LOCATION '/user/oracle/weather/historic/yr=2011'; INSERT OVERWRITE DIRECTORY '/user/oracle/weather/cleaned_history' SELECT w.stn, w.wban, w.weather_year, w.weather_month, w.weather_day, w.temp, w.dewp, w.weather FROM ( FROM historic_weather SELECT TRANSFORM(...) USING '/path/to/hive/filters/ncdc_parser.py' as stn, wban, weather_year, weather_month, weather_day, temp, dewp, weather ) w; Since I'm going to prepare training directories with at least the same frequency that I add partitions, I should also add that to my workflow. Oozie is going to invoke these Hive actions using what's somewhat obviously referred to as a Hive action. Hive actions amount to Oozie running a script file containing our query language statements, so we can place them in a file called weather_train.hql. Starting Our Workflow Oozie offers two types of jobs: workflows and coordinator jobs. Workflows are straightforward: they define a set of actions to perform as a sequence or directed acyclic graph. Coordinator jobs can take all the same actions of Workflow jobs, but they can be automatically started either periodically or when new data arrives in a specified location. To keep things simple we'll make a workflow job; coordinator jobs simply require another XML file for scheduling. The bare minimum for workflow XML defines a name, a starting point, and an end point: <workflow-app name="WeatherMan" xmlns="uri:oozie:workflow:0.1"> <start to="ParseNCDCData"/> <end name="end"/> </workflow-app> To this we need to add an action, and within that we'll specify the hive parameters Also, keep in mind that actions require <ok> and <error> tags to direct the next action on success or failure. <action name="ParseNCDCData"> <hive xmlns="uri:oozie:hive-action:0.2"> <job-tracker>localhost:8021</job-tracker> <name-node>localhost:8020</name-node> <configuration> <property> <name>oozie.hive.defaults</name> <value>/user/oracle/weather_ooze/hive-default.xml</value> </property> </configuration> <script>ncdc_parse.hql</script> </hive> <ok to="WeatherMan"/> <error to="end"/> </action> There are a couple of things to note here: I have to give the FQDN (or IP) and port of my JobTracker and NameNode. I have to include a hive-default.xml file. I have to include a script file. The hive-default.xml and script file must be stored in HDFS That last point is particularly important. Oozie doesn't make assumptions about where a given workflow is being run. You might submit workflows against different clusters, or have different hive-defaults.xml on different clusters (e.g. MySQL or Postgres-backed metastores). A quick way to ensure that all the assets end up in the right place in HDFS is just to make a working directory locally, build your workflow.xml in it, and copy the assets you'll need to it as you add actions to workflow.xml. At this point, our local directory should contain: workflow.xml hive-defaults.xml (make sure this file contains your metastore connection data) ncdc_parse.hql Adding Pig to the Ooze Adding our Pig script as an action is slightly simpler from an XML standpoint. All we do is add an action to workflow.xml as follows: <action name="WeatherMan"> <pig> <job-tracker>localhost:8021</job-tracker> <name-node>localhost:8020</name-node> <script>weather_train.pig</script> </pig> <ok to="end"/> <error to="end"/> </action> Once we've done this, we'll copy weather_train.pig to our working directory. However, there's a bit of a "gotcha" here. My pig script registers the Weka Jar and a chunk of jython. If those aren't also in HDFS, our action will fail from the outset -- but where do we put them? The Jython script goes into the working directory at the same level as the pig script, because pig attempts to load Jython files in the directory from which the script executes. However, that's not where our Weka jar goes. While Oozie doesn't assume much, it does make an assumption about the Pig classpath. Anything under working_directory/lib gets automatically added to the Pig classpath and no longer requires a REGISTER statement in the script. Anything that uses a REGISTER statement cannot be in the working_directory/lib directory. Instead, it needs to be in a different HDFS directory and attached to the pig action with an <archive> tag. Yes, that's as confusing as you think it is. You can get the exact rules for adding Jars to the distributed cache from Oozie's Pig Cookbook. Making the Workflow Work We've got a workflow defined and have collected all the components we'll need to run. But we can't run anything yet, because we still have to define some properties about the job and submit it to Oozie. We need to start with the job properties, as this is essentially the "request" we'll submit to the Oozie server. In the same working directory, we'll make a file called job.properties as follows: nameNode=hdfs://localhost:8020 jobTracker=localhost:8021 queueName=default weatherRoot=weather_ooze mapreduce.jobtracker.kerberos.principal=foo dfs.namenode.kerberos.principal=foo oozie.libpath=${nameNode}/user/oozie/share/lib oozie.wf.application.path=${nameNode}/user/${user.name}/${weatherRoot} outputDir=weather-ooze While some of the pieces of the properties file are familiar (e.g., JobTracker address), others take a bit of explaining. The first is weatherRoot: this is essentially an environment variable for the script (as are jobTracker and queueName). We're simply using them to simplify the directives for the Oozie job. The oozie.libpath pieces is extremely important. This is a directory in HDFS which holds Oozie's shared libraries: a collection of Jars necessary for invoking Hive, Pig, and other actions. It's a good idea to make sure this has been installed and copied up to HDFS. The last two lines are straightforward: run the application defined by workflow.xml at the application path listed and write the output to the output directory. We're finally ready to submit our job! After all that work we only need to do a few more things: Validate our workflow.xml Copy our working directory to HDFS Submit our job to the Oozie server Run our workflow Let's do them in order. First validate the workflow: oozie validate workflow.xml Next, copy the working directory up to HDFS: hadoop fs -put working_dir /user/oracle/working_dir Now we submit the job to the Oozie server. We need to ensure that we've got the correct URL for the Oozie server, and we need to specify our job.properties file as an argument. oozie job -oozie http://url.to.oozie.server:port_number/ -config /path/to/working_dir/job.properties -submit We've submitted the job, but we don't see any activity on the JobTracker? All I got was this funny bit of output: 14-20120525161321-oozie-oracle This is because submitting a job to Oozie creates an entry for the job and places it in PREP status. What we got back, in essence, is a ticket for our workflow to ride the Oozie train. We're responsible for redeeming our ticket and running the job. oozie -oozie http://url.to.oozie.server:port_number/ -start 14-20120525161321-oozie-oracle Of course, if we really want to run the job from the outset, we can change the "-submit" argument above to "-run." This will prep and run the workflow immediately. Takeaway So, there you have it: the somewhat laborious process of building an Oozie workflow. It's a bit tedious the first time out, but it does present a pair of real benefits to those of us who spend a great deal of time data munging. First, when new data arrives that requires the same processing, we already have the workflow defined and ready to run. Second, as we build up a set of useful action definitions over time, creating new workflows becomes quicker and quicker.

Search Results

Search found 9154 results on 367 pages for 'replication jobs job'.

Page 64/367 | < Previous Page | 60 61 62 63 64 65 66 67 68 69 70 71 | Next Page >

- by Cicik

- by depesz

- by Daniel

- by mark

- by tanyehzheng

- by Eran Kampf

- by ju

- by Neil Pitman

- by Timothy Pratley

- by Brian

- by Timothy Pratley

- by meka

- by eded

- by Marek Gimza

- by Sebi

- by dustyprogrammer

- by marcgg

- by AlxVallejo

- by JohnFlyTN

- by IAdapter

- by Billy ONeal

- by qel

- by dB'

- by shovonr

- by dan.mcclary

< Previous Page | 60 61 62 63 64 65 66 67 68 69 70 71 | Next Page >