Search Results

Search found 28 results on 2 pages for 'zookeeper'.

Page 2/2 | < Previous Page | 1 2 

  • Big Data – Interacting with Hadoop – What is PIG? – What is PIG Latin? – Day 16 of 21

    - by Pinal Dave
    In yesterday’s blog post we learned the importance of the HIVE in Big Data Story. In this article we will understand what is PIG and PIG Latin in Big Data Story. Yahoo started working on Pig for their application deployment on Hadoop. The goal of Yahoo to manage their unstructured data. What is Pig and What is Pig Latin? Pig is a high level platform for creating MapReduce programs used with Hadoop and the language we use for this platform is called PIG Latin. The pig was designed to make Hadoop more user-friendly and approachable by power-users and nondevelopers. PIG is an interactive execution environment supporting Pig Latin language. The language Pig Latin has supported loading and processing of input data with series of transforming to produce desired results. PIG has two different execution environments 1) Local Mode – In this case all the scripts run on a single machine. 2) Hadoop – In this case all the scripts run on Hadoop Cluster. Pig Latin vs SQL Pig essentially creates set of map and reduce jobs under the hoods. Due to same users does not have to now write, compile and build solution for Big Data. The pig is very similar to SQL in many ways. The Ping Latin language provide an abstraction layer over the data. It focuses on the data and not the structure under the hood. Pig Latin is a very powerful language and it can do various operations like loading and storing data, streaming data, filtering data as well various data operations related to strings. The major difference between SQL and Pig Latin is that PIG is procedural and SQL is declarative. In simpler words, Pig Latin is very similar to SQ Lexecution plan and that makes it much easier for programmers to build various processes. Whereas SQL handles trees naturally, Pig Latin follows directed acyclic graph (DAG). DAGs is used to model several different kinds of structures in mathematics and computer science. DAG Tomorrow In tomorrow’s blog post we will discuss about very important components of the Big Data Ecosystem – Zookeeper. Reference: Pinal Dave (http://blog.sqlauthority.com) Filed under: Big Data, PostADay, SQL, SQL Authority, SQL Query, SQL Server, SQL Tips and Tricks, T SQL

    Read the article

  • Push-Based Events in a Services Oriented Architecture

    - by Colin Morelli
    I have come to a point, in building a services oriented architecture (on top of Thrift), that I need to expose events and allow listeners. My initial thought was, "create an EventService" to handle publishing and subscribing to events. That EventService can use whatever implementation it desires to actually distribute the events. My client automatically round-robins service requests to available service hosts which are determined using Zookeeper-based service discovery. So, I'd probably use JMS inside of EventService mainly for the purpose of persisting messages (in the event that a service host for EventService goes down before it can distribute the message to all of the available listeners). When I started considering this, I began looking into the differences between Queues and Topics. Topics unfortunately won't work for me, because (at least for now), all listeners must receive the message (even if they were down at the time the event was pushed, or hadn't made a subscription yet because they haven't completed startup (during deployment, for example) - messages should be queued until the service is available). However, I don't want EventService to be responsible for handling all of the events. I don't think it should have the code to react to events inside of it. Each of the services should do what it needs with a given event. This would indicate that each service would need a JMS connection, which questions the value of having EventService at all (as the services could individually publish and subscribe to JMS directly). However, it also couples all of the services to JMS (when I'd rather that there be a single service that's responsible for determining how to distribute events). What I had thought was to publish an event to EventService, which pulls a configuration of listeners from some configuration source (database, flat file, irrelevant for now). It replicates the message and pushes each one back into a queue with information specific to that listener (so, if there are 3 listeners, 1 event would become 3 events in JMS). Then, another thread in EventService (which is replicated, running on multiple hots) would be pulling from the queue, attempting to make the service call to the "listener", and returning the message to the queue (if the service is down), or discarding the message (if the listener completed successfully). tl;dr If I have an EventService that is responsible for receiving events and delegating service calls to "event listeners," (which are really just endpoints on other services), how should it know how to craft the service call? Should I create a generic "Event" object that is shared among all services? Then, the EventService can just construct this object and pass it to the service call. Or is there a better answer to this problem entirely?

    Read the article

  • YouTube Scalability Lessons

    - by Bertrand Matthelié
    @font-face { font-family: "Arial"; }@font-face { font-family: "Courier New"; }@font-face { font-family: "Wingdings"; }@font-face { font-family: "Calibri"; }@font-face { font-family: "Cambria"; }p.MsoNormal, li.MsoNormal, div.MsoNormal { margin: 0cm 0cm 0.0001pt; font-size: 12pt; font-family: "Times New Roman"; }h2 { margin: 12pt 0cm 3pt; page-break-after: avoid; font-size: 14pt; font-family: "Times New Roman"; font-style: italic; }a:link, span.MsoHyperlink { color: blue; text-decoration: underline; }a:visited, span.MsoHyperlinkFollowed { color: purple; text-decoration: underline; }span.Heading2Char { font-family: Calibri; font-weight: bold; font-style: italic; }div.Section1 { page: Section1; }ol { margin-bottom: 0cm; }ul { margin-bottom: 0cm; } Very interesting blog post by Todd Hoff at highscalability.com presenting “7 Years of YouTube Scalability Lessons in 30 min” based on a presentation from Mike Solomon, one of the original engineers at YouTube: …. The key takeaway away of the talk for me was doing a lot with really simple tools. While many teams are moving on to more complex ecosystems, YouTube really does keep it simple. They program primarily in Python, use MySQL as their database, they’ve stuck with Apache, and even new features for such a massive site start as a very simple Python program. That doesn’t mean YouTube doesn’t do cool stuff, they do, but what makes everything work together is more a philosophy or a way of doing things than technological hocus pocus. What made YouTube into one of the world’s largest websites? Read on and see... Stats @font-face { font-family: "Arial"; }@font-face { font-family: "Cambria"; }p.MsoNormal, li.MsoNormal, div.MsoNormal { margin: 0cm 0cm 0.0001pt; font-size: 12pt; font-family: "Times New Roman"; }div.Section1 { page: Section1; } 4 billion Views a day 60 hours of video is uploaded every minute 350+ million devices are YouTube enabled Revenue double in 2010 The number of videos has gone up 9 orders of magnitude and the number of developers has only gone up two orders of magnitude. 1 million lines of Python code Stack @font-face { font-family: "Arial"; }@font-face { font-family: "Cambria"; }p.MsoNormal, li.MsoNormal, div.MsoNormal { margin: 0cm 0cm 0.0001pt; font-size: 12pt; font-family: "Times New Roman"; }div.Section1 { page: Section1; } Python - most of the lines of code for YouTube are still in Python. Everytime you watch a YouTube video you are executing a bunch of Python code. Apache - when you think you need to get rid of it, you don’t. Apache is a real rockstar technology at YouTube because they keep it simple. Every request goes through Apache. Linux - the benefit of Linux is there’s always a way to get in and see how your system is behaving. No matter how bad your app is behaving, you can take a look at it with Linux tools like strace and tcpdump. MySQL - is used a lot. When you watch a video you are getting data from MySQL. Sometime it’s used a relational database or a blob store. It’s about tuning and making choices about how you organize your data. Vitess- a  new project released by YouTube, written in Go, it’s a frontend to MySQL. It does a lot of optimization on the fly, it rewrites queries and acts as a proxy. Currently it serves every YouTube database request. It’s RPC based. Zookeeper - a distributed lock server. It’s used for configuration. Really interesting piece of technology. Hard to use correctly so read the manual Wiseguy - a CGI servlet container. Spitfire - a templating system. It has an abstract syntax tree that let’s them do transformations to make things go faster. Serialization formats - no matter which one you use, they are all expensive. Measure. Don’t use pickle. Not a good choice. Found protocol buffers slow. They wrote their own BSON implementation, which is 10-15 time faster than the one you can download. ...Contiues. Read the blog Watch the video

    Read the article

< Previous Page | 1 2