Search Results

Search found 1914 results on 77 pages for 'mongrel cluster'.

Page 48/77 | < Previous Page | 44 45 46 47 48 49 50 51 52 53 54 55  | Next Page >

  • Implementing database redundancy with sharded tables

    - by ensnare
    We're looking to implement load balancing by horizontally sharding our tables across a cluster of servers. What are some options to implement live redundancy should a server fail? Would it be effective to do (2) INSERTS instead of one ... one to the target shard, and another to a secondary shard which could be accessed should the primary shard not respond? Or is there a better way? Thanks.

    Read the article

  • Migrating From SQL Server Server 7 To 2005, What should I get excited about?

    - by Jon P
    The company I work for has decided to join the 21st century and upgrade our main database cluster from SQL Server 7 to SQL Server 2005. As a web developer what new whiz-bang features of SQL Server 2005 should I get excited about or get to know? Currently I'm mainly writing CRUD style queries, pretty much exclusively using Stored Procdures for a mixed ASP.net and Classic ASP environment.

    Read the article

  • ASP.NET MVC, Webform hybrid

    - by Greg Ogle
    We (me and my team) have a ASP.NET MVC application and we are integrating a page or two that are Web Forms. We are trying to reuse the Master Page from our MVC part of the app in the WebForms part. We have found a way of rendering an MVC partial view in web forms, which works great, until we try and do a postback, which is the reason for using a WebForm. The Error: Validation of viewstate MAC failed. If this application is hosted by a Web Farm or cluster, ensure that configuration specifies the same validationKey and validation algorithm. AutoGenerate cannot be used in a cluster. The Code to render the partial view from a WebForm (credited to "How to include a partial view inside a webform"): public static class WebFormMVCUtil { public static void RenderPartial(string partialName, object model) { //get a wrapper for the legacy WebForm context var httpCtx = new HttpContextWrapper(System.Web.HttpContext.Current); //create a mock route that points to the empty controller var rt = new RouteData(); rt.Values.Add("controller", "WebFormController"); //create a controller context for the route and http context var ctx = new ControllerContext( new RequestContext(httpCtx, rt), new WebFormController()); //find the partial view using the viewengine var view = ViewEngines.Engines.FindPartialView(ctx, partialName).View; //create a view context and assign the model var vctx = new ViewContext(ctx, view, new ViewDataDictionary { Model = model }, new TempDataDictionary()); //ERROR OCCURS ON THIS LINE view.Render(vctx, System.Web.HttpContext.Current.Response.Output); } } My only experience with this error is in context of a web farm, which is not the case. Also, I understand that the machine key is used for decrypting the ViewState. Any information on how to diagnose this issue would be appreciated. A Work-around: So far the work-around is to move the header content to a PartialView, then use an AJAX call to call a page with just the Partial View from the WebForms, and then using the PartialView directly on the MVC Views. Also, we are still able to share non-tech-specific parts of the Master Page, i.e. anything that is not MVC specific. Still yet, this is not an ideal solution, a server-side solution is still desired. Also, this solutino has issues when working with controls that have more sophisticated controls, using JavaScript, particularly dynamically generated script as used by 3rd party controls.

    Read the article

  • running a python script where dependencies are not avail: distributed computing

    - by sadhu_
    Hi, I have access to a grid (running condor) that would (potentially) allow to very substantially reduce how long by nltk based nlp tasks take. unfortunately, i dont have root access on the cluster so cannot install new packages, only run whatever is available on the linux boxes. python is of course available, but nltk isnt - i was wondering however, if there might be a way around this somehow ? is there a way i can somehow still distribute the task in a self-contained 'package' of some sort? Thanks for your hel

    Read the article

  • Amazon EC2 - network issues

    - by Algorist
    Hi, We are launching hadoop cluster on amazon ec2 and recently we are having network issues like master unable to connect to slave. We thought the reason is due to amazon throttling the network connections over a limit. So, we tried to establish a connection after a random delay from each slave node. But, that didn't help. Are there any other suggestions? Thank you Bala

    Read the article

  • Amazon Web Services : Fault tolerant solution

    - by Algorist
    Hi, I am using Boto library to write scripts for automating our jobs on AWS. My script actually starts a hadoop cluster using cloudera scripts and then does some customization. I am having a problem with retries. Seems like very command in my script fails once couple of days. I started adding retry to all the commands, but then the code is very clumsy and difficult to maintain. what do people do in general. Thank you Bala

    Read the article

  • NoHostAvailableException With Cassandra & DataStax Java Driver If Large ResultSet

    - by hughj
    The setup: 2-node Cassandra 1.2.6 cluster replicas=2 very large CQL3 table with no secondary index Rowkey is a UUID.randomUUID().toString() read consistency set to ONE Using DataStax java driver 1.0 The request: Attempting to do a table scan by "SELECT some-col from schema.table LIMIT nnn;" The fail: Once I go beyond a certain nnn LIMIT, I start to get NoHostAvailableExceptions from the driver. It reads like this: com.datastax.driver.core.exceptions.NoHostAvailableException: All host(s) tried for query failed (tried: /10.181.13.239 ([/10.181.13.239] Unexpected exception triggered)) at com.datastax.driver.core.exceptions.NoHostAvailableException.copy(NoHostAvailableException.java:64) at com.datastax.driver.core.ResultSetFuture.extractCauseFromExecutionException(ResultSetFuture.java:214) at com.datastax.driver.core.ResultSetFuture.getUninterruptibly(ResultSetFuture.java:169) at com.jpmc.es.rtm.storage.impl.EventExtract.main(EventExtract.java:36) at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) at java.lang.reflect.Method.invoke(Method.java:601) at com.intellij.rt.execution.application.AppMain.main(AppMain.java:120) Caused by: com.datastax.driver.core.exceptions.NoHostAvailableException: All host(s) tried for query failed (tried: /10.181.13.239 ([/10.181.13.239] Unexpected exception triggered)) at com.datastax.driver.core.RequestHandler.sendRequest(RequestHandler.java:98) at com.datastax.driver.core.RequestHandler$1.run(RequestHandler.java:165) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1110) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:603) Given: This is probably not the most enlightened thing to do to a large table with millions of rows, but this is how I learn what not to do, so I would really appreciate someone who could volunteer how this kind of error can be debugged. For example, when this happens, there are no indications that the nodes in the cluster ever had an issue with the request (there is nothing in the logs on either node that indicate any timeout or failure). Also, I enabled the trace on the driver, which gives you some nice autotrace (ala Oracle) info as long as the query succeeds. But in this case, the driver blows a NoHostAvailableException and no ExecutionInfo is available, so tracing has not provided any benefit in this case. I also find it interesting that this does not seem to be recorded as a timeout (my JMX consoles tell me no timeouts have occurred). So, I am left not understanding WHERE the failure is actually occurring. I am left with the idea that it is the driver that is having a problem, but I don't know how to debug it (and I would really like to). I have read several posts from folks that state that query'g for resultSets 10000 rows is probably not a good idea, and I am willing to accept this, but I would like to understand what is causing the exception and where the exception is happening. FWIW, I also tried bumping the timeout properties in the cassandra.yaml, but this made no difference whatsoever. I welcome any suggestions, anecdotes, insults, or monetary contributions for my registration in the house of moron-developers. Regards!!

    Read the article

  • How do I solve an AntiForgeryToken exception that occurs after an iisreset in my ASP.Net MVC app?

    - by Colin Newell
    I’m having problems with the AntiForgeryToken in ASP.Net MVC. If I do an iisreset on my web server and a user continues with their session they get bounced to a login page. Not terrible but then the AntiForgery token blows up and the only way to get going again is to blow away the cookie on the browser. With the beta version of version 1 it used to go wrong when reading the cookie back in for me so I used to scrub it before asking for a validation token but that was fixed when it was released. For now I think I’ll roll back to my code that fixed the beta problem but I can’t help but think I’m missing something. Is there a simpler solution, heck should I just drop their helper and create a new one from scratch? I get the feeling that a lot of the problem is the fact that it’s tied so deeply into the old ASP.Net pipeline and is trying to kludge it into doing something it wasn’t really designed to do. I had a look in the source code for the ASP.Net MVC 2 RC and it doesn't look like the code has changed much so while I haven't tried it, I don't think there are any answers there. Here is the relevant part of the stack trace of the exception. Edit: I just realised I didn't mention that this is just trying to insert the token on the GET request. This isn't the validation that occurs when you do a POST kicking off. System.Web.Mvc.HttpAntiForgeryException: A required anti-forgery token was not supplied or was invalid. ---> System.Web.HttpException: Validation of viewstate MAC failed. If this application is hosted by a Web Farm or cluster, ensure that <machineKey> configuration specifies the same validationKey and validation algorithm. AutoGenerate cannot be used in a cluster. ---> System.Web.UI.ViewStateException: Invalid viewstate. Client IP: 127.0.0.1 Port: 4991 User-Agent: scrubbed ViewState: scrubbed Referer: blah Path: /oursite/Account/Login ---> System.Security.Cryptography.CryptographicException: Padding is invalid and cannot be removed. at System.Security.Cryptography.RijndaelManagedTransform.DecryptData(Byte[] inputBuffer, Int32 inputOffset, Int32 inputCount, Byte[]& outputBuffer, Int32 outputOffset, PaddingMode paddingMode, Boolean fLast) at System.Security.Cryptography.RijndaelManagedTransform.TransformFinalBlock(Byte[] inputBuffer, Int32 inputOffset, Int32 inputCount) at System.Security.Cryptography.CryptoStream.FlushFinalBlock() at System.Web.Configuration.MachineKeySection.EncryptOrDecryptData(Boolean fEncrypt, Byte[] buf, Byte[] modifier, Int32 start, Int32 length, IVType ivType, Boolean useValidationSymAlgo) at System.Web.UI.ObjectStateFormatter.Deserialize(String inputString) --- End of inner exception stack trace --- --- End of inner exception stack trace --- at System.Web.UI.ViewStateException.ThrowError(Exception inner, String persistedState, String errorPageMessage, Boolean macValidationError) at System.Web.UI.ViewStateException.ThrowMacValidationError(Exception inner, String persistedState) at System.Web.UI.ObjectStateFormatter.Deserialize(String inputString) at System.Web.UI.ObjectStateFormatter.System.Web.UI.IStateFormatter.Deserialize(String serializedState) at System.Web.Mvc.AntiForgeryDataSerializer.Deserialize(String serializedToken) --- End of inner exception stack trace --- at System.Web.Mvc.AntiForgeryDataSerializer.Deserialize(String serializedToken) at System.Web.Mvc.HtmlHelper.GetAntiForgeryTokenAndSetCookie(String salt, String domain, String path) at System.Web.Mvc.HtmlHelper.AntiForgeryToken(String salt, String domain, String path)

    Read the article

  • Scipy Negative Distance? What?

    - by disappearedng
    I have a input file which are all floating point numbers to 4 decimal place. i.e. 13359 0.0000 0.0000 0.0001 0.0001 0.0002` 0.0003 0.0007 ... (the first is the id). My class uses the loadVectorsFromFile method which multiplies it by 10000 and then int() these numbers. On top of that, I also loop through each vector to ensure that there are no negative values inside. However, when I perform _hclustering, I am continually seeing the error, "Linkage Z contains negative values". I seriously think this is a bug because: I checked my values, the values are no where small enough or big enough to approach the limits of the floating point numbers and the formula that I used to derive the values in the file uses absolute value (my input is DEFINITELY right). Can someone enligten me as to why I am seeing this weird error? What is going on that is causing this negative distance error? ===== def loadVectorsFromFile(self, limit, loc, assertAllPositive=True, inflate=True): """Inflate to prevent "negative" distance, we use 4 decimal points, so *10000 """ vectors = {} self.winfo("Each vector is set to have %d limit in length" % limit) with open( loc ) as inf: for line in filter(None, inf.read().split('\n')): l = line.split('\t') if limit: scores = map(float, l[1:limit+1]) else: scores = map(float, l[1:]) if inflate: vectors[ l[0]] = map( lambda x: int(x*10000), scores) #int might save space else: vectors[ l[0]] = scores if assertAllPositive: #Assert that it has no negative value for dirID, l in vectors.iteritems(): if reduce(operator.or_, map( lambda x: x < 0, l)): self.werror( "Vector %s has negative values!" % dirID) return vectors def main( self, inputDir, outputDir, limit=0, inFname="data.vectors.all", mappingFname='all.id.features.group.intermediate'): """ Loads vector from a file and start clustering INPUT vectors is { featureID: tfidfVector (list), } """ IDFeatureDic = loadIdFeatureGroupDicFromIntermediate( pjoin(self.configDir, mappingFname)) if not os.path.exists(outputDir): os.makedirs(outputDir) vectors = self.loadVectorsFromFile( limit, pjoin( inputDir, inFname)) for threshold in map( lambda x:float(x)/30, range(20,30)): clusters = self._hclustering(threshold, vectors) if clusters: outputLoc = pjoin(outputDir, "threshold.%s.result" % str(threshold)) with open(outputLoc, 'w') as outf: for clusterNo, cluster in clusters.iteritems(): outf.write('%s\n' % str(clusterNo)) for featureID in cluster: feature, group = IDFeatureDic[featureID] outline = "%s\t%s\n" % (feature, group) outf.write(outline.encode('utf-8')) outf.write("\n") else: continue def _hclustering(self, threshold, vectors): """function which you should call to vary the threshold vectors: { featureID: [ tfidf scores, tfidf score, .. ] """ clusters = defaultdict(list) if len(vectors) > 1: try: results = hierarchy.fclusterdata( vectors.values(), threshold, metric='cosine') except ValueError, e: self.werror("_hclustering: %s" % str(e)) return False for i, featureID in enumerate( vectors.keys()):

    Read the article

  • sqlite over network share for failover

    - by reinier
    As a follow-up of this question: sqlite-over-a-network-share If I put the SQlite DB on a network share, but will not access it concurrently from different machines. I only have the SQLite db stored on a share so a cluster of failover computers can take over where one machine left off. Are there any inherent problems with that approach?

    Read the article

  • Run Linux command as predefined user

    - by vijay.shad
    Hi all, I have created a shell script to start a server program. startup.sh start When above command will executes, it will try starts the server as adminuser. To achieve this my script has written like this. SUBIT="su - adminuser -c " SERVER_BOX_COMMAND_A="Server" ############## # Function to start cluster function start(){ $SUBIT "$SERVER_BOX_COMMAND_A" } When i execute the command it asks for password. Is there any other way to do this so, it will not ask for password.

    Read the article

  • Can I use an NSDecimalNumber anywhere that an NSNumber is expected?

    - by Nick Forge
    NSDecimalNumber is a subclass of NSNumber, and from what I can tell, it implements all of the NSNumber methods as expected for an NSNumber instance. Given that, is it ok to give NSDecimalNumbers to any code that is expecting an NSNumber? The only possible issue might be code that checks that an argument is an instance of NSNumber, but since NSNumber is a class-cluster, code like this would have to check that the instance is a subclass of NSNumber, and NSDecimalNumber instances should pass the same tests.

    Read the article

  • Jboss logging issue

    - by balaji
    I'm Working as deployer and server administrator. We use Jboss 4.0x AS to deploy our applications. The issue I'm facing is, Whenever we redeploy/restart the server, server.log is getting created but after sometime the logging goes off. Yes it is not at all updating the server.log file. Due to this, we could not trace the other critical issues we have. Actually we have two separate nodes and we do deploy/restarting the server separately on two nodes. We are facing the issue in both of our test and production environment. I could not trace out where exactly the issue is. Could you please help me in resolving the issue? If we have any other issues, we can check the log files. If log itself is not getting updated/logged, how can we move further in analyzing the issues without the recent/updated logs? Below are the logs found in the stdout.log: 18:55:50,303 INFO [Server] Core system initialized 18:55:52,296 INFO [WebService] Using RMI server codebase: http://kl121tez.is.klmcorp.net:8083/ 18:55:52,313 INFO [Log4jService$URLWatchTimerTask] Configuring from URL: resource:log4j.xml 18:55:52,860 ERROR [STDERR] LOG0026E The Log Manager cannot create the object AmasRBPFTraceLogger without a class name. 18:55:52,860 ERROR [STDERR] LOG0026E The Log Manager cannot create the object AmasRBPFMessageLogger without a class name. 18:55:54,273 ERROR [STDERR] LOG0026E The Log Manager cannot create the object AmasCacheTraceLogger without a class name. 18:55:54,274 ERROR [STDERR] LOG0026E The Log Manager cannot create the object AmasCacheMessageLogger without a class name. 18:55:54,334 ERROR [STDERR] LOG0026E The Log Manager cannot create the object JACCTraceLogger without a class name. 18:55:54,334 ERROR [STDERR] LOG0026E The Log Manager cannot create the object JACCMessageLogger without a class name. 18:55:56,059 INFO [ServiceEndpointManager] WebServices: jbossws-1.0.3.SP1 (date=200609291417) 18:55:56,635 INFO [Embedded] Catalina naming disabled 18:55:56,671 INFO [ClusterRuleSetFactory] Unable to find a cluster rule set in the classpath. Will load the default rule set. 18:55:56,672 INFO [ClusterRuleSetFactory] Unable to find a cluster rule set in the classpath. Will load the default rule set. 18:55:56,843 INFO [Http11BaseProtocol] Initializing Coyote HTTP/1.1 on http-0.0.0.0-8180 18:55:56,844 INFO [Catalina] Initialization processed in 172 ms 18:55:56,844 INFO [StandardService] Starting service jboss.web

    Read the article

  • Using Java for 'nslookup -type=srv'

    - by user321524
    Hi: Is there a way in Java to do a Nslookup search on SRV records? I need to bind to Active Directory and would like to use the SRV records to help determine which of the ADs in the cluster are live. In command line the nslookup search string is: 'nslookup -type=srv _ ldap._tcp.ActiveDirectory domain name' Thanks.

    Read the article

  • Enthought Python, Sage, or others (in Unix clusters)

    - by vailen
    I am currently get access to a cluster of Unix machines, but they don't have the software I need (numpy, scipy, matplotlib, etc), and I have to install them by myself (I don't have the root permission, either, so commands like apt-get or yast doesn't work). In the worst case, I have to compile them all from source. Is there any better way to do so? I hear something about Enthought Python and Sage, but not sure what is the best way to do so. Any suggestion?

    Read the article

  • How do I print out objects in an array in python?

    - by Jonathan
    I'm writing a code which performs a k-means clustering on a set of data. I'm actually using the code from a book called collective intelligence by O'Reilly. Everything works, but in his code he uses the command line and i want to write everything in notepad++. As a reference his line is >>>kclust=clusters.kcluster(data,k=10) >>>[rownames[r] for r in k[0]] Here is my code: from PIL import Image,ImageDraw def readfile(filename): lines=[line for line in file(filename)] # First line is the column titles colnames=lines[0].strip( ).split('\t')[1:] rownames=[] data=[] for line in lines[1:]: p=line.strip( ).split('\t') # First column in each row is the rowname rownames.append(p[0]) # The data for this row is the remainder of the row data.append([float(x) for x in p[1:]]) return rownames,colnames,data from math import sqrt def pearson(v1,v2): # Simple sums sum1=sum(v1) sum2=sum(v2) # Sums of the squares sum1Sq=sum([pow(v,2) for v in v1]) sum2Sq=sum([pow(v,2) for v in v2]) # Sum of the products pSum=sum([v1[i]*v2[i] for i in range(len(v1))]) # Calculate r (Pearson score) num=pSum-(sum1*sum2/len(v1)) den=sqrt((sum1Sq-pow(sum1,2)/len(v1))*(sum2Sq-pow(sum2,2)/len(v1))) if den==0: return 0 return 1.0-num/den class bicluster: def __init__(self,vec,left=None,right=None,distance=0.0,id=None): self.left=left self.right=right self.vec=vec self.id=id self.distance=distance def hcluster(rows,distance=pearson): distances={} currentclustid=-1 # Clusters are initially just the rows clust=[bicluster(rows[i],id=i) for i in range(len(rows))] while len(clust)>1: lowestpair=(0,1) closest=distance(clust[0].vec,clust[1].vec) # loop through every pair looking for the smallest distance for i in range(len(clust)): for j in range(i+1,len(clust)): # distances is the cache of distance calculations if (clust[i].id,clust[j].id) not in distances: distances[(clust[i].id,clust[j].id)]=distance(clust[i].vec,clust[j].vec) #print 'i' #print i #print #print 'j' #print j #print d=distances[(clust[i].id,clust[j].id)] if d<closest: closest=d lowestpair=(i,j) # calculate the average of the two clusters mergevec=[ (clust[lowestpair[0]].vec[i]+clust[lowestpair[1]].vec[i])/2.0 for i in range(len(clust[0].vec))] # create the new cluster newcluster=bicluster(mergevec,left=clust[lowestpair[0]], right=clust[lowestpair[1]], distance=closest,id=currentclustid) # cluster ids that weren't in the original set are negative currentclustid-=1 del clust[lowestpair[1]] del clust[lowestpair[0]] clust.append(newcluster) return clust[0] def kcluster(rows,distance=pearson,k=4): # Determine the minimum and maximum values for each point ranges=[(min([row[i] for row in rows]),max([row[i] for row in rows])) for i in range(len(rows[0]))] # Create k randomly placed centroids clusters=[[random.random( )*(ranges[i][1]-ranges[i][0])+ranges[i][0] for i in range(len(rows[0]))] for j in range(k)] lastmatches=None for t in range(100): print 'Iteration %d' % t bestmatches=[[] for i in range(k)] # Find which centroid is the closest for each row for j in range(len(rows)): row=rows[j] bestmatch=0 for i in range(k): d=distance(clusters[i],row) if d<distance(clusters[bestmatch],row): bestmatch=i bestmatches[bestmatch].append(j) # If the results are the same as last time, this is complete if bestmatches==lastmatches: break lastmatches=bestmatches # Move the centroids to the average of their members for i in range(k): avgs=[0.0]*len(rows[0]) if len(bestmatches[i])>0: for rowid in bestmatches[i]: for m in range(len(rows[rowid])): avgs[m]+=rows[rowid][m] for j in range(len(avgs)): avgs[j]/=len(bestmatches[i]) clusters[i]=avgs return bestmatches

    Read the article

  • Choosing a distributed shared memory solution

    - by mindas
    I have a task to build a prototype for a massively scalable distributed shared memory (DSM) app. The prototype would only serve as a proof-of-concept, but I want to spend my time most effectively by picking the components which would be used in the real solution later on. The aim of this solution is to take data input from an external source, churn it and make the result available for a number of frontends. Those "frontends" would just take the data from the cache and serve it without extra processing. The amount of frontend hits on this data can literally be millions per second. The data itself is very volatile; it can (and does) change quite rapidly. However the frontends should see "old" data until the newest has been processed and cached. The processing and writing is done by a single (redundant) node while other nodes only read the data. In other words: no read-through behaviour. I was looking into solutions like memcached however this particular one doesn't fulfil all our requirements which are listed below: The solution must at least have Java client API which is reasonably well maintained as the rest of app is written in Java and we are seasoned Java developers; The solution must be totally elastic: it should be possible to add new nodes without restarting other nodes in the cluster; The solution must be able to handle failover. Yes, I realize this means some overhead, but the overall served data size isn't big (1G max) so this shouldn't be a problem. By "failover" I mean seamless execution without hardcoding/changing server IP address(es) like in memcached clients when a node goes down; Ideally it should be possible to specify the degree of data overlapping (e.g. how many copies of the same data should be stored in the DSM cluster); There is no need to permanently store all the data but there might be a need of post-processing of some of the data (e.g. serialization to the DB). Price. Obviously we prefer free/open source but we're happy to pay a reasonable amount if a solution is worth it. In any way, paid 24hr/day support contract is a must. The whole thing has to be hosted in our data centers so SaaS offerings like Amazon SimpleDB are out of scope. We would only consider this if no other options would be available. Ideally the solution would be strictly consistent (as in CAP); however, eventual consistence can be considered as an option. Thanks in advance for any ideas.

    Read the article

  • Scalability comparison between different DBMSs

    - by Björn Lindfors
    By what factor does the performance (read queries/sec) increase when a machine is added to a cluster of machines running either: a Bigtable-like database MySQL? Google's research paper on Bigtable suggests that "near-linear" scaling is achieved can be achieved with Bigtable. This page here featuring MySQL's marketing jargon suggests that MySQL is capable of scaling linearly. Where is the truth?

    Read the article

  • Very basic question about Hadoop and compressed input files

    - by Luis Sisamon
    I have started to look into Hadoop. If my understanding is right i could process a very big file and it would get split over different nodes, however if the file is compressed then the file could not be split and wold need to be processed by a single node (effectively destroying the advantage of running a mapreduce ver a cluster of parallel machines). My question is, assuming the above is correct, is it possible to split a large file manually in fixed-size chunks, or daily chunks, compress them and then pass a list of compressed input files to perform a mapreduce?

    Read the article

< Previous Page | 44 45 46 47 48 49 50 51 52 53 54 55  | Next Page >