Search Results

Search found 6716 results on 269 pages for 'distributed algorithm'.

Page 49/269 | < Previous Page | 45 46 47 48 49 50 51 52 53 54 55 56  | Next Page >

  • Need help implementing this algorithm with map reduce(hadoop)

    - by Julia
    Hi all! i have algorithm that will go through a large data set read some text files and search for specific terms in those lines. I have it implemented in Java, but I didnt want to post code so that it doesnt look i am searching for someone to implement it for me, but it is true i really need a lot of help!!! This was not planned for my project, but data set turned out to be huge, so teacher told me I have to do it like this. I was reading about MapReduce and thaught that i first do the standard implementation and then it will be more/less easier to do it with mapreduce. But didnt happen, since algorithm is quite stupid and nothing special, and map reduce...i cant wrap my mind around it. So here is shortly pseudo code of my algorithm LIST termList (there is method that creates this list from lucene index) FOLDER topFolder INPUT topFolder IF it is folder and not empty list files (there are 30 sub folders inside) FOR EACH sub folder GET file "CheckedFile.txt" analyze(CheckedFile) ENDFOR END IF Method ANALYZE(CheckedFile) read CheckedFile WHILE CheckedFile has next line GET line FOR(loops through termList) GET third word from line IF third word = term from list append whole line to string buffer ENDIF ENDFOR END WHILE OUTPUT string buffer to file Also, as you can see, each time when "analyze" is called, new file has to be created, i understood that map reduce is difficult to write to many outputs??? I understand mapreduce intuition, and my example seems perfectly suited for mapreduce, but when it comes to do this, obviously I do not know enough and i am STUCK! Please please help.

    Read the article

  • On counting pairs of words that differ by one letter

    - by Quintofron
    Let us consider n words, each of length k. Those words consist of letters over an alphabet (whose cardinality is n) with defined order. The task is to derive an O(nk) algorithm to count the number of pairs of words that differ by one position (no matter which one exactly, as long as it's only a single position). For instance, in the following set of words (n = 5, k = 4): abcd, abdd, adcb, adcd, aecd there are 5 such pairs: (abcd, abdd), (abcd, adcd), (abcd, aecd), (adcb, adcd), (adcd, aecd). So far I've managed to find an algorithm that solves a slightly easier problem: counting the number of pairs of words that differ by one GIVEN position (i-th). In order to do this I swap the letter at the ith position with the last letter within each word, perform a Radix sort (ignoring the last position in each word - formerly the ith position), linearly detect words whose letters at the first 1 to k-1 positions are the same, eventually count the number of occurrences of each letter at the last (originally ith) position within each set of duplicates and calculate the desired pairs (the last part is simple). However, the algorithm above doesn't seem to be applicable to the main problem (under the O(nk) constraint) - at least not without some modifications. Any idea how to solve this?

    Read the article

  • Vacancy Tracking Algorithm implementation in C++

    - by Dave
    I'm trying to use the vacancy tracking algorithm to perform transposition of multidimensional arrays in C++. The arrays come as void pointers so I'm using address manipulation to perform the copies. Basically, there is an algorithm that starts with an offset and works its way through the whole 1-d representation of the array like swiss cheese, knocking out other offsets until it gets back to the original one. Then, you have to start at the next, untouched offset and do it again. You repeat until all offsets have been touched. Right now, I'm using a std::set to just fill up all possible offsets (0 up to the multiplicative fold of the dimensions of the array). Then, as I go through the algorithm, I erase from the set. I figure this would be fastest because I need to randomly access offsets in the tree/set and delete them. Then I need to quickly find the next untouched/undeleted offset. First of all, filling up the set is very slow and it seems like there must be a better way. It's individually calling new[] for every insert. So if I have 5 million offsets, there's 5 million news, plus re-balancing the tree constantly which as you know is not fast for a pre-sorted list. Second, deleting is slow as well. Third, assuming 4-byte data types like int and float, I'm using up actually the same amount of memory as the array itself to store this list of untouched offsets. Fourth, determining if there are any untouched offsets and getting one of them is fast -- a good thing. Does anyone have suggestions for any of these issues?

    Read the article

  • License key pattern detection?

    - by Ricket
    This is not a real situation; please ignore legal issues that you might think apply, because they don't. Let's say I have a set of 200 known valid license keys for a hypothetical piece of software's licensing algorithm, and a license key consists of 5 sets of 5 alphanumeric case-insensitive (all uppercase) characters. Example: HXDY6-R3DD7-Y8FRT-UNPVT-JSKON Is it possible (or likely) to extrapolate other possible keys for the system? What if the set was known to be consecutive; how do the methods change for this situation, and what kind of advantage does this give? I have heard of "keygens" before, but I believe they are probably made by decompiling the licensing software rather than examining known valid keys. In this case, I am only given the set of keys and I must determine the algorithm. I'm also told it is an industry standard algorithm, so it's probably not something basic, though the chance is always there I suppose. If you think this doesn't belong in Stack Overflow, please at least suggest an alternate place for me to look or ask the question. I honestly don't know where to begin with a problem like this. I don't even know the terminology for this kind of problem.

    Read the article

  • Find existence of number in a sorted list in constant time? (Interview question)

    - by Rich
    I'm studying for upcoming interviews and have encountered this question several times (written verbatim) Find or determine non existence of a number in a sorted list of N numbers where the numbers range over M, M N and N large enough to span multiple disks. Algorithm to beat O(log n); bonus points for constant time algorithm. First of all, I'm not sure if this is a question with a real solution. My colleagues and I have mused over this problem for weeks and it seems ill formed (of course, just because we can't think of a solution doesn't mean there isn't one). A few questions I would have asked the interviewer are: Are there repeats in the sorted list? What's the relationship to the number of disks and N? One approach I considered was to binary search the min/max of each disk to determine the disk that should hold that number, if it exists, then binary search on the disk itself. Of course this is only an order of magnitude speedup if the number of disks is large and you also have a sorted list of disks. I think this would yield some sort of O(log log n) time. As for the M N hint, perhaps if you know how many numbers are on a disk and what the range is, you could use the pigeonhole principle to rule out some cases some of the time, but I can't figure out an order of magnitude improvement. Also, "bonus points for constant time algorithm" makes me a bit suspicious. Any thoughts, solutions, or relevant history of this problem?

    Read the article

  • Distributed datastore

    - by Julien Genestoux
    We're trying to add some kind of persistence in our app. The app generates about 250 entries per second. Each of these entries belong to one of 2M files. For each file, we want to keep the last 10 entries, so we can look them up later. The way our client application works : it gets a stream of all the data it fetches the right file (GET) it adds the new content it saves the file back (PUT) We're looking for an efficient way to store this data that can scale horizontally as the amount of data we're getting is doubling every few weeks. We initially looked at S3. It works fine, but becomes very expensive very fast ($1000 monthly just in PUT operations!) We then gave a shot at Riak. But it seems we can't get more than 60 write/sec on each node, which is very very slow. Any other solution out there?

    Read the article

  • Git on windows, is it truly distributed?

    - by Noel Kennedy
    I am just starting out with git on the Windows platform. I have mysygit installed and bar a few hiccups I am 'git'ing away nicely. However, I must be missing something because I don't understand how two msysgit clients on different Windows machines can push and pull to each other directly? I am a complete linux noob but I think I can see that the ssh thing allows distribution on linux. However, the msysgit client appears just to be additional commands in the windows cmd prompt and there is no windows service element. If I try git clone 'MyMatesPc' who is going to be listening to this request at the other end? I can see that if you have a 'central' server running git on linux (or cygwin), you can share commits by pushing them onto the 'central' repo from one machine, then pulling them down onto another. This effectively means that you are having to use a central server. I don't have a problem with this, but wanted to check that I am not missing anything!

    Read the article

  • Longitudinal Redundancy Check fails

    - by PaulH
    I have an application that decodes data from a magnetic stripe reader. But, I'm having difficulty getting my calculated LRC check byte to match the one on the cards. If I were to grab 3 cards each with 3 tracks, I would guess the algorithm below would work on 4 of the 9 tracks in those cards. The algorithm I'm using looks like this (C#): private static char GetLRC(string s, int start, int end) { int result = 0; for (int i = start; i <= end; i++) { result ^= Convert.ToByte(s[i]); } return Convert.ToChar(result); } This is an example of track 3 data that fails the check. On this card, track 2 matched, but track 1 also failed. 0 1 2 3 4 5 6 7 8 9 A B C D E F 00 3 4 4 4 4 4 4 4 4 4 4 5 5 5 5 5 10 5 5 5 5 5 6 6 6 6 6 6 6 6 6 6 7 20 7 7 7 7 7 7 7 7 7 8 8 8 8 8 8 8 30 8 8 8 9 9 9 9 9 9 9 9 9 9 0 0 0 40 0 0 0 0 0 0 0 1 2 3 4 1 1 1 1 1 50 1 1 1 1 1 2 2 2 2 2 2 2 2 2 2 3 60 3 3 3 3 3 3 3 3 The sector delimiter is ';' and it ends with a '?'. The LRC byte from this track is 0x30. Unfortunately, the algorithm above computes an LRC of 0x00 per the following calculation (apologies for its length. I want to be thorough): 00 ^ 3b = 3b ';' 3b ^ 33 = 08 08 ^ 34 = 3c 3c ^ 34 = 08 08 ^ 34 = 3c 3c ^ 34 = 08 08 ^ 34 = 3c 3c ^ 34 = 08 08 ^ 34 = 3c 3c ^ 34 = 08 08 ^ 34 = 3c 3c ^ 34 = 08 08 ^ 35 = 3d 3d ^ 35 = 08 08 ^ 35 = 3d 3d ^ 35 = 08 08 ^ 35 = 3d 3d ^ 35 = 08 08 ^ 35 = 3d 3d ^ 35 = 08 08 ^ 35 = 3d 3d ^ 35 = 08 08 ^ 36 = 3e 3e ^ 36 = 08 08 ^ 36 = 3e 3e ^ 36 = 08 08 ^ 36 = 3e 3e ^ 36 = 08 08 ^ 36 = 3e 3e ^ 36 = 08 08 ^ 36 = 3e 3e ^ 36 = 08 08 ^ 37 = 3f 3f ^ 37 = 08 08 ^ 37 = 3f 3f ^ 37 = 08 08 ^ 37 = 3f 3f ^ 37 = 08 08 ^ 37 = 3f 3f ^ 37 = 08 08 ^ 37 = 3f 3f ^ 37 = 08 08 ^ 38 = 30 30 ^ 38 = 08 08 ^ 38 = 30 30 ^ 38 = 08 08 ^ 38 = 30 30 ^ 38 = 08 08 ^ 38 = 30 30 ^ 38 = 08 08 ^ 38 = 30 30 ^ 38 = 08 08 ^ 39 = 31 31 ^ 39 = 08 08 ^ 39 = 31 31 ^ 39 = 08 08 ^ 39 = 31 31 ^ 39 = 08 08 ^ 39 = 31 31 ^ 39 = 08 08 ^ 39 = 31 31 ^ 39 = 08 08 ^ 30 = 38 38 ^ 30 = 08 08 ^ 30 = 38 38 ^ 30 = 08 08 ^ 30 = 38 38 ^ 30 = 08 08 ^ 30 = 38 38 ^ 30 = 08 08 ^ 30 = 38 38 ^ 30 = 08 08 ^ 31 = 39 39 ^ 32 = 0b 0b ^ 33 = 38 38 ^ 34 = 0c 0c ^ 31 = 3d 3d ^ 31 = 0c 0c ^ 31 = 3d 3d ^ 31 = 0c 0c ^ 31 = 3d 3d ^ 31 = 0c 0c ^ 31 = 3d 3d ^ 31 = 0c 0c ^ 31 = 3d 3d ^ 31 = 0c 0c ^ 32 = 3e 3e ^ 32 = 0c 0c ^ 32 = 3e 3e ^ 32 = 0c 0c ^ 32 = 3e 3e ^ 32 = 0c 0c ^ 32 = 3e 3e ^ 32 = 0c 0c ^ 32 = 3e 3e ^ 32 = 0c 0c ^ 33 = 3f 3f ^ 33 = 0c 0c ^ 33 = 3f 3f ^ 33 = 0c 0c ^ 33 = 3f 3f ^ 33 = 0c 0c ^ 33 = 3f 3f ^ 33 = 0c 0c ^ 33 = 3f 3f ^ 3f = 00 '?' If anybody can point out how to fix my algorithm, I would appreciate it. Thanks, PaulH

    Read the article

  • Sell me Distributed revision control

    - by ring bearer
    I know 1000s of similar topics floating around. I read at lest 5 threads here in SO But why am I still not convinced about DVCS? I have only following questions (note that I am selfishly worried only about Java projects) What is the advantage or value of committing locally? What? really? All modern IDEs allows you to keep track of your changes? and if required you can restore a particular change. Also, they have a feature to label your changes/versions at IDE level!? what if I crash my hard drive? where did my local repository go? (so how is it cool compared to checking in to a central repo?) Working offline or in an air plane. What is the big deal?In order for me to build a release with my changes, I must eventually connect to the central repository. Till then it does not matter how I track my changes locally. Ok Linus Torvalds gives his life to Git and hates everything else. Is that enough to blindly sing praises? Linus lives in a different world compared to offshore developers in my mid-sized project? Pitch me!

    Read the article

  • Estimating the boundary of arbitrarily distributed data

    - by Dave
    I have two dimensional discrete spatial data. I would like to make an approximation of the spatial boundaries of this data so that I can produce a plot with another dataset on top of it. Ideally, this would be an ordered set of (x,y) points that matplotlib can plot with the plt.Polygon() patch. My initial attempt is very inelegant: I place a fine grid over the data, and where data is found in a cell, a square matplotlib patch is created of that cell. The resolution of the boundary thus depends on the sampling frequency of the grid. Here is an example, where the grey region are the cells containing data, black where no data exists. OK, problem solved - why am I still here? Well.... I'd like a more "elegant" solution, or at least one that is faster (ie. I don't want to get on with "real" work, I'd like to have some fun with this!). The best way I can think of is a ray-tracing approach - eg: from xmin to xmax, at y=ymin, check if data boundary crossed in intervals dx y=ymin+dy, do 1 do 1-2, but now sample in y An alternative is defining a centre, and sampling in r-theta space - ie radial spokes in dtheta increments. Both would produce a set of (x,y) points, but then how do I order/link neighbouring points them to create the boundary? A nearest neighbour approach is not appropriate as, for example (to borrow from Geography), an isthmus (think of Panama connecting N&S America) could then close off and isolate regions. This also might not deal very well with the holes seen in the data, which I would like to represent as a different plt.Polygon. The solution perhaps comes from solving an area maximisation problem. For a set of points defining the data limits, what is the maximum contiguous area contained within those points To form the enclosed area, what are the neighbouring points for the nth point? How will the holes be treated in this scheme - is this erring into topology now? Apologies, much of this is me thinking out loud. I'd be grateful for some hints, suggestions or solutions. I suspect this is an oft-studied problem with many solution techniques, but I'm looking for something simple to code and quick to run... I guess everyone is, really! Cheers, David

    Read the article

  • MSDTC Distributed Transaction Coordinator Enabling

    - by Curtis White
    I've a web server and a separate SQL server. I'm trying to use transaction scope to ensure that SQL queries are completed with my linq queries. I wrap everything with this using (TransactionScope scope = new TransactionScope()) I want to know where I need to install DTC. Do I need to install it on the IIS 7.5 box AND the SQL server? Do I need to unblock some ports? Are there any security risk in doing so? I've setup this up once before but don't remember how. If I can't get access to DTC then is there any other way to ensure a lINQ and sql query is atomic?

    Read the article

  • Embed a database in the .apk of a distributed application [Android]

    - by Sephy
    Hi everybody, My question is I think quite simple but I don't think the answer will be... I have quite a lot of content to have in my application to make it run properly, and I'm thinking of putting all of it in a database, and distribute it in an embeded database with the application in the market. The only trouble is that I have no idea of how to do that. I know that I can extract a file .db from Eclipse DDMS with the content of my database, and I suppose I need to put it in the asset folder of my application, but then how to make the application use it to regenerate the application database? If you have any link to some code or help, that would be great. Thanks

    Read the article

  • Merge Sort issue when removing the array copy step

    - by Ime Prezime
    I've been having an issue that I couldn't debug for quite some time. I am trying to implement a MergeSort algorithm with no additional steps of array copying by following Robert Sedgewick's algorithm in "Algorithm's in C++" book. Short description of the algorithm: The recursive program is set up to sort b, leaving results in a. Thus, the recursive calls are written to leave their result in b, and we use the basic merge program to merge those files from b into a. In this way, all the data movement is done during the course of the merges. The problem is that I cannot find any logical errors but the sorting isn't done properly. Data gets overwritten somewhere and I cannot determine what logical error causes this. The data is sorted when the program is finished but it is not the same data any more. For example, Input array: { A, Z, W, B, G, C } produces the array: { A, G, W, W, Z, Z }. I can obviously see that it must be a logical error somewhere, but I have been trying to debug this for a pretty long time and I think a fresh set of eyes could maybe see what I'm missing cause I really can't find anything wrong. My code: static const int M = 5; void insertion(char** a, int l, int r) { int i,j; char * temp; for (i = 1; i < r + 1; i++) { temp = a[i]; j = i; while (j > 0 && strcmp(a[j-1], temp) > 0) { a[j] = a[j-1]; j = j - 1; } a[j] = temp; } } //merging a and b into c void merge(char ** c,char ** a, int N, char ** b, int M) { for (int i = 0, j = 0, k = 0; k < N+M; k++) { if (i == N) { c[k] = b[j++]; continue; } if (j == M) { c[k] = a[i++]; continue; } c[k] = strcmp(a[i], b[j]) < 0 ? a[i++] : b[j++]; } } void mergesortAux(char ** a, char ** b, int l, int r) { if(r - l <= M) { insertion(a, l, r); return; } int m = (l + r)/2; mergesortAux(b, a, l, m); //merge sort left mergesortAux(b, a, m+1, r); //merge sort right merge(a+l, b+l, m-l+1, b+m+1, r-m); //merge } void mergesort(char ** a,int l, int r, int size) { static char ** aux = (char**)malloc(size * sizeof(char*)); for(int i = l; i < size; i++) aux[i] = a[i]; mergesortAux(a, aux, l, r); free(aux); }

    Read the article

  • Generating a perfectly distributed grid from array

    - by zath
    I'm looking for a formula or rule that will allow me to distribute n characters into a n*n grid with as perfect of a distribution as possible. Let's say we have an array of 5 characters, A through E. Here's an example of how it definitely shouldn't turn out: A B C D E B C D E A C D E A B D E A B C E A B C D The pattern is very clear here, it doesn't look "random". It would look better this way: A B C D E D E A B C B C D E A E A B C D C D E A B What I basically did here was place the A B C D E on the first row, then shift it by 2 on the second row, by 4 on the third row, 1 on the fourth row and 3 on the fifth row. Compared to the very bad example, this one shows no clear pattern. Though I'm certainly hoping there is a pattern, so I can use it to calculate not only small arrays such as this one, but arrays of any size. Any ideas as to how this can be accomplished?

    Read the article

  • How to install ceph on EC2 Amazon Linux AMI

    - by takaomag
    I want to test Ceph (a distributed network storage and file system) on some EC2 hosts which is derived from Amazon Linux AMI (amzn-ami-2011.09.2.x86_64-ebs). The kernel version is 3.2 and btrfs is enabled. But kernel config options related to Ceph (CONFIG_CEPH_FS and CONFIG_BLK_DEV_RBD) seems to be disabled. I have to make a new kernel and register it to amazon ? Or, does someone know more easy way ?

    Read the article

  • Is there any technical info about the youtube network?

    - by Allen
    I'm an IT graduate student trying to get my head around distributed content distribution systems, much like what I assume Youtube uses. I have read Google Research stuff like Bigtable and Google File System academic papers. Is there any such thing for Youtube? Can anyone point me at stuff to learn about the Youtube network and the underlying technology? thanks dbaman

    Read the article

  • Oracle Coherence w/ ASP.NET application

    - by frankadelic
    Is it possible to use Oracle Coherence to provide distributed caching to an ASP.NET application? We would like to use Coherence to scale out an ASP.NET application which does not have distributed caching. Alternatives would be memcached, etc. However, we are considering Coherence since we already have licensing/expertise in that area.

    Read the article

  • Are there any benefits to using a Distributed vSwitch for iSCSI?

    - by dunxd
    I am designing our vSphere farm - we'll be migrating from ESX 3.5 to 4.1. I plan to set up a new farm using ESXi 4.1, and move the Virtual Machines on the 3.5 farm into it by shutdown, then import. In ESX 3.5 there is no distributed networking, so each host has a vSwitch connected to my SAN NICs, and a port group for the vmkernel. In vSphere (ESXi 4.1) I have the extra option to set up a distributed vSwitch and distributed port groups for vmkernel to access iSCSI storage. Is there any benefit to this, or should I stick to non-distributed networking for iSCSI.

    Read the article

  • Adaboost algorithm and its usage in face detection

    - by Hani
    I am trying to understand Adaboost algorithm but i have some troubles. After reading about Adaboost i realized that it is a classification algorithm(somehow like neural network). But i could not know how the weak classifiers are chosen (i think they are haar-like features for face detection) and how finally the H result which is the final strong classifier can be used. I mean if i found the alpha values and compute the H ,how am i going to benefit from it as a value (one or zero) for new images. Please is there an example describes it in a perfect way? i found the plus and minus example that is found in most adaboost tutorials but i did not know how exactly hi is chosen and how to adopt the same concept on face detection. I read many papers and i had many ideas but until now my ideas are not well arranged. Thanks....

    Read the article

< Previous Page | 45 46 47 48 49 50 51 52 53 54 55 56  | Next Page >