Search Results

Search found 4291 results on 172 pages for 'cluster analysis'.

Page 95/172 | < Previous Page | 91 92 93 94 95 96 97 98 99 100 101 102  | Next Page >

  • Writing csv files with python with exact formatting parameters

    - by Ben Harrison
    I'm having trouble with processing some csv data files for a project. The project's programmer has moved onto greener pastures, and now I'm trying to finish the data analysis up (I did/do the statistical analysis.) The programmer suggested using python/csv reader to help break down the files, which I've had some success with, but not in a way I can use. This code is a little different from what I was trying before. I am essentially attempting to create an array. In the raw data format, the first 7 rows contain no data, and then each column contains 50 experiments, each with 4000 rows, for 200000 some rows total. What I want to do is take each column, and make it an individual csv file, with each experiment in its own column. So it would be an array of 50 columns and 4000 rows for each data type. The code here does break down the correct values, I think the logic is okay, but it is breaking down the opposite of how I want it. I want the separators without quotes (the commas and spaces) and I want the element values in quotes. Right now it is doing just the opposite for both, element values with no quotes, and the separators in quotes. I've spent several hours trying to figure out how to do this to no avail, import csv ifile = open('00_follow_maverick.csv') epistemicfile = open('00_follower_maverick_EP.csv', 'w') reader = csv.reader(ifile) colnum = 0 rownum = 0 y = 0 z = 8 for column in reader: rownum = 4000 * y + z for element in column: writer = csv.writer(epistemicfile) if y <= 50: y = y + 1 writer.writerow([element]) writer.writerow(',') rownum = x * y + z if y > 50: y = 0 z = z + 1 writer.writerow(' ') rownum = x * y + z if z >= 4008: break What is going on: I am taking each row in the raw data file in iterations of 4000, so that I can separate them with commas for the 50 experiments. When y, the experiment indicator here, reaches 50, it resets back to experiment 0, and adds 1 to z, which tells it which row to look at, by the formula of 4000 * y + z. When it completes the rows for all 50 experiments, it is finished. The problem here is that I don't know how to get python to write the actual values in quotes, and my separators outside of quotes. Any help will be most appreciated. Apologies if this seems a stupid question, I have no programming experience, this is my first attempt ever. Thank you.

    Read the article

  • Possible to download entire whois database / list of registered domains?

    - by Parand
    I wanted to do some analysis on registered domain names. Looks like I can hit whois.internic.net to get information about each domain, but it also looks like there are rate limits that prevent me from doing large numbers of queries. Is there a way to periodically (say daily) grab the entire whois database? I really only care about whether a domain is registered or not, so I don't need the full whois information.

    Read the article

  • Why is it still so hard to write software?

    - by nornagon
    Writing software, I find, is composed of two parts: the Idea, and the Implementation. The Idea is about thinking: "I have this problem; how do I solve it?" and further, "how do I solve it elegantly?" The answers to these questions are obtainable by thinking about algorithms and architecture. The ideas come partially through analysis and partially through insight and intuition. The Idea is usually the easy part. You talk to your friends and co-workers and you nut it out in a meeting or over coffee. It takes an hour or two, plus revisions as you implement and find new problems. The Implementation phase of software development is so difficult that we joke about it. "Oh," we say, "the rest is a Simple Matter of Code." Because it should be simple, but it never is. We used to write our code on punch cards, and that was hard: mistakes were very difficult to spot, so we had to spend extra effort making sure every line was perfect. Then we had serial terminals: we could see all our code at once, search through it, organise it hierarchically and create things abstracted from raw machine code. First we had assemblers, one level up from machine code. Mnemonics freed us from remembering the machine code. Then we had compilers, which freed us from remembering the instructions. We had virtual machines, which let us step away from machine-specific details. And now we have advanced tools like Eclipse and Xcode that perform analysis on our code to help us write code faster and avoid common pitfalls. But writing code is still hard. Writing code is about understanding large, complex systems, and tools we have today simply don't go very far to help us with that. When I click "find all references" in Eclipse, I get a list of them at the bottom of the window. I click on one, and I'm torn away from what I was looking at, forced to context switch. Java architecture is usually several levels deep, so I have to switch and switch and switch until I find what I'm really looking for -- by which time I've forgotten where I came from. And I do that all day until I've understood a system. It's taxing mentally, and Eclipse doesn't do much that couldn't be done in 1985 with grep, except eat hundreds of megs of RAM. Writing code has barely changed since we were staring at amber on black. We have the theoretical groundwork for much more advanced tools, tools that actually work to help us comprehend and extend the complex systems we work with every day. So why is writing code still so hard?

    Read the article

  • Programmatically printing git revision and checking for uncommitted changes

    - by Andrew Grimm
    To ensure that my scientific analysis is reproducible, I'd like to programmatically check if there are any modifications to the code base that aren't checked in, and if not, print out what commit is being used. For example, if there are uncommitted changes, it should output Warning: uncommitted changes made. This output may not be reproducible. Else, produce Current commit: d27ec73cf2f1df89cbccd41494f579e066bad6fe Ideally, it should use "plumbing", not "porcelain".

    Read the article

  • Enthought Python, Sage, or others (in Unix clusters)

    - by vailen
    I am currently get access to a cluster of Unix machines, but they don't have the software I need (numpy, scipy, matplotlib, etc), and I have to install them by myself (I don't have the root permission, either, so commands like apt-get or yast doesn't work). In the worst case, I have to compile them all from source. Is there any better way to do so? I hear something about Enthought Python and Sage, but not sure what is the best way to do so. Any suggestion?

    Read the article

  • Running out of memory while analyzing a Java Heap Dump

    - by Abel Morelos
    Hi, I have a curious problem, I need to analyze a Java heap dump (from an IBM JRE) which has 1.5GB in size, the problem is that while analyzing the dump (I've tried HeapAnalyzer and the IBM Memory Analyzer 0.5) the tools runs out of memory I can't really analyze the dump. I have 3GB of RAM in my machine, but seems like it's not enough to analyze the 1.5 GB dump, My question is, do you know a specific tool for heap dump analysis (supporting IBM JRE dumps) that I could run with the amount of memory I have? Thanks.

    Read the article

  • How do I print out objects in an array in python?

    - by Jonathan
    I'm writing a code which performs a k-means clustering on a set of data. I'm actually using the code from a book called collective intelligence by O'Reilly. Everything works, but in his code he uses the command line and i want to write everything in notepad++. As a reference his line is >>>kclust=clusters.kcluster(data,k=10) >>>[rownames[r] for r in k[0]] Here is my code: from PIL import Image,ImageDraw def readfile(filename): lines=[line for line in file(filename)] # First line is the column titles colnames=lines[0].strip( ).split('\t')[1:] rownames=[] data=[] for line in lines[1:]: p=line.strip( ).split('\t') # First column in each row is the rowname rownames.append(p[0]) # The data for this row is the remainder of the row data.append([float(x) for x in p[1:]]) return rownames,colnames,data from math import sqrt def pearson(v1,v2): # Simple sums sum1=sum(v1) sum2=sum(v2) # Sums of the squares sum1Sq=sum([pow(v,2) for v in v1]) sum2Sq=sum([pow(v,2) for v in v2]) # Sum of the products pSum=sum([v1[i]*v2[i] for i in range(len(v1))]) # Calculate r (Pearson score) num=pSum-(sum1*sum2/len(v1)) den=sqrt((sum1Sq-pow(sum1,2)/len(v1))*(sum2Sq-pow(sum2,2)/len(v1))) if den==0: return 0 return 1.0-num/den class bicluster: def __init__(self,vec,left=None,right=None,distance=0.0,id=None): self.left=left self.right=right self.vec=vec self.id=id self.distance=distance def hcluster(rows,distance=pearson): distances={} currentclustid=-1 # Clusters are initially just the rows clust=[bicluster(rows[i],id=i) for i in range(len(rows))] while len(clust)>1: lowestpair=(0,1) closest=distance(clust[0].vec,clust[1].vec) # loop through every pair looking for the smallest distance for i in range(len(clust)): for j in range(i+1,len(clust)): # distances is the cache of distance calculations if (clust[i].id,clust[j].id) not in distances: distances[(clust[i].id,clust[j].id)]=distance(clust[i].vec,clust[j].vec) #print 'i' #print i #print #print 'j' #print j #print d=distances[(clust[i].id,clust[j].id)] if d<closest: closest=d lowestpair=(i,j) # calculate the average of the two clusters mergevec=[ (clust[lowestpair[0]].vec[i]+clust[lowestpair[1]].vec[i])/2.0 for i in range(len(clust[0].vec))] # create the new cluster newcluster=bicluster(mergevec,left=clust[lowestpair[0]], right=clust[lowestpair[1]], distance=closest,id=currentclustid) # cluster ids that weren't in the original set are negative currentclustid-=1 del clust[lowestpair[1]] del clust[lowestpair[0]] clust.append(newcluster) return clust[0] def kcluster(rows,distance=pearson,k=4): # Determine the minimum and maximum values for each point ranges=[(min([row[i] for row in rows]),max([row[i] for row in rows])) for i in range(len(rows[0]))] # Create k randomly placed centroids clusters=[[random.random( )*(ranges[i][1]-ranges[i][0])+ranges[i][0] for i in range(len(rows[0]))] for j in range(k)] lastmatches=None for t in range(100): print 'Iteration %d' % t bestmatches=[[] for i in range(k)] # Find which centroid is the closest for each row for j in range(len(rows)): row=rows[j] bestmatch=0 for i in range(k): d=distance(clusters[i],row) if d<distance(clusters[bestmatch],row): bestmatch=i bestmatches[bestmatch].append(j) # If the results are the same as last time, this is complete if bestmatches==lastmatches: break lastmatches=bestmatches # Move the centroids to the average of their members for i in range(k): avgs=[0.0]*len(rows[0]) if len(bestmatches[i])>0: for rowid in bestmatches[i]: for m in range(len(rows[rowid])): avgs[m]+=rows[rowid][m] for j in range(len(avgs)): avgs[j]/=len(bestmatches[i]) clusters[i]=avgs return bestmatches

    Read the article

  • Explicit disable MySQL query cache in some parts of program

    - by jack
    In a Django project, some cronjob programs are mainly used for administrative or analysis purposes, e.g. generating site usage stats, rotating user activities log, etc. We probably do not hope MySQL to cache queries in those programs to save memory usage and improve query cache efficiency. Is it possible to turn off MySQL query cache explicitly in those programs while keep it enabled for other parts including all views.py?

    Read the article

  • How can I debug a Windows service that crashes?

    - by Christopher
    I have a .NET Windows service that appears to be crashing due to C00000005 (access violation--according to Dr Watson). When I attach the VS debugger to it--whether I build it with or without symbols--the VS debugger just stops when the service crashes, instead of stopping to give me a chance to do any investigation. Is that to be expected, or am I doing something wrong? Will using WinDbg let me do something more in real time (obviously, WinDbg lets me do crash dump analysis)? Thanks!

    Read the article

  • Choosing a distributed shared memory solution

    - by mindas
    I have a task to build a prototype for a massively scalable distributed shared memory (DSM) app. The prototype would only serve as a proof-of-concept, but I want to spend my time most effectively by picking the components which would be used in the real solution later on. The aim of this solution is to take data input from an external source, churn it and make the result available for a number of frontends. Those "frontends" would just take the data from the cache and serve it without extra processing. The amount of frontend hits on this data can literally be millions per second. The data itself is very volatile; it can (and does) change quite rapidly. However the frontends should see "old" data until the newest has been processed and cached. The processing and writing is done by a single (redundant) node while other nodes only read the data. In other words: no read-through behaviour. I was looking into solutions like memcached however this particular one doesn't fulfil all our requirements which are listed below: The solution must at least have Java client API which is reasonably well maintained as the rest of app is written in Java and we are seasoned Java developers; The solution must be totally elastic: it should be possible to add new nodes without restarting other nodes in the cluster; The solution must be able to handle failover. Yes, I realize this means some overhead, but the overall served data size isn't big (1G max) so this shouldn't be a problem. By "failover" I mean seamless execution without hardcoding/changing server IP address(es) like in memcached clients when a node goes down; Ideally it should be possible to specify the degree of data overlapping (e.g. how many copies of the same data should be stored in the DSM cluster); There is no need to permanently store all the data but there might be a need of post-processing of some of the data (e.g. serialization to the DB). Price. Obviously we prefer free/open source but we're happy to pay a reasonable amount if a solution is worth it. In any way, paid 24hr/day support contract is a must. The whole thing has to be hosted in our data centers so SaaS offerings like Amazon SimpleDB are out of scope. We would only consider this if no other options would be available. Ideally the solution would be strictly consistent (as in CAP); however, eventual consistence can be considered as an option. Thanks in advance for any ideas.

    Read the article

  • Learning the Introspection API (used by FxCop)

    - by Anand Patel
    Microsoft's FxCop tool uses the introspection API. This introspection API could be used to develop new code analysis tools. But the introspection api is not documented well. Additionally, I was not able to figure out any blogs which explains this API in breadth and depth of it. The knowledge gained by understanding the API can also be used for writing custom FxCop rules. Does anybody knows about any blog or resources which explains the same?

    Read the article

  • Very basic question about Hadoop and compressed input files

    - by Luis Sisamon
    I have started to look into Hadoop. If my understanding is right i could process a very big file and it would get split over different nodes, however if the file is compressed then the file could not be split and wold need to be processed by a single node (effectively destroying the advantage of running a mapreduce ver a cluster of parallel machines). My question is, assuming the above is correct, is it possible to split a large file manually in fixed-size chunks, or daily chunks, compress them and then pass a list of compressed input files to perform a mapreduce?

    Read the article

  • Scalability comparison between different DBMSs

    - by Björn Lindfors
    By what factor does the performance (read queries/sec) increase when a machine is added to a cluster of machines running either: a Bigtable-like database MySQL? Google's research paper on Bigtable suggests that "near-linear" scaling is achieved can be achieved with Bigtable. This page here featuring MySQL's marketing jargon suggests that MySQL is capable of scaling linearly. Where is the truth?

    Read the article

  • Exception when indexing text documents with Lucene, using SnowballAnalyzer for cleaning up

    - by Julia
    Hello!!! I am indexing the documents with Lucene and am trying to apply the SnowballAnalyzer for punctuation and stopword removal from text .. I keep getting the following error :( IllegalAccessError: tried to access method org.apache.lucene.analysis.Tokenizer.(Ljava/io/Reader;)V from class org.apache.lucene.analysis.snowball.SnowballAnalyzer Here is the code, I would very much appreciate help!!!! I am new with this.. public class Indexer { private Indexer(){}; private String[] stopWords = {....}; private String indexName; private IndexWriter iWriter; private static String FILES_TO_INDEX = "/Users/ssi/forindexing"; public static void main(String[] args) throws Exception { Indexer m = new Indexer(); m.index("./newindex"); } public void index(String indexName) throws Exception { this.indexName = indexName; final File docDir = new File(FILES_TO_INDEX); if(!docDir.exists() || !docDir.canRead()){ System.err.println("Something wrong... " + docDir.getPath()); System.exit(1); } Date start = new Date(); PerFieldAnalyzerWrapper analyzers = new PerFieldAnalyzerWrapper(new SimpleAnalyzer()); analyzers.addAnalyzer("text", new SnowballAnalyzer("English", stopWords)); Directory directory = FSDirectory.open(new File(this.indexName)); IndexWriter.MaxFieldLength maxLength = IndexWriter.MaxFieldLength.UNLIMITED; iWriter = new IndexWriter(directory, analyzers, true, maxLength); System.out.println("Indexing to dir..........." + indexName); if(docDir.isDirectory()){ File[] files = docDir.listFiles(); if(files != null){ for (int i = 0; i < files.length; i++) { try { indexDocument(files[i]); }catch (FileNotFoundException fnfe){ fnfe.printStackTrace(); } } } } System.out.println("Optimizing...... "); iWriter.optimize(); iWriter.close(); Date end = new Date(); System.out.println("Time to index was" + (end.getTime()-start.getTime()) + "miliseconds"); } private void indexDocument(File someDoc) throws IOException { Document doc = new Document(); Field name = new Field("name", someDoc.getName(), Field.Store.YES, Field.Index.ANALYZED); Field text = new Field("text", new FileReader(someDoc), Field.TermVector.WITH_POSITIONS_OFFSETS); doc.add(name); doc.add(text); iWriter.addDocument(doc); } }

    Read the article

  • can i do multiple things in one command on linux?

    - by Jason94
    Im testing something where im compiling some code and analysing output with a perl script. So first i run make, manually copy&paste the output to errors.txt and then running my perl script (running: perl analysis.pl) in terminal. Is there away I can do this just with one line in bash?

    Read the article

  • A strategy to troubleshoot/ fix application crashes in Windows?

    - by Manav Sharma
    All, Over a period of time I have observed that fixing issues related to application crash is a discipline in itself. Some people have this nice way of attacking such problems. Ranging from Viewing the 'Event Viewer' to running Static/ Dynamic memory analysis tools to some of their 'personal favorites', these people have developed this art. Can we share articles/ links/ personal approaches that we use to understand/ troubleshoot/ fix such issues? Thanks

    Read the article

  • Meaning of parameters in a Google query?

    - by blinry
    Are there any ressources on what the parameters in a Google query mean? Any analysis how the Google search pages work internally? Examples would be http://www.google.com/#hl=en&source=hp&q=lol&aq=f&aqi=&aql=&oq=&fp=45675624562456 or http://www.google.com/url?sa=t&source=web&ct=res&cd=11&ved=KJSGHFKSDJF&url=sfdgagasdgasdgasgasg&rct=j&q=fghthwrteghedgf&ei=asdfasdfsa&usg=asdfasdfasf

    Read the article

  • Using Hadoop, are my reducers guaranteed to get all the records with the same key?

    - by samg
    I'm running a hadoop job (using hive actually) which is supposed to uniq lines in a lot of text file. More specifically it chooses the most recently timestamped record for each key in the reduce step. Does hadoop guarantee that every record with the same key, output by the map step, will go to a single reducer, even if there are many reducers running across a cluster? I'm worried that the mapper output might be split after the shuffle happens, in the middle of a set of records with the same key.

    Read the article

< Previous Page | 91 92 93 94 95 96 97 98 99 100 101 102  | Next Page >