Search Results

Search found 92 results on 4 pages for 'readlines'.

Page 4/4 | < Previous Page | 1 2 3 4 

  • Coding the Python way

    - by Aaron Moodie
    I've just spent the last half semester at Uni learning python. I've really enjoyed it, and was hoping for a few tips on how to write more 'pythonic' code. This is the __init__ class from a recent assignment I did. At the time I wrote it, I was trying to work out how I could re-write this using lambdas, or in a neater, more efficient way, but ran out of time. def __init__(self, dir): def _read_files(_, dir, files): for file in files: if file == "classes.txt": class_list = readtable(dir+"/"+file) for item in class_list: Enrol.class_info_dict[item[0]] = item[1:] if item[1] in Enrol.classes_dict: Enrol.classes_dict[item[1]].append(item[0]) else: Enrol.classes_dict[item[1]] = [item[0]] elif file == "subjects.txt": subject_list = readtable(dir+"/"+file) for item in subject_list: Enrol.subjects_dict[item[0]] = item[1] elif file == "venues.txt": venue_list = readtable(dir+"/"+file) for item in venue_list: Enrol.venues_dict[item[0]] = item[1:] elif file.endswith('.roll'): roll_list = readlines(dir+"/"+file) file = os.path.splitext(file)[0] Enrol.class_roll_dict[file] = roll_list for item in roll_list: if item in Enrol.enrolled_dict: Enrol.enrolled_dict[item].append(file) else: Enrol.enrolled_dict[item] = [file] try: os.path.walk(dir, _read_files, None) except: print "There was a problem reading the directory" As you can see, it's a little bulky. If anyone has the time or inclination, I'd really appreciate a few tips on some python best-practices. Thanks.

    Read the article

  • Ruby : UTF-8 IO

    - by subtenante
    I use ruby 1.8.7. I try to parse some text files containing greek sentences, encoded in UTF-8. (I can't much paste here sample files, because they are subject to copyright. Really just some greek text encoded in UTF-8.) I want, for each file, to parse the file, extract all the words, and make a list of each new word found in this file. All that saved to one big index file. Here is my code : #!/usr/bin/ruby -KU def prepare_line(l) l.gsub(/^\s*[ST]\d+\s*:\s*|\s+$|\(\d+\)\s*/u, "") end def tokenize(l) l.split /['·.;!:\s]+/u end $dict = {} $cpt = 0 $out = File.new 'out.txt', 'w' def lesson(file) $cpt = $cpt + 1 file.readlines.each do |l| $out.puts l l = prepare_line l tokenize(l).each do |t| unless $dict[t] $dict[t] = $cpt $out.puts " #{t}\n" end end end end Dir.new('etc/').each do |filename| f = File.new("etc/#{filename}") unless File.directory? f lesson f end end Here is part of my output : ?@???†?†?????????? ?...[snip very long hangul/hanzi mishmash]... ????????†? ???N2 : ?e?te?? (2) µ???µa (Note that the puts l part seems to work fine, at the end of the given output line.) Any idea what is wrong with my code ? (General comments about ruby idioms I could use are very welcome, I'm really a beginner.)

    Read the article

  • How to change a Linux user password from python

    - by Vaulor
    I'm having problems with changing a Linux user's password from python. I've tried so many things, but I couldn't manage to solve the issue, here is the sample of things I've already tried: sudo_password is the password for sudo, sudo_command is the command I want the system to run, user is get from a List and is the user who I want to change the password for, and newpass is the pass I want to assign to 'user' user = list.get(ANCHOR) sudo_command = 'passwd' f = open("passwordusu.tmp", "w") f.write("%s\n%s" % (newpass, newpass)) f.close() A=os.system('echo -e %s|sudo -S %s < %s %s' % (sudo_password, sudo_command,'passwordusu.tmp', user)) print A windowpass.destroy() 'A' is the return value for the execution of os.system, in this case 256. I tried also A=os.system('echo %s|sudo -S %s < %s %s' % (sudo_password, sudo_command,'passwordusu.tmp', user)) but it returns the same error code. I tried several other ways with 'passwd' command, but whithout succes. With 'chpasswd' command I 've tried this: user = list.get(ANCHOR) sudo_command = 'chpasswd' f = open("passwordusu.tmp", "w") f.write("%s:%s" % (user, newpass)) f.close() A=os.system('echo %s|sudo -S %s < %s %s' % (sudo_password, sudo_command,'passwordusu.tmp', user)) print A windowpass.destroy() also with: A=os.system('echo %s|sudo -S %s:%s|%s' % (sudo_password, user, newpass, sudo_command)) @;which returns 32512 A=os.system("echo %s | sudo -S %s < \"%s\"" % (sudo_password, sudo_command, "passwordusu.tmp")) @;which returns 256 I tried 'mkpasswd' and 'usermod' too like this: user = list.get(ANCHOR) sudo_command = 'mkpasswd -m sha-512' os.system("echo %s | sudo -S %s %s > passwd.tmp" % (sudo_password,sudo_command, newpass)) sudo_command="usermod -p" f = open('passwd.tmp', 'r') for line in f.readlines(): newpassencryp=line f.close() A=os.system("echo %s | sudo -S %s %s %s" % (sudo_password, sudo_command, newpassencryp, user)) @;which returns 32512 but, if you go to https://www.mkpasswd.net , hash the 'newpass' and substitute for 'newpassencryp', it returns 0 which theoretically means it has gone right, but so far it doesn't changes the password. I've searched on internet and stackoverflow for this issue or similar and tried what solutions exposed, but again,without success. I would really apreciate any help, and of course, if you need more info i'll be glad to supply it! Thanks in advance.

    Read the article

  • Sending Email over VPN SmtpException net_io_connectionclosed

    - by Holy Christ
    I am sending an email from a WPF application. When sending as a domain user on the network, the emails sends as expected. However, when I attempt to send email over a VPN connection, I get the following exception: Exception: System.Net.Mail.SmtpException: Failure sending mail. --- System.IO.IOException: Unable to read data from the transport connection: net_io_connectionclosed. at System.Net.Mail.SmtpReplyReaderFactory.ProcessRead(Byte[] buffer, Int32 offset, Int32 read, Boolean readLine) at System.Net.Mail.SmtpReplyReaderFactory.ReadLines(SmtpReplyReader caller, Boolean oneLine) at System.Net.Mail.SmtpReplyReaderFactory.ReadLine(SmtpReplyReader caller) at System.Net.Mail.SmtpConnection.GetConnection(String host, Int32 port) at System.Net.Mail.SmtpTransport.GetConnection(String host, Int32 port) at System.Net.Mail.SmtpClient.GetConnection() at System.Net.Mail.SmtpClient.Send(MailMessage message) I have tried using impersonation as well as setting the Credentials on the SmtpClient. Neither seem to work: using (new ImpersonateUser("myUser", "MYDOMAIN", "myPass")) { var client = new SmtpClient("myhost.com"); client.UseDefaultCredentials = true; client.Credentials = new NetworkCredential("myUser", "myPass", "MYDOMAIN"); client.Send(mailMessage); } I've also tried using Wireshark to view the message over the wire, but I don't know enough about SMTP to know what I'm looking for. One other variable is that the machine I'm using on the VPN is Vista Business and the machine on the network is Win7. I don't think it's related, but then I wouldn't be asking if I knew the issue! :) Any ideas?

    Read the article

  • accessing list sent from server as JSON object

    - by tazim
    How to access a list sent in form of json object using django to the template received in ajax callback function . The code is as follows : views.py def showfiledata(request): with open("/home/tazim/webexample/test.txt") as f: list = f.readlines() f.closed return_dict = {'filedata':list} json = simplejson.dumps(return_dict) HttpResponse(json,mimetype="application/json") in template showfile.html: < html> < head> < script type="text/javascript" src="/jquerycall/">< /script> < script type="text/javascript"> $(document).ready(function() { $("button").click(function() { $.ajax({ type:"POST", url:"/showfiledata/", datatype:"json", success:function(data) { var s = data.filedata; $("#someid").html(s); } }); }); }); < /script> < /head> < body> < form method="post"> < button type="button">Click Me< /button> < div id="someid">< /div> < /form> < /body> < /html>

    Read the article

  • Python opening a file and putting list of names on separate lines

    - by Jeremy Borton
    I am trying to write a python program using Python 3 I have to open a text file and read a list of names, print the list, sort the list in alphabetical order and then finally re-print the list. There's a little more to it than that BUT the problem I am having is that I'm supposed to print the list of names with each name on a separate line Instead of printing each name on a separate line, it prints the list all on one line. How can I fix this? def main(): #create control loop keep_going = 'y' #Open name file name_file = open('names.txt', 'r') names = name_file.readlines() name_file.close() #Open outfile outfile = open('sorted_names.txt', 'w') index = 0 while index < len(names): names[index] = names[index].rstrip('\n') index += 1 #sort names print('original order:', names) names.sort() print('sorted order:', names) #write names to outfile for item in names: outfile.write(item + '\n') #close outfile outfile.close() #search names while keep_going == 'y' or keep_going == 'Y': search = input('Enter a name to search: ') if search in names: print(search, 'was found in the list.') keep_going = input('Would you like to do another search Y for yes: ') else: print(search, 'was not found.') keep_going = input('Would you like to do another search Y for yes: ') main()

    Read the article

  • Python: Serial Transmission

    - by Silent Elektron
    I have an image stack of 500 images (jpeg) of 640x480. I intend to make 500 pixels (1st pixels of all images) as a list and then send that via COM1 to FPGA where I do my further processing. I have a couple of questions here: How do I import all the 500 images at a time into python and how do i store it? How do I send the 500 pixel list via COM1 to FPGA? I tried the following: Converted the jpeg image to intensity values (each pixel is denoted by a number between 0 and 255) in MATLAB, saved the intensity values in a text file, read that file using readlines(). But it became too cumbersome to make the intensity value files for all the 500 images! Used NumPy to put the read files in a matrix and then pick the first pixel of all images. But when I send it, its coming like: [56, 61, 78, ... ,71, 91]. Is there a way to eliminate the [ ] and , while sending the data serially? Thanks in Advance! :)

    Read the article

  • Could I do this blind relative to absolute path conversion (for perforce depot paths) better?

    - by wonderfulthunk
    I need to "blindly" (i.e. without access to the filesystem, in this case the source control server) convert some relative paths to absolute paths. So I'm playing with dotdots and indices. For those that are curious I have a log file produced by someone else's tool that sometimes outputs relative paths, and for performance reasons I don't want to access the source control server where the paths are located to check if they're valid and more easily convert them to their absolute path equivalents. I've gone through a number of (probably foolish) iterations trying to get it to work - mostly a few variations of iterating over the array of folders and trying delete_at(index) and delete_at(index-1) but my index kept incrementing while I was deleting elements of the array out from under myself, which didn't work for cases with multiple dotdots. Any tips on improving it in general or specifically the lack of non-consecutive dotdot support would be welcome. Currently this is working with my limited examples, but I think it could be improved. It can't handle non-consecutive '..' directories, and I am probably doing a lot of wasteful (and error-prone) things that I probably don't need to do because I'm a bit of a hack. I've found a lot of examples of converting other types of relative paths using other languages, but none of them seemed to fit my situation. These are my example paths that I need to convert, from: //depot/foo/../bar/single.c //depot/foo/docs/../../other/double.c //depot/foo/usr/bin/../../../else/more/triple.c to: //depot/bar/single.c //depot/other/double.c //depot/else/more/triple.c And my script: begin paths = File.open(ARGV[0]).readlines puts(paths) new_paths = Array.new paths.each { |path| folders = path.split('/') if ( folders.include?('..') ) num_dotdots = 0 first_dotdot = folders.index('..') last_dotdot = folders.rindex('..') folders.each { |item| if ( item == '..' ) num_dotdots += 1 end } if ( first_dotdot and ( num_dotdots > 0 ) ) # this might be redundant? folders.slice!(first_dotdot - num_dotdots..last_dotdot) # dependent on consecutive dotdots only end end folders.map! { |elem| if ( elem !~ /\n/ ) elem = elem + '/' else elem = elem end } new_paths << folders.to_s } puts(new_paths) end

    Read the article

  • re.sub emptying list

    - by jmau5
    def process_dialect_translation_rules(): # Read in lines from the text file specified in sys.argv[1], stripping away # excess whitespace and discarding comments (lines that start with '##'). f_lines = [line.strip() for line in open(sys.argv[1], 'r').readlines()] f_lines = filter(lambda line: not re.match(r'##', line), f_lines) # Remove any occurances of the pattern '\s*<=>\s*'. This leaves us with a # list of lists. Each 2nd level list has two elements: the value to be # translated from and the value to be translated to. Use the sub function # from the re module to get rid of those pesky asterisks. f_lines = [re.split(r'\s*<=>\s*', line) for line in f_lines] f_lines = [re.sub(r'"', '', elem) for elem in line for line in f_lines] This function should take the lines from a file and perform some operations on the lines, such as removing any lines that begin with ##. Another operation that I wish to perform is to remove the quotation marks around the words in the line. However, when the final line of this script runs, f_lines becomes an empty lines. What happened? Requested lines of original file: ## English-Geek Reversible Translation File #1 ## (Moderate Geek) ## Created by Todd WAreham, October 2009 "TV show" <=> "STAR TREK" "food" <=> "pizza" "drink" <=> "Red Bull" "computer" <=> "TRS 80" "girlfriend" <=> "significant other"

    Read the article

  • How to wrap Ruby strings in HTML tags

    - by Jason H.
    Hi all: I'm looking for help on two things. 1) I'm looking for a way for Ruby to wrap strings in HTML. I have a program I'm writing that generates a Hash of word frequencies for a text file and I want to take the results and place it into an HTML file rather than print to STDOUT. I'm thinking each string needs to be wrapped in an HTML paragraph tag using readlines() or something, but I can't quite figure it out. Then, once I've wrapped the strings in HTML 2) I want to write to an empty HTML file. Right now my program looks like: filename = File.new(ARGV[0]).read().downcase().scan(/[\w']+/) frequency = Hash.new(0) words.each { |word| frequency[word] +=1 } frequency.sort_by { |x,y| y }.reverse().each{ |w,f| puts "#{f}, #{w}" } So if we ran a text file through this and received: 35, the 27, of 20, to 16, in # . . . I'd want to export to an HTML file that wraps the lines like: <p>35, the</p> <p>27, of</p> <p>20, to</p> <p>16, in</p> # . . . Thanks for any tips in advance!

    Read the article

  • Unable to write to a text file

    - by chrissygormley
    Hello, I am running some tests and need to write to a file. When I run the test's the open = (file, 'r+') does not write to the file. The test script is below: class GetDetailsIP(TestGet): def runTest(self): self.category = ['PTZ'] try: # This run's and return's a value result = self.client.service.Get(self.category) mylogfile = open("test.txt", "r+") print >>mylogfile, result result = ("".join(mylogfile.readlines()[2])) result = str(result.split(':')[1].lstrip("//").split("/")[0]) mylogfile.close() except suds.WebFault, e: assert False except Exception, e: pass finally: if 'result' in locals(): self.assertEquals(result, self.camera_ip) else: assert False When this test run's, no value has been entered into the text file and a value is returned in the variable result. I havw also tried mylogfile.write(result). If the file does not exist is claim's the file does not exist and doesn't create one. Could this be a permission problem where python is not allowed to create a file? I have made sure that all other read's to this file are closed so I the file should not be locked. Can anyone offer any suggestion why this is happening? Thanks

    Read the article

  • Faster or more memory-efficient solution in Python for this Codejam problem.

    - by jeroen.vangoey
    I tried my hand at this Google Codejam Africa problem (the contest is already finished, I just did it to improve my programming skills). The Problem: You are hosting a party with G guests and notice that there is an odd number of guests! When planning the party you deliberately invited only couples and gave each couple a unique number C on their invitation. You would like to single out whoever came alone by asking all of the guests for their invitation numbers. The Input: The first line of input gives the number of cases, N. N test cases follow. For each test case there will be: One line containing the value G the number of guests. One line containing a space-separated list of G integers. Each integer C indicates the invitation code of a guest. Output For each test case, output one line containing "Case #x: " followed by the number C of the guest who is alone. The Limits: 1 = N = 50 0 < C = 2147483647 Small dataset 3 = G < 100 Large dataset 3 = G < 1000 Sample Input: 3 3 1 2147483647 2147483647 5 3 4 7 4 3 5 2 10 2 10 5 Sample Output: Case #1: 1 Case #2: 7 Case #3: 5 This is the solution that I came up with: with open('A-large-practice.in') as f: lines = f.readlines() with open('A-large-practice.out', 'w') as output: N = int(lines[0]) for testcase, i in enumerate(range(1,2*N,2)): G = int(lines[i]) for guest in range(G): codes = map(int, lines[i+1].split(' ')) alone = (c for c in codes if codes.count(c)==1) output.write("Case #%d: %d\n" % (testcase+1, alone.next())) It runs in 12 seconds on my machine with the large input. Now, my question is, can this solution be improved in Python to run in a shorter time or use less memory? The analysis of the problem gives some pointers on how to do this in Java and C++ but I can't translate those solutions back to Python.

    Read the article

  • Using classes for the first time,help in debugging

    - by kaushik
    here is post my code:this is no the entire code but enough to explain my doubt.please discard any code line which u find irrelavent enter code here saving_tree={} isLeaf=False class tree: global saving_tree rootNode=None lispTree=None def __init__(self,x): file=x string=file.readlines() #print string self.lispTree=S_expression(string) self.rootNode=BinaryDecisionNode(0,'Root',self.lispTree) class BinaryDecisionNode: global saving_tree def __init__(self,ind,name,lispTree,parent=None): self.parent=parent nodes=lispTree.getNodes(ind) print nodes self.isLeaf=(nodes[0]==1) nodes=nodes[1]#Nodes are stored self.name=name self.children=[] if self.isLeaf: #Leaf Node print nodes #Set the leaf data self.attribute=nodes print "LeafNode is ",nodes else: #Set the question self.attribute=lispTree.getString(nodes[0]) self.attribute=self.attribute.split() print "Question: ",self.attribute,self.name tree={} tree={str(self.name):self.attribute} saving_tree=tree #Add the children for i in range(1,len(nodes)):#Since node 0 is a question # print "Adding child ",nodes[i]," who has ",len(nodes)-1," siblings" self.children.append(BinaryDecisionNode(nodes[i],self.name+str(i),lispTree,self)) print saving_tree i wanted to save some data in saving_tree{},which i have declared previously and want to use that saving tree in the another function outside the class.when i asked to print saving_tree it printing but,only for that instance.i want the saving_tree{} to have the data to store data of all instance and access it outside. when i asked for print saving_tree outside the class it prints empty{}.. please tell me the required modification to get my required output and use saving_tree{} outside the class..

    Read the article

  • How to send email from an EC2 instance using GoDaddy's SMTP server?

    - by Matt Greer
    SMTP is a whole new ballgame for me, but I am reading up on it. I am attempting to send email from my EC2 instance using GoDaddy's SMTP server. My domain name is registered through GoDaddy and I have 2 email accounts with them. I can successfully send the email from my dev box no problem. my web.config <system.net> <mailSettings> <smtp from="[email protected]" deliveryMethod="Network"> <network host="smtpout.secureserver.net" clientDomain="mydomain.com" port="25" userName="[email protected]" password="mypassword" defaultCredentials="false" /> </smtp> </mailSettings> </system.net> In my ASP.NET app: MailMessage mailMessage = new MailMessage("[email protected]", recipientEmail, emailSubject, body); mailMessage.IsBodyHtml = false; SmtpClient mailClient = new SmtpClient(); mailClient.Send(mailMessage); Very typical, simple use of System.Net.Mail.SmtpClient. The mail client is picking up the settings from my web.config as expected. From the EC2 instance, the same setup yields: System.Net.Mail.SmtpException: Failure sending mail. ---> System.IO.IOException: Unable to read data from the transport connection: net_io_connectionclosed. at System.Net.Mail.SmtpReplyReaderFactory.ProcessRead(Byte[] buffer, Int32 offset, Int32 read, Boolean readLine) at System.Net.Mail.SmtpReplyReaderFactory.ReadLines(SmtpReplyReader caller, Boolean oneLine) at System.Net.Mail.SmtpReplyReaderFactory.ReadLine(SmtpReplyReader caller) at System.Net.Mail.SmtpConnection.GetConnection(ServicePoint servicePoint) at System.Net.Mail.SmtpClient.Send(MailMessage message) --- End of inner exception stack trace --- I have searched high and low and not found anyone else attempting this. All GoDaddy smtp situations I have found involve people being hosted by GoDaddy using their relay server. Some more info: My EC2 instance is Windows Server 2008 with IIS 7. The app is running in .NET 4 I can successfully use Gmail's SMTP server on the EC2 instance by using their port, setting SmtpClient.EnableSsl to true, and sending the mail through a gmail account. But we want to send the email from an account on our domain. I have port 25 open on both the Windows firewall and Amazon's Security group based firewall. I have played with Wireshark and noticed my SMTP related traffic was talking to ports in the 5,000s, so out of desperation I opened them all up to no avail (then closed them back down) As far as I know my EC2 instance's IP address is not black listed by GoDaddy. I have a feeling I'm just missing something fundamental. I also have a feeling someone is going to recommend I use AuthSmtp or something similar, I'll agree, and have had wasted the past 6 hours :)

    Read the article

  • Generate a list of file names based on month and year arithmetic

    - by MacUsers
    How can I list the numbers 01 to 12 (one for each of the 12 months) in such a way so that the current month always comes last where the oldest one is first. In other words, if the number is grater than the current month, it's from the previous year. e.g. 02 is Feb, 2011 (the current month right now), 03 is March, 2010 and 09 is Sep, 2010 but 01 is Jan, 2011. In this case, I'd like to have [09, 03, 01, 02]. This is what I'm doing to determine the year: for inFile in os.listdir('.'): if inFile.isdigit(): month = months[int(inFile)] if int(inFile) <= int(strftime("%m")): year = strftime("%Y") else: year = int(strftime("%Y"))-1 mnYear = month + ", " + str(year) I don't have a clue what to do next. What should I do here? Update: I think, I better upload the entire script for better understanding. #!/usr/bin/env python import os, sys from time import strftime from calendar import month_abbr vGroup = {} vo = "group_lhcb" SI00_fig = float(2.478) months = tuple(month_abbr) print "\n%-12s\t%10s\t%8s\t%10s" % ('VOs','CPU-time','CPU-time','kSI2K-hrs') print "%-12s\t%10s\t%8s\t%10s" % ('','(in Sec)','(in Hrs)','(*2.478)') print "=" * 58 for inFile in os.listdir('.'): if inFile.isdigit(): readFile = open(inFile, 'r') lines = readFile.readlines() readFile.close() month = months[int(inFile)] if int(inFile) <= int(strftime("%m")): year = strftime("%Y") else: year = int(strftime("%Y"))-1 mnYear = month + ", " + str(year) for line in lines[2:]: if line.find(vo)==0: g, i = line.split() s = vGroup.get(g, 0) vGroup[g] = s + int(i) sumHrs = ((vGroup[g]/60)/60) sumSi2k = sumHrs*SI00_fig print "%-12s\t%10s\t%8s\t%10.2f" % (mnYear,vGroup[g],sumHrs,sumSi2k) del vGroup[g] When I run the script, I get this: [root@serv07 usage]# ./test.py VOs CPU-time CPU-time kSI2K-hrs (in Sec) (in Hrs) (*2.478) ================================================== Jan, 2011 211201372 58667 145376.83 Dec, 2010 5064337 1406 3484.07 Feb, 2011 17506049 4862 12048.04 Sep, 2010 210874275 58576 145151.33 As I said in the original post, I like the result to be in this order instead: Sep, 2010 210874275 58576 145151.33 Dec, 2010 5064337 1406 3484.07 Jan, 2011 211201372 58667 145376.83 Feb, 2011 17506049 4862 12048.04 The files in the source directory reads like this: [root@serv07 usage]# ls -l total 3632 -rw-r--r-- 1 root root 1144972 Feb 9 19:23 01 -rw-r--r-- 1 root root 556630 Feb 13 09:11 02 -rw-r--r-- 1 root root 443782 Feb 11 17:23 02.bak -rw-r--r-- 1 root root 1144556 Feb 14 09:30 09 -rw-r--r-- 1 root root 370822 Feb 9 19:24 12 Did I give a better picture now? Sorry for not being very clear in the first place. Cheers!! Update @Mark Ransom This is the result from Mark's suggestion: [root@serv07 usage]# ./test.py VOs CPU-time CPU-time kSI2K-hrs (in Sec) (in Hrs) (*2.478) ========================================================== Dec, 2010 5064337 1406 3484.07 Sep, 2010 210874275 58576 145151.33 Feb, 2011 17506049 4862 12048.04 Jan, 2011 211201372 58667 145376.83 As I said before, I'm looking for the result to b printed in this order: Sep, 2010 - Dec, 2010 - Jan, 2011 - Feb, 2011 Cheers!!

    Read the article

  • Searching for duplicate records within a text file where the duplicate is determined by only two fie

    - by plg
    First, Python Newbie; be patient/kind. Next, once a month I receive a large text file (think 7 Million records) to test for duplicate values. This is catalog information. I get 7 fields, but the two I'm interested in are a supplier code and a full orderable part number. To determine if the record is dupliacted, I compress all special characters from the part number (except . and #) and create a compressed part number. The test for duplicates becomes the supplier code and compressed part number combination. This part is fairly straight forward. Currently, I am just copying the original file with 2 new columns (compressed part and duplicate indicator). If the part is a duplicate, I put a "YES" in the last field. Now that this is done, I want to be able to go back (or better yet, at the same time) to get the previous record where there was a supplier code/compressed part number match. So far, my code looks like this: Compress Full Part to a Compressed Part and Check for Duplicates on Supplier Code and Compressed Part combination import sys import re import time ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ start=time.time() try: file1 = open("C:\Accounting\May Accounting\May.txt", "r") except IOError: print sys.stderr, "Cannot Open Read File" sys.exit(1) try: file2 = open(file1.name[0:len(file1.name)-4] + "_" + "COMPRESSPN.txt", "a") except IOError: print sys.stderr, "Cannot Open Write File" sys.exit(1) hdrList="CIGSUPPLIER|FULL_PART|PART_STATUS|ALIAS_FLAG|ACQUISITION_FLAG|COMPRESSED_PART|DUPLICATE_INDICATOR" file2.write(hdrList+chr(10)) lines_seen=set() affirm="YES" records = file1.readlines() for record in records: fields = record.split(chr(124)) if fields[0]=="CIGSupplier": continue #If incoming file has a header line, skip it file2.write(fields[0]+"|"), #Supplier Code file2.write(fields[1]+"|"), #Full_Part file2.write(fields[2]+"|"), #Part Status file2.write(fields[3]+"|"), #Alias Flag file2.write(re.sub("[$\r\n]", "", fields[4])+"|"), #Acquisition Flag file2.write(re.sub("[^0-9a-zA-Z.#]", "", fields[1])+"|"), #Compressed_Part dupechk=fields[0]+"|"+re.sub("[^0-9a-zA-Z.#]", "", fields[1]) if dupechk not in lines_seen: file2.write(chr(10)) lines_seen.add(dupechk) else: file2.write(affirm+chr(10)) print "it took", time.time() - start, "seconds." ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ file2.close() file1.close() It runs in less than 6 minutes, so I am happy with this part, even if it is not elegant. Right now, when I get my results, I import the results into Access and do a self join to locate the duplicates. Loading/querying/exporting results in Access a file this size takes around an hour, so I would like to be able to export the matched duplicates to another text file or an Excel file. Confusing enough? Thanks.

    Read the article

  • MapReduce in DryadLINQ and PLINQ

    - by JoshReuben
    MapReduce See http://en.wikipedia.org/wiki/Mapreduce The MapReduce pattern aims to handle large-scale computations across a cluster of servers, often involving massive amounts of data. "The computation takes a set of input key/value pairs, and produces a set of output key/value pairs. The developer expresses the computation as two Func delegates: Map and Reduce. Map - takes a single input pair and produces a set of intermediate key/value pairs. The MapReduce function groups results by key and passes them to the Reduce function. Reduce - accepts an intermediate key I and a set of values for that key. It merges together these values to form a possibly smaller set of values. Typically just zero or one output value is produced per Reduce invocation. The intermediate values are supplied to the user's Reduce function via an iterator." the canonical MapReduce example: counting word frequency in a text file.     MapReduce using DryadLINQ see http://research.microsoft.com/en-us/projects/dryadlinq/ and http://connect.microsoft.com/Dryad DryadLINQ provides a simple and straightforward way to implement MapReduce operations. This The implementation has two primary components: A Pair structure, which serves as a data container. A MapReduce method, which counts word frequency and returns the top five words. The Pair Structure - Pair has two properties: Word is a string that holds a word or key. Count is an int that holds the word count. The structure also overrides ToString to simplify printing the results. The following example shows the Pair implementation. public struct Pair { private string word; private int count; public Pair(string w, int c) { word = w; count = c; } public int Count { get { return count; } } public string Word { get { return word; } } public override string ToString() { return word + ":" + count.ToString(); } } The MapReduce function  that gets the results. the input data could be partitioned and distributed across the cluster. 1. Creates a DryadTable<LineRecord> object, inputTable, to represent the lines of input text. For partitioned data, use GetPartitionedTable<T> instead of GetTable<T> and pass the method a metadata file. 2. Applies the SelectMany operator to inputTable to transform the collection of lines into collection of words. The String.Split method converts the line into a collection of words. SelectMany concatenates the collections created by Split into a single IQueryable<string> collection named words, which represents all the words in the file. 3. Performs the Map part of the operation by applying GroupBy to the words object. The GroupBy operation groups elements with the same key, which is defined by the selector delegate. This creates a higher order collection, whose elements are groups. In this case, the delegate is an identity function, so the key is the word itself and the operation creates a groups collection that consists of groups of identical words. 4. Performs the Reduce part of the operation by applying Select to groups. This operation reduces the groups of words from Step 3 to an IQueryable<Pair> collection named counts that represents the unique words in the file and how many instances there are of each word. Each key value in groups represents a unique word, so Select creates one Pair object for each unique word. IGrouping.Count returns the number of items in the group, so each Pair object's Count member is set to the number of instances of the word. 5. Applies OrderByDescending to counts. This operation sorts the input collection in descending order of frequency and creates an ordered collection named ordered. 6. Applies Take to ordered to create an IQueryable<Pair> collection named top, which contains the 100 most common words in the input file, and their frequency. Test then uses the Pair object's ToString implementation to print the top one hundred words, and their frequency.   public static IQueryable<Pair> MapReduce( string directory, string fileName, int k) { DryadDataContext ddc = new DryadDataContext("file://" + directory); DryadTable<LineRecord> inputTable = ddc.GetTable<LineRecord>(fileName); IQueryable<string> words = inputTable.SelectMany(x => x.line.Split(' ')); IQueryable<IGrouping<string, string>> groups = words.GroupBy(x => x); IQueryable<Pair> counts = groups.Select(x => new Pair(x.Key, x.Count())); IQueryable<Pair> ordered = counts.OrderByDescending(x => x.Count); IQueryable<Pair> top = ordered.Take(k);   return top; }   To Test: IQueryable<Pair> results = MapReduce(@"c:\DryadData\input", "TestFile.txt", 100); foreach (Pair words in results) Debug.Print(words.ToString());   Note: DryadLINQ applications can use a more compact way to represent the query: return inputTable         .SelectMany(x => x.line.Split(' '))         .GroupBy(x => x)         .Select(x => new Pair(x.Key, x.Count()))         .OrderByDescending(x => x.Count)         .Take(k);     MapReduce using PLINQ The pattern is relevant even for a single multi-core machine, however. We can write our own PLINQ MapReduce in a few lines. the Map function takes a single input value and returns a set of mapped values àLINQ's SelectMany operator. These are then grouped according to an intermediate key à LINQ GroupBy operator. The Reduce function takes each intermediate key and a set of values for that key, and produces any number of outputs per key à LINQ SelectMany again. We can put all of this together to implement MapReduce in PLINQ that returns a ParallelQuery<T> public static ParallelQuery<TResult> MapReduce<TSource, TMapped, TKey, TResult>( this ParallelQuery<TSource> source, Func<TSource, IEnumerable<TMapped>> map, Func<TMapped, TKey> keySelector, Func<IGrouping<TKey, TMapped>, IEnumerable<TResult>> reduce) { return source .SelectMany(map) .GroupBy(keySelector) .SelectMany(reduce); } the map function takes in an input document and outputs all of the words in that document. The grouping phase groups all of the identical words together, such that the reduce phase can then count the words in each group and output a word/count pair for each grouping: var files = Directory.EnumerateFiles(dirPath, "*.txt").AsParallel(); var counts = files.MapReduce( path => File.ReadLines(path).SelectMany(line => line.Split(delimiters)), word => word, group => new[] { new KeyValuePair<string, int>(group.Key, group.Count()) });

    Read the article

< Previous Page | 1 2 3 4