How can I quickly parse large (>10GB) files?

Posted by Andrew on Stack Overflow See other posts from Stack Overflow or by Andrew
Published on 2009-12-17T01:56:48Z Indexed on 2010/03/19 16:01 UTC
Read the original article Hit count: 188

Filed under:

awk

|

perl

|

large-files

Hi - I have to process text files 10-20GB in size of the format: field1 field2 field3 field4 field5

I would like to parse the data from each line of field2 into one of several files; the file this gets pushed into is determined line-by-line by the value in field4. There are 25 different possible values in field2 and hence 25 different files the data can get parsed into.

I have tried using Perl (slow) and awk (faster but still slow) - does anyone have any suggestions or pointers toward alternative approaches?

FYI here is the awk code I was trying to use; note I had to revert to going through the large file 25 times because I wasn't able to keep 25 files open at once in awk:

chromosomes=(1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25)
for chr in ${chromosomes[@]}
do

awk < my_in_file_here -v pat="$chr" '{if ($4 == pat) for (i = $2; i <= $2+52; i++) print i}' >> my_out_file_"$chr".query 

done

© Stack Overflow or respective owner

Related posts about awk

problem with awk script

as seen on Stack Overflow - Search for 'Stack Overflow'
Hello, when I call my awk script, I keep getting an error : sam@sam-laptop:~/shell/td4$ awk -f agenda.awk -- -n Robert agenda.txt awk: agenda.awk:6: printf "Hello" awk: agenda.awk:6: ^ syntax error the script contains this : #!/usr/bin/awk BEGIN { } printf "Hello" END { } Thank you >>> More
Awk to grab colo(u)r codes from CSS files aka School me in Awk

as seen on Stack Overflow - Search for 'Stack Overflow'
Nice and (hopefully) easy. I am trying to work out how to grab the variable #XXX from a text file (css file) containing strings like hr { margin: 18px 0 17px; border-color: #ccc; } h1 a:hover, h2 a:hover, h3 a:hover { color: #001100; } Which I would like to return as ccc 777 The… >>> More
Parsing the output of "uptime" with bash

as seen on Super User - Search for 'Super User'
I would like to save the output of the uptime command into a csv file in a Bash script. Since the uptime command has different output formats based on the time since the last reboot I came up with a pretty heavy solution based on case, but there is surely a more elegant way of doing this. uptime… >>> More
AWK scripting :How to remove Field separator using awk

as seen on Stack Overflow - Search for 'Stack Overflow'
Need the following output ONGC044 ONGC043 ONGC042 ONGC041 ONGC046 ONGC047 from this input Medium Label Medium ID Free Blocks =============================================================================== [ONGC044] ECCPRDDB_FS_43 ac100076:4aed9b39:44f0:0001… >>> More
sort associative array in awk - help?

as seen on Stack Overflow - Search for 'Stack Overflow'
Hi all, I have an associative array in awk that gets populated like this... chr_count[$3]++ When I try to print my chr_counts I use this: for (i in chr_count) { print i,":",chr_count[i]; } But not surprisingly, the order of i is not sorted in any way. Is there an easy way to iterate over… >>> More

Related posts about perl

Munin on Centos 6 - missing perl MODULE_COMPAT_5.8.8

as seen on Server Fault - Search for 'Server Fault'
I'm trying to install Munin on a new VPS through yum install munin but I keep getting an error about a missing perl module: Requires: perl(:MODULE_COMPAT_5.8.8). This is the perl version currently installed: v5.10.1. I've searched all around and still haven't found a solution for this. Here's the… >>> More
Pain removing a perl rootkit

as seen on Server Fault - Search for 'Server Fault'
So, we host a geoservice webserver thing at the office. Someone apparently broke into this box (probably via ftp or ssh), and put some kind of irc-managed rootkit thing. Now I'm trying to clean the whole thing up, I found the process pid who tries to connect via irc, but i can't figure out who's… >>> More
How To Avoid a Perl script calling an Another Perl Script

as seen on Stack Overflow - Search for 'Stack Overflow'
Hello, i am calling a perl script client.pl from a main script to capture the output of client.pl in @output. is there anyway to avoid the use of these two files so i can use the output of client.pl in main.pl itself here is my code.... main.pl ======= my @output = readpipe("client.pl"); client… >>> More
Perl :how to sort dates in perl

as seen on Stack Overflow - Search for 'Stack Overflow'
Hi, How can I sort the dates in perl. my @dates = ( "02/11/2009" , "12/20/2001" , "11/21/2010" ) ; I have above dates in my array . How can I sort those dates... ? My date format is dd/mm/YYYY. >>> More
please suggest a perl book exclusively for perl programs

as seen on Stack Overflow - Search for 'Stack Overflow'
I want tha name of a perl book for only PERL PROGRAMS. The reason behind is I want to improve my programming skill in perl >>> More