Parsing multiple files at a time in Perl

Posted by sfactor on Stack Overflow See other posts from Stack Overflow or by sfactor
Published on 2010-12-31T14:23:06Z Indexed on 2010/12/31 14:54 UTC
Read the original article Hit count: 310

Filed under:

perl

|

Performance

|

parsing

|

file

|

memory-management

I have a large data set (around 90GB) to work with. There are data files (tab delimited) for each hour of each day and I need to perform operations in the entire data set. For example, get the share of OSes which are given in one of the columns. I tried merging all the files into one huge file and performing the simple count operation but it was simply too huge for the server memory.

So, I guess I need to perform the operation each file at a time and then add up in the end. I am new to perl and am especially naive about the performance issues. How do I do such operations in a case like this.

As an example two columns of the file are.

ID      OS
1       Windows
2       Linux
3       Windows
4       Windows

Lets do something simple, counting the share of the OSes in the data set. So, each .txt file has millions of these lines and there are many such files. What would be the most efficient way to operate on the entire files.

© Stack Overflow or respective owner

Related posts about perl

Munin on Centos 6 - missing perl MODULE_COMPAT_5.8.8

as seen on Server Fault - Search for 'Server Fault'
I'm trying to install Munin on a new VPS through yum install munin but I keep getting an error about a missing perl module: Requires: perl(:MODULE_COMPAT_5.8.8). This is the perl version currently installed: v5.10.1. I've searched all around and still haven't found a solution for this. Here's the… >>> More
Pain removing a perl rootkit

as seen on Server Fault - Search for 'Server Fault'
So, we host a geoservice webserver thing at the office. Someone apparently broke into this box (probably via ftp or ssh), and put some kind of irc-managed rootkit thing. Now I'm trying to clean the whole thing up, I found the process pid who tries to connect via irc, but i can't figure out who's… >>> More
How To Avoid a Perl script calling an Another Perl Script

as seen on Stack Overflow - Search for 'Stack Overflow'
Hello, i am calling a perl script client.pl from a main script to capture the output of client.pl in @output. is there anyway to avoid the use of these two files so i can use the output of client.pl in main.pl itself here is my code.... main.pl ======= my @output = readpipe("client.pl"); client… >>> More
Perl :how to sort dates in perl

as seen on Stack Overflow - Search for 'Stack Overflow'
Hi, How can I sort the dates in perl. my @dates = ( "02/11/2009" , "12/20/2001" , "11/21/2010" ) ; I have above dates in my array . How can I sort those dates... ? My date format is dd/mm/YYYY. >>> More
please suggest a perl book exclusively for perl programs

as seen on Stack Overflow - Search for 'Stack Overflow'
I want tha name of a perl book for only PERL PROGRAMS. The reason behind is I want to improve my programming skill in perl >>> More

Related posts about Performance

Improving VPN performance - stronger encryption = more performance?

as seen on Server Fault - Search for 'Server Fault'
I have a site-to-site VPN set up with two SonicWall's (a TZ170 and a Pro1260). It was suggested to me that turning off encryption (so the VPN is tunneling only) would improve performance. (I'm not concerned with security, because the VPN is running over a trusted line.) Using FTP and HTTP transfers… >>> More
Inaccurate performance counter timer values in Windows Performance Monitor

as seen on Stack Overflow - Search for 'Stack Overflow'
I am implementing instrumentation within an application and have encountered an issue where the value that is displayed in Windows Performance Monitor from a PerformanceCounter is incongruent with the value that is recorded. I am using a Stopwatch to record the duration of a method execution, then… >>> More
Excel-based Performance Reviews transformed into Web Application for Performance Management

as seen on Geeks with Blogs - Search for 'Geeks with Blogs'
HR TMS provides enterprise talent management solutions for healthcare, retail and corporate customers, focusing on performance management, compensation management and succession planning. As the competency of nurses and other healthcare workers is critical, the government, via the Joint Commission… >>> More
How to save a perfmon Performance Counter as a textfile (Reliability and Performance Monitor Version

as seen on Server Fault - Search for 'Server Fault'
Now the file gets saved as blg, but I would like a txt versin to import in Excel. >>> More
SQLAuthority News – A Successful Performance Tuning Seminar at Pune – Dec 4-5, 2010

as seen on SQL Authority - Search for 'SQL Authority'
This is report to my third of very successful seminar event on SQL Server Performance Tuning. SQL Server Performance Tuning Seminar in Colombo was oversubscribed with total of 35 attendees. You can read the details over here SQLAuthority News – SQL Server Performance Optimizations Seminar – Grand… >>> More