Python: Most efficient way to concatenate and rearrange files

Posted by user300890 on Stack Overflow See other posts from Stack Overflow or by user300890
Published on 2010-03-24T14:46:45Z Indexed on 2010/03/24 15:03 UTC
Read the original article Hit count: 270

Filed under:
|
|
|

Hi,

I am reading from several files, each file is divided into 2 pieces, first a header section of a few thousand lines followed by a body of a few thousand. My problem is I need to concatenate these files into one file where all the headers are on the top followed by the body.

Currently I am using two loops; one to pull out all the headers and write them, and the second to write the body of each file (I also include a tmp_count variable to limit the number of lines to be loading into memory before dumping to file).

This is pretty slow - about 6min for 13gb file. Can anyone tell me how to optimize this or if there is a faster way to do this in python ?

Thanks!

Here is my code:

def cat_files_sam(final_file_name,work_directory_master,file_count):

    final_file = open(final_file_name,"w")

if len(file_count) > 1:
    file_count=sort_output_files(file_count)

# only for @ headers    
    for bowtie_file in file_count:
        #print bowtie_file
        tmp_list = []

        tmp_count = 0
        for line in open(os.path.join(work_directory_master,bowtie_file)):
            if line.startswith("@"):

        if tmp_count == 1000000:
            final_file.writelines(tmp_list)
            tmp_list = []
            tmp_count = 0

        tmp_list.append(line)
        tmp_count += 1

    else:
        final_file.writelines(tmp_list)
        break

for bowtie_file in file_count:
        #print bowtie_file
        tmp_list = []

        tmp_count = 0
        for line in open(os.path.join(work_directory_master,bowtie_file)):
            if line.startswith("@"):
        continue
    if tmp_count == 1000000:
        final_file.writelines(tmp_list)
        tmp_list = []
        tmp_count = 0

    tmp_list.append(line)
    tmp_count += 1
    final_file.writelines(tmp_list)

final_file.close()

© Stack Overflow or respective owner

Related posts about python

Related posts about concatenate